MI2DataLab
diff --git a/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/Introduction_to_Sparse_Autoencoders.pdf‎
4.79 MB b/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/Introduction_to_Sparse_Autoencoders.pdf‎
4.79 MB
diff --git a/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/README.md‎
Lines changed: 5 additions & 0 deletions b/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/README.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/README.md‎
Lines changed: 5 additions & 0 deletions b/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/README.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/Sparse_Autoencoders_Do_Not_Find_Canonical_Units_of_Analysis.pdf‎
2.53 MB b/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/Sparse_Autoencoders_Do_Not_Find_Canonical_Units_of_Analysis.pdf‎
2.53 MB
diff --git a/‎README.md‎
Lines changed: 8 additions & 8 deletions b/‎README.md‎
Lines changed: 8 additions & 8 deletions
@@ -0,0 +1,5 @@
+# Introduction to Sparse Autoencoders with focus on paper: Scaling and evaluating sparse autoencoders
+
+The presentation will focus on foundations of sparse autoencoders as a promising unsupervised approach for extracting interpretable features from a model. I will present top architectures used in this relatively new domain, how they are trained and also evaluated. The presentation will focus on Scaling and evaluating sparse autoencoders paper recently published on ICLR 2025 where authors based on the hypothesis that Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. They proposed k-sparse autoencoder where scaling of size and sparsity improves the reconstruction-sparsity frontier with fewer dead latents and “better” interpretability.
+
+The presentation is based on this [paper](https://openreview.net/forum?id=tcsZt9ZNKD)
@@ -0,0 +1,5 @@
+# Sparse Autoencoders Do Not Find Canonical Units of Analysis
+
+This paper examines the limitations of Sparse Autoencoders (SAEs) as a method for finding interpretable features in neural networks, particularly challenging the idea that they can identify a “canonical set of units” (unique and complete atomic features). To showcase this problem, the authors propose Meta-SAEs - essentially applying SAEs to the decoder matrix of another SAE - demonstrating that supposedly atomic features can still be further decomposed. Additionally, the authors introduce feature stitching, a method where features from wider SAEs are inserted into narrower ones, revealing that narrower SAEs are incomplete in their feature representation.
+
+The presentation is based on this [paper](https://openreview.net/forum?id=9ca9eHNrdH)
@@ -9,14 +9,14 @@ Join us at https://meet.drwhy.ai.
 ### Spring semester
 
 * 24.02 - [Tracking information flow in biosystems from high-throughput data](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_02_24_tracking_information_flow) - Miron Kursa
-* 03.03 - Introduction to Sparse Autoencoders SAE with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - (advising Vladimir)
-* 10.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir) 
-* 17.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://openreview.net/forum?id=9ca9eHNrdH) (Durham University/Independent; ICLR 2025; 07.02.2025) - (advising Vladimir)
-* 24.03 - [Interpreting CLIP with Hierarchical Sparse Autoencoders](https://arxiv.org/abs/2502.20578) (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
-* 31.03 - [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) - Karol Dobiczek
-* 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) - Paweł Gelar
-* 14.04 - [Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning](https://arxiv.org/abs/2503.08636) - Hubert Baniecki
-* 28.04 - Vlad (ICLR)
+* 03.03 - [Introduction to Sparse Autoencoders (SAE)](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders) with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - Vladimir
+* 10.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis) - Vladimir
+* 17.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir) 
+* 24.03 - Interpreting CLIP with Hierarchical Sparse Autoencoders (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
+* 31.03 - Short intro (HB) + [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) (CVPR 2023) - (advising HB)
+* 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) (NeurIPS 2024) - (advising HB)
+* 14.04 - Adversarial analysis of intrinsically interpretable deep learning - Hubert Baniecki
+* 28.04 - Vlad
 * 12.05 - [DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations](https://arxiv.org/abs/2311.17833) (advising BS)
 * 19.05 - [Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models](https://arxiv.org/abs/2412.18604) (advising BS)
 * 26.05 - [Diffusion Posterior Sampling for General Noisy Inverse Problems](https://arxiv.org/abs/2209.14687) (advising BS)