You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Introduction to Sparse Autoencoders with focus on paper: Scaling and evaluating sparse autoencoders
2
+
3
+
The presentation will focus on foundations of sparse autoencoders as a promising unsupervised approach for extracting interpretable features from a model. I will present top architectures used in this relatively new domain, how they are trained and also evaluated. The presentation will focus on Scaling and evaluating sparse autoencoders paper recently published on ICLR 2025 where authors based on the hypothesis that Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. They proposed k-sparse autoencoder where scaling of size and sparsity improves the reconstruction-sparsity frontier with fewer dead latents and “better” interpretability.
4
+
5
+
The presentation is based on this [paper](https://openreview.net/forum?id=tcsZt9ZNKD)
# Sparse Autoencoders Do Not Find Canonical Units of Analysis
2
+
3
+
This paper examines the limitations of Sparse Autoencoders (SAEs) as a method for finding interpretable features in neural networks, particularly challenging the idea that they can identify a “canonical set of units” (unique and complete atomic features). To showcase this problem, the authors propose Meta-SAEs - essentially applying SAEs to the decoder matrix of another SAE - demonstrating that supposedly atomic features can still be further decomposed. Additionally, the authors introduce feature stitching, a method where features from wider SAEs are inserted into narrower ones, revealing that narrower SAEs are incomplete in their feature representation.
4
+
5
+
The presentation is based on this [paper](https://openreview.net/forum?id=9ca9eHNrdH)
Copy file name to clipboardExpand all lines: README.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,14 +9,14 @@ Join us at https://meet.drwhy.ai.
9
9
### Spring semester
10
10
11
11
* 24.02 - [Tracking information flow in biosystems from high-throughput data](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_02_24_tracking_information_flow) - Miron Kursa
12
-
* 03.03 - Introduction to Sparse Autoencoders SAE with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - (advising Vladimir)
13
-
* 10.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir)
14
-
* 17.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://openreview.net/forum?id=9ca9eHNrdH) (Durham University/Independent; ICLR 2025; 07.02.2025) - (advising Vladimir)
15
-
* 24.03 - [Interpreting CLIP with Hierarchical Sparse Autoencoders](https://arxiv.org/abs/2502.20578) (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
16
-
* 31.03 - [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html)- Karol Dobiczek
17
-
* 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG)- Paweł Gelar
18
-
* 14.04 - [Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning](https://arxiv.org/abs/2503.08636) - Hubert Baniecki
19
-
* 28.04 - Vlad (ICLR)
12
+
* 03.03 - [Introduction to Sparse Autoencoders (SAE)](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders) with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - Vladimir
13
+
* 10.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis) - Vladimir
14
+
* 17.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir)
15
+
* 24.03 - Interpreting CLIP with Hierarchical Sparse Autoencoders (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
16
+
* 31.03 - Short intro (HB) + [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html)(CVPR 2023) - (advising HB)
* 19.05 - [Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models](https://arxiv.org/abs/2412.18604) (advising BS)
22
22
* 26.05 - [Diffusion Posterior Sampling for General Noisy Inverse Problems](https://arxiv.org/abs/2209.14687) (advising BS)
0 commit comments