Skip to content

Commit 172d627

Browse files
Merge branch 'pr-88'
2 parents 3e9ccd9 + a8dae41 commit 172d627

5 files changed

Lines changed: 18 additions & 8 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Introduction to Sparse Autoencoders with focus on paper: Scaling and evaluating sparse autoencoders
2+
3+
The presentation will focus on foundations of sparse autoencoders as a promising unsupervised approach for extracting interpretable features from a model. I will present top architectures used in this relatively new domain, how they are trained and also evaluated. The presentation will focus on Scaling and evaluating sparse autoencoders paper recently published on ICLR 2025 where authors based on the hypothesis that Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. They proposed k-sparse autoencoder where scaling of size and sparsity improves the reconstruction-sparsity frontier with fewer dead latents and “better” interpretability.
4+
5+
The presentation is based on this [paper](https://openreview.net/forum?id=tcsZt9ZNKD)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Sparse Autoencoders Do Not Find Canonical Units of Analysis
2+
3+
This paper examines the limitations of Sparse Autoencoders (SAEs) as a method for finding interpretable features in neural networks, particularly challenging the idea that they can identify a “canonical set of units” (unique and complete atomic features). To showcase this problem, the authors propose Meta-SAEs - essentially applying SAEs to the decoder matrix of another SAE - demonstrating that supposedly atomic features can still be further decomposed. Additionally, the authors introduce feature stitching, a method where features from wider SAEs are inserted into narrower ones, revealing that narrower SAEs are incomplete in their feature representation.
4+
5+
The presentation is based on this [paper](https://openreview.net/forum?id=9ca9eHNrdH)

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@ Join us at https://meet.drwhy.ai.
99
### Spring semester
1010

1111
* 24.02 - [Tracking information flow in biosystems from high-throughput data](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_02_24_tracking_information_flow) - Miron Kursa
12-
* 03.03 - Introduction to Sparse Autoencoders SAE with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - (advising Vladimir)
13-
* 10.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir)
14-
* 17.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://openreview.net/forum?id=9ca9eHNrdH) (Durham University/Independent; ICLR 2025; 07.02.2025) - (advising Vladimir)
15-
* 24.03 - [Interpreting CLIP with Hierarchical Sparse Autoencoders](https://arxiv.org/abs/2502.20578) (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
16-
* 31.03 - [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) - Karol Dobiczek
17-
* 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) - Paweł Gelar
18-
* 14.04 - [Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning](https://arxiv.org/abs/2503.08636) - Hubert Baniecki
19-
* 28.04 - Vlad (ICLR)
12+
* 03.03 - [Introduction to Sparse Autoencoders (SAE)](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders) with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - Vladimir
13+
* 10.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis) - Vladimir
14+
* 17.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir)
15+
* 24.03 - Interpreting CLIP with Hierarchical Sparse Autoencoders (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
16+
* 31.03 - Short intro (HB) + [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) (CVPR 2023) - (advising HB)
17+
* 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) (NeurIPS 2024) - (advising HB)
18+
* 14.04 - Adversarial analysis of intrinsically interpretable deep learning - Hubert Baniecki
19+
* 28.04 - Vlad
2020
* 12.05 - [DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations](https://arxiv.org/abs/2311.17833) (advising BS)
2121
* 19.05 - [Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models](https://arxiv.org/abs/2412.18604) (advising BS)
2222
* 26.05 - [Diffusion Posterior Sampling for General Noisy Inverse Problems](https://arxiv.org/abs/2209.14687) (advising BS)

0 commit comments

Comments
 (0)