Skip to content

Commit e7844b8

Browse files
authored
Merge pull request #89 from DawidPludowski/master
Added 17.03.2025 presentation
2 parents a7ccaa1 + b368a0d commit e7844b8

3 files changed

Lines changed: 16 additions & 13 deletions

File tree

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
2+
3+
The paper presents how to get feature circuits in language models - structured interactions between sparse, interpretable features that collectively influence a model's behaviour. The features in circuits can be interpreted, as we saw in the previous seminars, but also removed during the manual debiasing, which is performed by a novel SHIFT method. The authors argue that this method can be used to, e.g., detect and remove the influence of gender on the profession classification, making language models safe and fair.
Binary file not shown.

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,20 @@ Join us at https://meet.drwhy.ai.
99
### Spring semester
1010

1111
* 24.02 - [Tracking information flow in biosystems from high-throughput data](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_02_24_tracking_information_flow) - Miron Kursa
12-
* 03.03 - Introduction to Sparse Autoencoders SAE with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - (advising Vladimir)
13-
* 10.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir)
14-
* 17.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://openreview.net/forum?id=9ca9eHNrdH) (Durham University/Independent; ICLR 2025; 07.02.2025) - (advising Vladimir)
15-
* 24.03 - [Interpreting CLIP with Hierarchical Sparse Autoencoders](https://arxiv.org/abs/2502.20578) (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
16-
* 31.03 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) - Paweł Gelar
17-
* 07.04 - [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) - Karol Dobiczek
18-
* 14.04 - [Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning](https://arxiv.org/abs/2503.08636) - Hubert Baniecki
19-
* 28.04 - Vlad (ICLR)
20-
* 05.05 - [Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models](https://arxiv.org/abs/2412.18604) (advising BS)
21-
* 12.05 - [DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations](https://arxiv.org/abs/2311.17833) (advising BS)
22-
* 19.05 - [Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model](https://arxiv.org/abs/2212.00490) (advising BS)
23-
* 26.05 - [Diffusion Posterior Sampling for General Noisy Inverse Problems](https://arxiv.org/abs/2209.14687) (advising BS)
12+
* 03.03 - [Introduction to Sparse Autoencoders (SAE)](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders) with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - Vladimir
13+
* 10.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis) - Vladimir
14+
* 17.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - Dawid Płudowski
15+
* 24.03 - Interpreting CLIP with Hierarchical Sparse Autoencoders (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
16+
* 31.03 - Short intro (HB) + [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) (CVPR 2023) - (advising HB)
17+
* 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) (NeurIPS 2024) - (advising HB)
18+
* 14.04 - Adversarial analysis of intrinsically interpretable deep learning - Hubert Baniecki
19+
* 28.04 - Vlad
20+
* 05.05 - [Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models](https://arxiv.org/abs/2412.18604) - Jakub Świstak
21+
* 12.05 - [Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model](https://arxiv.org/abs/2212.00490) - Jakub Grzywaczewski
22+
* 19.05 - [Diffusion Posterior Sampling for General Noisy Inverse Problems](https://arxiv.org/abs/2209.14687) - Tymoteusz Kwieciński
23+
* 26.05 - [DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations](https://arxiv.org/abs/2311.17833) - Dawid Płudowski
2424
* 31.05 - Wyjazd zespołu do Wilgi
25-
* 02.06
25+
* 02.06
2626
* 09.06
2727

2828
### Fall semester

0 commit comments

Comments
 (0)