MI2DataLab
diff --git a/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/Introduction_to_Sparse_Autoencoders.pdf‎
-4.79 MB b/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/Introduction_to_Sparse_Autoencoders.pdf‎
-4.79 MB
diff --git a/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/README.md‎
Lines changed: 0 additions & 5 deletions b/‎2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders/README.md‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎2025/2025_03_17_sparse_feature_circuits/README.md‎
Lines changed: 3 additions & 0 deletions b/‎2025/2025_03_17_sparse_feature_circuits/README.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎2025/2025_03_17_sparse_feature_circuits/sparse_feature_circuits.pdf‎
2.33 MB b/‎2025/2025_03_17_sparse_feature_circuits/sparse_feature_circuits.pdf‎
2.33 MB
diff --git a/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/README.md‎
Lines changed: 0 additions & 5 deletions b/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/README.md‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/Sparse_Autoencoders_Do_Not_Find_Canonical_Units_of_Analysis.pdf‎
-2.53 MB b/‎2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis/Sparse_Autoencoders_Do_Not_Find_Canonical_Units_of_Analysis.pdf‎
-2.53 MB
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
@@ -0,0 +1,3 @@
+# Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
+
+The paper presents how to get feature circuits in language models - structured interactions between sparse, interpretable features that collectively influence a model's behaviour. The features in circuits can be interpreted, as we saw in the previous seminars, but also removed during the manual debiasing, which is performed by a novel SHIFT method. The authors argue that this method can be used to, e.g., detect and remove the influence of gender on the profession classification, making language models safe and fair.
@@ -11,7 +11,7 @@ Join us at https://meet.drwhy.ai.
 * 24.02 - [Tracking information flow in biosystems from high-throughput data](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_02_24_tracking_information_flow) - Miron Kursa
 * 03.03 - [Introduction to Sparse Autoencoders (SAE)](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_03_03_scaling_and_evaluating_sparse_autoencoders) with [ReLU](https://transformer-circuits.pub/2024/scaling-monosemanticity/) (Anthropic; Blog; 21.05.2024), [TopK](https://openreview.net/forum?id=tcsZt9ZNKD) (OpenAI; ICLR 2025; 06.06.2025) and [JumpReLU](https://openreview.net/forum?id=XkMrWOJhNd) (DeepMind; 09.08.2024; EMNLP 2024 Workshop) - Vladimir
 * 10.03 - [Sparse Autoencoders Do Not Find Canonical Units of Analysis](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2025/2025_10_03_sparse_autoencoders_do_not_find_canonical_units_of_analysis) - Vladimir
-* 17.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - (advising Vladimir) 
+* 17.03 - [Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models](https://openreview.net/forum?id=I4e82CIDxv) (Northeastern University; ICLR 2025; 31.03.2025) - Dawid Płudowski
 * 24.03 - Interpreting CLIP with Hierarchical Sparse Autoencoders (My paper; Under Review; 30.01.2025) - Vladimir Zaigrajew
 * 31.03 - Short intro (HB) + [PIP-Net: Patch-based intuitive prototypes for interpretable image classification](https://openaccess.thecvf.com/content/CVPR2023/html/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.html) (CVPR 2023) - (advising HB)
 * 07.04 - [ProtoViT: Interpretable image classification with adaptive prototype-based vision transformers](https://openreview.net/forum?id=hjhpCJfbFG) (NeurIPS 2024) - (advising HB)
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models`
	`2`	`+`
	`3`	+The paper presents how to get feature circuits in language models - structured interactions between sparse, interpretable features that collectively influence a model's behaviour. The features in circuits can be interpreted, as we saw in the previous seminars, but also removed during the manual debiasing, which is performed by a novel SHIFT method. The authors argue that this method can be used to, e.g., detect and remove the influence of gender on the profession classification, making language models safe and fair.