We run a weekly reading group on AI Safety and Interpretability, hosted at the Department of Mathematics and Computer Science, University of Southern Denmark. We discuss recent papers
on interpretability, control, alignment, agentic/multi-agent safety, and related topics.
Everyone is welcome, regardless of background. Our meetings are held in hybrid format.
The reading group is currently held on Mondays at 14:00 Copenhagen time (unless another time is mentioned).
Email us at galke@imada.sdu.dk if you want to join or have any questions.
Schedule
| Date |
Topic |
Presenter |
| Feb 3, 2026 |
Activation Oracles |
Federico |
| Feb 10, 2026 |
Weird generalizations |
Lukas |
| Feb 16, 2026 |
The Dead Salmons of AI Interpretability |
Andor |
| Mar 2, 2026 |
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences |
Annemette |
| Mar 9, 2026 |
Linear Representations can change over the course of a conversation |
Federico |
| Mar 23, 2026 |
EasySteer |
Gianluca |
| March 30, 2026 |
Thought Branches |
Andrea |
| April 13, 2026 |
Evaluating and Understanding Scheming Propensity in LLM Agents |
Filippo |
| April 20, 2026, 13:00 |
Steering Evaluation-Aware Language Models to Act Like They Are Deployed |
Lukas |
| April 27, 2026 |
Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment |
Laurene Vaugrante (University of Stuttgart) |
| May 4, 2026 |
Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs |
Federico |
| May 11, 2026 |
Screening new literature: Natural Language Autoencoders, Neural Geometry, and Manifold Steering |
* |
| June 1, 2026, 15:00 |
Main topic: Position: Model Collapse Does Not Mean What You Think + Side reading: Subliminal learning |
Anton |
| June 8, 2026 |
In-Training Defense against Emergent Misalignment in Language Models |
Florian Mai (Lamarr Institute for ML/AI) |
| June 15, 2026 |
Subliminal Learning Is Steering Vector Distillation |
Lukas |
| June 22, 2026 |
NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment |
Gianluca |
| June 29, 2026 |
Manifold Steering |
Johannes/Lukas |
| July 6, 2026 |
Solipsistic Superintelligence is Unlikely to be Cooperative |
Filippo |
| July 13, 2026 |
Human-AI Complementarity: A Goal for Amplified Oversight |
Shamim |