We run a weekly reading group on AI Safety and Interpretability, hosted at the Department of Mathematics and Computer Science, University of Southern Denmark. We discuss recent papers
on interpretability, control, alignment, agentic/multi-agent safety, and related topics.
Everyone is welcome, regardless of background. Our meetings are held in hybrid format.
The reading group is currently held on Mondays at 14:00 Copenhagen time (unless another time is mentioned).
Email us at galke@imada.sdu.dk if you want to join or have any questions.
Schedule
| Date |
Topic |
Presenter |
| Feb 3, 2026 |
Activation Oracles |
Federico |
| Feb 10, 2026 |
Weird generalizations |
Lukas |
| Feb 16, 2026 |
The Dead Salmons of AI Interpretability |
Andor |
| Mar 2, 2026 |
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences |
Annemette |
| Mar 9, 2026 |
Linear Representations can change over the course of a conversation |
Federico |
| Mar 23, 2026 |
EasySteer |
Gianluca |
| March 30, 2026 |
Thought Branches |
Andrea |
| April 13, 2026 |
Evaluating and Understanding Scheming Propensity in LLM Agents |
Filippo |
| April 20, 2026, 13:00 |
Steering Evaluation-Aware Language Models to Act Like They Are Deployed |
Lukas |
| April 27, 2026 |
Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment |
Laurene Vaugrante |
| May 4, 2026 |
Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs |
Federico |
| May 11, 2026 |
Screening new literature: Natural Language Autoencoders, Neural Geometry, and Manifold Steering |
* |
| June 1, 2026, 15:00 |
Position: Model Collapse Does Not Mean What You Think |
Anton |