SDU AI Safety Readings | AI Safety & Interpretability Lab
AI Safety & Interpretability Lab

SDU AI Safety Readings

We run a weekly reading group on AI Safety and Interpretability, hosted at the Department of Mathematics and Computer Science, University of Southern Denmark. We discuss recent papers on interpretability, control, alignment, agentic/multi-agent safety, and related topics. Everyone is welcome, regardless of background. Our meetings are held in hybrid format.

The reading group is currently held on Mondays at 14:00 Copenhagen time (unless another time is mentioned). Email us at galke@imada.sdu.dk if you want to join or have any questions.

Schedule

Date Topic Presenter
Feb 3, 2026 Activation Oracles Federico
Feb 10, 2026 Weird generalizations Lukas
Feb 16, 2026 The Dead Salmons of AI Interpretability Andor
Mar 2, 2026 Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences Annemette
Mar 9, 2026 Linear Representations can change over the course of a conversation Federico
Mar 23, 2026 EasySteer Gianluca
March 30, 2026 Thought Branches Andrea
April 13, 2026 Evaluating and Understanding Scheming Propensity in LLM Agents Filippo
April 20, 2026, 13:00 Steering Evaluation-Aware Language Models to Act Like They Are Deployed Lukas
April 27, 2026 Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment Laurene Vaugrante
May 4, 2026 Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs Federico
May 11, 2026 Screening new literature: Natural Language Autoencoders, Neural Geometry, and Manifold Steering *
June 1, 2026, 15:00 Position: Model Collapse Does Not Mean What You Think Anton

Contact us via galke@imada.sdu.dk