AI Safety Readings

We run a weekly reading group on AI Safety and Interpretability, hosted at the Department of Mathematics and Computer Science, University of Southern Denmark. We discuss recent papers on interpretability, control, alignment, agentic/multi-agent safety, and related topics. Everyone is welcome, regardless of background. Our meetings are held in hybrid format.

The reading group is currently held on Mondays at 14:00 Copenhagen time (unless another time is mentioned). Email us at galke@imada.sdu.dk if you want to join or have any questions.

Schedule

Date	Topic	Presenter
Feb 3, 2026	Activation Oracles	Federico
Feb 10, 2026	Weird generalizations	Lukas
Feb 16, 2026	The Dead Salmons of AI Interpretability	Andor
Mar 2, 2026	Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences	Annemette
Mar 9, 2026	Linear Representations can change over the course of a conversation	Federico
Mar 23, 2026	EasySteer	Gianluca
March 30, 2026	Thought Branches	Andrea
April 13, 2026	Evaluating and Understanding Scheming Propensity in LLM Agents	Filippo
April 20, 2026, 13:00	Steering Evaluation-Aware Language Models to Act Like They Are Deployed	Lukas
April 27, 2026	Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment	Laurene Vaugrante (University of Stuttgart)
May 4, 2026	Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs	Federico
May 11, 2026	Screening new literature: Natural Language Autoencoders, Neural Geometry, and Manifold Steering	*
June 1, 2026, 15:00	Main topic: Position: Model Collapse Does Not Mean What You Think + Side reading: Subliminal learning	Anton
June 8, 2026	In-Training Defense against Emergent Misalignment in Language Models	Florian Mai (Lamarr Institute for ML/AI)
June 15, 2026	Subliminal Learning Is Steering Vector Distillation	Lukas
June 22, 2026	NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment	Gianluca
June 29, 2026	Manifold Steering	Johannes/Lukas
July 6, 2026	Solipsistic Superintelligence is Unlikely to be Cooperative	Filippo
July 13, 2026	Human-AI Complementarity: A Goal for Amplified Oversight	Shamim