In the AI Safety & Interpretability Lab at SDU, we investigate how and why advanced AI systems behave the way they do and how we can use these insights to make them more reliable, interpretable, and safe.
Learn moreContinually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment by Filippo Tonini As AI systems built from multiple language-model agents become more common, they are increasingly tasked with collaborative decision-making: discussing,...
The Auditability-Accuracy Tradeoff by Lukas Galke Poech Monitoring reasoning traces of large language models is currently one of the most promising methods to detect when language models do not behave...
Contact us via galke@imada.sdu.dk