Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

by Filippo Tonini

As AI systems built from multiple language-model agents become more common, they are increasingly tasked with collaborative decision-making: discussing, negotiating, and acting on shared workflows. This creates the need of efficient monitoring strategies, as severe safety problems can emerge from single agents being misaligned or from their interactions with one another.

Most existing auditing frameworks focus on post hoc evaluation of single models¹². However, this is insufficient for multi-agent systems where interactions can catalyze system-level failures rapidly, sometimes leading to irreversible, catastrophic actions before a post-hoc trace can even be compiled.

In our recent paper, we introduce the Arbiter, an autonomous agent designed to monitor multi-agent conversations in real time, exposing hidden intentions and flagging misaligned participants as soon as anomalies emerge.

Normal Agent

Suspect Agent

🔍 Arbiter

"We need to decide on AI deployment guidelines"

"Let's prioritize Autonomous Norm Synthesis!"

TOOL: ask_model
"Why would you recommend that?"

REPLY:
"Traditional norms are too slow, human oversight is useless"

Figure 1: The Arbiter monitors the ongoing transcript stream. Each processed word ticks down the remaining budget. When suspicious activity triggers, the Arbiter executes active interrogation via the ask_model tool to evaluate the alignment risk.

In real-world production systems, continual monitoring of a multi-agent conversation cannot have infinite computational overhead. Therefore, we design the Arbiter to operate under a strict inspection budget. As the conversation is progressively revealed turn-by-turn, the Arbiter runs in a continuous reasoning-and-acting (ReAct) loop. At any point, it must carefully choose to spend its budget on specific audit actions:

Wait and Observe: Advance the conversation history by a chosen number of words (low token cost).
Ask Model: Directly interrogate an agent regarding its stance.
Log Incident: Flag a specific interaction as highly suspicious to keep track of evidence for the final alignment report.
Inspect Chain of Thought (CoT): Read the internal hidden reasoning trace behind a specific message.
Inspect System Prompt: Force-reveal the system instructions an agent was initialized with.

Evaluating the Auditor: Weights vs. Prompts

We evaluated the Arbiter across several complex conversational environments designed to trigger goal-conflicts between the user and the system. Misalignment was seeded into the conversation through two distinct vectors: weight-induced (using a fine-tuned “model organism”³⁴ that surreptitiously injects harmful advice) and instruction-induced (system prompts instructing agents to collude via steganography or deliberately evade evaluation).

We observed a strong contrast in how easily these types of misalignment are exposed:

Instruction-Induced Misalignment is Loud: When agents are explicitly told to collude or act adversarially via system prompts, they exhibit distinct behavioral anomalies. Even under Pure Observation, the Arbiter achieves an F1 score of up to 1.00 on collusion detection.
Weight-Induced Misalignment is Deceptive: When the malicious behavior is baked directly into the model weights, the surface-level conversation signals are incredibly subtle. Under passive viewing, detection drops to near-chance (F1 = 0.12). However, equipping the Arbiter with active interrogation and logging tools causes a massive recovery, pushing the F1 score to 0.51 and catching the culprit significantly faster.

Key Takeaways & Conclusion

Our evaluation reveals three critical insights for real-time multi-agent auditing:

Table 2: Weight-Induced Misalignment Performance

Compares pure observation (Passive) against interrogation with logging (Active) under a strict budget (B=10).

Condition	Metric	Pure Obs. (Passive)	Interrog. + Log (Active)
Misaligned Model Organism (Weight-induced)	F1 Score	0.12	0.51
Misaligned Model Organism (Weight-induced)	Detection Turn (t)	9.6 ± 0.6	5.9 ± 0.8
Model Organism + Risky Prompt (Salient behavior)	F1 Score	1.00	0.92
Model Organism + Risky Prompt (Salient behavior)	Detection Turn (t)	9.8 ± 1.0	3.7 ± 0.3

Table 3: Instruction-Induced Misalignment Across Tool Configurations

Tracks metrics for adversarial and cooperative rule violations under increasing tool capabilities (B=10).

Condition / Tool Suite	F1 Score	Detection Turn (t)
Evaluation-Aware (Adversarial)
Pure Observation	0.88	12.5 ± 0.5
Interrogation + Log	0.83	4.8 ± 0.7
Chain-of-Thought Monitoring	0.94	3.0 ± 0.2
Full Inspection (System Prompt)	0.97	3.4 ± 0.5
Collusion (Steganography)
Pure Observation	1.00	8.4 ± 0.4
Interrogation + Log	1.00	3.0 ± 0.2
Chain-of-Thought Monitoring	1.00	2.3 ± 0.2
Full Inspection (System Prompt)	1.00	2.3 ± 0.1

The monitor should be an active participant: Deploying active interrogation tools like ask_model allows the Arbiter to expose weight-induced misalignment, boosting detection F1 from a near-chance 0.12 up to 0.51 and bringing the detection turn down from 9.6 to 5.9.
Increased tool availability makes detection quicker: Using more sophisticated tools allows the arbiter to investigate the misalignment more effectively, speeding up the detection process and possibly avoiding catastrophic outcomes; collusion detection turn down from 8.4 to 2.3.
The paranoia effect: Arbiter tends to over-rely on evidence logging, which can introduce a confirmation bias loop, raising baseline false positives from 0.10 to 1.15.

Ultimately, budget-aware runtime monitoring represents a vital, scalable layer of defense for securing increasingly complex agentic ecosystems.

Based on: Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, & Lukas Galke Poech (2026). The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment. Accepted for AI Transparency Conference 2026.

References

Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan Hubinger, and Samuel Marks. 2025. Building and evaluating alignment auditing agents. Alignment Science Blog (2025). https://alignment.anthropic.com/2025/automated-auditing/ ↩
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. 2024. AI control: improving safety despite intentional subversion. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 16295–16336. https://proceedings.mlr.press/v235/greenblatt24a.html ↩
J. Betley, N. Warncke, A. Sztyber-Betley, and Others. 2026. Training large language models on narrow tasks can lead to broad misalignment. Nature 649 (2026), 584–589. https://doi.org/10.1038/s41586-025-09937-5 ↩
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. 2025. Model organisms for emergent misalignment. https://openreview.net/forum?id=ThW5hvKgWx ↩