Multi-agent coordination: 14 failure modes and how to avoid them
The Cemri 2025 paper identified 14 recurring failure modes in multi-agent systems. Diagnosis and three architecture families to guard against them.
Getting several AI agents to collaborate on a single business process is an open problem. Frameworks are multiplying (LangGraph, CrewAI, AutoGen, MetaGPT), approaches diverge, and production reveals failure modes that are not all documented. In March 2025, a paper by Cemri et al. published on arXiv became a reference by cataloguing 14 recurring failure modes in multi-agent systems.
This article summarizes those failure modes, offers a reading of the three market architecture families, and explains why Conway alignment is the most solid response.
The 14 failure modes in summary
- Error cascade: an upstream error propagates and amplifies downstream, with no one correcting it.
- Context loss: an agent doesn't know what another agent has already done, and re-does the work.
- Infinite negotiation: two agents iterate without converging, burning tokens with no result.
- Collective hallucination: a fact invented by one agent is taken as given by the others.
- Role contradiction: two agents make conflicting decisions on the same object.
- Goal drift: the system forgets the original goal and focuses on local sub-goals.
- Circular-dependency deadlock: A waits on B, B waits on C, C waits on A, the system freezes.
- Fuzzy attribution: you don't know which agent took which decision, audit becomes impossible.
- Orchestration overload: the central orchestrator becomes the bottleneck and slows everything down.
- Shared-resource contention: several agents write to the same registry without coordination.
- Excessive delegation: an agent delegates everything to others and stops doing anything itself.
- Sycophancy: agents validate each other without critical thinking, collective bias.
- Semantic drift: the meaning of a notion changes gradually between agents, eventual inconsistency.
- Capacity collapse: under certain volumes, the system loses any useful coordination.
Not all of these modes are equally likely. Error cascade, context loss and infinite negotiation are the three most frequent in production. Collective hallucination and role contradiction are the most dangerous when they occur.
Three architecture families, three trade-offs
On the 2026 market, multi-agent frameworks can be classed into three families according to their coordination topology.
Star topology: a central orchestrator
A supervisor agent drives specialized agents. This is LangGraph in supervisor mode and most CrewAI implementations. Upside: strong predictability, the orchestrator has the global view. Downside: limited scalability, single point of failure, orchestration overload beyond five or six agents.
Graph topology: peer agents exchanging messages
All agents are at the same level, communicating by messages according to declared rules. AutoGen and some CrewAI configurations work this way. Upside: flexibility, no bottleneck. Downside: very exposed to the failure modes catalogued by Cemri, particularly semantic drift and infinite negotiation.
Conway-aligned topology: organizational structure made executable
Agents are structured along the organizational or domain boundaries of the system. Communication happens through persisted typed events. This is the Swoft architecture, and it is also the one that enterprise neurosymbolic systems like FAOS converge towards. Upsides: strong business alignment, clear governance, drastically reduced failure modes. Downside: requires up-front domain modelling, which the more permissive frameworks do not.
Why Conway is the most solid answer
Conway alignment structurally addresses most of the 14 failure modes. Error cascade is bounded by disjoint bounded contexts: an error in one domain does not contaminate the others. Context loss is eliminated by the shared memory of the Event Store. Role contradiction is mechanically impossible because bounded contexts are disjoint. Goal drift is caught by approval gates injected into the sagas.
Three technical conditions make Conway alignment operational. First condition: a metamodel that describes the bounded contexts and their relationships. Second condition: communication via typed and persisted events, never free text. Third condition: orchestration of long-running workflows by event-sourced sagas, with automatic compensation in case of partial failure.
Sujets abordés
- Multi-agents
- Conway
- Coordination
- Cemri
- Architecture IA
À approfondir dans le glossaire
How Swoft turns this challenge into software
Chez Swoft, la coordination multi-agents repose sur trois principes alignés sur la loi de Conway. Voici comment ils se traduisent en garanties opérationnelles.
- 01
Bounded contexts disjoints
Chaque agent est rattaché à un bounded context du métamodèle DDD. Les contextes sont disjoints par construction : aucune contradiction de rôles possible, aucune contamination d'erreur entre domaines.
- 02
Communication par événements typés
Les agents ne se parlent jamais en texte libre. Toute communication entre agents passe par des événements typés persistés dans l'Event Store. La dérive sémantique et la négociation infinie deviennent structurellement impossibles.
- 03
Sagas event-sourcées avec compensation
Les workflows longs sont orchestrés par des sagas event-sourcées. En cas d'échec partiel, la compensation automatique restaure un état cohérent. La cascade d'erreurs est bornée, le système ne se fige jamais sur un blocage circulaire.
Continuer la lecture — SaaS
NIS2 for SaaS vendors: six months to pass the audit NIS2 for SaaS vendors: six months to pass the audit
Applicable since October 2024, the NIS2 directive starts to bite in 2026. SaaS vendors classified as "important entities" face new technical obligations.
EU AI Act articles 8-15: AI SaaS vendors must organize before August 2026 EU AI Act articles 8-15: AI SaaS vendors must organize before August 2026
On 2 August 2026, transparency and governance obligations for high-risk AI become applicable. For SaaS vendors, it's an underestimated workload.