MACI: Multi-LLM Agent Collaborative Intelligence
Foundations of Multi-LLM Agent Collaborative Intelligence
The original Multi-LLM Agent Collaboration framework: pioneered since 2022
The MACI framework—Multi-LLM Agent Collaborative Intelligence—represents the culmination of a multi-year research effort across eight foundational pillars:
- Pillar I: Consciousness Modeling – foundational theory for reflective and multi-perspective reasoning (2020–2022)
- Pillar II: Critical Reading Analysis (CRIT) – reasoning validation and argument structure analysis (2023)
- Pillar III: Collaborative Multi-LLM Reasoning (SocraSynth) – structured debate with dynamic modulation of linguistic behavior—including emphasis, tones, and language patterns—to mitigate hallucinations and reduce biases (2023)
- Pillar IV: Behavioral Emotion Analysis Modeling (BEAM) – modeling and modulating linguistic behaviors based on foundational emotions (2024)
- Pillar V: Entropy-Governed Multi-Agent Debate (EVINCE) – quantitative modulation of dialogue via information theory (2024)
- Pillar VI: Ethical Adjudication (DIKE–ERIS) – culturally aware decision-making through dual-agent governance (2024)
- Pillar VII: Adaptive Multi-Agent System for Mitigating LLM Planning Limitations – overcoming Gödelian self-validation barriers through iterative “think-validation” loops and addressing attention/context window narrowing (2024)
- Pillar VIII: Transactional Long-Lived Planning (SagaLLM) – persistent workflows with memory and rollback (2025)
These components collectively define a pathway toward reliable, trustworthy, and interpretable general intelligence through multi-agent LLM coordination.
Introduction
Building on foundational research from 2020 onward, the MACI framework—Multi-LLM Agent Collaborative Intelligence—offers a unified architecture for collaborative reasoning, ethical alignment, and persistent planning using large language models.
The following sections trace MACI’s evolution through eight major pillars, detailing the innovations that led from theoretical modeling to real-world orchestration.
2020–2022: Consciousness Modeling and Foundational Theory
MACI’s origins lie in the exploration of computational consciousness and multi-perspective reasoning, developed during the Stanford CS372 lecture series (2020–2022). These ideas culminated in the CoCoMo paper, which introduced a formal framework for modeling consciousness as emerging from automatic, reflective, pattern-matching processes—akin to an unconscious substrate. By probing the questions of what and where consciousness arises, CoCoMo articulated the transition between pattern recognition and conscious function, laying out a theoretical foundation for AGI. This foundational work established the epistemic framework for collaborative LLM behavior and seeded the architectural vision that became MACI.
2023: SocraSynth, CRIT, and the Birth of Multi-Agent Reasoning
Building on this foundation, 2023 saw the release of SocraSynth, a framework for orchestrating structured debates between LLMs. Rather than relying on a monolithic oracle, SocraSynth coordinates agents—each with a distinct stance or specialization—to conduct Socratic, adversarial-collaborative dialogue.
It operates in two phases:
- Generative Phase: LLM agents engage in structured discussion, critiquing and refining each other’s outputs.
- Evaluative Phase: The discussion is examined through scenario testing, counterfactual reasoning, and coherence validation.
This interaction rests on four design principles:
- Adversarial Multi-LLM Integration: Reduces hallucinations and encourages deep insight via dialectic tension.
- Conditional Statistics Framework: Assigns distinct analytic or ethical roles to agents for output diversity.
- Adversarial Linguistic Calibration: Modulates LLMs' linguistic behaviors along a spectrum—from contentious to conciliatory—to surface overlooked assumptions, uncover novel perspectives, and stimulate richer reasoning through stylistic contrast.
- Reflexive Evaluation: Applies internal consistency checks to ensure robustness before convergence.
In parallel, we introduced CRIT—the Critical-Reading Inquisitive Template—to model systematic evaluation of arguments. CRIT enhances reasoning coherence by issuing structured prompts to LLMs, guiding them through:
- Identifying key claims and evidence
- Validating reason-conclusion relationships
- Identifying missing counterarguments
- Evaluating citations and source trustworthiness
- Producing an overall document quality score with justification
CRIT was designed to embody the Socratic method, operationalizing critical thinking through layered inquiry. It complements SocraSynth by offering fine-grained argument inspection—essential for rigorous multi-agent evaluation.
2024: EVINCE – Quantitative Control of Multi-Agent Dialogue
While SocraSynth laid the foundation for structured multi-LLM dialogue, it relied on qualitative tuning of agent interaction—moderating debates between contentious and conciliatory tones based on human intuition or prompt engineering. In contrast, EVINCE—Entropy and Variation IN Conditional Exchanges—introduces a principled, quantitative framework rooted in information theory.
EVINCE modulates multi-agent discourse using metrics such as:
- Cross Entropy – to detect redundant or low-diversity responses
- Mutual Information – to measure inter-agent influence and convergence
- Divergence Scores – to maintain productive disagreement without incoherence
It enables LLMs to adopt non-default stances by manipulating conditional likelihoods, moving beyond maximum-likelihood generation to goal-specific exploration. Through dual-entropy optimization, EVINCE ensures idea diversity during early exploration and focused consensus during final synthesis. It also integrates CRIT-based validation modules to assess coherence, argument structure, and missing counterpoints.
2024: BEAM – Behavioral Emotion Analysis Modeling
As SocraSynth matured, we encountered the challenge of managing emotional dynamics in LLM-driven debates—particularly, how to model and regulate "contentiousness." To address this, we introduced the Behavioral Emotion Analysis Model (BEAM), which maps linguistic behaviors onto basic emotional spectrums such as "hate" and "love." BEAM allows for the modulation of emotional tone in text, enabling more effective control of argument style and intensity.
This framework is grounded in three core steps:
- Emotion Definition: Identifying and isolating basic emotions that can influence language generation and ethical outcomes.
- Quantification and Ethics Mapping: Training models on emotionally varied text samples to learn and modulate emotional expression within ethical boundaries.
- Testing and Adaptation: Deploying BEAM in real-world scenarios (e.g., multimedia generation) to evaluate emotional consistency, safety, and user reception.
BEAM integrates insights from cognitive psychology and Bayesian conditioning to demonstrate how in-context learning can shift the emotional valence of LLM outputs. It directly laid the foundation for DIKE–ERIS by enabling ethical adjudication through emotion-sensitive editing—such as transforming hate speech into neutral statements by modulating underlying affective cues. This paved the way for dynamic linguistic calibration in EVINCE and ethical safeguards in DIKE–ERIS.
2024: DIKE–ERIS – Ethical Adjudication via Deliberative Dual-Agent Design
While EVINCE governs epistemic diversity and convergence, DIKE–ERIS addresses the challenge of ethical alignment. Inspired by constitutional design, it implements a deliberative framework composed of two specialized agents:
- DIKE – Advocates for fairness, regulation, and normative consistency
- ERIS – Challenges assumptions, introduces cultural context, and promotes dissent
Rather than hardcoding moral rules or relying on static alignment scores, DIKE–ERIS ensures ethical soundness through internal deliberation. Outputs are reviewed from multiple perspectives and scored based on fairness, inclusiveness, and context sensitivity. The result is a dynamic ethical process—not a fixed standard—adaptable to domain, audience, and evolving cultural norms.
This adjudicative model extends MACI’s capacity from “what is true?” to “what is just?”, enabling AI agents to deliberate not only toward correctness, but toward socially responsible conclusions.
2025: SagaLLM – Persistent Memory and Transactional Planning
In 2025, MACI matured into an execution-ready platform, capable of sustaining long-term planning, persistent memory, and failure recovery. We introduced SagaLLM, a transactional orchestration system based on the Saga pattern from distributed databases.
SagaLLM equips multi-agent LLM systems with:
- Spatial-Temporal Checkpointing: For reversible state recovery and reliable execution.
- Inter-Agent Dependency Management: Tracks interlocked tasks and alerts agents to contradictions.
- Independent Critical Validation: Uses CRIT-like lightweight agents to verify decisions at key junctures.
These mechanisms collectively enable:
- Transactional Consistency with compensatory rollback logic
- Robust Validation against drift and incoherence
- Scalable Performance through modular planning
- Integrated Intelligence grounded in both language and system-level reliability
SagaLLM closes the gap between AGI vision and real-world execution, allowing collaborative systems to function reliably over extended durations, across high-stakes domains like medicine, law, and infrastructure.
Beyond 2025: Toward General Intelligence through MACI
Some AI leaders, including Yann LeCun, have argued that LLMs cannot achieve AGI due to their limitations in memory, planning, and grounding. MACI does not dispute these critiques—it answers them by orchestrating collaborative, role-specific, critically validated agents in a structured framework.
MACI is deeply interdisciplinary, drawing on:
- Cognitive Psychology: for modeling attention, memory, and self-awareness
- Philosophy: for truth, meaning, and epistemology
- Physics: for entropy-informed reasoning and system equilibrium
- Computer Science: for orchestration, retrieval, and planning
Modern LLMs already demonstrate multimodal capabilities. Under MACI, these capabilities are harmonized—enabling agents to integrate data from telescopes, microscopes, quantum sensors, and human conversations. Intelligence, in this setting, is no longer a property of an individual model—it emerges from dynamic orchestration, critical reflection, and epistemic balance.
MACI: Intelligence through Orchestration, Not Isolation
The MACI project continues not merely as a technical initiative, but as a philosophical and scientific journey. It redefines artificial general intelligence not as the product of one supermodel, but as the collective potential of systems that:
- Balance divergence and convergence
- Reason across modalities and cultures
- Reconcile disagreement through discourse
Please visit this site regularly for updates as we refine MACI, contribute to open research, and build toward a responsible and collaborative AGI future.
These components collectively define a pathway toward robust, trustworthy, and interpretable general intelligence through multi-agent LLM coordination.
Publications in 2025
4. Persistent Memory for Long-Lived Workflow Planning
SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning, March 2025 (under review).
3. Benchmark for Evaluating Multi-Agent Systems
REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems, February 2025, (under review).
2. Adaptive Multi-Agent System Design and Experiments
An Adaptive Multi-Agent System for
Mitigating LLM Planning Limitations, Edward Y. Chang, February, 2025 (under review)
1. *Multi-Agent System for Planning
Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning, Edward Y. Chang, January, 2025 (under review)
Publications in 2024
1. *New Approach to AI Ethical Alignment, as RLHF Fails
A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models (July 2024), NeurIPS SafeGenAI, December 2024.
2. *Textbook, The Path to Artificial General Intelligence
Multi-LLM Agent Collaborative Intelligence: The Path to Artificial General Intelligence, SocraSynth.com, March/October 2024.
3. The Wisdom of Large Language Models
Unlocking the Wisdom of Large Language Models: An Introduction to The Path to Artificial General Intelligence , SocraSynth.com, June/October 2024.
4. *Establishing theoretical foundation for multi-LLM joint prediction
EVINCE: Optimizing Adversarial LLM Dialogues \\via Conditional Statistics and Information Theory, Edward Y. Chang, February/August 2024.
5. Emotion Analysis for Linguistic Behavior Modeling
Behavioral Emotion Analysis Model for Large Language Models, Edward Y. Chang, IEEE MIPR (invited paper), June 2024.
6. Introducing a novel LLM architecture where LLM Collaborative Intelligence can Debias News Articles
Uncovering Biases with Reflective Large Language Models, Edward Y. Chang, February 2024.
7. Using Multiple Collaborative LLMs to Suggest Sales Strategies and Execution Steps to Maximize Profit and Customer Satisfaction
Cooperate Sales Planning Using Multiple Collaborative LLMs, February 2024, in collaboration with TrendMicro.
Publications in 2023
1. SocraFin: Leveraging SocraSynth for Enhanced Financial Planning and Analysis (FP&A)
SocraFin, Conditional Statistics for Financial Planning and Analysis, November 2023, in collaboration with AiBanker.
2. SocraHealth: Utilizing SocraSynth to Improve Disease Diagnosis
SocraHealth: Enhancing Medical Diagnosis and Correcting Historical Records, Jocelyn Chang and Edward Chang, October 2023.
The 10th International Conference on Computational Science and Computational Intelligence, December 2023.
This study introduces SocraHealth, an innovative method using Large Language Models (LLMs) for medical diagnostics. By engaging LLM-based agents in structured debates, SocraHealth not only refines diagnoses but also corrects historical record inaccuracies, utilizing patient data effectively. The case study, featuring GPT-4 and Bard across two experiments, showcases this approach's success in producing logical, hallucination-free debates. Demonstrating a significant advancement over traditional diagnostic techniques, SocraHealth highlights the transformative power of LLMs in healthcare, especially in enhancing diagnostic accuracy and rectifying past diagnostic errors.
3. SocraPlan: Leveraging SocraSynth for Advanced Corporate Sales Planning
Multi-Agent Reasoning with Large Language Models for Effective Corporate Planning, in collaboration with S. Tsao at TrendMicro, October 2023.
The 10th International Conference on Computational Science and Computational Intelligence, December 2023.
Large Language Models (LLMs) have demonstrated significant capabilities in natural language processing tasks. In this paper, we explore the application of LLMs within a business context. Specifically, we employ LLMs to devise a sales strategy geared towards maximizing customer values (benefits and satisfaction). This sales plan encompasses five iterative stages: market landscape survey, customer profiling, product usage analysis, sales strategy formulation, and crafting persuasive pitches and materials. We leverage LLMs to supplement the limited data available to the company, aiming to enhance the efficacy of each stage and optimize KPIs, including the value-oriented sales conversion and profitability. Due to confidentiality and trade secret concerns, we blend artificial data with genuine data to ensure customer anonymity and protect sales playbooks. Despite these precautions, we effectively demonstrate our methodology of harnessing LLMs to refine the sales planning procedure.
4. SocraSynth: Multi-LLM Reasoning with Conditional Statistics
SocraSynth: Multi-LLM Reasoning with Conditional Statistics, September 2023 (revised January 2024).
Large language models (LLMs), thought promising, face criticisms for biases, hallucinations, and a lack of reasoning capability. This paper introduces SocraSynth, a multi-LLM agent reasoning platform developed to mitigate these issues. SocraSynth utilizes conditional statistics and systematic context enhancement through continuous arguments, alongside adjustable debate contentiousness levels. The platform typically involves a human moderator and two LLM agents representing opposing viewpoints on a given subject. SocraSynth operates in two main phases: knowledge generation and reasoning evaluation. In the knowledge generation phase, the moderator defines the debate topic and contentiousness level, prompting the agents to formulate supporting arguments for their respective stances. The reasoning evaluation phase then employs Socratic reasoning and formal logic principles to appraise the quality of the arguments presented. The dialogue concludes with the moderator adjusting the contentiousness from confrontational to collaborative, gathering final, conciliatory remarks to aid in human reasoning and decision-making. Through case studies in three distinct application domains, this paper showcases SocraSynth's effectiveness in fostering rigorous research, dynamic reasoning, comprehensive assessment, and enhanced collaboration. This underscores the value of multi-agent interactions in leveraging LLMs for advanced knowledge extraction and decision-making support.
5. * Examining GPT-4's Capabilities and Enhancement by SocraSynth
Examining GPT-4's Capabilities and Enhancement with SocraSynth, July 2023.
The 10th International Conference on Computational Science and Computational Intelligence (CSCI'23), December 2023.
(Top-1% assessed paper, over 12,000 reads on ResearchGate since July 2023)
In this work, we investigate the capabilities and limitations of GPT-4, a large-scale, polydisciplinary, and polymodal language model. Despite its accomplishments across a range of tasks, GPT-4 exhibits key shortcomings, particularly in areas of reasoning and ethics, manifesting in tendencies like hallucination, imitation rather than understanding, and a lack of fact-checking ability. We propose several remedies to address these challenges. First, we introduce the CoCoMo framework, designed to incorporate reasoning into AI systems using Socratic methods and prompt ensembles. Second, we advocate for the use of demonstrations as a means to imbue AI agents with ethical behavior, building upon our experience with the Noora chatbot project. Lastly, we recommend adopting a more comprehensive approach to training ensemble members of GPT-4, shifting from an exclusive focus on optimizing for cross-entropy loss. Our end goal is the development of AI systems that not only enhance human abilities but also align with human values, thereby contributing constructively to society.
6. Discovering Insights Beyond the Known: A Dialogue Between GPT-4 Agents from Adam and Eve to the Nexus of Ecology, AI, and the Brain
Discovering Insights Beyond the Known: A Dialogue Between GPT-4 Agents, August 2023. (The most read paper in August on ResearcgGate)
Human knowledge, vast as it is, often falls short in grasping intricate interdisciplinary domains fully. In contrast, foundation models like GPT-4, endowed with extensive multidisciplinary knowledge, can potentially bridge this gap. Significantly, we leverage the vast expanses of GPT-4's knowledge, banking on its ability to frame questions that might elude human intuition, thus paving the way for the emergence of fresh insights and potentially novel knowledge. In this study, we convened a unique committee comprising a moderator (the authors) and two GPT-4 agents. The dialogue is ignited by the ancient narrative of Adam and Eve, setting the stage for a rich exchange between the GPT-4 agents. This conversation derives from the age-old tale, as the agents delve into three intertwined domains: the significance of myths in ecological interpretation, the intricate ethical and philosophical quandaries surrounding AI, and the enigmatic realm of the human brain as complemented by technology. This dialogue not only unveils captivating insights but also underscores the indispensable value of interdisciplinary exchanges. Foundation models, as demonstrated, can catalyze such dialogues, equipping us to traverse expansive knowledge landscapes and explore domains previously beyond human comprehension.
7. * Using Socratic Method to Facilitate Critical Thinking for Fact Checking
Prompting Large Language Models With the Socratic Method, IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), March 2023 (The best presentation in the AI track)
This paper presents a systematic approach to using the Socratic method in developing prompt templates that effectively interact with large language models, including GPT-3. Various methods are examined, and those that yield precise answers and justifications and those that foster creativity and imagination to enhance creative writing are identified. Techniques such as {\em definition}, {\em elenchus}, {\em dialectic}, {\em maieutics}, {\em generalization}, and {\em counterfactual reasoning} are discussed for their application in engineering prompt templates and their connections to inductive, deductive, and abductive reasoning. Through examples, the effectiveness of these dialogue and reasoning methods is demonstrated. An interesting observation is made that when the task's goal and user intent are conveyed to GPT-3 via ChatGPT before the start of a dialogue, the large language model seems to connect to the external context expressed in the intent and perform more effectively.
8. Consciousness Modeling for Reasoning
CoCoMo: Computational Consciousness Modeling for Generative and Ethical AI, February 2023.
The CoCoMo model proposes a computational solution to the challenge of incorporating ethical and emotional intelligence considerations into AI systems, with the aim of creating AI agents that combine knowledge with compassion. To achieve this goal, CoCoMo prioritizes fairness, beneficence, non-maleficence, empathy, adaptability, transparency, and critical and exploratory thinking abilities. The model employs consciousness modeling, reinforcement learning, and prompt template formulation to support these desired traits. By incorporating ethical and emotional intelligence considerations, a generative AI model can potentially lead to improved fairness, reduced toxicity, and increased reliability.
9. CRIT: An Inquisitive Prompt Template for Critical Reading
CRIT: An Inquisitive Prompt Template for Critical Reading, January 2023.
Critical reading, a pivotal element of education, necessitates active engagement with texts to delve deeper and form informed assessments about their validity and credibility. We introduce CRIT, a comprehensive prompt template designed to streamline this process. CRIT leverages pre-trained language models to critically evaluate texts, extracting their conclusions and supportive reasons, scrutinizing reason-to-claim arguments, suggesting counterarguments, and offering an overarching quality assessment. Notably, CRIT also possesses the capability to conduct fact-checking on the outputs of foundation models, ensuring accuracy and trustworthiness. With its structured and recursive prompts, CRIT facilitates a comprehensive and logical text analysis, providing insights into argument validity and source reliability. This makes CRIT an invaluable asset for K-12 education, fostering critical reading skills, and refining articles before public examination.
10. Applying AI to Precision Medicine
Knowledge-Guided Data-Centric AI in Healthcare: Progress, Shortcomings, and Future Directions, December 2022; Chapter 2 in Artificial Intelligence, Machine Learning, and Deep Learning in Precision Medicine in Liver Diseases (ISBN: 9780323991360), Elsevier, August 22, 2023
The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data: data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the effectiveness of knowledge-guided generative methods.
About Us
SocraSynth orchestrates a symposium of GAI agents, facilitating investigative dialogues to reveal knowledge and insights previously elusive to humans. The concept of SocraSynth was first introduced by Dr. Edward Y. Chang, an ACM Fellow. Chang held the position of Director at Google Research from 2006 to 2012 and has been serving as an adjunct professor in the Computer Science department at Stanford University since 2019.