Detailed Analysis of the Self-Reflection Mechanism
I. Core Constituent Models of Self-Reflection
Self-reflection is composed of three key models, each undertaking distinct functions:
Evaluator
- Function: Quantitatively evaluates the agent’s output and provides feedback in the form of reward scores.
- Input: The trajectory generated by the agent (short-term memory, i.e., records of intermediate steps in task execution).
- Output: Customized reward scores based on task types. For example:
Self-Reflection Model
- Role: Undertaken by a large language model (LLM), generating linguistic reinforcement cues to drive the agent’s self-optimization.
- Working mechanism:
- Key value:
II. Iterative Optimization Process of Self-Reflection
The core execution steps of self-reflection follow a closed-loop logic to achieve continuous improvement of the agent’s behavior:
- a) Task definition: Clarifies goals (e.g., decision-making, reasoning, programming) and sets evaluation criteria.
- b) Trajectory generation: The agent executes the task and records intermediate steps and outputs (short-term memory).
- c) Evaluation: The evaluator generates reward scores based on the trajectory to quantify task performance.
- d) Self-reflection execution:
- e) Next trajectory generation: The agent adjusts strategies based on reflection results and executes new task attempts.
III. Expansion of the ReAct Framework by Self-Reflection
Self-reflection (Reflexion) adds three new components to the ReAct framework:
- Self-evaluation component: Generates quantitative feedback through the evaluator, replacing single-rule judgment.
- Self-reflection component: An LLM-driven dynamic feedback generation mechanism to achieve adaptive strategy adjustment.
- Memory component:
IV. Experimental Validation and Performance Improvement
Self-reflection has demonstrated significant advantages in various tasks, with specific data as follows:
Sequential Decision-Making Task (AlfWorld)
- Scenario: 134 decision-making tasks requiring the agent to complete environmental interaction and strategy selection.
- Result: The ReAct + Reflexion combination completed 130 tasks, far exceeding the pure ReAct framework, with an accuracy improvement of approximately 7.5% (130/134 vs. original baseline).
- Key strategy: Integrates heuristic rules with GPT self-evaluation for binary classification to optimize decision paths.
Question Reasoning Task (HotPotQA)
- Scenario: Multi-hop reasoning problems requiring integrating multi-source information to draw conclusions.
- Result: The Reflexion + CoT (Chain of Thought) model significantly outperforms single CoT or CoT with only situational memory, especially in few-shot learning scenarios.
Python Programming Task (HumanEval)
- Scenario: Generates Python code meeting requirements, evaluating correctness and efficiency.
- Result: The self-reflection mechanism helps the agent surpass baseline methods in code generation quality, particularly in handling complex logic (e.g., algorithm optimization, boundary condition processing).
V. Summary of Core Values
Through the closed-loop mechanism of “evaluation-reflection-memory”, self-reflection enables LLM agents to possess human-like “learning from mistakes” capabilities, specifically manifested in:
- Rapid iteration: Reduces trial-and-error costs, surpassing traditional baseline methods within a few learning steps.
- Generalization ability: Transfers experience across tasks, such as reusing strategies from decision-making to programming tasks.
- Interpretability: Clarifies optimization directions through linguistic feedback, making it more understandable and debuggable than black-box models.