AgentRefine: Enhancing Agent Generalization through Refinement Tuning

Introduction

We introduce AgentRefine, an agent synthesis framework that enables models to learn from observations within trajectories to correct their own errors. AgentRefine significantly outperforms state-of-the-art agent tuning works in terms of generalization capabilities across diverse agent tasks. Our findings establish a relationship between agent generalization and self-improvement, offering a new paradigm for future research.

Overview of the AgentRefine framework.

Main Results

The table shows the performance comparison of AgentRefine and other methods across different families and sizes.It can be observed that compared to other agent works, our method shows significant advantages in held-out tasks. For example, it leads Agent-FLAN by 13.3% in Sciworld Success Rate. Notably, in some tasks, AgentRefine can even match the performance of the GPT-4o series. This demonstrates the strong generalization capability of AgentRefine.

Main Results. The underlined text indicates that the training data is sampled in the same environment as the task and is considered as held-in evaluation.

Robustness Analysis

We perturb the candidate actions in Alfworld ensuring that the perturbed ones consist of different tokens (or token order) but express the same semantic information. The detail perturbation rules are shown in Appendix K of the paper.

Performance for different models across various perturbations.

It can be observed that simple data perturbation leads to a significant performance drop on the original held-in task. For example, under the average score, AgentGym’s Success Rate drops by 25.6%, while Agent-FLAN experiences an even more severe performance decline of 30.4%. Their standard deviation is close to 20%. In comparison, Our AgentRefine has a 3.7% increase in the average and low standard deviation, 3.73%, indicating that it learns decision-making capabilities rather than just simple memorization.

Case Study

The figure presents examples of Agent-FLAN and AgentRefine in Jericho and Sciworld. The cases show that Refinement Tuning can enhance the diversity and quality of the model’s thinking, which helps improve the model’s exploration breadth and efficiency and avoid always getting stuck in loops in a new environment.

Comparison case study on Jericho and SciWorld between Agent-FLAN and AgentRefine.

Case Study Details

(Left) In Jericho, Agent-FLAN mistakenly believes it is not in the cell and attempts to go to cell. After failing, it chooses to check valid actions. Although check valid actions is a correct choice, Agent-FLAN does not correct its erroneous decision based on the returned results and repeats the go to cell and check valid actions error loop. In contrast, AgentRefine, upon realizing its actions are not achieving the goal, tries various new methods instead of endlessly repeating previously tried incorrect actions.

(Right) In Sciworld, Agent-FLAN ignores the hint in the Goal that the f ork is in the bedroom and chooses to search in the kitchen. Additionally, Agent-FLAN, having memorized the Alfworld dataset, attempts to output locations can only be found in Alfworld (drawer, countertop, and the action format go to {place}), which do not exist in SciWorld. Conversely, AgentRefine can clearly find the thermometer and decide to go bedroom to search for the f ork. After go bedroom fails, it decides to go hallway based on several rounds of observation. In T hought 6, although AgentRefine mistakenly believes it cannot reach the bedroom, its judgement shows it can revise its decisions using short-term memory (from turn 2). When Observation 6 provides clear information about the bedroom, AgentRefine can correct its wrong decision in T hought 6 and reach the bedroom. This indicates that AgentRefine's improvement in results is not due to memorizing prior knowledge from training data but rather its ability to efficiently utilize and integrate multiple key pieces of information from short-term memory to correct errors in historical decisions.

Comprehensive Analysis

Overall progress score among 5 tasks. Agent-FLAN has been trained on Held-in task.

The model’s performance as the AgentRefine train data scales up.

The similarity heatmap between different environments in 6 sources.

For more technical details, methodology, and comprehensive experimental results, refer to our paper.

BibTeX

@inproceedings{fu2025agentrefine,
        title={AgentRefine: Enhancing Agent Generalization through Refinement Tuning},
        author={Dayuan Fu and Keqing He and Yejie Wang and Wentao Hong and Zhuoma GongQue and Weihao Zeng and Wei Wang and Jingang Wang and Xunliang Cai and Weiran Xu},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=FDimWzmcWn}
      }