Self-Training | fufu酱のNoteBook

type

status

slug

summary

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

💡

Abstract

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data（自动生成指令跟踪训练数据）. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Overview

This research paper explores a novel technique called "self-play with execution feedback" to improve the instruction-following capabilities of large language models (LLMs).指令跟踪能力

The key idea is to have the LLM engage in a self-play process where it generates and executes instructions, then receives feedback on the quality of its execution.拥有LLM参与self-play过程，生成并执行指令，然后接收有关其执行质量的反馈

This feedback is then used to fine-tune the LLM, enabling it to better understand and follow instructions over time.再使用该反馈来微调LLM

💡

核心内容

The researchers' approach involves having the LLM practice following instructions on its own. First, the LLM generates some instructions. Then, it tries to execute those instructions. Importantly, the LLM also receives feedback on how well it did at executing the instructions. This feedback is used to fine-tune or improve the LLM, so that it can better understand and follow instructions in the future.LLM生成一些指令。然后，它尝试执行这些指令然后LLM还收到有关其执行指令的情况的反馈。此反馈用于微调或改进LLM

Technical

"self-play with execution feedback" to improve the instruction-following abilities of large language models (LLMs). [1] The core idea is to have the LLM engage in a self-play process where it first generates a set of instructions, then attempts to execute those instructions, and finally receives feedback on the quality of its execution.This feedback is then used to fine-tune the LLM, enabling it to better understand and follow instructions over time.

💡

the LLM's ability to learn from its own mistakes and continuously improve its instruction-following capabilities through the self-play process.

To implement this approach, the researchers first train an LLM on a large corpus of natural language data. They then fine-tune this base model on a dataset of instruction-following tasks, using a combination of supervised learning and reinforcement learning techniques. [2] During the fine-tuning process, the LLM engages in the self-play cycle of generating instructions, executing them, and receiving feedback on its performance.

主要的缺陷在于

本文主要侧重于评估LLM在相对狭窄、受限的指令遵循任务上的性能。这项技术能否很好地推广到更开放的、现实世界的指令跟踪场景，在这些场景中，语言和上下文更加复杂和模糊

目前尚不清楚该技术如何扩展到更大、能力更强LLMs，或者它如何与其他微调或训练策略相互作用。

As language models continue to grow in power and sophistication, techniques like self-play with execution feedback will likely play an increasingly important role

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

overview

This paper explores a novel approach called "self-play fine-tuning" that can transform weak language models into strong, high-performing ones.

The authors demonstrate how this technique can effectively train language models to exhibit strong reasoning abilities, outperforming alternative fine-tuning methods.

The research provides insights into how language models can be optimized for tasks requiring advanced reasoning skills, which has significant implications for developing more capable and versatile AI systems.

The researchers in this study were interested in finding ways to make language models, which are AI systems that can understand and generate human language, become better at reasoning and problem-solving. Typically, language models are trained on large datasets of text, which allows them to learn the patterns and structures of language. However, this approach can result in models that struggle with tasks that require deeper reasoning or more advanced cognitive abilities.

The core idea is to have the language model engage in a sort of "dialogue" with itself, where it takes on different roles and perspectives to solve complex problems.

The researchers found that this self-play fine-tuning approach was able to transform weak language models - models that were not very good at reasoning - into much stronger and more capable models. These improved models were able to outperform other fine-tuning methods on a variety of tasks that required advanced reasoning abilities.需要高级推理能力的各种任务上

This research is significant because it provides a way to develop more versatile and capable AI systems that can excel at a wider range of tasks, including those that demand higher-level cognitive skills. By optimizing language models for reasoning, the researchers have taken an important step towards creating AI that can truly understand and engage with the world in more meaningful and intelligent ways.

Technical Explanation

"self-play fine-tuning” can effectively convert weak language models into strong, high-performing models.

The key idea is to have the language model engage in a self-directed dialogue, where it takes on different roles and perspectives to solve complex problems. This self-play process allows the model to learn more effective reasoning strategies, which can then be leveraged to improve its performance on a variety of tasks.让语言模型参与自我引导的对话，扮演不同的角色和观点来解决复杂的问题。这种自我博弈过程使模型能够学习更有效的推理策略，然后可以利用这些策略来提高其在各种任务上的性能using self-directed interactions to enhance language model capabilities.

潜在问题

One potential limitation of the study is the reliance on synthetic tasks and datasets to evaluate the model's reasoning skills. While these controlled experiments provide valuable insights, it would be important to also assess the model's performance on real-world, naturalistic tasks that capture the full complexity of human reasoning and problem-solving.

本文没有深入探讨自我对弈过程背后的具体机制或动态

Large Language Models Can Self-Improve At Web Agent Tasks

💡

Abstract

Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

Researchers explore how large language models (LLMs) can self-improve their performance as agents in complex environments like web browsers.

They use the WebArena benchmark to assess agent performance in web navigation and task completion.

The goal is to see if LLMs can fine-tune on their own generated data to exceed their base performance as autonomous agents.目标是看看是否LLMs可以微调自己生成的数据，以超越其作为自主代理的基本性能。

Training AI agents to effectively navigate and perform actions in complex environments like web browsers has traditionally been challenging due to limited training data. However, Exploring Autonomous Agents through the Lens of Large Language Models: A Review has shown that large language models (LLMs) can demonstrate some ability to navigate novel environments using just natural language instructions as a guide.仅使用自然语言指令作为指导来导航新环境的能力。

Additionally, that LLMs have the capability to improve their own performance by fine-tuning on data generated by the model itself. In this work, the researchers explore whether LLMs can leverage this self-improvement capability to enhance their performance as autonomous agents in complex, long-term tasks.

The researchers also contribute new evaluation metrics to assess the performance, robustness, and quality of the agent's trajectories in greater detail than just aggregate benchmark scores, providing a more comprehensive way to measure self-improvement.

现有缺陷分析

Existing research has also highlighted the challenge of maintaining coherence and logical reasoning in LLM-based agents as they navigate complex, long-horizon tasks. The researchers in this paper do not directly address this issue, which could be an area for further investigation.

They note that the synthetic training data used for fine-tuning may not fully capture the complexity and nuance of real-world web navigation, which could limit the agent's performance in more realistic scenarios.

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. 经过针对较简单任务的监督训练的评估器（奖励模型）可以有效地用于对较难任务的候选解决方案进行评分，从而促进对不同级别任务的从易到难的泛化。我们提出了一种可扩展对齐的新方法，该方法首先在简单问题（例如，1-3级）上训练过程监督奖励模型，然后使用它们来评估策略模型在困难问题上的性能

评估者接受过流程监督培训，可进行从易到难的评估，通过重新排名或强化学习来改进生成。

overview

This paper investigates the concept of easy-to-hard generalization in AI, proposing a novel strategy for scaling AI's problem-solving capabilities beyond human expertise by using human annotations on simpler tasks to tackle more complex challenges.

It highlights the difference in generalization capabilities between generators (policy models) and evaluators, with evaluators, especially process-supervised reward models (PRMs), showing superior performance in guiding generators to solve harder tasks.

The study demonstrates that reinforcement learning (RL) techniques, when used to optimize generators against evaluators trained on easier tasks, significantly improve the AI's ability to perform complex reasoning tasks.

The paper suggests a future direction for AI that involves refining and extending these models and methods, enabling AI systems to independently navigate and solve problems beyond human-level supervision.

💡

Generators and Evaluators: Bridging the Gap

Generators, or policy models, trained solely on simpler tasks exhibit varied performance when confronted with more complex tasks. The study finds that supervised fine-tuning(SFT) consistently outperforms in-context learning(ICL) in generalizing from easy to hard tasks. Interestingly,

data quality plays a crucial role in this generalization, with high-quality, well-aligned data from simpler tasks enabling better generalization performances. Despite improvements, a palpable performance gap exists between generators trained on a full spectrum of tasks and those limited to easier tasks, highlighting the challenge of easy-to-hard generalization for generators.生成器从易到难泛化的问题

💡

Evaluators' Superior Easy-to-Hard Generalization

Evaluators, particularly process-supervised reward models

(PRMs), demonstrate remarkable easy-to-hard generalization capabilities. Through re-ranking strategies like weighted voting and reinforcement learning (RL) approaches, evaluators effectively enhance generator performance on complex tasks. The study presents a novel Outcome &

Process Reward Model(OPRM) that combines the merits of both PRMs and traditional outcome

reward models, delivering superior performance across tasks. These findings suggest that evaluators can serve as a significant catalyst in advancing generators' easy-to-hard generalization.评估器可以作为推进生成器由易到难的泛化的重要催化剂。

💡

Reinforcement Learning: Harnessing Evaluators for Enhancement

The research moves beyond re-ranking to explore how evaluators can further facilitate generator improvement through reinforcement learning. By optimizing generators against the evaluators, the study showcases that training with easy-to-hard evaluators via RL achieves notable performance gains. The process reward models, specifically when employed in RL training modes, enable generators to surpass the performance of models trained across a full data spectrum, including harder tasks.如何通过强化学习进一步促进生成器的改进

Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models

Identifying User Goals from UI Trajectories

💡

generate a natural language description that accurately captures the user’s intended task.

input:UI trajectory - a sequence of steps performed by the user.

steps:Each step consists of a snapshot of the UI environment at that moment, along with the corresponding action the user took at that step.

output:generate a natural language description that accurately captures the user’s intended task

text generation tasks like summarization where multiple valid outputs can exist, is inherently ambiguous, mostly because the same trajectory may fulfill multiple intents.
When a model identifies intent from trajectory, we expect it to predict the most likely one.” (Berkovitch 等, 2024, p. 2)

evaluation metric

Given an input UI trajectory and a corresponding gold task description,we assess whether a predicted task description matches the gold reference.

我们首先假设观察到的UI trajectory 满足了用户的intended task是正确的，不满足的是错误的

对于每一个任务subtask的定义不一样的

运用满足了用户intent的trajectory进行相应的操作

information-seeking

the trajectory provides the necessary information sought in the user intent.

transactional-event

successfully completing the specific requirement outlined in the task.
In essence, this means that completing, task A necessarily results in completing task B