type
status
slug
summary
tags
category
password
date
icon
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
Abstract
One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data(自动生成指令跟踪训练数据). AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.
Overview
- This research paper explores a novel technique called "self-play with execution feedback" to improve the instruction-following capabilities of large language models (LLMs).指令跟踪能力
- The key idea is to have the LLM engage in a self-play process where it generates and executes instructions, then receives feedback on the quality of its execution.拥有LLM参与self-play过程,生成并执行指令,然后接收有关其执行质量的反馈
- This feedback is then used to fine-tune the LLM, enabling it to better understand and follow instructions over time.再使用该反馈来微调LLM
核心内容
The researchers' approach involves having the LLM practice following instructions on its own. First, the LLM generates some instructions. Then, it tries to execute those instructions. Importantly, the LLM also receives feedback on how well it did at executing the instructions. This feedback is used to fine-tune or improve the LLM, so that it can better understand and follow instructions in the future.LLM生成一些指令。然后,它尝试执行这些指令然后LLM还收到有关其执行指令的情况的反馈。此反馈用于微调或改进LLM
Technical
"self-play with execution feedback" to improve the instruction-following abilities of large language models (LLMs). [1] The core idea is to have the LLM engage in a self-play process where it first generates a set of instructions, then attempts to execute those instructions, and finally receives feedback on the quality of its execution.This feedback is then used to fine-tune the LLM, enabling it to better understand and follow instructions over time.
the LLM's ability to learn from its own mistakes and continuously improve its instruction-following capabilities through the self-play process.
To implement this approach, the researchers first train an LLM on a large corpus of natural language data. They then fine-tune this base model on a dataset of instruction-following tasks, using a combination of supervised learning and reinforcement learning techniques. [2] During the fine-tuning process, the LLM engages in the self-play cycle of generating instructions, executing them, and receiving feedback on its performance.
主要的缺陷在于
本文主要侧重于评估LLM在相对狭窄、受限的指令遵循任务上的性能。这项技术能否很好地推广到更开放的、现实世界的指令跟踪场景,在这些场景中,语言和上下文更加复杂和模糊
目前尚不清楚该技术如何扩展到更大、能力更强LLMs,或者它如何与其他微调或训练策略相互作用。
As language models continue to grow in power and sophistication, techniques like self-play with execution feedback will likely play an increasingly important role
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
overview
- This paper explores a novel approach called "self-play fine-tuning" that can transform weak language models into strong, high-performing ones.
- The authors demonstrate how this technique can effectively train language models to exhibit strong reasoning abilities, outperforming alternative fine-tuning methods.
- The research provides insights into how language models can be optimized for tasks requiring advanced reasoning skills, which has significant implications for developing more capable and versatile AI systems.
The researchers in this study were interested in finding ways to make language models, which are AI systems that can understand and generate human language, become better at reasoning and problem-solving. Typically, language models are trained on large datasets of text, which allows them to learn the patterns and structures of language. However, this approach can result in models that struggle with tasks that require deeper reasoning or more advanced cognitive abilities.
The core idea is to have the language model engage in a sort of "dialogue" with itself, where it takes on different roles and perspectives to solve complex problems.
The researchers found that this self-play fine-tuning approach was able to transform weak language models - models that were not very good at reasoning - into much stronger and more capable models. These improved models were able to outperform other fine-tuning methods on a variety of tasks that required advanced reasoning abilities.需要高级推理能力的各种任务上
This research is significant because it provides a way to develop more versatile and capable AI systems that can excel at a wider range of tasks, including those that demand higher-level cognitive skills. By optimizing language models for reasoning, the researchers have taken an important step towards creating AI that can truly understand and engage with the world in more meaningful and intelligent ways.
Technical Explanation
"self-play fine-tuning” can effectively convert weak language models into strong, high-performing models.
The key idea is to have the language model engage in a self-directed dialogue, where it takes on different roles and perspectives to solve complex problems. This self-play process allows the model to learn more effective reasoning strategies, which can then be leveraged to improve its performance on a variety of tasks.让语言模型参与自我引导的对话,扮演不同的角色和观点来解决复杂的问题。这种自我博弈过程使模型能够学习更有效的推理策略,然后可以利用这些策略来提高其在各种任务上的性能using self-directed interactions to enhance language model capabilities.
潜在问题
One potential limitation of the study is the reliance on synthetic tasks and datasets to evaluate the model's reasoning skills. While these controlled experiments provide valuable insights, it would be important to also assess the model's performance on real-world, naturalistic tasks that capture the full complexity of human reasoning and problem-solving.
本文没有深入探讨自我对弈过程背后的具体机制或动态
Large Language Models Can Self-Improve At Web Agent Tasks
Abstract
Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.
- Researchers explore how large language models (LLMs) can self-improve their performance as agents in complex environments like web browsers.
- They use the WebArena benchmark to assess agent performance in web navigation and task completion.
- The goal is to see if LLMs can fine-tune on their own generated data to exceed their base performance as autonomous agents.目标是看看是否LLMs可以微调自己生成的数据,以超越其作为自主代理的基本性能。
Training AI agents to effectively navigate and perform actions in complex environments like web browsers has traditionally been challenging due to limited training data. However, Exploring Autonomous Agents through the Lens of Large Language Models: A Review has shown that large language models (LLMs) can demonstrate some ability to navigate novel environments using just natural language instructions as a guide.仅使用自然语言指令作为指导来导航新环境的能力。
Additionally, that LLMs have the capability to improve their own performance by fine-tuning on data generated by the model itself. In this work, the researchers explore whether LLMs can leverage this self-improvement capability to enhance their performance as autonomous agents in complex, long-term tasks.
The researchers also contribute new evaluation metrics to assess the performance, robustness, and quality of the agent's trajectories in greater detail than just aggregate benchmark scores, providing a more comprehensive way to measure self-improvement.
现有缺陷分析
Existing research has also highlighted the challenge of maintaining coherence and logical reasoning in LLM-based agents as they navigate complex, long-horizon tasks. The researchers in this paper do not directly address this issue, which could be an area for further investigation.
They note that the synthetic training data used for fine-tuning may not fully capture the complexity and nuance of real-world web navigation, which could limit the agent's performance in more realistic scenarios.
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. 经过针对较简单任务的监督训练的评估器(奖励模型)可以有效地用于对较难任务的候选解决方案进行评分,从而促进对不同级别任务的从易到难的泛化。我们提出了一种可扩展对齐的新方法,该方法首先在简单问题(例如,1-3级)上训练过程监督奖励模型,然后使用它们来评估策略模型在困难问题上的性能
评估者接受过流程监督培训,可进行从易到难的评估,通过重新排名或强化学习来改进生成。
overview
- This paper investigates the concept of easy-to-hard generalization in AI, proposing a novel strategy for scaling AI's problem-solving capabilities beyond human expertise by using human annotations on simpler tasks to tackle more complex challenges.
- It highlights the difference in generalization capabilities between generators (policy models) and evaluators, with evaluators, especially process-supervised reward models (PRMs), showing superior performance in guiding generators to solve harder tasks.
- The study demonstrates that reinforcement learning (RL) techniques, when used to optimize generators against evaluators trained on easier tasks, significantly improve the AI's ability to perform complex reasoning tasks.
- The paper suggests a future direction for AI that involves refining and extending these models and methods, enabling AI systems to independently navigate and solve problems beyond human-level supervision.
Generators and Evaluators: Bridging the Gap
Generators, or policy models, trained solely on simpler tasks exhibit varied performance when confronted with more complex tasks. The study finds that supervised fine-tuning(SFT) consistently outperforms in-context learning(ICL) in generalizing from easy to hard tasks. Interestingly,
data quality plays a crucial role in this generalization, with high-quality, well-aligned data from simpler tasks enabling better generalization performances. Despite improvements, a palpable performance gap exists between generators trained on a full spectrum of tasks and those limited to easier tasks, highlighting the challenge of easy-to-hard generalization for generators.生成器从易到难泛化的问题
Evaluators' Superior Easy-to-Hard Generalization
Evaluators, particularly process-supervised reward models
(PRMs), demonstrate remarkable easy-to-hard generalization capabilities. Through re-ranking strategies like weighted voting and reinforcement learning (RL) approaches, evaluators effectively enhance generator performance on complex tasks. The study presents a novel Outcome &
Process Reward Model(OPRM) that combines the merits of both PRMs and traditional outcome
reward models, delivering superior performance across tasks. These findings suggest that evaluators can serve as a significant catalyst in advancing generators' easy-to-hard generalization.评估器可以作为推进生成器由易到难的泛化的重要催化剂。
Reinforcement Learning: Harnessing Evaluators for Enhancement
The research moves beyond re-ranking to explore how evaluators can further facilitate generator improvement through reinforcement learning. By optimizing generators against the evaluators, the study showcases that training with easy-to-hard evaluators via RL achieves notable performance gains. The process reward models, specifically when employed in RL training modes, enable generators to surpass the performance of models trained across a full data spectrum, including harder tasks.如何通过强化学习进一步促进生成器的改进
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
Identifying User Goals from UI Trajectories
generate a natural language description that accurately captures the user’s intended task.
- input:UI trajectory - a sequence of steps performed by the user.
- steps:Each step consists of a snapshot of the UI environment at that moment, along with the corresponding action the user took at that step.
- output:generate a natural language description that accurately captures the user’s intended task
- text generation tasks like summarization where multiple valid outputs can exist, is inherently ambiguous, mostly because the same trajectory may fulfill multiple intents.
- When a model identifies intent from trajectory, we expect it to predict the most likely one.” (Berkovitch 等, 2024, p. 2)
- evaluation metric
Given an input UI trajectory and a corresponding gold task description,we assess whether a predicted task description matches the gold reference.
- 我们首先假设观察到的UI trajectory 满足了用户的intended task是正确的,不满足的是错误的
- 对于每一个任务subtask的定义不一样的
- information-seeking
- the trajectory provides the necessary information sought in the user intent.
- transactional-event
- successfully completing the specific requirement outlined in the task.
- In essence, this means that completing, task A necessarily results in completing task B
运用满足了用户intent的trajectory进行相应的操作
Below is a detailed explanation of how this framework could be applied to a web agent setting, incorporating the idea of high-level intents, multiple possible “trajectories” (solution plans), and a subsequent inference process to turn those trajectories into concrete tasks.
Conceptual Overview:
- High-Level Intent:
- “Research the latest trends in AI-powered chatbots.”
- “Find the cheapest flights from New York to London next month.”
- “Collect data on top-performing e-commerce products in the last quarter.”
The process begins with a high-level user intent. In a web agent scenario, this intent might be something like:
This “intent” is a broad directive that does not yet specify how exactly the agent will accomplish the goal. It serves as the input that will guide the system to generate multiple candidate approaches to fulfilling the request.
- LLM Actor - Generating Multiple Trajectories:
- Trajectory 1:
- Trajectory 2:
- Trajectory 3:
An LLM (Large Language Model) acting as a planning agent—let’s call it the LLM Actor—takes the high-level intent and produces several potential solution “trajectories.” Each trajectory can be thought of as a conceptual plan or approach outlining how the web agent might navigate the online environment to achieve the goal. For example, given the intent “Research the latest trends in AI-powered chatbots,” the LLM Actor might generate trajectories such as:
Step A: Search on Google for “latest trends in AI-powered chatbots.”
Step B: Visit top 3 authoritative tech news sites (e.g., TechCrunch, VentureBeat, MIT Technology Review) for relevant articles.
Step C: Identify key subtopics (e.g., multimodal chatbots, retrieval-augmented LLMs, voice integration).
Step D: Summarize findings in a concise report.
Step A: Check specialized industry reports on sites like Gartner or IDC.
Step B: Gather insights from developers’ forums (Stack Overflow, Reddit r/MachineLearning) to see what trends developers discuss.
Step C: Combine data into a structured list of trends with links to sources.
Step A: Use a specialized industry database (if available through APIs) that tracks AI patents and startups.
Step B: Identify which new startups are emerging and what features their chatbots include.
Step C: Compile a summary focusing on upcoming technology and niche solutions.
Each trajectory represents a distinct approach to executing the high-level intent. Some might focus on broad market research, others might focus on technical forums, or proprietary data sources. The key point is that the LLM Actor can generate multiple candidate plans, possibly leveraging known web resources, specialized databases, or API endpoints.
- Selecting and Refining Through LLM Inference (“LLM Infer”):
- Relevance: Does the trajectory align well with the user’s stated intent?
- Feasibility: Does the plan rely on resources that the agent can access (e.g., are there APIs or known websites available)?
- Efficiency: Is the plan too lengthy or can it be done in fewer steps?
- Coverage: Does the plan cover a broad enough set of sources to ensure a comprehensive result?
- Pick the best single trajectory from the set provided by the LLM Actor and break it down into actionable tasks.
- Combine elements from multiple trajectories to form an improved, hybrid plan.
- Further refine steps, making them more concrete and detailed: for instance, turning “Visit top 3 authoritative tech news sites” into specific tasks like “Open TechCrunch homepage, search for ‘AI chatbot trends,’ read the top 2 recent articles, and note key points.”
After the LLM Actor has produced these candidate trajectories, another LLM-based process—let’s call it the LLM Inference module—reviews them. This module’s role is to evaluate the candidate plans and transform them into executable tasks that the web agent can perform. It might take into account criteria like:
The LLM Inference step might do one of the following:
- Task Generation and Execution by the Web Agent: Once the LLM Inference module finalizes the plan, it produces a set of tasks that the web agent can execute. In a web environment, these tasks might be:
- Web Navigation Tasks:
- “Open a new browser tab and navigate to https://www.techcrunch.com”
- “Use the site’s search function to find articles on ‘AI chatbots’”
- “Extract article titles, publication dates, and main points.”
- API Requests:
- “Send a GET request to the Gartner Insights API for the latest report on AI chatbots.”
- “Use the Reddit API to fetch top posts from r/MachineLearning mentioning ‘chatbots’ in the last month.”
- Data Processing Tasks:
- “Parse the returned JSON for relevant keywords and summarize the findings.”
- “Store extracted data in a local knowledge base or a temporary memory buffer.”
At this point, the web agent, possibly coupled with other automated tools (like web scraping libraries, APIs, and data processing routines), can carry out the plan. The agent systematically executes the tasks, collects results, and compiles a final answer or dataset for the user.
- Iterative Improvement: Since LLMs and agents can be run iteratively, the entire loop can refine itself:
- If the end result is not satisfactory, the user or a system-level review can trigger another round of planning.
- The LLM Actor might generate new trajectories focusing on different strategies (e.g., searching scholarly archives like arXiv or specialized chatbot vendor directories).
- The LLM Inference module can merge these new approaches into a new set of tasks, progressively improving the quality of the outcome.
Why This Matters in a Web Agent Scenario:
- By separating the high-level intent from the actual execution steps, this framework allows a web agent to handle complex, multifaceted requests.
- Multiple candidate trajectories encourage creativity and exploration: the agent doesn’t just pick a single approach but can propose several, allowing more robust and thorough solutions.
- The two-step process (LLM Actor to propose broad plans, followed by LLM Inference to refine and finalize tasks) ensures that the final executed plan is both conceptually sound and pragmatically executable.
In summary, the given framework, when applied to a web agent context, provides a structured approach to handling complex user queries. It first uses a large language model to propose multiple potential solution paths (trajectories) from a high-level intent, and then uses another LLM-guided inference step to distill these trajectories into practical, step-by-step tasks the web agent can carry out. This two-level reasoning approach can significantly enhance both the flexibility and reliability of autonomous web-based research, data gathering, and analysis.
下面以较为详细的视角来解释 “HTML元素如何成为我们中间定义metric的ground truth” 这一问题。这里的背景设定可能是:我们有一个用于网页自动化或信息提取的系统(例如一个web agent或scraper),需要评估其对页面内容解析、理解和操作的质量和准确性。在这种场景下,HTML元素本身可以作为客观、可重复访问的“地面事实(ground truth)”来定义评价指标(metrics)。
1. 背景与问题动机:
在构建web agent或自动化测试系统时,我们经常要判断一个任务是否成功完成,如:
- 是否成功提取了页面上的关键信息(例如产品价格、标题、图片URL)?
- 是否在正确的位置找到并点击了特定的按钮或链接?
- 表单中的输入是否正确填入相应的HTML元素中? 这些操作的验证离不开对页面本身DOM(Document Object Model)的检查和匹配。HTML元素是网页内容的基本构成单元,它们是客观存在并可通过选择器定位和解析的。这使得HTML元素成为一种天然的“ground truth”,我们以它们为参考点来定义和评估某些metrics。
2. 什么是HTML元素的ground truth?
**“ground truth”**在许多领域(如计算机视觉、信息抽取、机器学习)中指代一种公认的、可用来评估算法或系统性能的客观标准。在网页解析与自动化领域,HTML元素本身包含了所有页面内容的标记结构,是页面信息的最原始、未经解释的体现。当我们需要评估一个代理(agent)或程序的行为时,HTML元素就提供了可观测并可验证的基础事实。
- 例如,如果一个评价标准是「代理能否正确找到页面上的搜索输入框」,那么搜索输入框往往是以
<input type="text" id="search" ...>
这样的HTML元素标识的。该HTML元素的位置、ID、类名、属性就是我们的ground truth。我们不需要对其进行主观判断,它在DOM中客观存在。
3. 如何利用HTML元素定义metrics:
HTML元素成为metrics的ground truth的方式,核心在于利用它们的可寻址性(通过CSS选择器、XPath、ID、类名等定位)、属性信息以及层级结构来定义定量和定性指标。
具体方式如下:
- 定位精确度 (Element Localization Accuracy):
定义一个metric来衡量代理找到目标元素的准确率。例如,若目标是获取页面主标题
<h1>
元素的文本内容,那么metric可以是:“代理返回的元素节点与实际
<h1>
标签的匹配程度(0或1)”。如果代理定位错误,则记为0分,定位正确则为1分。通过对多个页面的平均得分来衡量定位精度。
- 属性匹配 (Attribute Matching):
有时不仅要找到元素,还要确认其属性是否正确,比如产品价格标签可能是
<span class="price" data-currency="USD">
。我们可以定义一个metric:
“代理提取到的元素属性是否与预期的ground truth属性一致?”
例如,对某个页面我们事先知道价格应在
<span class="price">
里,那么检查代理输出的元素是否具有class="price"
并且包含预期的文本模式(如$[0-9]+
)。- 文本内容准确率 (Content Extraction Accuracy):
对文本提取的metric可以基于从HTML元素中解析出的纯文本。
ground truth是页面上某个已知位置(HTML元素)中实际存在的文本(如产品标题、描述)。
metric可以是提取文本的编辑距离(edit distance)、词级别匹配度(F1分数)或完全匹配率等。
- 层级结构匹配 (DOM Structural Consistency):
当需要验证一个复杂数据结构(如一个产品列表、一个导航菜单)时,我们可以定义metric评估代理提取的DOM子树与期望的DOM结构之间的相似度。例如:
对某段HTML结构提前存有ground truth树状描述,然后将代理提取的DOM(或者对其解析出来的元素树)与ground truth的DOM树进行比较,根据节点数量、类型、排列顺序的匹配程度来打分。
- 交互成功率 (Interaction Success):
如果代理执行点击或输入等操作,可以定义metric:
“在点击某特定HTML元素(如
<button id="buy-now">Buy Now</button>
)后,是否产生预期的后续DOM变化?”这里的ground truth是操作后页面应出现的特定元素(如订单确认弹窗)。如果点击后出现了
<div id="order-confirmation">
,则可认为交互成功率为1,否则为0。这将ground truth与后续的HTML元素变化关联起来。4. 为什么HTML元素可作为ground truth?
- 客观性:HTML是页面呈现的最底层数据结构,独立于解释者和执行环境。
- 可重复验证性:每次加载同一网页(在相同条件下)都会产生相同的HTML结构,因此可以重复测试。
- 精确定位:HTML元素有确定的ID、类名或结构层次,可以被工具和代码准确匹配,不存在主观歧义。
- 通用性:无论是文本、图片还是交互控件,都以HTML元素为载体,适用各种类型的metrics。
5. 实际应用举例:
假设我们在做一个电商信息提取代理,要衡量它提取价格、标题、库存信息的能力:
- 我们提前知道某页面产品标题应在
<h1 class="product-title">
元素内,价格在<span class="price">
内,库存状态在<div class="stock-status">
内。
- 作为ground truth,我们将这些元素在测试数据集中标记好(比如存档HTML片段和期望的文本值)。
- 当代理运行时,我们执行:
- 代理返回提取结果(可能是一个JSON,其中包含“title”、“price”、“stock”字段)。
- 对照HTML中已知正确的产品信息(ground truth),检查代理的提取是否匹配:
- 如果
title
字段与<h1 class="product-title">
的文本相同,记为正确。 - 如果
price
字段与<span class="price">
元素的文本数值相符,记为正确。 - 如果
stock
字段与<div class="stock-status">
显示的内容一致,记为正确。
- 将这些正确率统计起来,即可定义提取任务的metrics(如准确率、召回率、F1分数等)。
- 这些对比操作就是基于HTML元素作为ground truth来进行的。
总结:
HTML元素可以作为评估web agent或信息抽取系统性能的“ground truth”,因为它们提供了一个稳定、客观、可重复定位的基础结构。通过将HTML中明确存在、可定位和可验证的标签、属性、文本等信息作为对照标准,我们可以定义各种度量指标(metrics),以量化系统对页面理解和操作的正确性和有效性。
下面以一个较为系统的流程来解释如何利用HTML元素作为ground truth(真值)来进行DPO(Direct Preference Optimization)训练,从而在每次迭代中逐步提升模型的能力。假设我们的任务是让模型从网页中正确提取特定信息(如产品标题、价格、库存状态等),或者执行正确的操作(如点击正确的按钮),并且我们已经有一套HTML元素标注的真值数据可作为参考。
概念回顾:
- Ground Truth(真值):对于每个任务(如提取网页中某一信息),我们有一份基于HTML元素定义的正确答案。比如,价格应在
<span class="price">
中提取,产品标题为<h1 class="product-title">
内的文本。
- DPO(Direct Preference Optimization):一种将模型输出分级偏好(preference)直接整合进训练过程的方法。与RLHF(基于人类反馈的强化学习)类似,但通过优化目标函数,直接引入偏好数据,使模型更倾向于生成高偏好值的答案。
训练数据与标注:
- 查询/任务集 (Queries):准备一组代表任务的输入,例如:
- 给定HTML片段,提取价格信息。
- 给定网页结构,在搜索栏输入某关键词的指令等。 每个查询与一个网页DOM或HTML子树相关联。
- 参考回答 (Reference Responses):基于HTML元素的ground truth,对于每个查询提前构建一个“正确的”回答。比如,如果任务是提取产品价格,那么参考回答就是正确价格的文本字符串。如果任务是完成一次点击操作,那么参考回答是对该操作正确性的描述,如“已在
<button id="buy-now">
位置正确点击”。
- 模型候选回答 (Model Candidate Responses):当前模型给定同样的查询产生的回答。
通过HTML元素作为ground truth,我们能轻易判断模型回答的好坏:
- 如果模型输出的提取值与
<span class="price">
内的文本完全匹配,说明回答质量高。
- 若模型输出偏差很大,则其偏好分数应低。
偏好定义与DPO训练步骤:
- 产生偏好数据 (Preference Data Generation):
- 参考模型的回答(或一个参考回答集),称为
r_ref
- 当前模型的回答,称为
r_w
- 若
r_w
的提取与ground truth完全一致,给r_w
高分(偏好值高)。 - 若
r_w
与ground truth偏差较大,给r_w
低分。
对于每条训练实例(查询+HTML片段),我们有:
使用ground truth对两者进行对比打分。例如:
同时,我们有
r_ref
作为一个参考(baseline),它可以是一个已有较高质量的模型产出,或是人工标注的正确输出,保证其与ground truth高度一致。从而将r_w
和r_ref
进行对比,若r_w
更接近ground truth,就认为r_w
比r_ref
更好;若更差,则偏好r_ref
。这样,我们获得了一个偏好对:(r_w vs r_ref)以及它们相对的偏好等级。DPO需要的就是这种偏好信息。
- DPO核心思想:
- 若
r_w
胜过r_ref
(即r_w
更接近ground truth),则在优化中增加r_w
出现的概率,相对减少r_ref
的概率。 - 若
r_ref
优于r_w
,则反之。
DPO通过直接优化使得模型倾向于产生符合偏好更高的回答。它不需要显式奖励建模器(如RLHF中的reward model),而是将对(r_w, r_ref)回答对的偏好转换为一个相对似然的优化目标函数。公式层面,DPO会将偏好信息整合进一个对数似然比的优化目标,鼓励模型对更偏好(更符合ground truth)的回答分配更高概率。
简化后,可理解为:
- 迭代训练流程:
- Step A:从数据集中抽取一批查询
- Step B:模型产生回答 (Generation)
- Step C:计算偏好 (Preference Computation)
r_w_score
= 1(若r_w
更优)或0(若更差)r_ref_score
= 1 -r_w_score
- Step D:优化更新 (Optimization)
- Step E:重复迭代
训练迭代中可分为以下步骤:
每个查询带有真值(ground truth)HTML元素和正确输出参考。
使用当前模型参数对查询生成回答
r_w
。参考回答r_ref
(可来自一个冻结的参考模型或已有标注)也被准备好。基于ground truth,对
r_w
与r_ref
进行比较。如果r_w
比r_ref
的输出更接近ground truth(如在文本匹配度上更高),则偏好r_w
。若差,则偏好r_ref
。对应地,为
r_w
和r_ref
分配偏好分数,如:DPO用这些偏好对的分数来计算梯度并更新模型参数。目标是让模型分布更倾向生成高偏好回答。在反向传播时,模型参数逐渐朝满足这些偏好信息的方向移动。
不断重复上述过程,每次迭代模型都会在对比参考回答的过程中更加接近ground truth,从而提高对目标任务的性能。
- 迭代提升模型能力的核心原理:
- 在初始阶段,模型回答可能离ground truth有差距,因此大部分时候
r_ref
更优。DPO训练会促使模型向r_ref
的回答分布靠近,从而逼近ground truth。 - 随着训练进行,模型产生的回答逐渐提升,与ground truth更为吻合。当模型输出逐渐逼近或超越参考回答的质量时,它将获得更高偏好,从而不断强化这种“正确”的行为。
- 最终通过多轮迭代,模型学习到了一套内在的策略,使得在遇到相似网页和查询时,更容易直接产出贴近ground truth的回答。
总结:
通过将HTML元素作为ground truth来定义明确的偏好对(模型输出 vs 参考输出),并使用DPO训练方法,每次迭代都会基于偏好对的比较结果对模型进行优化。由于ground truth在这里提供了客观的正确答案标尺,模型会反复在与参考回答对照的过程中进行学习和参数更新,最终在迭代中不断提升回答质量和在目标任务上的性能。
下面是一个较为系统的思路,来阐述如何先为少量的trajectory(轨迹)定义度量指标(metrics),然后利用这些定义好的示例作为few-shot样本,去自动化生成更多trajectory的metric定义。这种方法能够让我们在人工定义大量指标之前,就通过有限的人类定义样本(few-shots)来泛化出更多的指标。
概念和背景:
- Trajectory(轨迹):在web agent或自动化系统的语境下,一个trajectory是实现高层意图的某种解决方案步骤序列,比如:“搜索关键词->点击特定链接->提取信息->记录结果”。
- Metrics(度量指标):用来评估每个trajectory是否达到了预期目标以及性能质量的指标。例如,准确性(accuracy)、召回率、执行时间、是否成功匹配关键HTML元素等。
- Few-Shot定义:给定少数几个已明确定义了指标的trajectory作为例子,然后让模型(或系统)基于这些示例对新的trajectory进行自动化的指标定义。
基本思想:
- 手动定义几个代表性trajectory的metric:
- 选取少数几个具有代表性的trajectory。这些trajectory最好覆盖不同类型的任务和网页结构。
- 为这些trajectory人工定义出严谨、清晰的metric。例如:
- 对于“提取价格”的trajectory:定义准确率、误差范围(如价格误差在±1美元内为合格)、爬取速度(耗时)。
- 对于“点击按钮并检验页面跳转”的trajectory:定义成功率(目标按钮是否被点击)、跳转后的页面是否包含特定HTML元素、操作耗时等。
- trajectory描述
- 目标任务和ground truth定义
- 明确的metrics(例如,对于正确输出的判断标准和可量化分数的计算方式)。
在此步骤中,你创造了一个基础的标注集,它包括:
- 将已定义的trajectory-metric映射作为few-shot示例: 把这些人工定义好的示例整理为一个few-shot prompt或数据集格式,比如(以自然语言或结构化描述):
若有3~5个高质量示例,就可以作为few-shot例子输入给一个大语言模型(LLM)或者一个自动化规则生成器。
- Few-Shot泛化生成更多trajectory的metric: 当你有了这些few-shot示例后,对于一个新的trajectory(C),只需提供:
- 该trajectory的描述(其目标、要执行的步骤)
- Ground truth(例如期望提取的HTML元素位置或期望的最终结果)
然后让LLM(或你的系统)基于few-shot示例的模式自动生成对该trajectory的metrics定义。例如:
输入给LLM的提示可能是:
LLM则会根据前面示例的模式和逻辑,为Trajectory C输出一系列合理的metrics定义。它可能会参考已有的模式(比如提取准确率、时间指标、结构匹配度等),并尝试适应新的trajectory特性生成相应的指标。
- 自动化评估与修正: 由于LLM产生的metrics定义可能并非完美,一种策略是:
- 对LLM产生的metric定义进行自动审查,可以通过一个规则检测程序(例如正则表达式或程序化检查)验证生成的指标是否可计算、是否符合预期格式。
- 人类审核少量生成的metrics定义,对其进行微调或纠正,并将纠正后的版本再次作为示例放入few-shot集,迭代提高自动生成质量。
- 持续增强few-shot示例库: 随着不断为新的trajectory定义metrics并得到人工确认,可以将一些高质量的自动生成结果再加入few-shot示例中,丰富示例集,使得后续自动生成的metrics定义更加贴近目标需求。
优势与意义:
- 减少人工对每条trajectory都手动定义metrics的繁琐工作量。
- 通过少量示例的启发,使LLM(或类似系统)学会一套“通用原则”和“推理模式”,从而快速、规模化地产生更多metrics定义。
- 随着示例集不断扩充,自动生成的质量会持续提升。
总结:
可以先为少数具代表性的trajectory清晰地手动定义metric,用这些已定义的trajectories作为few-shot示例提供给LLM,然后要求模型为后续大量新trajectory自动生成metric定义。在此过程中,不断对自动生成结果进行审核和迭代更新few-shot示例,从而逐步提高自动化定义指标的质量。
- 作者:fufu酱
- 链接:https://csfufu.life/article/56371125-2b2c-448d-b11a-6dc0c0ee996b
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章