PaSa: An LLM Agent for Comprehensive Academic Paper Search
PaSa：一个用于全面学术论文检索的 LLM 代理

Yichen He^∗¹ Guanhua Huang^∗¹ Peiyuan Feng¹ Yuan Lin^†¹
Yuchen Zhang¹ Hang Li¹ Weinan E²
¹ByteDance Seed ²Peking University
{hyc,huangguanhua,fpy,linyuan.0}@bytedance.com,
{zhangyuchen.zyc,lihang.lh}@bytedance.com, weinan@math.pku.edu.cn
Demo: https://pasa-agent.ai

Abstract 摘要

We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.
我们介绍 PaSa，一种由大型语言模型驱动的高级 Paper Serch 代理。PaSa 可以自主做出一系列决策，包括调用搜索工具、阅读论文和选择相关参考文献，最终为复杂的学者查询获得全面且准确的结果。我们利用一个合成数据集 AutoScholarQuery 优化 PaSa，该数据集包含 3.5 万条细粒度学术查询及来自顶级 AI 会议出版物的相关论文。此外，我们还开发了 RealScholarQuery，这是一个收集真实学术查询的基准工具，用于评估 PaSa 在更现实情境下的表现。尽管训练基于合成数据，PaSa 在 RealScholarQuery 上表现显著优于现有基线，包括 Google、Google Scholar、带有 GPT-4o 的 Google 用于改写查询、ChatGPT（支持搜索的 GPT-4o）、GPT-o1 和 PaSa-GPT-4o（PaSa 通过提示 GPT-4o 实现）。值得注意的是，PaSa-7B 在 recall@20 上比基于 GPT-4o 的谷歌基线高出 37.78%，在 recall@50 中高出 39.90%，在召回率上高出 PaSa-GPT-4o30.36%，准确率提升 4.25%。模型、数据集和代码均可在 https://github.com/bytedance/pasa 获取。

Yichen He^∗¹ Guanhua Huang^∗¹ Peiyuan Feng¹ Yuan Lin^†¹ Yuchen Zhang¹ Hang Li¹ Weinan E² ¹ByteDance Seed ²Peking University
¹ 字节跳动^第二种子北京大学 {hyc,huangguanhua,fpy,linyuan.0}@bytedance.com,
{hyc，黄冠花，fpy，linyuan.0}@bytedance.com， {zhangyuchen.zyc,lihang.lh}@bytedance.com, weinan@math.pku.edu.cn
{张宇辰.zyc，lihang.lh}@bytedance.com，weinan@math.pku.edu.cn Demo: https://pasa-agent.ai
演示：https://pasa-agent.ai

^$*$^$*$footnotetext: Equal contribution. 平等贡献。^$\dagger$^$\dagger$footnotetext: Corresponding author. 通讯作者。

1 Introduction 1 简介

Academic paper search lies at the core of research yet represents a particularly challenging information retrieval task. It requires long-tail specialized knowledge, comprehensive survey-level coverage, and the ability to address fine-grained queries. For instance, consider the query: "Which studies have focused on non-stationary reinforcement learning using value-based methods, specifically UCB-based algorithms?" While widely used academic search systems like Google Scholar are effective for general queries, they often fall short when addressing these complex queries Gusenbauer and Haddaway (2020). Consequently, researchers frequently spend substantial time conducting literature surveys Kingsley et al. (2011); Gusenbauer and Haddaway (2021).
学术论文检索是研究的核心，但也是一项特别具有挑战性的信息检索任务。它需要长尾专业知识、全面的调查级别覆盖，以及能够应对细致问题的能力。例如，考虑这样一个问题：“ 哪些研究专注于使用基于价值的方法，特别是基于 UCB 算法的非定常强化学习？” 虽然广泛使用的学术搜索系统如 Google Scholar 在通用查询中表现有效，但在处理这些复杂查询时往往表现不足（Gusenbauer 和 Haddaway（2020）。因此，研究人员经常花费大量时间进行文献调查（Kingsley 等人（2011）;古森鲍尔和哈达韦（2021 年）。

The advancements in large language models (LLMs) OpenAI (2023); Anthropic (2024); Gemini (2023); Yang et al. (2024) have inspired numerous studies leveraging LLMs to enhance information retrieval, particularly by refining or reformulating search queries to improve retrieval quality Alaofi et al. (2023); Li et al. (2023); Ma et al. (2023); Peng et al. (2024). In academic search, however, the process goes beyond simple retrieval. Human researchers not only use search tools, but also engage in deeper activities, such as reading relevant papers and checking citations, to perform comprehensive and accurate literature surveys.
大型语言模型（LLMs）的进展 OpenAI（2023）;人类（2024 年）;双子座（2023 年）;Yang 等人（2024）启发了大量利用大型语言模型（LLM）提升信息检索能力的研究，特别是通过优化或重新表述搜索查询以提升检索质量 Alaofi 等人（2023）;Li 等人（2023）;马等（2023）;Peng 等（2024）。然而，在学术检索中，这一过程远不止于简单的检索。人类研究人员不仅使用搜索工具，还会进行更深入的活动，如阅读相关论文和核查引用，以进行全面且准确的文献调查。

Refer to caption — Figure 1: Architecture of PaSa. The system consists of two LLM agents, Crawler and Selector. The Crawler processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool, expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in the user query.
图 1：帕萨的建筑。该系统由两个大型语言模型代理（Crawler）和选择器（Selector）组成。爬虫处理用户查询，并可访问论文队列中的论文。它可以自主调用搜索工具、展开引用，或停止当前论文的处理。爬虫收集的所有论文都会附加到论文队列中。选择器会阅读论文队列中的每篇论文，以判断其是否符合用户查询中指定的标准。

In this paper, we introduce PaSa, a novel paper search agent designed to mimic human behavior for comprehensive and accurate academic paper searches. As illustrated in Figure 1, PaSa consists of two LLM agents: the Crawler and the Selector. For a given user query, the Crawler can autonomously collect relevant papers by utilizing search tools or extracting citations from the current paper, which are then added to a growing paper queue. The Crawler iteratively processes each paper in the paper queue, navigating citation networks to discover increasingly relevant papers. The Selector carefully reads each paper in the paper queue to determine whether it meets the requirements of the user query. We optimize PaSa within the AGILE, a reinforcement learning (RL) framework for LLM agents Feng et al. (2024).
本文介绍了 PaSa，一种新型论文检索工具，旨在模拟人类行为，实现全面且准确的学术论文检索。如图 1 所示，PaSa 由两个大型语言模型代理组成：爬虫和选择器。对于给定的用户查询，爬虫可以通过使用搜索工具或从当前论文中提取引用，自主收集相关论文，这些引用随后被添加到不断增长的论文队列中。爬虫会迭代处理论文队列中的每篇论文，通过引用网络探索越来越相关的论文。选择器会仔细阅读队列中的每篇论文，以判断其是否符合用户查询的要求。我们在 AGILE（一个面向 LLM 代理的强化学习（RL）框架中优化了 PaSa，Feng 等人（2024）。

Effective training requires high-quality academic search data. Fortunately, human scientists have already created a vast amount of high-quality academic papers, which contain extensive surveys on a wide range of research topics. We build a synthetic but high-quality academic search dataset, AutoScholarQuery, which collects fine-grained scholar queries and their corresponding relevant papers from the related work sections of papers published at ICLR 2023 ¹¹1https://iclr.cc/Conferences/2023
有效的培训需要高质量的学术检索数据。幸运的是，人类科学家已经创作了大量高质量的学术论文，涵盖了广泛的研究主题。我们构建了一个合成但高质量的学术检索数据集 AutoScholarQuery，收集 ICLR 2023 1 发表论文中相关工作部分的细粒度学者查询及其相关论文, ICML 2023 ²²2https://icml.cc/Conferences/2023 ， ICML 2023 2, NeurIPS 2023 ³³3https://neurips.cc/Conferences/2023 ，NeurIPS 2023 3, ACL 2024 ⁴⁴4https://2024.aclweb.org/ ，ACL 2024 4, and CVPR 2024 ⁵⁵5https://cvpr.thecvf.com/Conferences/2024
，以及 CVPR 2024 5. AutoScholarQuery includes 33,511 / 1,000 / 1,000 query-paper pairs in the training / development / test split.
.AutoScholarQuery 在训练/开发/测试分段包含 33,511 / 1,000 / 1,000 查询-论文对。

Although AutoScholarQuery only provides query and paper answers, without demonstrating the path by which scientists collect the papers, we can utilize it to perform RL training to improve PaSa. In addition, we design a new session-level PPO (Proximal Policy Optimization Schulman et al. (2017)) training method to address the unique challenges of the paper search task: 1) sparse reward: The papers in AutoScholarQuery are collected via citations, making it a smaller subset of the actual qualified paper set. 2) long trajectories: The complete trajectory of the Crawler may involve hundreds of papers, which is too long to directly input into the LLM context.
虽然 AutoScholarQuery 仅提供查询和论文答案，且未展示科学家收集论文的路径，但我们可以利用它进行强化学习训练以提升 PaSa。此外，我们设计了一种新的会话级 PPO（近端策略优化，Schulman 等，2017）训练方法，以应对论文检索任务的独特挑战：1）奖励稀疏：AutoScholarQuery 中的论文通过引用收集，是实际合格论文集中的较小子集。2）长轨迹：爬虫的完整轨迹可能涉及数百篇论文，这太长，无法直接输入到大型语言模型的语境中。

To evaluate PaSa, besides the test set of AutoScholarQuery, we also develop a benchmark, RealScholarQuery. It contains 50 real-world academic queries with annotated relevant papers, to assess PaSa in real-world scenarios. We compare PaSa with several baselines including Google, Google Scholar, Google paired with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1 and PaSa-GPT-4o (PaSa agent realized by prompting GPT-4o). Our experiments show that PaSa-7b significantly outperforms all baselines. Specifically, for AutoScholarQuery test set, PaSa-7b achieves a 34.05% improvement in Recall@20 and a 39.36% improvement in Recall@50 compared to Google with GPT-4o, the strongest Google-based baseline. PaSa-7b surpasses PaSa-GPT-4o by 11.12% in recall, with similar precision. For RealScholarQuery, PaSa-7b outperforms Google with GPT-4o by 37.78% in Recall@20 and 39.90% in Recall@50. PaSa-7b surpasses PaSa-GPT-4o by 30.36% in recall and 4.25% in precision.
为了评估 PaSa，除了 AutoScholarQuery 的测试集外，我们还开发了一个基准测试 RealScholarQuery。它包含 50 条真实学术查询及注释相关论文，用于评估 PaSa 在现实场景中的表现。我们将 PaSa 与多个基线进行了比较，包括 Google、Google Scholar、Google 配合 GPT-4o 进行意译查询、ChatGPT（支持搜索的 GPT-4o）、GPT-o1 和 PaSa-GPT-4o（通过提示 GPT-4o 实现的 PaSa 代理）。我们的实验显示，PaSa-7b 的表现显著优于所有基线。具体来说，对于 AutoScholarQuery 测试集，PaSa-7b 在 Recall@20 提升了 34.05%，Recall@50 提升了 39.36%，而 GPT-4o 是谷歌基于谷歌的最强基线。PaSa-7b 在召回率上比 PaSa-GPT-4o 高出 11.12%，精度相近。对于 RealScholarQuery，PaSa-7b 在 GPT-4o 的表现 Recall@20 中高出 37.78%，Recall@50 中领先 39.90%。PaSa-7b 在召回率上比 PaSa-GPT-4o 高出 30.36%，精度高出 4.25%。

The main contributions of this paper are summarized as follows:
本文的主要贡献总结如下：

•

We introduce PaSa, a comprehensive and accurate paper search agent that can autonomously use online search tools, read entire papers, and navigate citation networks.

• 我们介绍 PaSa，一个全面且准确的论文检索代理，能够自主使用在线搜索工具，阅读整篇论文，并导航引用网络。
•

We develop two high-quality datasets for complex academic search, AutoScholarQuery and RealScholarQuery.

• 我们开发了两个高质量的复杂学术搜索数据集，AutoScholarQuery 和 RealScholarQuery。
•

Although PaSa is trained solely on synthetic data, it achieves remarkable real-world performance. Experiments demonstrate that PaSa, built on 7B LLM, significantly outperforms all baselines, including GPT-4 agent, Google-based search, and ChatGPT.

• 尽管 PaSa 仅基于合成数据训练，但它在实际世界表现依然出色。实验表明，基于 7B 大型语言模型的 PaSa 显著优于所有基线，包括 GPT-4 代理、基于谷歌的搜索和 ChatGPT。

2 Related Work 2 相关工作

LLMs in Scientific Discovery
科学发现中的 LLM 课程

LLMs have been applied across various stages of scientific discovery Van Noorden and Perkel (2023); Lu et al. (2024); Messeri and Crockett (2024); Liao et al. (2024), such as brainstorming ideas Girotra et al. (2023); Wang et al. (2024a); Baek et al. (2024), designing experiments M. Bran et al. (2024), writing code Xu et al. (2022), and generating research papers Shao et al. (2024); Agarwal et al. (2024); Wang et al. (2024b). One of the most fundamental yet critical stages in research is conducting academic surveys. Despite its importance, current tools like Google Scholar are often insufficient, leading researchers to spend considerable time on literature review tasks Kingsley et al. (2011); Gusenbauer and Haddaway (2021, 2020). This challenge motivates us to develop PaSa, an LLM agent designed to autonomously and comprehensively assist researchers in collecting relevant research papers for complex scholarly queries.
LLM 已被应用于科学发现的各个阶段：Van Noorden 和 Perkel（2023）;Lu 等（2024）;梅塞里和克罗克特（2024 年）;Liao 等（2024）的头脑风暴（Girotra 等（2023）;Wang 等人（2024a）;Baek 等人（2024），实验设计 M. Bran 等人（2024），编写代码 Xu 等人（2022），并撰写研究论文 Shao 等人（2024）;Agarwal 等人（2024）;Wang 等人（2024b）。研究中最基础且关键的阶段之一是进行学术调查。尽管如此，现有工具如 Google Scholar 往往不足，导致研究者在文献综述任务上投入大量时间。Kingsley 等（2011）;古森鲍尔和哈达韦（2021 年，2020 年）。这一挑战促使我们开发了 PaSa，一款旨在自主且全面地协助研究人员收集复杂学术问题相关研究论文的 LLM 代理。

LLM Agents 大型语言模型代理

LLM Agents combine LLMs with memory, tool use, and planning, enabling them to perform more complex tasks such as personal copilots (Stratton, 2024), travel planning (Gundawar et al., 2024), web operations (Deng et al., 2024), software development (Qian et al., 2023), and scientific experimentation (Bran et al., 2023). In addition to realizing LLM Agents through prompt engineering Park et al. (2023); Yao et al. (2023); Shinn et al. (2024); Chen et al. (2023), recent research has focused on optimizing and training these agents Feng et al. (2024); Putta et al. (2024); Liu et al. (2023). Among these efforts, AGILE Feng et al. (2024), a reinforcement learning framework for LLM agents, allows the joint optimization of all agent skills in an end-to-end manner. In our work, we adopt the AGILE framework to implement PaSa. Specifically, we design a novel session-level PPO algorithm to address the unique challenges of the paper search task, including sparse rewards and long trajectories.
LLM 代理将 LLM 与记忆、工具使用和规划结合，使其能够执行更复杂的任务，如个人副驾驶（Stratton，2024）、旅行规划（Gundawar 等，2024）、网页操作（Deng 等，2024）、软件开发（Qian 等，2023）和科学实验（Bran 等，2023）.除了通过快速工程实现 LLM 代理外，Park 等（2023）;Yao 等（2023）;Shinn 等人（2024）;Chen 等（2023）的最新研究聚焦于优化和训练这些代理（ Feng 等（2024）;Putta 等（2024）;刘等（2023）。其中，AGILE Feng 等人（2024）提出的 LLM 智能体强化学习框架，允许端到端协同优化所有智能体技能。在我们的工作中，我们采用 AGILE 框架来实施 PaSa。具体来说，我们设计了一种新的会话级 PPO 算法，以应对论文检索任务的独特挑战，包括奖励稀疏和轨迹过长。

3 Datasets 3 数据集

3.1 AutoScholarQuery

AutoScholarQuery is a synthetic but high-quality dataset of academic queries and related papers, specifically curated for the AI field.
AutoScholarQuery 是一个合成但高质量的学术查询及相关论文数据集，专为人工智能领域精心策划。

To construct AutoScholarQuery, we began by collecting all papers published at ICLR 2023, ICML 2023, NeurIPS 2023, ACL 2024, and CVPR 2024. For the Related Work section of each paper, we prompted GPT-4o Hurst et al. (2024) to generate scholarly queries, where the answers to these queries correspond to the references cited in the Related Work section. The prompt used is shown in Appendix H.1. For each query, we retained only the papers that could be retrieved on arXiv⁶⁶6https://arxiv.org/
为了构建 AutoScholarQuery，我们首先收集了 ICLR 2023、ICML 2023、NeurIPS 2023、ACL 2024 和 CVPR 2024 上发表的所有论文。对于每篇论文的相关工作部分，我们促使 GPT-4o Hurst 等人（2024）生成学术查询，这些查询的答案对应于相关工作部分引用的参考文献。所用提示见附录 H.1。对于每个查询，我们只保留了可在 arXiv6 上检索的论文, using their arxiv_id as the unique article identifier in the dataset. We adopt the publication date of the source paper as the query date. During both training and testing, we only considered papers published prior to the query date.
，使用其 arxiv_id 作为数据集中的唯一条目标识符。我们采用原始论文的发表日期作为查询日期。在培训和测试过程中，我们只考虑了查询日期之前发表的论文。

The final AutoScholarQuery dataset comprises 33,551, 1,000, and 1,000 instances in the training, development, and testing splits, respectively. Each instance consists of a query, the associated paper set, and the query date, with queries in each split derived from distinct source papers. Table 1 provides illustrative examples from AutoScholarQuery, while additional dataset statistics are summarized in Table 2.
最终的 AutoScholarQuery 数据集分别包含 33,551、1,000 和 1,000 个实例，分别分布在训练、开发和测试分段。每个实例由查询、相关的论文集和查询日期组成，每个拆分中的查询均源自不同的原始论文。表 1 提供了 AutoScholarQuery 的示例说明，其他数据集统计数据汇总见表 2。

To evaluate the quality of AutoScholarQuery, we sampled 100 query-paper pairs and assessed the rationality and relevance of each query and the corresponding paper. A qualified query should be meaningful and unambiguous. A qualified paper should match the requirements of the scholarly query. Detailed evaluation criteria are provided in Appendix. A. Three authors manually reviewed each pair, determining that 94.0% of the queries were qualified. Among these qualified queries, 93.7% had corresponding papers that were deemed relevant and appropriate.
为了评估 AutoScholarQuery 的质量，我们抽取了 100 对查询-论文，评估每个查询及其相关论文的合理性和相关性。合格的查询应当有意义且明确。合格的论文应符合学术问题的要求。详细评估标准见附录。答：三位作者手动审查每对查询，确定 94.0%的查询符合资格。在这些合格查询中，93.7%的相关论文被认为相关且合适。

Query: Could you provide me some studies that proposed hierarchical neural models to capture spatiotemporal features in sign videos?
问题：你能提供一些提出层级神经模型来捕捉手势视频时空特征的研究吗？ Query Date: 2023-05-02
查询日期：2023-05-02 Answer Papers: 答题试卷： [1] TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation (2010.05468)
[1] TSPNet：通过时间语义金字塔进行手语翻译的层级特征学习（2010.05468） [2] Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network (2111.07258)
[2] 带有层级时空图神经网络的手语翻译（2111.07258）

Source: SLTUnet: A Simple Unified Model for Sign Language Translation, ICLR 2023
来源：SLTUnet：一个简单的统一手语翻译模型，ICLR 2023

Query: Which studies have focused on nonstationary RL using value-based methods, specifically Upper Confidence Bound (UCB) based algorithms?
问题：有哪些研究专注于使用基于值的方法，特别是基于上置信界限（UCB）的算法，进行非平稳强化学习？ Query Date: 2023-08-10
查询日期：2023-08-10 Answer Papers: 答题试卷： [1] Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism (2006.14389)
[1] 非平稳马尔可夫决策过程的强化学习：（更多）乐观主义的祝福（2006.14389） [2] Efficient Learning in Non-Stationary Linear Markov Decision Processes (2010.12870)
[2] 非平稳线性马尔可夫决策过程中的高效学习（2010.12870） [3] Nonstationary Reinforcement Learning with Linear Function Approximation (2010.04244)
[3] 非平稳强化学习含线性函数近似（2010.04244） Source: Provably Efficient Algorithm for Nonstationary Low-Rank MDPs, NeurIPS 2023
来源：非平稳低秩 MDP 的可证明高效算法，NeurIPS 2023

Query: Which studies have been conducted in long-form text generation, specifically in story generation?
问题：有哪些研究是在长篇文本生成领域进行的，特别是故事生成方面？ Query Date: 2024-01-26
查询日期：2024-01-26 Answer Papers: 答题试卷： [1] Strategies for Structuring Story Generation (1902.01109)
[1] 故事生成结构策略（1902.01109） [2] MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models (2010.00840)
[2] MEGATRON-CNTRL：利用大规模语言模型实现外部知识的可控故事生成（2010.00840） Source: ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models, ACL 2024
来源：ProxyQA：利用大型语言模型评估长文本生成的替代框架，ACL 2024

Table 1: Examples of queries and corresponding papers in AutoScholarQuery.
表 1： AutoScholarQuery 中的查询及相关论文示例。

Conference 会议	$\|P\|$	$\|Q\|$	$Ans(/Q)$	$Ans$ - $50$	$Ans$ - $90$
ICLR 2023	888	5204	2.46	2.0	5.0
ICML 2023	981	5743	2.37	2.0	5.0
NeurIPS 2023 神经国际反应学会2023	1948	11761	2.59	2.0	5.0
CVPR 2024	1336	9528	2.94	2.0	6.0
ACL 2024 2024 年 ACL	485	3315	2.16	2.0	4.0

Table 2: Statistics of AutoScholarQuery.

|P|

and

|Q|

represent the total number of papers and queries collected for each conference.

Ans(/Q)

denotes the average number of answer papers per query.

Ans

50

and

Ans

90

refers to the 50th and 90th percentiles of answer paper counts per query.
表 2： AutoScholarQuery 的统计数据。

|P|

代表

|Q|

每场会议收集的论文和查询总数。

Ans(/Q)

表示每查询的平均答卷数量。

Ans

50

和

Ans

90

指每查询答案数量的第 50 百分位和第 90 百分位。

3.2 RealScholarQuery

To evaluate PaSa in more realistic scenarios, we constructed RealScholarQuery, a test dataset consisting of 50 real-world research queries. After launching the demo of PaSa, we invited several AI researchers to use the system. From the queries they provided, we randomly sampled a subset of queries and manually filtered out overly broad topics (e.g., "multimodal large language models," "video generation"). Ultimately, we collected 50 fine-grained and realistic queries.
为了在更现实的情景下评估 PaSa，我们构建了 RealScholarQuery，这是一个包含 50 条真实世界研究查询的测试数据集。在发布 PaSa 演示后，我们邀请了几位 AI 研究人员使用该系统。从他们提供的查询中，我们随机抽样了部分查询，并手动过滤掉过于宽泛的主题（例如，“多模态大型语言模型”、“视频生成”）。最终，我们收集了 50 条细致且现实的查询。

For each query, we first manually gathered relevant papers to the best of our ability. To ensure comprehensive coverage, we then applied multiple methods to retrieve additional papers, including PaSa, Google, Google Scholar, ChatGPT (search-enabled GPT-4o), and Google paired with GPT-4o for paraphrased queries. As these methods also serve as baselines for comparison with PaSa, implementation details are deferred to Section 5.2. The results from all methods were aggregated into a pool of candidate papers. Finally, professional annotators reviewed all candidate papers for each query, selecting those that met the specific requirements of the query to create the final set of relevant papers. Annotation guidelines and quality control procedures are detailed in Appendix. B. The query date of all instances in RealScholarQuery is 2024-10-01. Table 9 in Appendix C provides an example from RealScholarQuery.
对于每个查询，我们首先尽力手动收集相关论文。为确保全面覆盖，我们随后采用多种方法检索额外论文，包括 PaSa、Google、Google Scholar、ChatGPT（支持搜索的 GPT-4o）以及将 Google 与 GPT-4o 配合进行意译查询。由于这些方法也作为与 PaSa 比较的基线，实现细节推迟至第 5.2 节。所有方法的结果被汇总成候选论文池。最后，专业注释员审查了每个查询的所有候选论文，挑选符合查询具体要求的，形成最终的相关论文集。注释指南和质量控制程序详见附录 B 。RealScholarQuery 中所有实例的查询日期为 2024-10-01。附录 C 中的表 9 提供了 RealScholarQuery 的一个示例。

The annotators included professors from the Department of Computer Science at a top-tier university in China. On average, each query required the annotators to review 76 candidate papers. We paid $4 per data entry (a query-paper pair), resulting in an average of $304 per query. Given the high annotation cost, we completed this process for only 50 instances. On average, each query is associated with 15.82 answer papers. The 50th percentile of answer counts per query is 9, while the 90th percentile reaches 37.
注释者包括中国一所顶尖大学计算机科学系的教授。平均而言，每次查询需要注释员审阅76篇候选论文。我们每次数据录入（查询与论文对）支付4美元，平均每次查询304美元。鉴于注释成本高昂，我们只完成了50个实例。平均每个查询对应15.82份答案。每查询的第50百分位回答数为9，第90百分位为37。

4 Methodology 4 方法论

4.1 Overview 4.1 概述

As illustrated in Figure 1, the PaSa system consists of two LLM agents: Crawler and Selector. The crawler reads the user’s query, generates multiple search queries, and retrieves relevant papers. The retrieved papers are added to a paper queue. The Crawler further processes each paper in the paper queue to identify key citations worth exploring further, appending any newly relevant papers to the paper queue. The selector conducts a thorough review of each paper in the paper queue to assess whether it fulfills the user’s query requirements.
如图 1 所示，PaSa 系统由两个 LLM 代理组成：爬虫和选择器。爬虫读取用户查询，生成多个搜索查询，并检索相关论文。检索到的论文会被加入论文队列 。爬虫进一步处理论文队列中的每篇论文，以识别值得进一步探索的关键引用，并将任何新相关论文附加到队列中。选择者会对论文队列中的每篇论文进行全面审查，以评估其是否满足用户的查询需求。

In summary, the Crawler is designed to maximize the recall of relevant papers, whereas the Selector emphasizes precision in identifying papers that meet the user’s needs.
总之，爬虫旨在最大化相关论文的召回率，而选择器则强调精准识别符合用户需求的论文。

Name 名称	Implementation 实现
	Generate a search query and invoke 生成搜索查询并调用
[Search] [搜索]	the search tool. Append all resulting 搜索工具。附加所有结果
	papers to the paper queue. 把文件排到纸上排队。
	Generate a subsection name, then 然后生成一个子部分名称
[Expand] [展开]	add all referenced papers in the sub- 将所有引用的论文添加到子版块中——
	section to the paper queue. 排队的部分。
[Stop] [停止]	Reset the context to the user query and 将上下文重置到用户查询，
[Stop] [停止]	the next paper in the paper queue. 下一份报纸排队。

Table 3: Functions of the Crawler.
表 3：爬行者的功能。

4.2 Crawler 4.2 爬行者

In RL terminology, the Crawler performs a token-level Markov Decision Process (MDP). The action space $\mathcal{A}$ corresponds to the LLM’s vocabulary, where each token represents an action. The LLM functions as the policy model. The agent’s state is defined by the current LLM context and the paper queue. The Crawler operates with three registered functions, as outlined in Table 3. When an action matches a function name, the corresponding function is executed, further modifying the agent’s state.
在强化学习术语中，爬虫执行一个令牌级的马尔可夫决策过程（MDP）。动作空间 $\mathcal{A}$ 对应 LLM 的词汇表，每个令牌代表一个动作。LLM 作为策略模型发挥作用。代理的状态由当前的 LLM 上下文和纸质队列定义。爬虫工作有三个注册功能，如表 3 所述。当动作与函数名称匹配时，执行相应的函数，进一步修改代理的状态。

For example, as Figure 2 shows, the agent begins by receiving a user query, incorporating it into its context, and initiating actions. If the token generated is [Search], the LLM continues to generate a search query, and the agent invokes a search tool to retrieve papers, which are then added to the paper queue. If the token is [Expand], the LLM continues to extract a subsection name from the current paper in its context. The agent then extracts all referenced papers within that subsection, adding them to the paper queue. If the token is [Stop], the agent resets its context to the user query and information of the next paper in the paper queue. This information includes the title, abstract, and an outline of all sections and subsections.
例如，如图 2 所示，代理首先接收用户查询，将其纳入上下文，并发起操作。如果生成的令牌是 [Search]，LLM 继续生成搜索查询，代理调用搜索工具检索论文，这些论文随后被添加到论文队列中。如果标记为 [展开]，LLM 会继续从当前论文的上下文中提取子版块名称。代理人随后提取该子部分内所有引用的论文，并将其加入论文队列。如果令牌是 [停止]，代理会重置其上下文到用户查询和论文队列中下一篇论文的信息。这些信息包括标题、摘要以及所有章节和子章节的大纲。

The training process for the Crawler comprises two stages. In the first stage, we generate trajectories for a small subset of the training data and then perform imitation learning (see Appendix D.1 for details). In the second stage, reinforcement learning is applied. The details of the RL training implementation are described below.
履带车的训练过程包括两个阶段。第一阶段，我们为训练数据中的一小部分生成轨迹，然后进行模仿学习（详见附录 D.1）。第二阶段应用强化学习。强化学习训练实现的细节如下所述。

Reward Design 奖励设计

We conduct RL training on the AutoScholarQuery training set, where each instance consists of a query $q$ and a corresponding paper set $\mathcal{P}$ . Starting with a query $q$ , the Crawler generates a trajectory $\tau=(s_{1},a_{1},\cdots,s_{T},a_{T})$ . At each time step $t$ , we denote the current paper queue as $\mathcal{Q}_{t}$ . Upon taking action $a_{t}$ , the Crawler appends a set of new papers $(p_{1},p_{2},\cdots,p_{n_{t}})$ to the paper queue. If $a_{t}=\texttt{[Stop]}$ , the set is empty and no papers are added.
我们在 AutoScholarQuery 训练集上进行强化学习训练，每个实例由一个查询 $q$ 和对应的论文集 $\mathcal{P}$ 组成。从查询 $q$ 开始，爬虫生成轨迹 $\tau=(s_{1},a_{1},\cdots,s_{T},a_{T})$ 。在每个时间步 $t$ ，我们表示当前的纸质队列为 $\mathcal{Q}_{t}$ 。在采取行动 $a_{t}$ 时，爬虫会将一组新文件 $(p_{1},p_{2},\cdots,p_{n_{t}})$ 附加到文件队列中。如果 $a_{t}=\texttt{[Stop]}$ ，则集合为空，且不添加任何纸张。

The reward of executing action $a_{t}$ in state $s_{t}$ is defined as
在状态 $s_{t}$ 下执行动作 $a_{t}$ 的奖励定义为

r(s_{t},a_{t})=\alpha\times\sum_{i=1}^{n_{t}}\mathbb{I}(q,p_{i},t)-c(a_{t}),

(1)

where $\mathbb{I}(q,p_{i},t)=1$ if $p_{i}$ matches the query $q$ and is not already in $\mathcal{Q}_{t}$ , and $\mathbb{I}(q,p_{i},t)=0$ otherwise. Here, $\alpha$ is a reward coefficient, and $c(a_{t})$ is the cost of action $a_{t}$ .
其中 $\mathbb{I}(q,p_{i},t)=1$ 如果 $p_{i}$ 匹配查询 $q$ 且不属于 $\mathcal{Q}_{t}$ ， $\mathbb{I}(q,p_{i},t)=0$ 否则。这里， $\alpha$ 是奖励系数， $c(a_{t})$ 是行动 $a_{t}$ 代价。

The indicator function $\mathbb{I}(q,p_{i},t)$ can be determined by checking if $p_{i}$ belongs to $\mathcal{P}-\mathcal{Q}_{t}$ . However, it is important to note that the AutoScholarQuery may only include a subset of the ground-truth papers, as citations often emphasize a limited number of key references. If the Crawler receives rewards solely based on matching papers in AutoScholarQuery, this could lead to sparse rewards during training. To mitigate this, we use the Selector as an auxiliary reward model for the Crawler. The revised definition of $\mathbb{I}(q,p_{i},t)$ is:
指示函数 $\mathbb{I}(q,p_{i},t)$ 可以通过检查是否 $p_{i}$ 属于 $\mathcal{P}-\mathcal{Q}_{t}$ 来确定。然而，需要注意的是，AutoScholarQuery 可能只包含部分基于实地性论文的子集，因为引用通常强调的关键参考文献数量有限。如果爬虫仅基于 AutoScholarQuery 中的论文匹配获得奖励，可能导致培训期间奖励稀少。为了缓解这个问题，我们用选择器作为爬虫的辅助奖励模型。修订后的定义 $\mathbb{I}(q,p_{i},t)$ 为：

\mathbb{I}(q,p_{i},t)=\begin{cases}1,&\text{if }\left(\text{Selector}(q,p_{i})=1\text{ or }p_{i}\in\mathcal{P}\right)\\ &\quad\text{and }p_{i}\notin\mathcal{Q}_{t},\\ 0,&\text{otherwise.}\end{cases}

(2)

Here $\text{Selector}(q,p_{i})=1$ if paper $p_{i}$ is identified as correct to meet the query $q$ by the Selector, and $\text{Selector}(q,p_{i})=0$ otherwise.
如果 $\text{Selector}(q,p_{i})=1$ 论文 $p_{i}$ 被选择者确认正确，是否符合查询 $q$ ， $\text{Selector}(q,p_{i})=0$ 其他情况。

RL Training 现实学习培训

A key challenge in training the Crawler with RL is the significant time required to sample a complete trajectory for a given query. This is due to each [Search] or [Expand] action adding multiple papers to the paper queue, resulting in hundreds or even thousands of papers in the final paper queue.
用强化学习训练爬虫的一个关键挑战是采样给定查询的完整轨迹所需的大量时间。这是因为每次 [搜索] 或 [展开] 操作都会将多篇论文加入论文队列，最终导致数百甚至数千篇论文进入最终的论文队列。

To address this issue, we define a session as a sub-trajectory that ends with the [Stop] action, after which a new session begins. We identify two types of initial states for such sub-trajectories: $S_{q}$ , containing only the user query, and $S_{q+p}$ , containing both the query and a paper. $S_{q}$ represents the task’s starting point, where the LLM context includes only the query. In contrast, $S_{q+p}$ arises after a [Stop] action, where the LLM context is reset to the query and the next paper in the queue.
为解决这个问题，我们将会话定义为以 [停止] 动作结束的子轨迹，之后开始新的会话。我们为此类子轨迹识别两种初始状态： $S_{q}$ 仅包含用户查询的，以及 $S_{q+p}$ 包含查询和论文的。 $S_{q}$ 表示任务的起点，LLM 上下文仅包含查询。相比之下，在 $S_{q+p}$ [ 停止] 操作后，LLM 上下文被重置为查询和队列中的下一篇论文。

Formally, we model the Crawler as a policy $\pi_{\theta}(a_{t}|s_{t})$ . We partition the entire trajectory $\tau=(s_{1},a_{1},\cdots,s_{T},a_{T})$ into a sequence of sessions: $(\tau_{t_{1}:t_{2}-1},\tau_{t_{2}:t_{3}-1},\cdots)$ . Each session is $\tau_{t_{i}:t_{i+1}-1}=(s_{t_{i}},a_{t_{i}},\cdots,s_{t_{i+1}-1},a_{t_{i+1}-1})$ , where the initial state $s_{t_{i}}$ is either belonging to type $S_{q}$ or $S_{q+p}$ , and the final action $a_{t_{i+1}-1}$ is [STOP].
形式上，我们将爬虫建模为一个策略 $\pi_{\theta}(a_{t}|s_{t})$ 。我们将整个轨迹划 $\tau=(s_{1},a_{1},\cdots,s_{T},a_{T})$ 分为一系列会话： $(\tau_{t_{1}:t_{2}-1},\tau_{t_{2}:t_{3}-1},\cdots)$ 。每个会话都是 $\tau_{t_{i}:t_{i+1}-1}=(s_{t_{i}},a_{t_{i}},\cdots,s_{t_{i+1}-1},a_{t_{i+1}-1})$ ，其中初始状态 $s_{t_{i}}$ 属于类型 $S_{q}$ 或 $S_{q+p}$ ，最终动作 $a_{t_{i+1}-1}$ 为 [STOP]。

Sampling such a sub-trajectory from these session initial states is computationally efficient. During the PPO training, at time step $t\in[t_{i},t_{i+1})$ , we estimate the return in the session using Monte Carlo sampling:
从这些会话初始状态中采样此类子轨迹具有计算效率。在 PPO 训练期间，在时间步 $t\in[t_{i},t_{i+1})$ ，我们利用蒙特卡洛抽样估计会话中的回报：

	$\displaystyle\hat{R}_{t}$	$\displaystyle=$	$\displaystyle\sum_{k=t}^{t_{i+1}-1}\gamma_{0}^{k-t}\bigg[r(s_{k},a_{k})$
			$\displaystyle+\gamma_{1}\sum_{j=1}^{n_{k}}\hat{V}_{\phi}(S_{q+p_{j}})\bigg]-\beta\cdot\log\frac{\pi_{\theta}(a_{t}\|s_{t})}{\pi_{\rm sft}(a_{t}\|s_{t})}$

Here, $\gamma_{0}$ is the in-session discount factor, and $\gamma_{1}$ is the across-session discount factor. $\hat{V}_{\phi}(\cdot)$ is the value function model to approximate the state value. After executing $a_{k}$ , the paper queue is updated to include the newly found papers $(p_{1},p_{2},\cdots,p_{n_{k}})$ . Since the Crawler will subsequently initiate new sessions to process these additional papers, their associated reward-to-go should be incorporated into the return estimate. In addition, we include a per-token KL penalty term from the learned policy $\pi_{\theta}$ to the initial policy $\pi_{\rm sft}$ obtained through imitation learning at each token to mitigate over-optimization. This term is scaled by the coefficient $\beta$ .
这里， $\gamma_{0}$ 是会话中折扣因子， $\gamma_{1}$ 是跨会话折扣因子。 $\hat{V}_{\phi}(\cdot)$ 是用来近似状态值的值函数模型。执行 $a_{k}$ 后，论文队列更新为包含新发现的 $(p_{1},p_{2},\cdots,p_{n_{k}})$ 论文。由于爬虫随后会启动新的会话来处理这些额外文件，因此其相关的带走奖励应纳入返还估算。此外，我们还在每个代币上从学习策略 $\pi_{\theta}$ 中加入了通过模拟学习获得的初始策略 $\pi_{\rm sft}$ 的每枚代币 KL 惩罚项，以减轻过度优化。该项由系数 $\beta$ 来缩放。

Then the advantage function can be approximated by
那么优势函数可以近似为

\displaystyle\hat{A}(s_{t},a_{t})

\displaystyle=

\displaystyle\hat{R}_{t}-\hat{V}_{\phi}(s_{t}).

(4)

Finally, the policy and value objectives can be given by
最后，政策和价值目标可由下式给出

	$\displaystyle\mathcal{L}_{\text{policy}}(\theta)=\!\mathbb{E}_{\tau^{\prime}\sim\pi_{\theta}^{\text{old}}}\Bigg[\min\bigg(\frac{\pi_{\theta}(a_{t}\|s_{t})}{\pi_{\theta}^{\text{old}}(a_{t}\|s_{t})}\hat{A}(s_{t},a_{t}),$		(5)
	$\displaystyle\text{clip}\Big(\frac{\pi_{\theta}(a_{t}\|s_{t})}{\pi_{\theta}^{\text{old}}(a_{t}\|s_{t})},1-\epsilon,1+\epsilon\Big)\hat{A}(s_{t},a_{t})\bigg)\Bigg]$

and 以及

	$\displaystyle\mathcal{L}_{\text{value}}(\phi)=$	$\displaystyle\mathbb{E}_{\tau^{\prime}\sim\pi_{\theta}^{\text{old}}}\Bigg[\rm{max}\bigg(\Big(\hat{R}_{t}-\hat{V}_{\phi}(s_{t})\Big)^{2},$
		$\displaystyle\Big(\hat{R}_{t}-\hat{V}_{\phi}^{\rm{clip}}(s_{t})\Big)^{2}\bigg)\Bigg],$

respectively, where 分别，其中

\displaystyle\hat{V}_{\phi}^{\rm{clip}}(s_{t})=

\displaystyle\text{clip}\Big(\hat{V}_{\phi}(s_{t}),V_{\phi}^{\text{old}}(s_{t})-\epsilon,V_{\phi}^{\text{old}}(s_{t})+\epsilon\Big).

(7)

Here, $\pi_{\theta}^{\text{old}}$ and $V_{\phi}^{\text{old}}$ is used for sampling and $\tau^{\prime}$ is session trajectory. We then combine these into the unified RL loss:
这里， $\pi_{\theta}^{\text{old}}$ 和 $V_{\phi}^{\text{old}}$ 用于采样，是 $\tau^{\prime}$ 会话轨迹。然后我们将这些合并为统一的强化学习损耗：

\displaystyle\mathcal{L}_{\text{RL}}(\theta,\phi)=\mathcal{L}_{\text{policy}}(\theta)+\eta\cdot\mathcal{L}_{\text{value}}(\phi)

(8)

where $\eta$ is the coefficient of the value objective.
其中 $\eta$ 是值目标的系数。

4.3 Selector 4.3 选择器

The Selector is an LLM agent that takes two inputs: a scholar query and a research paper (including its title and abstract). It generates two outputs: (1) a single decision token $d$ , either "True" or "False", indicating whether the paper satisfies the query, and (2) a rationale $r=(r_{1},r_{2},...,r_{m})$ containing $m$ tokens that support this decision. The rationale serves two purposes: enhancing decision accuracy by jointly training the model to generate decisions and explanations, and improving user trust by providing the reasoning in PaSa application.
选择器是一个 LLM 代理，接收两个输入：一个学者查询和一篇研究论文（包括标题和摘要）。它生成两个输出：（1）一个单一决策标记 $d$ ，表示论文是否满足查询;（2） $r=(r_{1},r_{2},...,r_{m})$ 包含 $m$ 支持该决策的标记的理由。其理由有两个目的：通过联合训练模型生成决策和解释来提升决策准确性，以及通过在 PaSa 应用中提供推理来提升用户信任度。

To optimize training efficiency for the Crawler, the decision token is presented before the rationale, allowing the Selector to act as a single-token reward model during the Crawler training. Additionally, the token probability of the decision token can be used to rank search results. At last, as shown in Table 6, the order of the decision and rationale does not affect the Selector’s performance.
为了优化爬虫的训练效率，决策令牌置于理据之前，使选择器在爬虫训练期间作为单一令牌奖励模型。此外，决策令牌的概率还可用于排序搜索结果。最后，如表 6 所示，决策顺序和理由不会影响选择者的表现。

We perform imitation learning to optimize the Selector. See Appendix E for training data collection and training details.
我们通过模仿学习来优化选择器。有关训练数据收集和培训细节，请参见附录 E。

5 Experiments 5 实验

5.1 Experimental Setting
5.1 实验环境

We sequentially trained the Selector and Crawler, both based on the Qwen2.5-7b Yang et al. (2024), to develop the final agent, referred to as PaSa-7b.
我们依次训练了基于 Qwen2.5-7b 的 Selector 和 Crawler，Yang 等人（2024）开发最终药剂 PaSa-7b。

Selector 选择器

The Selector was fine-tuned using the training dataset described in Appendix E. We conducted supervised fine-tuning for one epoch with a learning rate of 1e-5 and a batch size of 4. The training runs on 8 NVIDIA-H100 GPUs.
选择器通过附录 E 中描述的训练数据集进行了微调。我们对一个时代进行了监督微调，学习率为 1e-5，批次规模为 4。训练运行在 8 块 NVIDIA-H100 显卡上。

Crawler 爬行者

The training process involves two stages. First, we perform imitation learning for 1 epoch on 12,989 training data with a learning rate of 1e-5 and batch size of 4 per device, using 8 NVIDIA H100 GPUs. In the second stage, we apply PPO training. To ensure stability, we first freeze the policy model and train the value model, followed by co-training both the policy and value models. The hyperparameters used during the training process are listed in Table 12 in Appendix D.2.
培训过程分为两个阶段。首先，我们对 12,989 个训练数据进行模拟学习，学习率为 1e-5，每设备 4 个批次，使用 8 块 NVIDIA H100 GPU。第二阶段，我们实施 PPO 培训。为确保稳定性，我们先冻结策略模型并训练价值模型，然后对政策模型和价值模型进行协同训练。训练过程中使用的超参数列于附录 D.2 的表 12。

During imitation learning, the model encounters 5,000 queries, while during the RL training phase, the model processes a total of 16,000 queries. For more details please refer to Appendix D.1 for the imitation learning data construction and Appendix D.2 for the PPO training data sampling.
在模仿学习阶段，模型会遇到 5,000 个查询;而在强化学习阶段，模型总共处理 16,000 个查询。更多详情请参阅附录 D.1 关于模拟学习数据构建，附录 D.2 关于 PPO 训练数据采样。

Implementation of [Search]
[ 搜索] 的实现

The LLM predicts a query based on the context. Then the agent calls Google⁷⁷7Accessed via the Google Search API provided by https://serper.dev.
通过 https://serper.dev 提供的谷歌搜索 API 访问。
LLM 根据上下文预测查询。然后客服打电话给 Google7 with the parameters site:arxiv.org and before:query_date, restricting search results by source and publication time.
使用 site：arxiv.org 和 before：query_date 参数，按来源和发布时间限制搜索结果。

Paper Management 论文管理

We developed a database to manage and restore research papers. PaSa retrieves paper information from the database. If no matching record is found, we use ar5iv⁸⁸8https://ar5iv.org/
我们开发了一个数据库来管理和恢复研究论文。PaSa 从数据库中获取纸质信息。如果找不到匹配记录，我们使用 ar5iv8 to obtain the full paper content, including citations, and then parse this data and store it in the database.
获取完整的论文内容，包括引用，然后解析这些数据并存储到数据库中。

Method 方法	Crawler Recall 履带召回	Precision 精度	Recall 召回	Recall@100	Recall@50	Recall@20
Google 谷歌	-	-	-	0.2015	0.1891	0.1568
Google Scholar 谷歌学术	-	-	-	0.1130	0.0970	0.0609
Google with GPT-4o 用 GPT-4o 谷歌搜索	-	-	-	0.2683	0.2450	0.1921
ChatGPT*	-	0.0507	0.3046	-	-	-
GPT-o1	-	0.0413	0.1925	-	-	-
PaSa-GPT-4o	0.7565	0.1457	0.3873	-	-	-
PaSa-7b	0.7931	0.1448	0.4834	0.6947	0.6334	0.5301
PaSa-7b-ensemble PaSa-7b-set	0.8265	0.1410	0.4985	0.7099	0.6386	0.5326

Table 4: Results on AutoScholarQuery test set. *: Due to the need for manual query submission, the ChatGPT baseline is evaluated on 100 randomly sampled instances. Results for all methods on this subset are reported in Table 14.
表 4： AutoScholarQuery 测试集的结果。*：由于需要手动提交查询，ChatGPT 基线基于 100 个随机抽样实例进行评估。该子集所有方法的结果均见表 14。

Method 方法	Crawler Recall 履带召回	Precision 精度	Recall 召回	Recall@100	Recall@50	Recall@20
Google 谷歌	-	-	-	0.2535	0.2342	0.1834
Google Scholar 谷歌学术	-	-	-	0.2809	0.2155	0.1514
Google with GPT-4o 用 GPT-4o 谷歌搜索	-	-	-	0.2946	0.2573	0.2020
ChatGPT	-	0.2280	0.2007	-	-	-
GPT-o1	-	0.058	0.0134	-	-	-
PaSa-GPT-4o	0.5494	0.4721	0.3075	-	-	-
PaSa-7b	0.7071	0.5146	0.6111	0.6929	0.6563	0.5798
PaSa-7b-ensemble PaSa-7b-set	0.7503	0.4938	0.6488	0.7281	0.6877	0.5986

Table 5: Results on RealScholarQuery.
表 5： RealScholarQuery 上的结果。

5.2 Baselines and Evaluation
5.2 基线与评估

We evaluate our paper search agent on both the test set of AutoScholarQuery and RealScholarQuery. We compare PaSa-7b against the following baselines:
我们对论文搜索代理在 AutoScholarQuery 和 RealScholarQuery 测试集上均进行了评估。我们将 PaSa-7b 与以下基线进行比较：

•

Google. We use Google to search the query directly, with the same parameter settings in Section 5.1.

• 谷歌。我们使用谷歌直接搜索查询，参数设置与第5.1节相同。
•

Google Scholar. Queries are submitted directly to Google Scholar⁷, with the same parameter settings in Section 5.1.

• 谷歌学术。查询直接提交到 Google Scholar7，参数设置与第 5.1 节相同。
•

Google with GPT-4o. We first employ GPT-4o to paraphrase the scholar query. The paraphrased query is then searched on Google.

• 用 GPT-4o 进行谷歌搜索。我们首先使用 GPT-4o 来转述学生问题。然后，在谷歌上搜索该意译查询。
•

ChatGPT. We submit scholar query to ChatGPT⁹⁹9https://chatgpt.com
•ChatGPT。我们向 ChatGPT9 提交学生查询, powered by search-enabled GPT-4o.
由支持搜索的 GPT-4o 驱动。
•

GPT-o1. Prompt GPT-o1 to process the scholar query. Note that it does not have access to external search tools.

• GPT-o1。提示 GPT-o1 处理学生查询。请注意，它无法使用外部搜索工具。
•

PaSa-GPT-4o. Implement PaSa as illustrated in Figure 1 by prompting GPT-4o. It can perform multiple searches, paper reading, and citation network crawling.

• PaSa-GPT-4o。如图 1 所示，通过提示 GPT-4o 实现 PaSa。它可以执行多次检索、论文阅读和引用网络爬取。

We carefully designed prompts for all baselines and they are shown in Appendix H.2. All baselines, except PaSa-GPT-4o, represent the best-known scholar search methods. These comparisons highlight the effectiveness of our agentic approach. The comparison with PaSa-GPT-4o isolates the impact of RL training.
我们精心设计了所有基线的提示，并展示在附录 H.2 中。除 PaSa-GPT-4o 外，所有基线均为最知名的学者检索方法。这些比较凸显了我们能动方法的有效性。与 PaSa-GPT-4o 的比较可以分离出强化学习训练的影响。

As shown in Figure 2, the crawling process of PaSa can be visualized as a paper tree. In practice, considering the computational expense, we limit the Crawler’s exploration depth to three for both PaSa-7b and PaSa-GPT-4o.
如图 2 所示，PaSa 的爬行过程可以被可视化为一棵纸树。实际上，考虑到计算成本，我们限制了爬虫的探索深度为三级，适用于 PaSa-7b 和 PaSa-GPT-4o。

For Google-based baselines, we evaluate recall using Recall@20, Recall@50, and Recall@100 metrics for the top-20, top-50, and top-100 search results, respectively. For other baselines that do not produce rankings, we assess precision and recall for the final retrieved papers. Additionally, we compare Crawler recall between PaSa-GPT-4o and PaSa-7b, defined as the proportion of target papers collected by the Crawler. This measures how many target papers are successfully included in the paper queue generated by the Crawler.
对于基于谷歌的基线，我们分别使用前 20、前 50 和前 100 搜索结果的 Recall@20、Recall@50 和 Recall@100 指标来评估召回率。对于其他未产生排名的基线，我们评估最终检索论文的准确性和回忆性。此外，我们比较了 PaSa-GPT-4o 和 PaSa-7b 之间的回忆，后者定义为爬虫收集的目标论文的比例。这衡量了有多少目标论文成功被 Crawler 生成的论文队列。

Method 方法	Precision 精度	Recall 召回	F1
GPT-4o	0.96	0.69	0.80
Qwen-2.5-7b	1.0	0.38	0.55
PaSa-7b-Selector PaSa-7b-选择器	0.95	0.78	0.85
PaSa-7b-Selector (Reason First) PaSa-7b-选择器（Reason First）	0.94	0.76	0.84

Table 6: Selector Evaluation.
表 6：选拔者评估。

5.3 Main results
5.3 主要结果

Method 方法	AutoScholarQuery			RealScholarQuery
Method 方法	Crawler Recall 履带召回	Precision 精度	Recall 召回	Crawler Recall 履带召回	Precision 精度	Recall 召回
w/o [Expand] 无 [展开]	0.3355	0.1445	0.2536	0.3359	0.6738	0.2890
w/o RL training 没有 RL 训练	0.6556	0.1476	0.4210	0.4847	0.5155	0.4115
w/o Selector as RM 无选择者作为 RM	0.7041	0.1535	0.4458	0.5994	0.5489	0.5148
PaSa-7b	0.7931	0.1448	0.4834	0.7071	0.5146	0.6111

Table 7: Ablation study results on AutoScholarQuery test set and RealScholarQuery.
表 7： AutoScholarQuery 测试集和 RealScholarQuery 上的消融研究结果。

As shown in Table 4, PaSa-7b outperforms all baselines on AutoScholarQuery test set. Specifically, compared to the strongest baseline, PaSa-GPT-4o, PaSa-7b demonstrates a 9.64% improvement in recall with comparable precision. Moreover, the recall of the Crawler in PaSa-7b is 3.66% higher than that in PaSa-GPT-4o. When compared to the best Google-based baseline, Google with GPT-4o, PaSa-7b achieves an improvement of 33.80%, 38.83% and 42.64% in Recall@20, Recall@50 and Recall@100, respectively.
如表 4 所示，PaSa-7b 在 AutoScholarQuery 测试集上表现优于所有基线。具体来说，与最强基线 PaSa-GPT-4o 相比，PaSa-7b 的回忆提升幅度达到 9.64%，精度相当。此外，PaSa-7b 中对爬虫的召回率比 PaSa-GPT-4o 高出 3.66%。与谷歌基于 GPT-4o 的最佳基线相比，PaSa-7b 在 Recall@20、Recall@50 和 Recall@100 分别提升了 33.80%、38.83%和 42.64%。

We observe that using multiple ensembles of Crawler during inference can improve performance. Specifically, we use sampling decoding to run Crawler twice in the PaSa-7b-ensemble setting, which boosts Crawler recall by 3.34% on AutoScholarQuery and increases the final recall by 1.51%, with no significant change in precision.
我们观察到在推断过程中使用多个爬虫集合可以提升性能。具体来说，我们使用采样解码在 PaSa-7b 集合设置下运行两次爬虫，这在 AutoScholarQuery 上使爬虫回忆率提升了 3.34%，最终回忆率提升了 1.51%，且精度没有显著变化。

To evaluate PaSa in a more realistic setting, we assess its effectiveness on RealScholarQuery. As illustrated in Table 5, PaSa-7b exhibits a greater advantage in real-world academic search scenarios. Compared to PaSa-GPT-4o, PaSa-7b achieves improvements of 30.36% in recall and 4.25% in precision. Against the best Google-based baseline on RealScholarQuery, Google with GPT-4o, PaSa-7b outperforms Google by 37.78%, 39.90%, and 39.83% in recall@20, recall@50 and recall@100, respectively. Additionally, the PaSa-7b-ensemble further enhances crawler recall by 4.32%, contributing to an overall 3.52% improvement in the recall of the entire agent system.
为了在更现实的环境中评估 PaSa，我们在 RealScholarQuery 上评估其有效性。如表 5 所示，PaSa-7b 在现实学术搜索场景中表现出更大的优势。与 PaSa-GPT-4o 相比，PaSa-7b 的回忆提升了 30.36%，精度提升了 4.25%。以 RealScholarQuery 为基础的最佳谷歌基线，谷歌在 recall@20、recall@50 和 recall@100 分别以 37.78%、39.90%和 39.83%的 GPT-4o 和 PaSa-7b 表现优于谷歌。此外，PaSa-7b 集合进一步提升了爬虫召回率 4.32%，整体提升了整个智能体系统的召回率 3.52%。

As both the final decision-maker and auxiliary reward model in RL training for the Crawler, the performance of the Selector is crucial. To evaluate its effectiveness, we collected a dataset of 200 query-paper pairs, annotating whether each paper meets the query’s requirements. This dataset serves as the benchmark for evaluating the Selector (see Appendix F for details). We then compared our Selector against GPT-4o Hurst et al. (2024) and Qwen-2.5-7b Yang et al. (2024), as shown in Table 6. The results show that our Selector achieves an F1 score of 85%, outperforming GPT-4o by 5% and Qwen-2.5-7b by 30%. Additionally, when compared to a setting where reasoning precedes decision token generation, the performance is comparable. Lastly, the Selector’s precision reaches 95%, confirming its effectiveness as an auxiliary reward model for the Crawler RL training.
作为爬行者强化学习训练中的最终决策者和辅助奖励模型，选择器的性能至关重要。为评估其有效性，我们收集了 200 对查询-论文，注注每篇论文是否符合查询要求。该数据集作为评估选择器的基准（详见附录 F）。随后，我们将 Selector 与 GPT-4o Hurst 等（2024）和 Qwen-2.5-7b Yang 等（2024）进行比较，如表 6 所示。结果显示，我们的 Selector 获得了 85%的 F1 分数，比 GPT-4o 高出 5%，比 Qwen-2.5-7b 高出 30%。此外，与在决策令牌生成前先进行推理的环境相比，其性能相当。最后，选择器的精度达到 95%，确认其作为爬行者强化学习辅助奖励模型的有效性。

5.4 Ablation study
5.4 消融研究

We perform ablation studies in Table 7 to evaluate the individual contributions of exploring citation networks, RL training, and using the Selector as the reward model. The results indicate that removing the [Expand] action from the Crawler leads to a significant drop in the recall: a decrease of 22.98% on AutoScholarQuery and 32.21% on RealScholarQuery. Furthermore, RL training enhances recall by 6.24% on AutoScholarQuery and 19.96% on RealScholarQuery. The RL training curves are depicted in Figure 3 in Appendix D.2, where the training curves show a steady increase in return with the training steps, eventually converging after 200 steps. Finally, removing the Selector as an auxiliary reward model results in a 3.76% recall drop on AutoScholarQuery and a 9.63% drop on RealScholarQuery.
我们在表 7 中进行消融研究，以评估探索引用网络、强化学习训练以及将选择器作为奖励模型的个人贡献。结果表明，移除爬虫中的 [扩展] 动作可显著降低回忆率：AutoScholarQuery 下降 22.98%，RealScholarQuery 下降 32.21%。此外，强化学习在 AutoScholarQuery 上提升回忆率为 6.24%，在 RealScholarQuery 提升 19.96%。强化学习训练曲线如附录 D.2 的图 3 所示，训练曲线随着训练步数呈稳步增长，最终在 200 步后收敛。最后，移除选择器作为辅助奖励模型后，AutoScholarQuery 的回忆率下降了 3.76%，RealScholarQuery 下降了 9.63%。

We investigate how to control agent behavior by adjusting the rewards in RL training. Experiments are conducted with varying reward coefficients $\alpha$ in Equation 1, and results are presented in Table 8. We report two metrics: crawler recall and crawler action. The crawler action refers to the total number of [Search] and [Expand] actions throughout the Crawler’s entire trajectory. As the reward increases, both crawler recall and crawler action increase, suggesting that adjusting rewards in RL training can effectively influence PaSa’s behavior.
我们研究如何通过调整强化学习训练中的奖励来控制代理行为。实验在方程 1 中以不同的奖励系数 $\alpha$ 进行，结果见表 8。我们报告了两个指标：爬虫回忆和爬虫动作。爬虫动作指的是爬虫整个轨迹中[ 搜索] 和 [展开] 动作的总数。随着奖励增加，爬虫回忆和爬虫动作都会增加，表明调整强化学习中的奖励可以有效影响 PaSa 的行为。

$\alpha$	Crawler 爬行者	Crawler 爬行者	Precision 精度	Recall 召回
$\alpha$	Recall 召回	Actions 行动	Precision 精度	Recall 召回
0.5	0.7227	175.9	0.1458	0.4602
1.0	0.7708	319.8	0.1451	0.4792
1.5	0.7931	382.4	0.1448	0.4834
2.0	0.8063	785.5	0.1409	0.4869

Table 8: Performance of the Crawler trained on different reward coefficient

\alpha

on AutoScholarQuery test set.
表 8：在 AutoScholarQuery 测试集中训练不同奖励系数

\alpha

的爬虫表现。

6 Conclusion 6 结论

In this paper, we introduce PaSa, a novel paper search agent designed to provide comprehensive and accurate results for complex academic queries. PaSa is implemented within the AGILE, a reinforcement learning framework for LLM agents. To train PaSa, we developed AutoScholarQuery, a dataset of fine-grained academic queries and corresponding papers drawn from top-tier AI conference publications. To evaluate PaSa in real-world scenarios, we also constructed RealScholarQuery, a dataset of actual academic queries paired with annotated papers. Our experimental results demonstrate that PaSa outperforms all baselines, including Google, Google Scholar, and Google with GPT-4o, ChatGPT, GPT-o1, and PaSa-GPT-4o. In particular, PaSa-7B surpasses Google with GPT-4o by 37.78% in recall@20 and 39.90% in recall@50, while also exceeding PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. These findings underscore that PaSa significantly improves the efficiency and accuracy of academic search.
本文介绍了 PaSa，一种新型论文搜索代理，旨在为复杂的学术查询提供全面且准确的结果。PaSa 实现在 AGILE 中，AGILE 是一个面向 LLM 代理的强化学习框架。为训练 PaSa，我们开发了 AutoScholarQuery，这是一个精细度学术查询及顶级 AI 会议论文的数据集。为在现实世界中评估 PaSa，我们还构建了 RealScholarQuery，这是一个将实际学术查询与注释论文配对的数据集。实验结果显示，PaSa 优于所有基线，包括 Google、Google Scholar，以及 Google 的 GPT-4o、ChatGPT、GPT-o1 和 PaSa-GPT-4o。特别是，PaSa-7B 在 recall@20 上比 GPT-4o 高出 37.78%，在 recall@50 上超过 39.90%，同时在召回率上也高于 PaSa-GPT-4o 30.36%，精度高出 4.25%。这些发现强调了 PaSa 显著提升了学术搜索的效率和准确性。

Limitations 局限性

(1) Our dataset collection and experiments were primarily focused on the field of machine learning. Although our proposed method is general, we did not explore its performance in other scientific fields. We leave to investigate its applicability to other domains in future work.
（1）我们的数据集收集和实验主要聚焦于机器学习领域。虽然我们提出的方法具有通用性，但我们并未探索其在其他科学领域的表现。我们将在未来工作中继续探讨其对其他领域的适用性。

(2) Due to resource constraints, our experiments primarily use LLMs with 7b parameters. We expect that scaling up to larger models will lead to more powerful agents. Expanding PaSa to leverage larger LLMs is our future work.
（2）由于资源限制，我们的实验主要使用参数为 7b 的 LLM。我们预计，规模扩大到更大模型将带来更强大的智能体。扩展 PaSa 以利用更大规模的大型语言模型是我们的未来工作。

Acknowledgments 致谢

The authors thank Yaohua Fang, Zheng Li, Qiang Lu, Yelong Shi, Xuguang Wei, and Tingshuai Yan from ByteDance for their support in developing the PaSa demo. We also thank Jianghui Xie at ByteDance for her assistance with the release of the PaSa demo. Finally, we thank the anonymous reviewers for their valuable suggestions that helped improve this work.
作者感谢字节跳动的方耀华、李正、陆强、石艺龙、魏旭光和严廷帅对 PaSa 演示开发的支持。我们也感谢字节跳动的谢江辉协助发布 PaSa 演示。最后，感谢匿名评审者提供的宝贵建议，帮助改进了这项工作。

References

Agarwal et al. (2024) Shubham Agarwal, Issam H Laradji, Laurent Charlin, and Christopher Pal. 2024. Litllm: A toolkit for scientific literature review. arXiv preprint arXiv:2402.01788.
Alaofi et al. (2023) Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. 2023. Can generative llms create query variants for test collections? an exploratory study. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 1869–1873.
Anthropic (2024) A Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku; 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7 bbc618857627/Model_Card_Claude_3.pdf.
Baek et al. (2024) Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. 2024. Researchagent: Iterative research idea generation over scientific literature with large language models. arXiv preprint arXiv:2404.07738.
Bran et al. (2023) Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2023. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.
Chen et al. (2023) Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. 2023. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288.
Deng et al. (2024) Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
Feng et al. (2024) Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. 2024. Agile: A novel reinforcement learning framework of llm agents. Advances in Neural Information Processing Systems, 37:5244–5284.
Gemini (2023) Team Gemini. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Girotra et al. (2023) Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. 2023. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071.
Gundawar et al. (2024) Atharva Gundawar, Mudit Verma, Lin Guan, Karthik Valmeekam, Siddhant Bhambri, and Subbarao Kambhampati. 2024. Robust planning with llm-modulo framework: Case study in travel planning. arXiv preprint arXiv:2405.20625.
Gusenbauer and Haddaway (2020) Michael Gusenbauer and Neal R Haddaway. 2020. Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of google scholar, pubmed, and 26 other resources. Research synthesis methods, 11(2):181–217.
Gusenbauer and Haddaway (2021) Michael Gusenbauer and Neal R Haddaway. 2021. What every researcher should know about searching–clarified concepts, search advice, and an agenda to improve finding in academia. Research synthesis methods, 12(2):136–147.
Hurst et al. (2024) Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276.
Kingsley et al. (2011) Karl Kingsley, Gillian M Galbraith, Matthew Herring, Eva Stowers, Tanis Stewart, and Karla V Kingsley. 2011. Why not just google it? an assessment of information literacy skills in a biomedical science curriculum. BMC medical education, 11:1–8.
Li et al. (2023) Minghan Li, Honglei Zhuang, Kai Hui, Zhen Qin, Jimmy Lin, Rolf Jagerman, Xuanhui Wang, and Michael Bendersky. 2023. Generate, filter, and fuse: Query expansion via multi-step keyword generation for zero-shot neural rankers. arXiv preprint arXiv:2311.09175.
Liao et al. (2024) Zhehui Liao, Maria Antoniak, Inyoung Cheong, Evie Yu-Yen Cheng, Ai-Heng Lee, Kyle Lo, Joseph Chee Chang, and Amy X Zhang. 2024. Llms as research tools: A large scale survey of researchers’ usage and perceptions. arXiv preprint arXiv:2411.05025.
Liu et al. (2023) Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, and Zhaoran Wang. 2023. Reason for future, act for now: A principled framework for autonomous llm agents with provable sample efficiency. arXiv preprint arXiv:2309.17382.
Lu et al. (2024) Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292.
M. Bran et al. (2024) Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2024. Augmenting large language models with chemistry tools. Nature Machine Intelligence, pages 1–11.
Ma et al. (2023) Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
Messeri and Crockett (2024) Lisa Messeri and MJ Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49–58.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
Peng et al. (2024) Wenjun Peng, Guiyang Li, Yue Jiang, Zilong Wang, Dan Ou, Xiaoyi Zeng, Derong Xu, Tong Xu, and Enhong Chen. 2024. Large language model based long-tail query rewriting in taobao search. In Companion Proceedings of the ACM on Web Conference 2024, pages 20–28.
Putta et al. (2024) Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. 2024. Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199.
Qian et al. (2023) Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shao et al. (2024) Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. 2024. Assisting in writing Wikipedia-like articles from scratch with large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, Mexico City, Mexico. Association for Computational Linguistics.
Shinn et al. (2024) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
Stratton (2024) Jess Stratton. 2024. An introduction to microsoft copilot. In Copilot for Microsoft 365: Harness the Power of Generative AI in the Microsoft Apps You Use Every Day, pages 19–35. Springer.
Van Noorden and Perkel (2023) Richard Van Noorden and Jeffrey M Perkel. 2023. Ai and science: what 1,600 researchers think. Nature, 621(7980):672–675.
Wang et al. (2024a) Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. 2024a. SciMON: Scientific inspiration machines optimized for novelty. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 279–299, Bangkok, Thailand. Association for Computational Linguistics.
Wang et al. (2024b) Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, et al. 2024b. Autosurvey: Large language models can automatically write surveys. arXiv preprint arXiv:2406.10252.
Xu et al. (2022) Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10.
Yang et al. (2024) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.
Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: synergizing reasoning and acting in language models (2022). arXiv preprint arXiv:2210.03629.

Query: Give me papers about how to rank search results by the use of LLM
查询：请给我一些关于如何利用大型语言模型（LLM）排名搜索结果的论文 Query Date: 2024-10-01
查询日期：2024-10-01 Answer Papers: 答题试卷： [0] Instruction Distillation Makes Large Language Models Efficient Zero-shot Rankers (2311.01555)
[0] 指令蒸馏使大型语言模型成为高效的零样本排名器（2311.01555） [1] Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels (2310.14122)
[1] 超越是与否：通过评分细粒度相关性标签提升零射点大型语言模型排名者（2310.14122） [2] Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting (2306.17563)
[2] 大型语言模型是有效的文本排序工具，采用成对排序提示（2306.17563） [3] A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models (2310.09497)
[3] 基于大型语言模型进行高效零样本排序的按集合方法（2310.09497） [4] RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models (2309.15088)
[4] RankVicuna：零样本列表文档重新排序，采用开源大型语言模型（2309.15088） [5] PaRaDe: Passage Ranking using Demonstrations with Large Language Models (2310.14408)
[5] PaRaDe：使用大型语言模型演示进行文章排名（2310.14408） [6] Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents (2304.09542)
[6] ChatGPT 擅长搜索吗？研究大型语言模型作为重新排序代理（2304.09542） [7] Large Language Models are Zero-Shot Rankers for Recommender Systems (2305.08845)
[7] 大型语言模型是推荐系统中的零样本排名器（2305.08845） [8] TourRank: Utilizing Large Language Models for Documents Ranking with a Tournament-Inspired Strategy (2406.11678)
[8] TourRank：利用大型语言模型进行文档排名，采用以锦标赛为灵感的策略（2406.11678） [9] ExaRanker: Explanation-Augmented Neural Ranker (2301.10521)
[9] ExaRanker：解释增强神经排序器（2301.10521） [10] RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs (2407.02485)
[10] RankRAG：大语言模型中与检索增强生成的统一上下文排名（2407.02485） [11] Make Large Language Model a Better Ranker (2403.19181)
[11] 让大型语言模型成为更好的排名器（2403.19181） [12] LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking (2406.00231)
[12] LLM-RankFusion：缓解基于 LLM 排名的内在不一致性（2406.00231） [13] Improving Zero-shot LLM Re-Ranker with Risk Minimization (2406.13331)
[13] 改进零样本大型语言模型重排序器并降低风险（2406.13331） [14] Zero-Shot Listwise Document Reranking with a Large Language Model (2305.02156)
[14] 零样本列表文档重新排序与大型语言模型（2305.02156） [15] Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing (2404.11791)
[15] 通过后处理整合大型语言模型的排名和相关性预测（2404.11791） [16] Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models (2406.18740)
[16] 逐步重新排序：研究大型语言模型重新排序的预过滤（2406.18740） [17] Large Language Models for Relevance Judgment in Product Search (2406.00247)
[17] 产品搜索中相关性判断的大型语言模型（2406.00247） [18] PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval (2404.18424)
[18] PromptRep：提示大型语言模型生成零样本文档检索的密集和稀疏表示（2404.18424） [19] Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models (2405.20654)
[19] 大型语言模型问答中文章重新排序的段落特定提示调优（2405.20654） [20] When Search Engine Services meet Large Language Models: Visions and Challenges (2407.00128)
[20] 当搜索引擎服务遇见大型语言模型：愿景与挑战（2407.00128） [21] RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! (2312.02724)
[21] RankZephyr：有效且稳健的零射手列表排名重排简直轻而易举！ (2312.02724) [22] Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models (2312.02969)
[22] 无 GPT 排名：在开源大型语言模型上构建与 GPT 无关的列表重排序器（2312.02969） [23] MuGI: Enhancing Information Retrieval through Multi-Text Generation Integration with Large Language Models (2401.06311)
[23] MuGI：通过多文本生成集成与大型语言模型增强信息检索（2401.06311） [24] Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker (2305.13729)
[24] 零次重排序器的受限生成实现离散提示优化（2305.13729） [25] REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering (2402.17497)
[25] REAR：一个具有相关性感知的检索增强框架用于开放域问答（2402.17497） [26] Agent4Ranking: Semantic Robust Ranking via Personalized Query Rewriting Using Multi-agent LLM (2312.15450)
[26] Agent4Ranking：通过多智能体 LLM 进行个性化查询重写实现语义稳健排名（2312.15450） [27] FIRST: Faster Improved Listwise Reranking with Single Token Decoding (2406.15657)
[27] 第一：采用单令牌解码实现更快改进的列表重新排序（2406.15657） [28] Leveraging LLMs for Unsupervised Dense Retriever Ranking (2402.04853)
[28] 利用大型语言模型进行无监督密集寻回犬排名（2402.04853） [29] Unsupervised Contrast-Consistent Ranking with Language Models (2309.06991)
[29] 无监督对比一致性与语言模型排序（2309.06991） [30] Enhancing Legal Document Retrieval: A Multi-Phase Approach with Large Language Models (2403.18093)
[30] 增强法律文件检索：采用大型语言模型的多阶段方法（2403.18093） [31] Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models (2310.07712)
[31] 中间发现：置换自洽性提升了大型语言模型中的列表排名（2310.07712） [32] Fine-Tuning LLaMA for Multi-Stage Text Retrieval (2310.08319)
[32] LLaMA 多阶段文本检索的微调（2310.08319） [33] Zero-shot Audio Topic Reranking using Large Language Models (2309.07606)
[33] 使用大型语言模型进行零样本音频主题重新排序（2309.07606） [34] Uncovering ChatGPT’s Capabilities in Recommender Systems (2305.02182)
[34] 揭示 ChatGPT 在推荐系统中的能力（2305.02182） [35] Cognitive Personalized Search Integrating Large Language Models with an Efficient Memory Mechanism (2402.10548)
[35] 认知个性化搜索：将大型语言模型与高效记忆机制整合（2402.10548） [36] Towards More Relevant Product Search Ranking Via Large Language Models: An Empirical Study (2409.17460)
[36] 通过大型语言模型实现更相关的产品搜索排名：一项实证研究（2409.17460） [37] Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction (2306.01599)
[37] 基于预训练语言模型的网络搜索排名：从相关性到满意度（2306.01599） [38] Open-source large language models are strong zero-shot query likelihood models for document ranking (2310.13243)
[38] 开源大型语言模型是用于文档排序的强零样本查询似然模型（2310.13243）

Table 9: Examples of queries and corresponding papers in RealScholarQuery.
表 9： RealScholarQuery 中的查询及相关论文示例。

Appendix A Quality Evaluation of AutoScholarQuery
附录 AAutoScholarQuery 的质量评估

To assess the quality of AutoScholarQuery, we sampled 100 query-paper pairs and evaluated the rationality and relevance of each query and its corresponding paper. The detailed evaluation criteria are as follows:
为评估 AutoScholarQuery 的质量，我们抽取了 100 对查询-论文，评估每个查询及其对应论文的合理性和相关性。详细评估标准如下：

•

A qualified query should be a complete and understandable sentence. For example, incomplete or fragmented sentences are not acceptable.

• 限定查询应为完整且易于理解的句子。例如，不完整或零散的句子是不可接受的。
•

A query that misrepresents the meaning of the source paper, leading to irrelevant citations, is not qualified. This includes queries that exaggerate the scope or introduce incorrect conditions.

• 错误表达原始论文含义，导致无关引用的查询不具限定性。这包括夸大范围或引入错误条件的查询。
•

A query is ambiguous if there is insufficient context and additional information is needed. For instance, abbreviations with multiple meanings can create ambiguity, leading to the corresponding citations being incomplete answer lists.

• 如果上下文不足且需要额外信息，查询即构成歧义。例如，具有多重含义的缩写可能造成歧义，导致相应的引用成为不完整的答案列表。
•

An answer paper is considered qualified if it aligns with the requirements of the query. The paper should address all or most of the essential factors that make it a suitable response.

• 如果答案论文符合查询要求，则被视为合格。论文应涵盖所有或大部分关键因素，使其成为合适的回答。

Our quality check found that 94.0% of the queries were qualified. Among them, 93.7% of the corresponding answer papers were also qualified. The primary reason for unqualified papers was inaccurate citations in the source paper.
我们的质量检查发现，94.0%的查询是合格的。其中，93.7%的对应答案也符合资格。不合格论文的主要原因是源论文中的引用不准确。

The prompt for search query generation
搜索查询生成提示

You are an elite researcher in the field of AI, please generate some mutually exclusive queries in a list to search the relevant papers according to the User Query. Searching for a survey paper would be better.
你是人工智能领域的顶尖研究者，请在列表中生成一些互斥的查询，根据用户查询搜索相关论文。搜索调查论文会更好。 User Query: {user_query}
用户查询：{user_query} The semantics between generated queries are not mutually inclusive. The format of the list is: [“query1”, “query2”, …]
生成查询之间的语义并不互为包容。列表的格式为：[“query1”， “query2” ...] Queries: 问题：

Table 10: The prompt for GPT-4o to generate search queries from the user query.
表 10： GPT-4o 从用户查询生成搜索查询的提示。

	Search Session starting from $S_{q}$ 搜索会议从 $S_{q}$	Expand Session starting from $S_{q+p}$ 扩展会话从 $S_{q+p}$
prompt 提示	Please, generate some mutually exclusive queries in a list to search the relevant papers according to the User Query. Searching for survey papers would be better. 请在列表中生成一些互斥的查询，根据用户查询搜索相关论文。搜索调查论文会更好。 User Query: {user_query} 用户查询：{user_query}	You are conducting research on '{user_query}'. You need to predict which sections to look at to get more relevant papers. 你正在研究“{user_query}”。你需要预测哪些部分，才能获得更相关的论文。 Title: {title} 标题：{title} Abstract: {abstract} 摘要：{抽象} Sections: {sections} 分段：{分段}
response 回应	[Search] {query 1} [搜索]{查询1} [Search] {query 2} [搜索]{查询2} … [Stop] [停止]	[Expand] {section 1} [展开]{第1节} [Expand] {section 2} [展开]{第2节} … [Stop] [停止]

Table 11: The session trajectory templates of the Crawler.
表 11：爬虫的会话轨迹模板。

Appendix B Annotation details
附录 B 注释细节

The annotators of RealScholarQuery include professors from the Department of Computer Science at a top-tier university in China. They are paid $4 per data entry, which consists of a user query and a research paper. Their task is to determine whether the paper satisfies the query.
RealScholarQuery 的注释者包括中国一所顶尖大学计算机科学系的教授。他们每条数据输入收取 4 美元的报酬，数据条目包括用户查询和一篇研究论文。他们的任务是判断论文是否满足查询要求。

B.1 Annotation Instructions
B.1 注释指令

For each <user query, paper> pair, carefully assess whether the paper address the user query. Write your decision and provide a brief explanation (1-2 sentences). Specific guidelines are as follows:
对于每个<用户查询、论文>对，仔细评估论文是否回应了用户问题。写下你的决定并提供简短的解释（1-2句话）。具体指导原则如下：

•

You may read the entire paper to determine whether it satisfies the user query.

• 你可以阅读整篇论文，以确定是否满足用户的问题。
•

Exclude survey papers unless the user query specifically requests them.

• 除非用户查询特别要求，否则排除调查论文。
•

All conditions in the user query must be met for the paper to be considered qualified.

• 用户查询中的所有条件必须满足，论文才被视为合格。

B.2 Quality control
B.2 质量控制

The annotation process follows the following quality control measures:
注释过程遵循以下质量控制措施：

•

Stage 1: Annotators work in batches of 20. Authors review 100% of the annotations. Once the consistency rate on the first pass reaches 90%, the process moves to Stage 2.

• 第一阶段：注释者以20人为一组工作。作者审核100%的注释。当第一遍一致性率达到90%时，流程进入第二阶段。
•

Stage 2: Annotators work in batches of 50. Authors randomly check 40% of the annotations. If the consistency rate is below 90%, the entire batch is re-annotated and re-checked. Once the batch meets the 90% consistency rate on the first pass, the process moves to Stage 3.

• 第二阶段：批注者以50个为一组工作。作者随机检查40%的注释。如果一致性率低于90%，则对整个批次进行重新注释和检查。一旦批次在第一次处理中达到90%的一致性率，流程进入第三阶段。
•

Stage 3: Annotators work in batches of 100. Authors randomly check 20% of the annotations. If the consistency rate is below 90%, the entire batch is re-annotated and re-checked.

• 第三阶段：注释者以100个为一组工作。作者随机检查20%的注释。如果一致性率低于90%，则对整个批次进行重新注释和检查。

Two authors conducted the quality control.
两位作者进行了质量控制。

Appendix C Example in RealScholarQuery
附录 CRealScholarQuery 示例

Table 9 presents an example query and corresponding papers from RealScholarQuery.
表 9 展示了 RealScholarQuery 的示例查询及相关论文。

Appendix D Implementation Details of the Crawler
附录 D 爬虫的实现细节

D.1 Imitation learning data generation
D.1 模仿学习数据生成

We generate training data for imitation learning on a session-by-session basis. There are two types of sessions: search session (starting from state $S_{q}$ ) and expand session (starting from state $S_{q+p}$ ).
我们逐场生成模拟学习的训练数据。会话分为两种类型： 搜索会话 （从状态 $S_{q}$ 开始）和展开会话 （从状态 $S_{q+p}$ 开始）。

For search sessions starting from $S_{q}$ , we sample user queries from the AutoScholarQuery training set and prompt GPT-4o to generate corresponding search queries. The prompt template is shown in Table 10. The session trajectory is constructed by adding a [Search] token before each query, concatenating the queries, and appending a [Stop] token at the end, as shown in Table 11. A total of 3,011 search session trajectories are generated.
对于从 $S_{q}$ ，从的搜索会话开始，我们从 AutoScholarQuery 训练集和提示 GPT-4o 中抽取用户查询，生成对应的搜索查询。提示模板见表 10。会话轨迹通过在每个查询前添加 [Search] 令牌，串接查询，并在末尾附加 [Stop] 令牌，如表 11 所示构建。共生成 3,011 条搜索会话轨迹。

For expanded sessions starting from $S_{q+p}$ , we continue by searching for the generated queries using Google. We then sample papers from the search results and obtain the initial state, which includes both the query and a paper. To build the session trajectory, we examine each sub-section of the paper. If the sub-section references at least one paper in the AutoScholarQuery training set corresponding to the query, the sub-section is selected. Otherwise, the sub-section is selected with a 10% probability to enhance trajectory diversity. The selected sections are filled into the template in Table 11, completing the session trajectory. In total, 9,978 expanded session trajectories are constructed.
对于从 $S_{q+p}$ 开始的扩展会话，我们继续使用谷歌搜索生成的查询。然后从搜索结果中抽样论文，获得初始状态，即查询和论文。为了建立会话轨迹，我们检查论文的每个子部分。如果子部分引用了对应查询的 AutoScholarQuery 训练集中至少一篇论文，则选择该子部分。否则，以 10%的概率选择该子部分以增强轨迹多样性。所选部分被填充到表 11 的模板中，完成会话轨迹。总共构建了 9,978 个扩展会话轨迹。

D.2 PPO training
D.2PPO 训练

During PPO training, each device processes 4 user queries in each step, generating a search session for each user query. Then, 6 expansion sessions are created by randomly sampling 6 papers from the search results. This process is repeated with the expanded citation results, yielding 6 additional expanded sessions. In total, 16 session trajectories are generated per step.
在 PPO 训练期间，每台设备在每一步处理 4 个用户查询，为每个用户查询生成一个搜索会话。然后，通过从搜索结果中随机抽样 6 篇论文创建 6 个扩展会话。该过程对扩展引用结果重复，产生 6 个额外的扩展会话。每个步骤总共生成 16 条会话轨迹。

Name 名称		Value 价值
$\alpha$	(Equation 1) （方程 1）	1.5
$c(\text{[Search]})$	(Equation 1) （方程 1）	0.1
$c(\text{[Expand]})$	(Equation 1) （方程 1）	0.1
$c(\text{[Stop]})$	(Equation 1) （方程 1）	0.0
$\gamma_{0}$	(Equation 4.2) （方程 4.2）	1.0
$\gamma_{1}$	(Equation 4.2) （方程 4.2）	0.1
$\beta$	(Equation 4.2) （方程 4.2）	0.1
$\epsilon$	(Equation 5, Equation 4.2) （方程 5，方程 4.2）	0.2
$\eta$	(Equation 8) （方程 8）	10
learning rate 学习速率		1e-6
epoch per step 每步纪元		2
forward batch size 前向批次尺寸		1
accumulate batch size 累计批次		16
NVIDIA H100 GPU		16
policy freezing step 政策冻结步骤		50
total step 全阶		250

Table 12: The hyperparameters used in PPO training.
表 12： PPO 训练中使用的超参数。

Table 12 lists the hyperparameters used during the training process. Figure 3 depicts the RL training curves, which show a steady increase in return with the training steps, eventually converging after 200 steps.
表 12 列出了训练过程中使用的超参数。图 3 展示了强化学习训练曲线，曲线随着训练步数的回报稳步增加，最终在 200 步后收敛。

Appendix E Implementation Details of the Selector
附录 E 选择器的实现细节

The prompt for paper selection
论文选择提示

You are an elite researcher in the field of AI, conducting research on {user_query}. Evaluate whether the following paper fully satisfies the detailed requirements of the user query and provide your reasoning. Ensure that your decision and reasoning are consistent.
你是人工智能领域的顶尖研究者，正在进行关于{user_query}的研究。评估以下论文是否完全满足用户查询的详细要求，并提供你的推理。确保你的决策和推理一致。 Searched Paper: 检索论文： Title: {title} 标题：{title} Abstract: {abstract} 摘要：{抽象} User Query: {user_query}
用户查询：{user_query} Output format: Decision: True/False
输出格式：判定：真/假 Reason:… 原因：...... Decision: 判决：

Table 13: Prompt used by PaSa Selector or GPT-4o to evaluate paper relevance.
表 13： PaSa Selector 或 GPT-4o 用于评估论文相关性的提示。

Method 方法	Crawler Recall 履带召回	Precision 精度	Recall 召回	Recall@100	Recall@50	Recall@20
Google 谷歌	-	-	-	0.2101	0.2010	0.1788
Google Scholar 谷歌学术	-	-	-	0.0801	0.0739	0.0612
Google with GPT-4o 用 GPT-4o 谷歌搜索	-	-	-	0.2101	0.2010	0.1788
ChatGPT	-	0.0507	0.3046	-	-	-
GPT-o1	-	0.0374	0.2006	-	-	-
PaSa-GPT-4o	0.7595	0.1817	0.4488	-	-	-
PaSa-7b	0.7752	0.1881	0.5328	0.6932	0.6543	0.5494
PaSa-7b-ensemble PaSa-7b-set	0.8244	0.1822	0.5568	0.7041	0.6795	0.5535

Table 14: Results on 100-sample subset of AutoScholarQuery test.
表 14： AutoScholarQuery 测试 100 样本子集的结果。

We begin by sampling user queries from the AutoScholarQuery training set. For each user query and one of its corresponding papers in the AutoScholarQuery training set, we prompt GPT-4o to generate a decision token and rationale (see Table 13 for the prompt). We reject any data where the decision token is "False", as this contradicts the AutoScholarQuery label. The remaining data are retained as positive <user query, paper> pairs.
我们首先从 AutoScholarQuery 训练集中抽样用户查询。对于每个用户查询及其在 AutoScholarQuery 训练集中的一篇对应论文，我们提示 GPT-4o 生成决策令牌和理由（提示见表 13）。我们拒绝任何决策令牌为“False”的数据，因为这与 AutoScholarQuery 标签相矛盾。其余数据保留为正对。

Next, we simulate a partial paper search using PaSa-GPT-4o. In this simulation, each paper has a 50% probability of being added to the paper queue. Pairs where the paper is not selected by GPT-4o and is not in the AutoScholarQuery training set are labeled as negative examples.
接下来，我们用 PaSa-GPT-4o 模拟部分论文搜索。在该模拟中，每篇论文有 50% 的概率被加入论文队列。未被 GPT-4o 选中且未在 AutoScholarQuery 训练集中的论文对标为负面样本。

The final training dataset consists of 19,812 <user query, paper> pairs, each with a decision token and rationale generated by GPT-4o, drawn from 9,000 instances in the AutoScholarQuery training set.
最终的训练数据集包含 19,81<2对用户查询、论文>，每个对都有由 GPT-4o 生成的决策标记和理由，这些词汇来自 AutoScholarQuery 训练集中的 9,000 个实例。

Appendix F Selector Test Dataset
附录 F 选择测试数据集

We select 200 queries from the AutoScholarQuery development set. For each query, we perform a Google search and randomly choose one paper from the union of the search results and the relevant paper set in AutoScholarQuery. This yields a set of <user query, paper> pairs. Annotators then evaluate whether each paper aligns with the requirements of the user query. The final test dataset consists of 98 positive samples and 102 negative samples.
我们从 AutoScholarQuery 开发集中选择 200 条查询。对于每个查询，我们进行 Google 搜索，并从搜索结果与 AutoScholarQuery 中相关论文集的联合中随机选出一篇论文。这会得到一组<用户查询、论文>对。注释者随后评估每篇论文是否符合用户查询的需求。最终测试数据集包括 98 个正面样本和 102 个阴性样本。

$c(a_{t})$	Crawler 爬行者	Crawler 爬行者	Precision 精度	Recall 召回
$c(a_{t})$	Recall 召回	Actions 行动	Precision 精度	Recall 召回
0	0.8239	1296.3	0.1388	0.4852
0.1	0.7931	382.4	0.1448	0.4834
0.2	0.7478	230.1	0.1489	0.4764

Table 15: Performance of the Crawler trained on different action cost

c(a_{t})

on AutoScholarQuery test set.
表 15：在 AutoScholarQuery 测试集中训练不同动作成本

c(a_{t})

的爬虫性能。

Appendix G Additional Experimental Results
附录 G 附加实验结果

G.1 Results on 100-sample subset of AutoScholarQuery
G.1AutoScholarQuery 100 样本子集结果

To ensure a fair comparison with the ChatGPT baseline, which is evaluated on only 100 samples from AutoScholarQuery test, we report the performance of all methods on the same subset in Table 14. The results align with those in Table 4, confirming that PaSa-7b consistently outperforms all baselines.
为了确保与仅基于 100 个 AutoScholarQuery 测试样本评估的 ChatGPT 基线进行公平对比，我们在表 14 中报告了同一子集上所有方法的性能。结果与表 4 一致，确认 PaSa-7b 持续优于所有基线。

G.2 Action cost G.2 行动成本

We incorporate action costs to prevent the agent from taking excessive, unproductive actions. Without such costs, the total number of actions would increase significantly without yielding meaningful outcomes.
我们纳入行动成本，以防止代理人采取过度且无效的行动。没有这些成本，行动总数将大幅增加，却无法产生实质性结果。

The key consideration is the reward coefficient $\alpha$ and the action cost $c(a_{t})$ in Equation 1. In Table 8, we fix $c(a_{t})$ and analyze how varying $\alpha$ affects performance.
关键考虑因素是方程 1 中的奖励系数 $\alpha$ 和行动成本 $c(a_{t})$ 。在表 8 中，我们固定 $c(a_{t})$ 并分析变异如何 $\alpha$ 影响绩效。

Additionally, Table 15 shows how different values of $c(a_{t})$ affect the final performance.
此外，表 15 展示了不同值如何 $c(a_{t})$ 影响最终表现。

Appendix H Prompt Templates
附录 H 提示模板

H.1 Prompts used for data synthesis in AutoScholarQuery
H.1AutoScholarQuery 中用于数据综合的提示

Table 16 presents the prompt template used with GPT-4o to automatically generate AutoScholarQuery. For each paper, we extract its Related Work section, input it into GPT-4o, and use the prompt to generate scholarly queries along with their corresponding paper answers.
表 16 展示了 GPT-4o 用于自动生成 AutoScholarQuery 的提示模板。对于每篇论文，我们提取其相关工作部分，输入到 GPT-4o 中，并利用提示生成学术查询及其对应的论文答案。

H.2 Prompts for baselines
H.2 基线提示

Table 17 presents the search query paraphrasing prompt used for the baseline Google with GPT-4o.
表 17 展示了用于基于 GPT-4o 的谷歌基线搜索查询改写提示。

Table 18, 19 and 20 show the prompts used for the ChatGPT baseline (search-enabled GPT-4o), the GPT-o1 baseline and PaSa-GPT-4o, respectively.
表 18、19 和 20 分别展示了 ChatGPT 基线（支持搜索的 GPT-4o）、GPT-o1 基线和 PaSa-GPT-4o 所使用的提示。

The prompt for AutoScholarQuery generation
AutoScholarQuery 生成提示

You are provided a ‘Related Work’ section of a research paper. The researcher reviewed the relevant work, conducted a literature survey, and cited corresponding references in this text (enclosed by ‘\cite’ tags with IDs). Can you guess what research questions the researcher might have posed when preparing this text? The answers to these questions should be the references cited in this passage. Please list questions and provide the corresponding answers.
你会在一篇研究论文中获得一个“相关工作”部分。研究者审阅了相关工作，进行了文献调查，并在本文中引用了对应的参考文献（并以带有编号的“\cite”标签包围）。你能猜到研究者在准备本文时可能提出了哪些研究问题吗？这些问题的答案应为本文引用的参考文献。请列出问题并提供相应的答案。 [Requirements:]  [要求：] 1. Craft questions similar to those a researcher would pose when reviewing related works, such as “Which paper studied …?”, “Any works about…?”, “Could you provide me some works…?”
1. 提出类似研究者在审阅相关作品时会提出的问题，例如“研究了哪些论文......？”，“有关于......的作品吗？”，“你能给我一些作品吗......？” 2. Construct the question-answer pairs based on [Section from A Research Paper]. The answer should be the cited papers in [Section from A Research Paper].
2. 根据[研究论文章节]构造问答对。答案应为[研究论文章节]中引用的论文。 3. Do not ask questions including "or" or "and" that may involve more than one condition.
3. 不要问包含“或”或“和”等可能涉及多个条件的问题。 4. Clarity: Formulate questions clearly and unambiguously to prevent confusion.
4. 清晰：清晰无误地提出问题，避免混淆。 5. Contextual Definitions: Include explanations or definitions for specialized terms and concepts used in the questions.
5. 上下文定义：包括对题目中使用的专业术语和概念的解释或定义。 6. Format the output as a JSON array containing five objects corresponding to the three question-answer pairs.
6. 将输出格式化为包含五个对象的 JSON 数组，对应三个问答对。 Here are some examples:
以下是一些例子： [Begin of examples]  [示例开始] {Section from A Research Paper-1}
{研究论文-1节} {OUTPUT-1}  {输出-1} {Section from A Research Paper-2}
{研究论文-2部分} {OUTPUT-2}  {输出-2} {Section from A Research Paper-3}
{研究论文第3节} {OUTPUT-3}  {输出-3} [End of examples]  [示例结束] {Section from A Research Paper}
{研究论文部分} [OUTPUT]:  [输出]：

Table 16: The prompt used with GPT-4o to automatically synthesize AutoScholarQuery.
表 16：配合 GPT-4o 自动合成 AutoScholarQuery 的提示。

The prompt for search query paraphrase
搜索查询意译提示

Generate a search query suitable for Google based on the given academic paper-related query. Here’s the structure and requirements for generating the search query:
根据与学术论文相关的查询，生成一个适合谷歌的搜索查询。以下是生成该搜索查询的结构和要求： Understand the Query: Read and understand the given specific academic query.
理解问题：阅读并理解给出的具体学术问题。 Identify Key Elements: Extract the main research field and the specific approaches or topics mentioned in the query.
识别关键要素：提取主要研究领域及查询中提及的具体方法或主题。 Formulate the Search Query: Combine these elements into a concise query that includes terms indicating academic sources. Do not add any site limitations to your query.
构建搜索查询：将这些元素组合成简明的查询，包含学术来源的术语。不要在查询中添加任何网站限制。 [User’s Query]: {user_query}
[用户查询]：{user_query} [Generated Search Query]:
[生成的搜索查询]：

Table 17: The search query paraphrasing prompt used for the Google with GPT-4o baseline.
表 17：用于谷歌 GPT-4o 基线的搜索查询意译提示。

The prompt for ChatGPT (search-enabled GPT-4o)
ChatGPT（可搜索的 GPT-4o）提示

[User’s Query] [用户查询] You should return the Arxiv papers. You should provide more than 10 papers you searched in JSON format:
你应该归还 Arxiv 论文。你应该提供超过 10 篇你搜索过的 JSON 格式论文： {"paper_1": {"title": , ’authors’: , ’link’: }, "paper_2": {"title": , ’authors’: , ’link’: }}
{“paper_1”： {“title”：， 'authors'：， 'link'： }， “paper_2”： {“title”：， 'authors'：， 'link'： }}

Table 18: The prompt for ChatGPT baseline (search-enabled GPT-4o).
表 18： ChatGPT 基线提示（支持搜索的 GPT-4o）。

The prompt for GPT-o1 GPT-o1 的提示

{user_query} You should return arxiv papers. You should provide more than 10 paper you searched in JSON format: {"paper_1": {"title": , "authors": , "link": }, "paper_2": {"title": , "authors": , "link": }}. Do not return any other description.
你应该返回 arxiv 论文。你应提供超过 10 篇你搜索过的 JSON 格式论文：{“paper_1”： {“title”：， “authors”：， “link”： }， “paper_2”： {“title”：， “authors”：， “link”： }}。不要返回任何其他描述。

Table 19: The prompt for GPT-o1 baseline.
表 19： GPT-o1 基线提示。

The prompt for search session of Crawler Crawler 搜索会话的提示
You are an elite researcher in the field of AI, please generate some mutually exclusive queries in a list to search the relevant papers according to the User Query. Searching for a survey paper would be better. 你是人工智能领域的顶尖研究者，请在列表中生成一些互斥的查询，根据用户查询搜索相关论文。搜索调查论文会更好。 User Query: {user_query} 用户查询：{user_query} The semantics between generated queries are not mutually inclusive. The format of the list is: [“query1”, “query2”, …] 生成查询之间的语义并不互为包容。列表的格式为：[“query1”， “query2” ...] Queries: 问题：
The prompt for the expand session of Crawler Crawler 扩展环节的提示
You are an elite researcher in the field of AI, currently conducting research on the [topic] specified below. Your task is to determine if the paper specified below likely contains highly relevant citations for the [topic] and, if so, to identify specific sections where these citations might appear. 您是人工智能领域的顶尖研究人员，目前正在对下述主题进行研究。您的任务是确定下文所述论文是否包含与该主题高度相关的引用，如果有，则识别这些引用可能出现的具体章节。 Task Instructions: 任务说明： 1. Relevance Assessment: Decide if the paper is likely to include citations highly relevant to the given [topic]. Output "Yes" or "No" on the first line. 1. 相关性评估：判断论文是否可能包含与该主题高度相关的引用。在第一行回答“是”或“否”。 2. Section Selection: If you answered "Yes" in step 1, identify which sections of the paper are likely to contain these relevant citations. From the list of provided sections, select only those you think may contain relevant citations. If no sections seem relevant even if your answer to step 1 was "Yes," leave this empty. Output the selected sections in JSON format on the second line. 2. 章节选择：如果你在第一步回答“是”，请确定论文中哪些章节可能包含这些相关引用。从提供的章节列表中，只选择你认为可能包含相关引用的部分。如果即使第一步的答案是“是”，也没有相关章节，请留空。在第二行以 JSON 格式输出所选章节。 [topic]: {user_query} [话题]：{user_query} [paper title]: {title} [论文标题]： {标题} [paper abstract]: {abstract} [论文摘要]： {摘要} [paper sections]: {sections} [纸质版块]： {版块} Output Format: Output exactly two lines: 输出格式：准确输出两行： 1. The first line: Either "Yes" or "No" based on the relevance assessment. 1. 第一行：根据相关性评估，可以选择“是”或“否”。 2. The second line: A JSON string with selected sections, e.g., {{"selected_section_1": section_name_1, "selected_section_2": section_name_2}}. If no sections are selected, output {{}}. 2. 第二行：一个带有选中部分的 JSON 字符串，例如{{“selected_section_1”： section_name_1，“selected_section_2”： section_name_2}}。如果没有选中任何部分，输出{{}}。
The prompt for Selector 选择者的提示
You are an elite researcher in the field of AI, conducting research on {user_query}. Evaluate whether the following paper fully satisfies the detailed requirements of the user query and provide your reasoning. Ensure that your decision and reasoning are consistent. 你是人工智能领域的顶尖研究者，正在进行关于{user_query}的研究。评估以下论文是否完全满足用户查询的详细要求，并提供你的推理。确保你的决策和推理一致。 Searched Paper: 检索论文： Title: {title} 标题：{title} Abstract: {abstract} 摘要：{抽象} User Query: {user_query} 用户查询：{user_query} Output format: Decision: True/False 输出格式：判定：真/假 Reason:… 原因：...... Decision: 判决：

Table 20: The prompts for PaSa-GPT-4o.
表 20： PaSa-GPT-4o 的提示。

PaSa: An LLM Agent for Comprehensive Academic Paper SearchPaSa：一个用于全面学术论文检索的 LLM 代理

Abstract 摘要

1 Introduction 1 简介

2 Related Work 2 相关工作

LLMs in Scientific Discovery科学发现中的 LLM 课程

LLM Agents 大型语言模型代理

3 Datasets 3 数据集

3.1 AutoScholarQuery

3.2 RealScholarQuery

4 Methodology 4 方法论

4.1 Overview 4.1 概述

4.2 Crawler 4.2 爬行者

Reward Design 奖励设计

RL Training 现实学习培训

4.3 Selector 4.3 选择器

5 Experiments 5 实验

5.1 Experimental Setting5.1 实验环境

Selector 选择器

Crawler 爬行者

Implementation of [Search][ 搜索] 的实现

Paper Management 论文管理

5.2 Baselines and Evaluation5.2 基线与评估

5.3 Main results5.3 主要结果

5.4 Ablation study5.4 消融研究

6 Conclusion 6 结论

Limitations 局限性

Acknowledgments 致谢

References

Appendix A Quality Evaluation of AutoScholarQuery附录 AAutoScholarQuery 的质量评估

Appendix B Annotation details附录 B 注释细节

B.1 Annotation InstructionsB.1 注释指令

B.2 Quality controlB.2 质量控制

Appendix C Example in RealScholarQuery附录 CRealScholarQuery 示例

Appendix D Implementation Details of the Crawler附录 D 爬虫的实现细节

D.1 Imitation learning data generationD.1 模仿学习数据生成

D.2 PPO trainingD.2PPO 训练

Appendix E Implementation Details of the Selector附录 E 选择器的实现细节

Appendix F Selector Test Dataset附录 F 选择测试数据集

Appendix G Additional Experimental Results附录 G 附加实验结果

G.1 Results on 100-sample subset of AutoScholarQueryG.1AutoScholarQuery 100 样本子集结果

G.2 Action cost G.2 行动成本

Appendix H Prompt Templates附录 H 提示模板

H.1 Prompts used for data synthesis in AutoScholarQueryH.1AutoScholarQuery 中用于数据综合的提示

H.2 Prompts for baselinesH.2 基线提示

PaSa: An LLM Agent for Comprehensive Academic Paper Search
PaSa：一个用于全面学术论文检索的 LLM 代理

LLMs in Scientific Discovery
科学发现中的 LLM 课程

5.1 Experimental Setting
5.1 实验环境

Implementation of [Search]
[ 搜索] 的实现

5.2 Baselines and Evaluation
5.2 基线与评估

5.3 Main results
5.3 主要结果

5.4 Ablation study
5.4 消融研究

Appendix A Quality Evaluation of AutoScholarQuery
附录 AAutoScholarQuery 的质量评估

Appendix B Annotation details
附录 B 注释细节

B.1 Annotation Instructions
B.1 注释指令

B.2 Quality control
B.2 质量控制

Appendix C Example in RealScholarQuery
附录 CRealScholarQuery 示例

Appendix D Implementation Details of the Crawler
附录 D 爬虫的实现细节

D.1 Imitation learning data generation
D.1 模仿学习数据生成

D.2 PPO training
D.2PPO 训练

Appendix E Implementation Details of the Selector
附录 E 选择器的实现细节

Appendix F Selector Test Dataset
附录 F 选择测试数据集

Appendix G Additional Experimental Results
附录 G 附加实验结果

G.1 Results on 100-sample subset of AutoScholarQuery
G.1AutoScholarQuery 100 样本子集结果

Appendix H Prompt Templates
附录 H 提示模板

H.1 Prompts used for data synthesis in AutoScholarQuery
H.1AutoScholarQuery 中用于数据综合的提示

H.2 Prompts for baselines
H.2 基线提示