Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Preprint. Meta-Reinforcement Learning with Self-Reﬂection for Agentic Search T eng Xiao 1,2 ∗ Y ige Y uan 2 ∗ Hamish Ivison 1,2 Huaisheng Zhu 3 Faeze Brahman 1 Nathan Lambert 1 Pradeep Dasigi 1,† Noah A. Smith 1,2,† Hannaneh Hajishirzi 1,2,† 1 Allen Institute for AI 2 University of W ashington 3 Independent Abstract This paper introduces MR-Search , an in-context meta r einforcement learn- ing (RL) formulation for agentic search with self-r eﬂection. Instead of optimizing a policy within a single independent episode with sparse re- wards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reﬂection, allowing search agents to improve in-context exploration at test-time. Speciﬁcally , MR-Search performs cross-episode ex- ploration by generating explicit self-r eﬂections after each episode and lever - aging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. W e further intro- duce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling ﬁne-grained cr edit assignment on each episode. Empirical results acr oss various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing str ong generalization and relative impr ovements of 9.2 % to 19.3 % across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR- Search . 1 Introduction Language models with advanced reasoning capabilities have driven substantial pr ogress toward mor e autonomous and multi-step decision-making behaviors in complex tasks ( Guo et al. , 2025 ; Jaech et al. , 2024 ). Examples include agentic search such as deep r esearch ( Du et al. , 2025 ; Shao et al. , 2025 ) and other information seeking ( Mialon et al. , 2023 ; Jin et al. , 2025a ), where LMs use sear ch tools and engage in dynamic, multi-turn interactions. Rein- forcement learning (RL) with the ReAct paradigm ( Y ao et al. , 2022 ) has emerged as a primary framework for training search agents ( Jin et al. , 2025a ; Zheng et al. , 2025 ), replacing the traditional reliance on supervised data collection. Speciﬁcally , these methods mainly focus on the correctness of the ﬁnal answer and only receive sparse rewar ds at the end of each trajectory , without providing precise cr edit assignment for intermediate steps. Due to the sparse nature of outcome rewar ds, the agent often struggles to learn more complex processes and is susceptible to issues such as inefﬁcient exploration at the early stage, local optima, and inefﬁcient search dynamics ( Zhang et al. , 2025 ; Feng et al. , 2025 ). These challenges become more pr onounced in agentic search tasks, where multi-turn interactions with tools could amplify small errors and obscur e credit assignment ( Feng et al. , 2025 ). T o address the key challenge of sparse rewar ds, several works have explored using process rewar d models ( Luo et al. , 2024 ; W ang et al. , 2023 ) or LM judges ( Zheng et al. , 2024 ; Deng et al. , 2025 ). However , these approaches rely on external annotations, which are both costly and difﬁcult to reuse when task requir ements change. Moreover , model-based rewar ds inevitably lead to rewar d hacking and bias ( W ang et al. , 2025a ) and incur additional computational overhead in the RL training ( Y uan et al. , 2024 ). In this paper , we introduce MR-Search , a simple yet effective meta-RL approach that enables search agents to improve in-context exploration at test time. Our work is most closely ∗ Equal contribution. Correspondence to tengx@allenai.org . † Equal senior authors. 1 Preprint. (b) Meta-Reinforcement Learning for Search … (a) Reinforcement Learning for Search … Q Episode 1 … Q Answer Toolcall ToolObs Self-Reflection Question Q … … … Q N inner-episodes per meta-episode T interactions per inner-episode Episode N Episode 1 Episode N Figure 1: RL-based agents (a) condition solely on the current episode, and episodes are inde- pendent , whereas meta-RL-based agents (b) leverage context accumulated acr oss episodes. MR-Search performs sequential self-reﬂection over past episodes to guide exploration in subsequent episodes. In MR-Search , we have inner-episodes, each consisting of a maximum of T interactions steps with an answer . A sequence of N episodes forms a meta-episode. related to in-context meta-reinfor cement learning methods ( Duan et al. , 2016 ; Stadie et al. , 2018 ; Laskin et al. , 2023 ), which leverage in-context histories from a few initial explo- ration episodes to guide subsequent exploitation episodes. Unlike traditional meta-RL approaches ( Duan et al. , 2016 ; Stadie et al. , 2018 ; Laskin et al. , 2023 ) in r obotics and games, we focus on open-domain agentic search tasks with tool interactions and self-reﬂection, without any rewar d feedback from the environment during infer ence. As illustrated in Figure 1 , MR-Search formulates agentic search as an iterative, self-reﬂection driven process instead of performing exploration thr ough multiple independent episodes that operate in isolation in RL-base agents. In MR-Search , each complete interaction trajectory with an answer is an episode and followed by an explicit r eﬂection step. This design enables sequential self-reﬂection and cross-episode knowledge consolidation in a multi-turn setting, transforming exploration from a set of disconnected attempts into a progressively informed search pr ocess. Thus, the agent learns to balance exploration and exploitation end-to-end by updating its search strategy according to ﬁnal task performance. Our method can be seen as an instance of meta-learning where we meta-learn how to generate ef fective self-reﬂection. T o optimize the policy with multi-turn reﬂection, we use a multi-turn RL algorithm that estimates unbiased grouped r elative advantages ( Ahmadian et al. , 2024 ) at the turn level to assign localized credit to self-reﬂection turns. As a result, MR-Search remains critic-fr ee and eliminates the need for auxiliary value models compared to PPO ( Schulman et al. , 2017 ). Our main contributions are: (i) W e advocate for and formalize in-context meta-r einforcement learning as a practical and scalable bridge between meta-learning and reinfor cement learn- ing for agentic search, where ground-truth rewar ds are absent at inference time. (ii) W e propose MR-Search , an eff ective multi-turn agentic sear ch framework that performs cr oss- episode exploration by generating an explicit self-reﬂection after each interaction episode. (iii) Empirically , we validate the effectiveness of MR-Search across multiple multi-hop QA benchmarks, showing that it signiﬁcantly outperforms prior methods. Speciﬁcally , MR-Search achieves an average relative impr ovement of 9.2 % to 19.3 % over strong baselines. 2 Related W ork RL for Agentic Search. RL has emerged as a promising training paradigm for developing adaptive and autonomous search agents ( Jin et al. , 2025a ; W u et al. , 2025b ; Li et al. , 2025a ). Speciﬁcally , agentic search with RL trains LLMs as decision-making agents that interact with a search envir onment through reasoning, r eceive feedback, and iteratively reﬁne their strategies to maximize task rewar ds. For instance, Search-R1 and ReSear ch ( Jin et al. , 2025a ; Chen et al. , 2025 ) propose training LLM-based agents end-to-end using RL algorithms such as PPO ( Schulman et al. , 2017 ) or GRPO ( Shao et al. , 2024 ) under the ReAct paradigm ( Y ao et al. , 2022 ). Despite signiﬁcant progress, these methods rely solely on sparse outcome rewar ds, without providing pr ecise credit assignment for ef fective exploration Feng et al. ( 2025 ). Recently , several works ( Deng et al. , 2025 ; W ang et al. , 2025b ) have attempted to design process r ewards for agentic search. However , these approaches r equire annotations at every intermediate step or rely on external evaluators, both of which are expensive 2 Preprint. ToolObs … Episode n Reason Toolcall Answer … Q Episode 1 Reason Answer Self-Reflection ToolObs The answer is April 3, 1969 Reflect on .. I need to check if Marcus Dupree was born in … I need to find out who made his debut for the Los Angeles Rams against the New York … Question When was the player who made his debut for the Los Angeles against the … born? Toolcall Who made the debut for the Los Angeles Rams against … … Los Angeles, Nov 11 1990, Marcus Dupree made his NFL debut against the New York … When was the player Marcus Dupree born? Marcus Dupree (born May 22 1964) is a former American football player … The answer is May 22, 1964 1) Reflect on the current answer 2) Search additional information 3) Provide another improved answer Figure 2: An overview of our proposed MR-Search framework. Given a question, the agent ﬁrst completes an initial episode by interleaving reasoning and tool calls. It then enters an iterative self-reﬂection loop, where previous episodes serve as experience to inform subsequent searches and answer revisions, enabling iterative impr ovement across episodes. to obtain. In contrast, our MR-Search aligns with in-context meta-reinforcement learning, enabling progr essively targeted exploration driven by explicit cross-episode r eﬂection. Meta-Reinforcement Learning. Our work is related to meta-RL methods that leverage in-context histories to guide exploration, with the goal of maximizing r ewards in subsequent exploitation episodes in robotics and game domains ( Melo , 2022 ; Laskin et al. , 2023 ). Duan et al. ( 2016 ) and W ang et al. ( 2017 ) concurrently proposed the RL 2 framework, which formulates meta-reinforcement learning by feeding in-context episodes into a recurr ent neural network (RNN) whose hidden state serves as a memory mechanism. Qu et al. ( 2025 ) formulate the optimization of test-time compute for LLMs as a meta-reinfor cement learning problem. A concurrent work ( Jiang et al. , 2025 ) proposes meta-RL to encourage exploration in LLM agents with ground-tr uth state feedback. In contrast, we focus on LLM open-domain agentic task and do not access to any environment feedback during infer ence. LLMs with Self-Reﬂection. Previous studies investigate prompting-based strategies that en- able LLMs to iteratively and sequentially r eﬁne their own generations via intrinsic feedback, effectively scaling test-time computation. These approaches include self-correction ( Huang et al. , 2023 ), self-reﬂection ( Shinn et al. , 2023 ), and self-reﬁne ( Madaan et al. , 2023 ). Beyond prompting alone, several subsequent works ( W an et al. , 2025 ; Qu et al. , 2024 ; Kumar et al. , 2024 ; Xiong et al. , 2025 ; Y uksekgonul et al. , 2026 ) employ ﬁnetuning to train models for intrinsic self-correction. A recent concurrent work ( Shi et al. , 2026 ) also proposes experi- ential reinfor cement learning to incorporate textual reﬂection feedback. In contrast, our work offers a novel meta-RL perspective on agentic search with tool interaction to catalyze continuous self-reﬂection, enabling the model to mor e effectively explore better answers. T est-time Scaling. The paradigm of scaling test-time compute has emerged as a critical avenue for enhancing reasoning capabilities ( Guo et al. , 2025 ). Prior work explores two main dimensions: parallel sampling ( W ang et al. , 2022b ) and sequential reﬁnement ( Madaan et al. , 2023 ; Snell et al. , 2024 ). Parallel sampling generates multiple answers independently , whereas sequential reﬁnement generates answers sequentially , with each attempt condi- tioned on previous ones. Our work is most closely related to the sequential reﬁnement paradigm. However , unlike these methods, which operate purely at inference time, this work proposes cr oss-episode meta-learning to enable continuous self-reﬂection in context, showing that meta-RL induces effective in-context exploration for agentic sear ch. 3 MetaSearch: Meta-Reinforcement Learning for Agentic Search 3.1 Background Given a dataset D consisting of question–answer pairs and an external search engine E , our goal is to train a LLM agent to give the answer by iteratively performing reasoning and interacting with E . Many search agent frameworks build upon the ReAct paradigm ( Y ao 3 Preprint. Figure 3: W e evaluate MR-Search , Search-R1 with sequential reﬂection infer ence (Search-R1- S), and Search-R1 with parallel sampling (Sear ch-R1-P), selecting the most frequent answer among the generated trajectories. Shaded regions show the standard deviation across 3 runs. W e observe that MR-Search achieves the best performance. See § 4.3 for details et al. , 2022 ; Jin et al. , 2025a ), wher e agents execute iterative cycles of reasoning and acting until a ﬁnal answer is r eached. When presented with a query , the agent conducts multiple rounds of thought-action-observation sequences. During each round, the RL-based agent π θ formulates a internal Thought ( τ ) based on the current context, executes a external Action ( α ) such as query , and receives corresponding feedback from sear ch engines as tool Observation ( x ) as shown in Figure 1 . The interaction trajectory is denoted as: a = ( τ 0 , α 0 , x 0 , τ 1 , α 1 , x 1 , . . . , τ T − 1 ) . (1) The ﬁrst round contains the pr ompt, while the ﬁnal r ound τ T − 1 contains only the thought with the ﬁnal answer o , without any further actions. Given this interaction process, we can directly maximize the RL objective with the ﬁnal outcome r ewards to optimize the policy: J ( π θ ) = E a ∼ π θ [ f veriﬁer ( o , o ∗ ) ] , (2) where o denotes the ﬁnal answer extracted from the completed trajectory a , o ∗ is the gr ound- truth answer , and f veriﬁer repr esents either a rule-based or model-based veriﬁer . Although agentic search based on outcome r ewards has demonst rated promising performance ( Jin et al. , 2025a ; Sun et al. , 2025 ; Chen et al. , 2025 ; Shi et al. , 2025 ; W u et al. , 2025b ), the outcome rewar ds are sparse and delayed, leading to ambiguous cr edit assignment and ineffective search exploration ( Liu et al. , 2021 ; Feng et al. , 2025 ; W ang et al. , 2025b ). 3.2 Meta-RL Framework for Agentic Search In this section, we introduce MR-Search , a Meta-RL framework for agentic search built on a cross-episode meta training scheme. Each meta-episode is modeled as a sequence of episodes, encouraging early exploration and subsequent exploitation of accumulated context, as shown in Figure 1 . By modeling prior episodes in context through a standardized self-reﬂection paradigm and propagating this information across episodes, MR-Search en- ables increasingly informed sear ch, leading to br oader exploration and reduced r edundancy . Algorithm 1 MR-Search : One Sample Update 1: Input: policy π θ , sample ( x , o ∗ ) ∼ D , group size G , reﬂection steps N 2: for i = 1, . . . , G do ▷ Group of meta-episodes 3: C ← x ▷ Initialize context 4: for n = 0, . . . , N − 1 do ▷ Inner episodes 5: a i , n ∼ π θ ( · | C ) 6: C ← C ⊕ a i , n ⊕ R E FL E C T ( C ) ▷ Self-reﬂection 7: r i , n ← f ( a i , n , o ∗ ) ▷ Reward 8: Compute A i , n via Eq. (7–8) ▷ RLOO advantage 9: θ ← R L U P D AT E ( θ , { r i , n } ) with Eq. ( 9 ) The key motivations are twofold: (i) For LLM agents, model lim- itations often stem from insufﬁ- cient exploration within a single trajectory rather than inadequate reasoning capacity ( Shen et al. , 2025 ); incorrect answers may con- sistently appear across multiple samples in parallel, as shown by Fan et al. ( 2025 ); Si et al. ( 2026 ). (ii) LLMs exhibit strong self-reﬂection capabilities, enabling inference- time adaptation that accumulates information from prior interac- tions across episodes, naturally leveraging in-context learning mechanisms. As shown 4 Preprint. in Figure 3 , MR-Search substantially improves over baselines both with sequential reﬂection and parallel sampling and its performance grows with more turns, suggesting that the ability for self-reﬂection emer ges as models become stronger in-context meta-learners. Given an input question, the search agent π θ ﬁrst generates the ﬁrst episode, in which it interacts with external tools until reaching the ﬁnal answer , following the procedure in § 3.1 . a 0 ∼ π θ ( a ) . (3) Unlike standard RL-based search agents ( Jin et al. , 2025a ; Song et al. , 2025 ), where episodes are independent, MR-Search conditions each episode on the preceding one. After each episode, the search agent invokes the tool to iteratively impr ove its answer under the self- reﬂection paradigm. Speciﬁcally , given an initial episode, we apply a reﬂection prompt (Appendix A.1.3 ) that triggers the model to r eﬁne its answer by conditioning on previous episodes as context and producing an additional episode with another answer . a 1 ∼ p θ ( a 1 | a 0 ) , a 2 ∼ p θ ( a 2 | a 0 , a 1 ) . . . (4) By repeating the multi-turn self-reﬂection process N − 1 times, in MR-Search , each meta- episode consists of N episodes sequentially generated by the search agent: y = ( a 0 , a 1 , · · · , a N ) , (5) where we can compute rewards independently for each episode a using the answer y n extracted from the n th episode by the veriﬁer . Given this multi-turn process, we can deﬁne the meta-level objective as maximizing the expected rewar d of the meta-episode: J meta ( π θ ) = E y ∼ π θ h ∑ N − 1 n = 0 γ n R ( s n , a n ) i = E y ∼ π θ h ∑ N − 1 n = 0 γ n f veriﬁer ( o n , o ∗ ) i , (6) where s n = a < n denotes the accumulated meta context up to episode n , o n is the answer extracted from the episode a n , and γ ∈ ( 0, 1 ] is the discount factor accounting for future returns. Unless otherwise speciﬁed, we set γ = 1 in this work. In this work, MR-Search conditions on trajectories and explicit reﬂections from all previous episodes, causing the context length to incr ease linearly with the number of r eﬂection steps N . T o mitigate this scalability issue, one can retain only the immediately preceding episode as context, or adopt a context management protocol that summarizes prior episodes before carrying them forward to subsequent episodes. In § 4.4 , we empirically ﬁnd that only keeping the immediately preceding episode as context also also works well for MR-Search . 3.3 Policy Optimization with Multi-T urn Advantages T o optimize the policy , instead of estimating advantages using separate value functions as in PPO ( Schulman et al. , 2017 ), we propose an approach that incorporates turn-level rewar d signals while maintaining unbiased policy optimization. Speciﬁcally , for each question, we sample a group of G meta-episodes, G = { y i } G i = 1 . T o make rewards comparable acr oss episodes and reduce variance, we aggr egate rewar ds at the same episode over meta-episodes and apply Leave-One-Out (RLOO) estimation ( Kool et al. , 2019 ; Ahmadian et al. , 2024 ): ˜ r i , n = r ( s i , n , a i , n ) − mean j  = i r ( s j , n , s j , n ) = r ( s i , n , a i , n ) − 1 G − 1 ∑ j  = i r ( s j , n , a j , n ) , (7) where ˜ r i , n is the rewar d of the n th episode in the i th meta-episode (i.e., y i , n ). Compared to GRPO ( Shao et al. , 2024 ), RLOO provides an unbiased advantage estimation ( Bereket & Leskovec , 2025 ). While ˜ r i , n captures the r elative quality of each episode, it reﬂects immediate effects and ignores the impact of futur e. T o include long-horizon dependencies, we compute a discounted cumulative advantage to propagate r ewards backward to earlier turns: A i , n = ∑ N n ′ = n γ n ′ − n ˜ r i , n ′ . (8) As the baseline used in RLOO, mean j  = i r ( s j , n , s j , n ) , does not depend on the current action y i , n , it provides an unbiased estimate of the turn-level advantage ( Sutton et al. , 1999 ). W ith the discounted turn-level advantages above, we optimize the policy using a clipped surrogate of f-policy objective in PPO ( Schulman et al. , 2017 ). Formally , the objective is: 5 Preprint. 1 G G ∑ i = 1 1 | y i | | y i | ∑ n = 1 min  π ( y i , n | x , y i , < n ; θ ) π ( y i , n | x , y i , < t ; θ ) A i , n , clip  π ( y i , n | x , y i , < n ; θ ) π ( y i , n | x , y i , < n ; θ ) , 1 − ε , 1 + ε  A i , n  , (9) where π ( y i , n | x , y i , < n ; θ ) and π ( y i , n | x , y i , < n ; θ ) denote the curr ent and old policy models over the steps, and we broadcast each step’s advantage signal to all tokens in that step ( Shao et al. , 2024 ). ϵ is the is the clipping ratio. W e additionally mask out tool output tokens from the loss, following previous work ( Jin et al. , 2025a ). Optimizing the policy with the above objective enables the policy to capture both global trajectory quality and local step effectiveness. In § 4.2 , we compare our objective with PPO ( Schulman et al. , 2017 ) and MT -GRPO ( Zeng et al. , 2025 ) using our designed process rewar ds. The results show that our objective consistently achieves better performance under the meta-RL framework. The full algorithm of MR-Search (one-sample training) is summarized in Algorithm 1 . 3.4 Discussion Exploration & Exploitation . As discussed above, MR-Search leverages experience from previous episodes to guide subsequent exploration. By default, all episodes contribute rewar ds according to Eq. ( 8 ) . T o promote unstructured exploration, we can optionally mask rewar ds for designated exploration episodes while retaining r ewards for exploitation episodes ( Stadie et al. , 2018 ). Exploration episodes are fully visible during the forward pass but receive zero reward during backpropagation, so gradients are driven only by exploitation episodes. Concretely , the advantage is computed using a masked return: A i , n = ∑ N n ′ = n γ n ′ − n ˜ r i , n ′ m n ′ , (10) where m n ′ ∈ { 0, 1 } indicates exploitation (1) or exploration (0). The policy gradient is therefor e computed using this masked return rather than the standard discounted r eturn. By zeroing out exploration r ewards, we encourage the policy to prioritize long-term gains from improved context adaptation rather than short-term episode feedback. Although exploration episodes do not directly contribute to the gradient, they serve as contextual adaptation steps that improve environment identiﬁcation and lead to higher rewards in subsequent exploitation episodes. W e provide empirical analysis of this strategy in § 4.4 and ﬁnd that it is helpful for ASearcher , which requires multi-turn sear ch. Meta-RL at the Step Level . While MR-Search models agentic search as a meta-episode composed of multiple reﬂection episodes, it treats each full trajectory with an answer as a single optimization unit. However , in long-horizon r easoning, inefﬁciencies often arise at a ﬁner granularity , such as individual tool calls or intermediate reasoning steps. The same principle of MR-Search extends naturally to semantically meaningful sub-episodes. For instance, in agentic search, each tool-interaction step can be tr eated as a micro-episode. During training, we pr ompt the model to pr oduce an intermediate answer after each tool call as shown in Figure 8 in Appendix and evaluate it with the veriﬁer to obtain step- level rewar ds. Converting these substructures into micro-episodes enables localized cr edit assignment, transforming long trajectories into reﬂection steps. This dense supervision promotes informative intermediate reasoning and reduces redundant exploration. W e empirically show that this extension also achieves strong performance (§ 4.4 ). 4 Experiments 4.1 Experimental Setup Datasets W e conduct evaluations on the following datasets: (1) General Question Answering : NQ ( Kwiatkowski et al. , 2019 ), T riviaQA ( Joshi et al. , 2017 ), and PopQA ( Mallen et al. , 2022 ); and (2) Multi-Hop Question Answering : HotpotQA ( Y ang et al. , 2018 ), 2W ikiMultiHopQA ( Ho et al. , 2020 ), Musique ( T rivedi et al. , 2022 ), and Bamboogle ( Press et al. , 2022 ). For training, we merge the NQ and HotpotQA training sets to construct a uniﬁed dataset for all ﬁnetuning approaches, following the setup in ( Jin et al. , 2025b ). For evaluation, we use the test or 6 Preprint. T able 1: Main accuracy (%) on search-based QA benchmarks. The best results are marked in boldface . W e compare with baselines that rely on outcome r ewards (ReSearch and Sear ch- R1) and those that use process r ewards with external models (PPRM and StepResearch). Method Single-Hop QA Multi-Hop QA NQ T riviaQA PopQA HotpotQA 2wiki Musique Bamboogle A vg. Qwen2.5-3b Direct Infer ence 10.6 28.8 10.8 14.9 24.4 2.0 2.4 13.4 Search-o1 23.8 47.2 26.2 22.1 21.8 5.4 32.0 25.5 ReSearch 42.7 59.7 43.0 30.5 27.2 7.4 11.5 30.4 Search-R1 46.2 62.2 45.6 32.6 31.0 7.7 17.6 34.7 PPRM 42.3 56.5 41.1 35.3 34.0 12.7 28.0 35.7 StepResearch 44.6 61.5 45.6 37.3 33.8 10.5 32.5 38.0 MR-Search 47.7 63.5 46.0 41.9 40.1 16.5 34.4 41.4 Qwen2.5-7b Direct Infer ence 13.4 40.8 14.0 18.3 25.0 3.1 12.0 18.1 Search-o1 15.1 44.3 13.1 18.7 17.6 5.8 29.6 20.6 ReSearch 36.6 60.5 39.1 37.8 38.6 16.6 37.6 38.1 Search-R1 45.9 63.2 44.9 43.9 38.7 18.1 40.0 42.1 PPRM 45.8 61.0 43.7 38.6 35.5 14.7 35.5 39.3 StepResearch 47.3 63.6 43.1 43.9 41.8 20.5 43.5 43.4 MR-Search 50.2 66.6 47.2 46.8 43.6 22.1 45.2 46.0 1 2 3 4 Evaluation metric 0.3 0.4 0.5 0.6 0.413 0.369 0.565 0.513 Evaluation A ccuracy MR -Sear ch Sear ch-R1 0 100 200 300 T raining step 0.1 0.2 0.3 0.4 0.5 0.6 T raining A ccuracy MR -Sear ch Sear ch-R1 0 100 200 300 T raining step 2 4 6 T ool Calls MR -Sear ch Sear ch-R1 Figure 4: T est performance, training curves of reward and sear ch frequency on ASear cher , evaluated with Qwen2.5-7B-Base. Additional results are pr ovided in Appendix A.2.1 . development splits of the seven datasets listed above. W e additionally include a synthetic dataset, ASearcher ( Gao et al. , 2025 ), which is more complex than NQ/HotpotQA and requir es long multi-turn search. W e split ASearcher into 90% training and 10% evaluation sets. W e split ASear cher into 90% training and 10% evaluation sets. Detailed dataset descriptions and statistics are pr ovided in Appendix A.1.1 . Evaluation Metrics For evaluation metrics, we follow Jin et al. ( 2025b ): we ﬁrst normalize both the predicted and gr ound-truth answers, and then compute Exact Match (EM) scor e. EM achieves true if and only if the predicted answer exactly matches any ground-truth answer . For all methods, we sample a single trajectory per question and report the average EM for the last valid prediction for questions following Jin et al. ( 2025a ). Setup W e mainly conduct experiments using Qwen-series models ( Y ang et al. , 2024 ) (Qwen2.5-3B-Base and Qwen2.5-7B-Base). Following ( Jin et al. , 2025b ), we use the 2018 W ikipedia dump ( Karpukhin et al. , 2020 ) as the knowledge source and E5 ( W ang et al. , 2022a ) as the retriever . W e ﬁx the number of retrieved documents to three across all methods for a fair comparison. Unless otherwise speciﬁed, we report r esults at turn 3 for our models. The detailed implementation settings are pr ovided in Appendix A.1.3 . Baselines T o evaluate the effectiveness of MR-Search , we compare it against the following recent baselines: (1) Inference without ﬁnetuning: methods that directly use the base model, including direct infer ence without retrieval and Sear ch-o1 ( Li et al. , 2025b ) with retrieval. (2) Finetuning-based methods that learn a policy to integrate the search tool without step-level supervision, including ReSearch ( Chen et al. , 2025 ) and Search-R1 ( Jin et al. , 2025a ). (3) Finetuning-based methods that learn a policy to integrate the search tool with step-level supervision, including PPRM ( Anonymous , 2026 ) and StepResearch ( W ang et al. , 2025b ). 7 Preprint. Figure 5: MR-Search , Sear ch-R1 with sequential reﬂection turns (Search-R1-S), and Sear ch-R1 with parallel sampling (Search-R1-P), selecting the most fr equent answer among them. T able 2: Ablation of discount factor and training algorithm evaluated on Qwen2.5-7B-Base. Method Single-Hop QA Multi-Hop QA NQ T riviaQA PopQA HotpotQA 2wiki Musique Bamboogle A vg. ReSearch 36.6 60.5 39.1 37.8 38.6 16.6 37.6 38.1 Search-R1 46.4 64.1 44.8 43.0 42.5 19.5 44.0 43.5 MR-Search w . γ = 0 49.3 63.9 45.6 44.3 41.7 19.3 42.6 43.8 MR-Search w . PPO 43.9 63.7 42.5 41.3 40.5 18.6 43.5 42.0 MR-Search w . MT-GRPO 46.1 64.5 44.5 44.7 44.2 20.7 45.1 44.3 MR-Search 50.2 66.6 47.2 46.8 43.6 22.1 45.2 46.0 W e also compare with Search-R1 with our multi-turn reﬂection mechanism during infer ence. in § 4.3 . The detailed description of baselines are given in Appendix A.1.2 . 4.2 Main Results T able 1 summarizes main results on benchmarks. Among the approaches, our MR-Search achieves a substantial margin over GRPO with outcome rewar ds (Search-R1), yielding 9.2% and 19.3% relative improvements on average for the Qwen2.5-7B-Base and Qwen2.5-3B- Base, respectively . This highlights the signiﬁcant beneﬁts of designing process rewar ds for agentic search. Remarkably , MR-Search remains highly ef fective on the small Qwen2.5- 3B model, whereas RL methods that rely only on sparse outcome rewar ds struggle to elicit multi-turn search behavior for good performance. Compared to other methods that rely on external models to obtain the process reward such as StepResearch and PPRM, our MR-Search achieves better performance. This conﬁrms that agentic sear ch can beneﬁt greatly fr om our designed free pr ocess rewar ds, and that MR-Search can effectively leverage this process supervision to achieve better performance. Figure 4 shows the r esults on the ASearcher datasets. Compared to Multi-Hop QA, ASearcher requir es longer-horizon, multi- turn search ( Gao et al. , 2025 ). From the results, we observe that MR-Search signiﬁcantly outperforms Search-R1, achieving 10.2% and 9.5% relative improvements in EM and F1, respectively , demonstrating the ef fectiveness of MR-Search on complex ASearcher tasks. Ablation Study . W e study the effects of the key design choices of MR-Search . Speciﬁcally , we consider the following ablations: (i) we compare our optimization algorithm (Section § 3.3 ) with PPO ( Schulman et al. , 2017 ) and MT -GRPO ( Zeng et al. , 2025 ) within our Meta-RL framework; (ii) we set the discount factor γ = 0, which removes futur e credit assignment. From the results in T able 2 , we observe that our proposed multi-turn RL algorithm con- sistently outperforms PPO and MT -GRPO when trained with episode turn-level rewar ds, demonstrating the effectiveness of our optimization strategy in leveraging dense feedback from r eﬂection. Moreover , both PPO and MT -GRPO underperform GRPO with outcome rewar ds (Search-R1) on single-hop NQ and PopQA, whereas our method does not, indicat- ing stronger generalization and r obustness. W e further observe that removing the discount factor substantially degrades performance and causes the training process to conver ge to poor local optima. As discussed above, a possible explanation is that an incorrect episode does not necessarily imply that intermediate episodes are uninformative. 8 Preprint. T able 3: Comparison with variants of MR-Search : encouraging exploration, MR-Search at the step level and context management (keeping one preceding episode as context). Method NQ T riviaQA PopQA HotpotQA 2wiki Musique Bamboogle ASearcher Search-R1 45.9 63.2 44.9 43.9 38.7 18.1 40.0 36.9 MR-Search 50.2 66.6 47.2 46.8 43.6 22.1 45.2 41.3 MR-Search Exploration 48.3 65.1 46.4 44.7 39.4 21.8 44.0 43.2 MR-Search Step Level 48.6 64.6 45.7 42.3 41.4 16.3 41.6 38.4 MR-Search Short Context 48.1 65.9 45.2 44.6 41.0 19.3 47.2 40.5 4.3 Further Analysis T raining Dynamics. For a more comprehensive understanding of MR-Search , we visualize its training dynamics. Figure 4 shows the training rewar d dynamics and of Search-R1 and MR-Search . W e observe that MR-Search exhibits stable convergence during training and consistently achieves higher training rewar d than Search-R1. This indicates that MR-Search effectively leverages iterative reﬂection to reﬁne answers, leading to progressively improved ﬁnal answers throughout training of multi-turn RL training. W e can observe that MR-Search calls the search engine mor e frequently than Search-R1, demonstrating that MR-Search can dynamically adjust the number of search calls accor ding to the complexity of tasks. T est-time Scaling. W e also evaluate how MR-Search scales with additional reﬂection turns at test time. W e extrapolate the number of reﬂection turns beyond training (3 turns) by appending the entire interaction history to the context at each turn. As shown in Figures 3 and 5 , the single-turn method Search-R1 with our reﬂection mechanism yields only mar ginal gains when additional reﬂection turns ar e allowed, since its training objective is optimized for a single turn. However , when additional reﬂection turns are allowed, MR-Search achieves signiﬁcantly higher performance, exhibiting the steep improvement curve. These results suggest that multi-turn reﬂection with MR-Search enhances the model’s ability to iteratively reﬁne and optimize its sear ch across turns and enables effective extrapolation. Case Study . W e present inference cases of models trained with MR-Search . As illustrated by the examples in Appendix A.2.2 , the model is able to execute multi-turn agentic tasks through iterative tool calls and autonomous information aggr egation. During subsequent reasoning episode, the model revisits and r evises answers in light of newly acquired evi- dence, ultimately producing corr ect ﬁnal answers. These observations highlight the beneﬁts of self-reﬂection for agentic search. Moreover , when the answer is alr eady accurate as shown in Cases 2 and 3, the model can preserve it and avoid unnecessary revisions, demonstrating its ability to selectively reﬁne r easoning based on retrieved information. 4.4 Extensions In this section, we conduct preliminary experiments to investigate the effects of explo- ration vs. exploitation, step-level meta-RL (discussed in § 3.4 ), and context management (i.e., retaining only the immediately preceding episode as context). T o study exploration and exploitation, we designate the ﬁrst two episodes as exploration and the last two as exploitation. T able 3 in provides detailed results. From T able 3 , we observe that all variants signiﬁcantly outperform the baseline Sear ch-R1, demonstrating the effectiveness of these variants. Moreover , MR-Search with step-level meta-RL achieves a substantial improvement over GRPO with outcome rewar ds (Search-R1). This result suggests that agentic search can beneﬁt from pr ocess-level rewards under meta-RL training, and that MR-Search effectively leverages such process supervision to achieve better performance. Furthermore, as shown in T able 3 , encouraging exploration by assigning r ewards to the ﬁrst episode is beneﬁcial for the more complex ASear cher dataset, which requires mor e interaction with tools. 5 Conclusions W e study agentic search under sparse outcome rewards and propose MR-Search , an in- context meta-reinfor cement learning framework that enables structur ed cross-episode ex- 9 Preprint. ploration via explicit self-reﬂection. By conditioning each episode on prior trajectories and reﬂections, MR-Search transforms independent search attempts into a progressively informed search pr ocess, improving exploration without relying on external process r eward models. T o support this multi-turn reﬂective setting, we introduce a turn-level grouped advantage formulation that provides unbiased and ﬁne-grained credit assignment while re- maining critic-free. Extensive experiments across diverse benchmarks show that MR-Search consistently outperforms outcome-only RL baselines. Overall, our results highlight the im- portance of in-context meta-learning for effective agentic r einforcement learning of LLMs. Acknowledgment This material is based upon work supported by the National Science Foundation under A ward No. 2413244. The views expressed are those of the author and do not reﬂect the ofﬁcial policy or position of the Department of Defense or the U.S. Government. Limitations Despite its effectiveness, MR-Search has several limitations. First , we do not evaluate our method on long-form benchmarks, wher e responses are substantially longer . V eriﬁcation in such settings is inherently challenging, and how to reliably assess progress and ﬁnal correctness for long-form generation r emains an open resear ch question. Second , our current study focuses on agentic search with a ﬁxed W ikipedia search tool. Extending MR-Search to environments involving multiple heter ogeneous tools, such as combined web search and web browsing. W e leave the investigation of these dir ections to future work. It would be also particularly interesting to scale MR-Search to large agentic RL training runs and further study the scaling properties of Meta-RL with fr ontier base models. Ethics Statement This paper advances r einforcement learning for agentic search by improving training efﬁ- ciency and credit assignment under sparse rewards. By enabling more effective training of language-model-based agents, this work may beneﬁt practical applications that r equire efﬁcient and reliable reasoning. This resear ch shares the societal implications of machine learning systems more br oadly and does not introduce additional ethical concerns beyond those commonly associated with large language models and r einforcement learning. References Arash Ahmadian, Chris Cremer , Matthias Gall ´ e, Marzieh Fadaee, Julia Kreutzer , Olivier Pietquin, Ahmet ¨ Ust ¨ un, and Sara Hooker . Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pp. 12248– 12267, 2024. Anonymous. Principle process rewar d for search agents. In Submitted to The Fourteenth International Conference on Learning Representations , 2026. Michael Bereket and Jur e Leskovec. Uncalibrated reasoning: Grpo induces overconﬁdence for stochastic outcomes. arXiv preprint , 2025. Mingyang Chen, Linzhuang Sun, T ianpeng Li, Haoze Sun, Y ijie Zhou, Chenzheng Zhu, Haofen W ang, Jeff Z Pan, W en Zhang, Huajun Chen, et al. Learning to reason with search for llms via reinfor cement learning. arXiv preprint , 2025. Y ong Deng, Guoqing W ang, Zhenzhe Y ing, Xiaofeng W u, Jinzhen Lin, W enwen Xiong, Y uqin Dai, Shuo Y ang, Zhanwei Zhang, Qiwen W ang, et al. Atom-searcher: Enhancing agentic deep resear ch via ﬁne-grained atomic thought rewar d. arXiv preprint , 2025. 10 Preprint. Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui W ang, and Zhendong Mao. Deepr e- search bench: A comprehensive benchmark for deep resear ch agents. arXiv preprint arXiv:2506.11763 , 2025. Y an Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever , and Pieter Abbeel. Rl 2 : Fast reinfor cement learning via slow reinfor cement learning. arXiv preprint arXiv:1611.02779 , 2016. Y uchen Fan, Kaiyan Zhang, Heng Zhou, Y uxin Zuo, Y anxu Chen, Y u Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Y uchen Zhang, et al. Ssrl: Self-search reinforcement learning. arXiv preprint arXiv:2508.10874 , 2025. Lang Feng, Zhenghai Xue, T ingcong Liu, and Bo An. Group-in-gr oup policy optimization for llm agent training. arXiv preprint , 2025. Jiaxuan Gao, W ei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Y i W u. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv preprint , 2025. Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Peiyi W ang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinfor cement learning. arXiv preprint , 2025. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for compr ehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060 , 2020. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams W ei Y u, Xinying Song, and Denny Zhou. Large language models cannot self-corr ect reasoning yet. arXiv preprint arXiv:2310.01798 , 2023. Aaron Jaech, Adam Kalai, Adam Ler er , Adam Richardson, Ahmed El-Kishky , Aiden Low , Alec Helyar , Aleksander Madry , Alex Beutel, Alex Carney , et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 , 2024. Y ulun Jiang, Liangze Jiang, Damien T eney , Michael Moor , and Maria Brbic. Meta-rl induces exploration in language agents. arXiv e-prints , pp. arXiv–2512, 2025. Bowen Jin, Hansi Zeng, Zhenrui Y ue, Jinsung Y oon, Sercan Arik, Dong W ang, Hamed Zamani, and Jiawei Han. Search-r1: T raining llms to reason and leverage search engines with reinfor cement learning. arXiv preprint , 2025a. Bowen Jin, Hansi Zeng, Zhenrui Y ue, Jinsung Y oon, Sercan O Arik, Dong W ang, Hamed Zamani, and Jiawei Han. Search-r1: T raining LLMs to reason and leverage search engines with reinfor cement learning. In Second Conference on Language Modeling , 2025b. URL https://openreview.net/forum?id=Rwhi91ideu . Mandar Joshi, Eunsol Choi, Daniel S W eld, and Luke Zettlemoyer . T riviaqa: A large scale distantly supervised challenge dataset for reading compr ehension. arXiv preprint arXiv:1705.03551 , 2017. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell W u, Ser gey Edunov , Danqi Chen, and W en-tau Y ih. Dense passage retrieval for open-domain question answer- ing. In EMNLP (1) , pp. 6769–6781, 2020. W outer Kool, Herke van Hoof, and Max W elling. Buy 4 reinfor ce samples, get a baseline for free! ICLR 2019 workshop: Deep RL Meets Structured Prediction , 2019. A viral Kumar , V incent Zhuang, Rishabh Agarwal, Y i Su, John D Co-Reyes, A vi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. T raining language models to self-correct via r einforcement learning. arXiv preprint , 2024. 11 Preprint. T om Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering resear ch. T ransactions of the Association for Computational Linguistics , 7:453–466, 2019. Michael Laskin, Luyu W ang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer , Richie Steiger- wald, DJ Strouse, Steven Stenber g Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinfor cement learning with algorithm distillation. In The Eleventh International Conference on Learning Representations , 2023. Kuan Li, Zhongwang Zhang, Huifeng Y in, Liwen Zhang, Litu Ou, Jialong W u, W enbiao Y in, Baixuan Li, Zhengwei T ao, Xinyu W ang, et al. W ebsailor: Navigating super-human reasoning for web agent. arXiv preprint , 2025a. Xiaoxi Li, Guanting Dong, Jiajie Jin, Y uyao Zhang, Y ujia Zhou, Y utao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced lar ge reasoning models. arXiv preprint arXiv:2501.05366 , 2025b. Evan Z Liu, Aditi Raghunathan, Per cy Liang, and Chelsea Finn. Decoupling exploration and exploitation for meta-reinfor cement learning without sacriﬁces. In International conference on machine learning , pp. 6925–6935. PMLR, 2021. Ilya Loshchilov and Frank Hutter . Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017. Liangchen Luo, Y inxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Y unxuan Li, Lei Shu, Y un Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint , 2024. Aman Madaan, Niket T andon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah W iegref fe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Y iming Y ang, et al. Self-reﬁne: Iterative reﬁnement with self-feedback. Advances in Neural Information Processing Systems , 36: 46534–46594, 2023. Alex Mallen, Akari Asai, V ictor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint , 7, 2022. Luckeciano C Melo. T ransformers ar e meta-reinfor cement learners. In international conference on machine learning , pp. 15340–15359. PMLR, 2022. Gr ´ egoire Mialon, Cl ´ ementine Fourrier , Thomas W olf, Y ann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The T welfth International Conference on Learning Representations , 2023. Oﬁr Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narr owing the compositionality gap in language models. arXiv preprint arXiv:2210.03350 , 2022. Y uxiao Qu, T ianjun Zhang, Naman Garg, and A viral Kumar . Recursive introspection: T eaching language model agents how to self-improve. Advances in Neural Information Processing Systems , 37:55249–55285, 2024. Y uxiao Qu, Matthew YR Y ang, Amrith Setlur , Lewis T unstall, Edward Emanuel Beech- ing, Ruslan Salakhutdinov , and A viral Kumar . Optimizing test-time compute via meta reinfor cement ﬁne-tuning. arXiv preprint , 2025. John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv preprint , 2017. Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, V arsha Kishore, Jing- ming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research. arXiv preprint arXiv:2511.19399 , 2025. 12 Preprint. Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint , 2024. Junhong Shen, Hao Bai, Lunjun Zhang, Y ifei Zhou, Amrith Setlur , Shengbang T ong, Diego Caples, Nan Jiang, T ong Zhang, Ameet T alwalkar , et al. Thinking vs. doing: Agents that reason by scaling test-time interaction. arXiv preprint , 2025. T aiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Y ang, and Jieyu Zhao. Experien- tial reinfor cement learning. arXiv preprint , 2026. Y aorui Shi, Shihan Li, Chang W u, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang W ang. Search and reﬁne during think: Autonomous retrieval-augmented reasoning of llms. NeurIPS , 2025. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Y ao. Reﬂexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36:8634–8652, 2023. Chenglei Si, Zitong Y ang, Y ejin Choi, Emmanuel Cand ` es, Diyi Y ang, and T atsunori Hashimoto. T owards execution-grounded automated ai r esearch. arXiv preprint arXiv:2601.14525 , 2026. Charlie Snell, Jaehoon Lee, Kelvin Xu, and A viral Kumar . Scaling llm test-time compute opti- mally can be more effective than scaling model parameters. arXiv preprint , 2024. Huatong Song, Jinhao Jiang, Y ingqian Min, Jie Chen, Zhipeng Chen, W ayne Xin Zhao, Lei Fang, and Ji-Rong W en. R1-searcher: Incentivizing the search capability in llms via reinfor cement learning. arXiv preprint , 2025. Bradly C Stadie, Ge Y ang, Rein Houthooft, Xi Chen, Y an Duan, Y uhuai W u, Pieter Abbeel, and Ilya Sutskever . Some considerations on learning to explore via meta-reinfor cement learning. arXiv preprint , 2018. Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Y ingyan Hou, Y ong Jiang, Pengjun Xie, Y an Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint , 2025. Richard S Sutton, David McAllester , Satinder Singh, and Y ishay Mansour . Policy gradient methods for reinfor cement learning with function approximation. Advances in neural information processing systems , 12, 1999. Harsh T rivedi, Niranjan Balasubramanian, T ushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. T ransactions of the Association for Computational Linguistics , 10:539–554, 2022. Zhongwei W an, Zhihao Dou, Che Liu, Y u Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Y i Xin, Y ifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reﬂection-awar e reinforcement learning. arXiv preprint , 2025. Jane W ang, Zeb Kurth-Nelson, Hubert Soyer , Joel Leibo, Dhruva T irumala, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinfor cement learn. In Proceedings of the Annual Meeting of the Cognitive Science Society , volume 39, 2017. Liang W ang, Nan Y ang, Xiaolong Huang, Binxing Jiao, Linjun Y ang, Daxin Jiang, Rangan Majumder , and Furu W ei. T ext embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 , 2022a. Peiyi W ang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Y ifei Li, Deli Chen, Y u W u, and Zhifang Sui. Math-shepherd: V erify and reinforce llms step-by-step without human annotations. arXiv preprint , 2023. 13 Preprint. T eng W ang, Zhangyi Jiang, Zhenqi He, Shenyang T ong, W enhan Y ang, Y anan Zheng, Zeyu Li, Zifan He, and Hailei Gong. T owards hierarchical multi-step reward models for enhanced reasoning in lar ge language models. arXiv preprint , 2025a. Xuezhi W ang, Jason W ei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery , and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint , 2022b. Ziliang W ang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Y uhang W ang, and Y ichao W u. Stepsearch: Igniting llms search ability via step-wise pr oximal policy optimization. arXiv preprint arXiv:2505.15107 , 2025b. Jiahao W u, Zhongwen Xu, Qiang Fu, and W ei Y ang. Buy 4 reinforce samples, get a baseline for free! Cut the Bill, Keep the T urns: Affordable Multi-T urn Search RL , 2025a. Jialong W u, Baixuan Li, Runnan Fang, W enbiao Y in, Liwen Zhang, Zhengwei T ao, Dingchu Zhang, Zekun Xi, Gang Fu, Y ong Jiang, et al. W ebdancer: T owards autonomous informa- tion seeking agency . arXiv preprint , 2025b. W ei Xiong, Hanning Zhang, Chenlu Y e, Lichang Chen, Nan Jiang, and T ong Zhang. Self- rewar ding correction for mathematical reasoning. arXiv preprint , 2025. An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran W ei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 , 2024. Zhilin Y ang, Peng Qi, Saizheng Zhang, Y oshua Bengio, W illiam W Cohen, Ruslan Salakhutdi- nov , and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint , 2018. Shunyu Y ao, Jeffr ey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning repr esentations , 2022. Lifan Y uan, W endi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process r ewards without process labels. arXiv preprint arXiv:2412.01981 , 2024. Mert Y uksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong W ang, Jan Kautz, Y ejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time. arXiv preprint , 2026. Siliang Zeng, Quan W ei, W illiam Brown, Oana Frunza, Y uriy Nevmyvaka, and Mingyi Hong. Reinforcing multi-turn r easoning in llm agents via turn-level cr edit assignment. arXiv preprint arXiv:2505.11821 , 2025. Kai Zhang, Xiangchao Chen, Bo Liu, T ianci Xue, Zeyi Liao, Zhihan Liu, Xiyao W ang, Y uting Ning, Zhaor un Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558 , 2025. Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Y u, Dayi- heng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process err ors in mathematical reasoning. arXiv preprint , 2024. Y uxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, L yumanshan Y e, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep resear ch via reinforcement learning in real- world environments. arXiv preprint , 2025. 14 Preprint. Generation Prompt Answer the given question. Y ou must conduct reasoning inside and ﬁrst every time you get new information. After reasoning, you must call a search engine by query , and it will return the top search results between and . After every time you get new information, you must try to provide the answer inside and without detailed illustrations. For example, xxxx . Question: { question } A Appendix A.1 Experimental Details A.1.1 The Details of Datasets NQ ( Kwiatkowski et al. , 2019 ): NQ is question answering (QA) data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. The training and test sets contain 79,168 and 3,610 samples, respectively . T riviaQA ( Joshi et al. , 2017 ): T riviaQA is a challenging reading comprehension dataset containing over question-answer-evidence triples, which pr ovide high quality distant su- pervision for answering the questions. The test sets we used contain 11,313 samples. PopQA ( Mallen et al. , 2022 ): PopQA is a large-scale open-domain question answering dataset, consisting of 14,267 entity-centric QA pairs. Each question is created by converting a knowledge tuple retrieved fr om W ikidata using a template. HotpotQA ( Y ang et al. , 2018 ): HotpotQA is a large-scale multi-hop QA benchmark featuring W ikipedia-based Q&A pairs with sentence-level supporting evidence. The training and test sets contain 90,447 and 7,405 samples, respectively . 2W ikiMultiHopQA ( Ho et al. , 2020 ): 2W ikiMultiHopQA is a multi-hop question an- swering dataset designed to more reliably test a model’s infer ence across multiple pieces of evidence. The test set contains 7,405 QA pairs. Musique ( T rivedi et al. , 2022 ): MuSique is a multihop question answering dataset con- structed to enfor ce genuine multi-step reasoning. The test set contains 2,417 QA pairs. Bamboogle ( Press et al. , 2022 ). Bamboogle is a dataset with multi-hop questions, where all questions are suf ﬁciently difﬁcult to be unanswerable by a popular internet search engine, but where both supporting pieces of evidence can be found in W ikipedia. Bamboogle contains 125 test QA pairs. ASearcher ( Gao et al. , 2025 ). ASearcher is a synthetic multi-turn dataset whose synthesis pipeline is largely based on W ikipedia. W e use the preprocessed version from ( W u et al. , 2025a ), which applies a three-step ﬁltering pipeline, including the r emoval of Chinese and math samples as well as rejection sampling, resulting in 14k samples. W e split ASearcher into 90% training and 10% evaluation sets to decouple the effects of data distribution. A.1.2 The Details of Baselines W e provide detailed descriptions of baselines. W e consider three types of baselines: Inference without ﬁne-tuning, Fine-tuning-based methods without step-level supervision and Fine- tuning-based methods with step-level supervision. Search-o1 ( Li et al. , 2025b ) is the sear ch-enhanced reasoning framework, which integrates the agentic RAG mechanism and reason-in-document module. 15 Preprint. Reﬂection Prompt Reﬂect on your current answer to the question and pr ovide another answer by searching for additional external information using search engines. Y ou must conduct reasoning inside and ﬁrst every time you get new information. After reasoning, if you ﬁnd you lack some knowledge, you can call a search engine by query and it will return the top searched r esults between and . Y ou can search as many times as your want. If you ﬁnd no further external knowledge needed, you can directly pr ovide the answer inside and ., without detailed illustrations. ReSearch ( Chen et al. , 2025 ) is an RL-based framework that trains LLMs to interleave reasoning with explicit search actions, deciding when and how to query and then using retrieved evidence to continue multi-hop r easoning. Search-R1 ( Jin et al. , 2025a ) extends RL-based r easoning by enabling LLMs to autonomously generate search queries during multi-turn r easoning. PPRM ( Anonymous , 2026 ) is a principle process reward model that provides step-wise signals to guide GRPO-based RL. StepResearch ( W ang et al. , 2025b ) trains search agents with step-wise PPO using intermedi- ate rewar ds and token-level supervision to better guide multi-hop retrieval and reasoning. A.1.3 The Details of Implementation Our implementation is built upon Search-R1 based on V eRL. The generation prompt and reﬂection Pr ompt are given in the colored box. Hyperparameters . For training, we use AdamW ( Loshchilov & Hutter , 2017 ) as the opti- mizer and set the learning rate to 1e-6 without warmup. The top-p and temperature for rollout are both set to 1. The total number of training steps is 300. The number of documents returned by the retrieval engine is 3. The group size for advantage calculation is set to 5. The context length is set to 8K and 16K for the NQ/HotpotQA and Asearcher datasets, respectively . W e also set the maximum number of tool calls in each episode to 3 and 5 for the NQ/HotpotQA and Asearcher datasets, r espectively . For evaluation, all other settings (e.g., the retriever conﬁguration) ar e kept the same as in training; the only difference is that we disable sampling and use greedy decoding with temperature = 0, and top p = 1.0. Computation Resources All RL training is performed on 8 × NVIDIA T esla H100 (80GB) GPUs, and we use an additional 2 × NVIDIA T esla H100 (80GB) GPUs to serve the retriever . A.2 Additional Experimental Results A.2.1 More Results of T raining Dynamics In Figures 6 and 7 , we further visualize the training dynamics of MR-Search . W e can observe that MR-Search exhibits more stable conver gence during training and consistently achieves higher rewar d than Search-R1 baseline. These results indicate that the pr ocess rewar ds at episode level with self-r eﬂection introduced by MR-Search provide mor e informative and reliable learning signals, leading to impr oved stability and training effectiveness. A.2.2 Case Study In T ables 4 - 7 , we provide some inference trajectory of the model trained by MR-Search . 16 Preprint. 50 100 150 200 250 300 T raining step 0.300 0.325 0.350 0.375 0.400 0.425 T est A ccuracy MetaSear ch Sear ch-R1 0 50 100 150 200 250 300 T raining step 0.0 0.1 0.2 0.3 0.4 0.5 T raining A ccuracy MetaSear ch Sear ch-R1 0 50 100 150 200 250 300 T raining step 1.0 1.5 2.0 2.5 3.0 T ool Calls MetaSear ch Sear ch-R1 Figure 6: The training dynamics of MR-Search and Search-R1 in terms of test accuracy , training accuracy , and the number of tool calls on Qwen2.5-3B-Base. 50 100 150 200 250 300 T raining step 0.30 0.35 0.40 0.45 T est A ccuracy MetaSear ch Sear ch-R1 0 50 100 150 200 250 300 T raining step 0.1 0.2 0.3 0.4 0.5 T raining A ccuracy MetaSear ch Sear ch-R1 0 50 100 150 200 250 300 T raining step 1 2 3 4 T ool Calls MetaSear ch Sear ch-R1 Figure 7: The training dynamics of MR-Search and Search-R1 in terms of test accuracy , training accuracy , and the number of tool calls on Qwen2.5-7B-Base. Meta-Eposide ToolCall ToolObs Think turn 1 Answe r Question ToolCall ToolObs Think Answe r tur n 2 ToolCall ToolObs Think Answe r tur n T … Eposide … Eposide Eposide w/o Loss w/Loss Outcome Verifier Figure 8: The idea of MR-Search can be extended beyond full trajectories to any semantically meaningful episode that can be explicitly segmented. For example, in agentic search, each tool-interaction step can naturally form an episode. W e can decompose the one interaction process into multiple self-reﬂective trials, where each tool-interaction step contributes a candidate answer to the original query via prompting. During training, at each interaction step, we monitor changes in the agent’s accuracy with respect to the target and use this signal as a dense training objective to promote ef fective exploration. 17 Preprint. T able 4: Case study 1: An early intermediate answer is incorrect due to ambiguous retrieval results, but is subsequently reconsidered through additional search and self-reﬂection, leading to the correct ﬁnal answer . Question : During a major geopolitical event in the early Cold W ar , Robert Menzies exploited rising tensions to portray the Labor Party as ’soft on Communism,’ leading to the defeat of a Prime Minister who attended a religious school. In the same period, which British maritime patrol aircraft, known for its role in anti-submarine warfare and derived fr om the world’s ﬁrst jet airliner , was frequently deployed to a U.S. military base in the Atlantic to support anti-submarine operations during the latter stages of the Cold W ar? Ground T ruth : 2002 MR-Search : T o determine the year in which the V ictorian government department responsible for agriculture and ener gy policy was established, we need more information about the department and its merger with another department in 2013. Query: V ictorian government department responsible for agriculture and ener gy policy merged with another department administering conservation strategy involving biodiversity offsets in 2013. When was this department established? Document (T itle: Department of the Environment (Australia, 2013–16)) by way of an Administrative Arrangements Order issued on 18 September 2013. It absorbed the responsibilities of the former Department of Sustainability , Environment, W ater , Population and Communities (DSWEPaC) and climate change from the former Department of Industry , Innovation, Climate Change, Science, Research and T ertiary Education. The department was dissolved in July 2016 and its functions, along with energy policy functions, wer e moved to the newly established Department of the Environment and Ener gy . From when it... It seems the search did not directly pr ovide the information. Let’s try another search with a speciﬁc focus on the department related to agricultur e and energy policy: Query: When was the Department of Economic Development, Jobs, T ransport and Resources established in V ictoria? Document (T itle: Department of Economic Development, Jobs, T ransport and Resources) Department of Economic Development, Jobs, T ransport and Resources The Department of Economic Development, Jobs, T ransport and Resources (DEDJTR) is a department of the Government of V ictoria. It was created on 1 January 2015 by the government of Pr emier Daniel Andrews when the number of government departments was r educed from 9 to 7, and assumed responsibility for ministerial portfolios pr eviously spread across 5 departments. The department was established after the 2014 V ictorian election by the new Labor government, using the powers of the Premier of V ictoria and the Governor-in-council under the Public ... This information indicates that the Department of Economic Development, Jobs, T ransport and Resources was established on 1 January 2015. Therefore, the year in which the V ictorian government department r esponsible for agricultur e and ener gy policy was established is: 2015 Reﬂect on your current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s ﬁrst conﬁrm the details we have about the mer ger with the department administering a conservation strategy involving biodiversity offsets. W e need to ensure that the department we identiﬁed merged with another department in 2013 as stated in the question. Query: Which department was merged with the Department of Economic Development, Jobs, T ransport and Resources in 2013 that administered a conservation strategy involving biodi- versity offsets? Document (T itle: Department of Economic Development, Jobs, T ransport and Resources) Department of Economic Development, Jobs, T ransport and Resources The Department of Economic Development, Jobs, T ransport and Resources (DEDJTR) is a department of the Government of V ictoria. It was created on 1 January 2015 by the government of Pr emier Daniel Andrews when the number of government departments was r educed from 9 to 7, and assumed responsibility for ministerial portfolios pr eviously spread across 5 departments. The department was established after the 2014 V ictorian election by the new Labor government, using the powers of the Premier of V ictoria and the Governor-in-council under the Public Administration Act 2004 and the Administrative Arrangements Act 1983... It seems the information does not dir ectly state which department was merged in 2013. However , the question speciﬁes that the department collaborated with Field and Game Australia to establish the Game Council of V ictoria in 2012. Let’s consider if this collaboration might provide a clue: 18 Preprint. Query: When was the Department of Environment and Primary Industries, which collaborated with Field and Game Australia to establish the Game Council of V ictoria in 2012, merged with another department in 2013? Document (T itle: Department of Primary Industries (V ictoria)) Bureau of Animal W elfare branch Does this Department need investigation if Rosewarne had anything to do with it? DPI had primary responsibility for administering legislation governing agricultur e, animal welfar e, ﬁsheries, hunting and energy and earth resources in V ictoria. The department administered and responded to queries regar ding 38 major Acts and is responsible for collecting fees under regulations for ﬁshing licences, mining rights, and exploration licences. In 2012 DPI also created the Game Council of V ictoria to with the Bureau of Animal W elfare and or ganisations such as Field and Game Australia to manage sustainable and responsible hunting in V ictoria... This information indicates that the Department of Environment and Primary Industries was merged with the Department of Sustainability and Environment to form the Department of Environment and Primary Industries in April 2013. However , we need to conﬁrm if this is the department responsible for agriculture and ener gy policy . Given the context, it seems likely that the Department of Environment and Primary Industries was the one r esponsible for agriculture and energy policy before the merger . Therefore, the year in which the V ictorian government department responsible for agricultur e and energy policy was established is: 2002 Reﬂect on current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s verify the year of establishment for the Department of Environment and Primary Industries, which was merged in 2013 and was responsible for agriculture and energy policy: Query: When was the Department of Environment and Primary Industries established in V ictoria? Document (T itle: Department of Envir onment and Primary Industries (V ictoria)) 1 January 2015. The Agriculture portfolio was moved to the new Department of Economic Development, Jobs, T ransport and Resources. Department of Environment and Primary Industries (V ictoria) The Department of Environment and Primary Industries (DEPI) was a state government department responsible for protecting the envir onment, boosting productivity in V ictoria’s food and ﬁbre sector , management of natural resources and managing water resour ces in the state of V ictoria, Australia. It was created in April 2013 by merging the Department of Primary Industries with the Department of Sustainability and Environment... This information conﬁrms that the Department of Envir onment and Primary Industries was established in April 2013 by merging with the Department of Sustainability and Environment. However , we need to ﬁnd the year when the Department of Primary Industries, which was mer ged to form the Department of Environment and Primary Industries, was established. Since the Department of Primary Industries was established in 2002, we can conclude that the Department of Environment and Primary Industries, which was r esponsible for agriculture and ener gy policy , was established in 2002. 2002 T able 5: Case study 2: MR-Search subsequently revisits the problem through additional search and self-r eﬂection, ultimately producing the correct ﬁnal answer . Question : During a major geopolitical event in the early Cold W ar , Robert Menzies exploited rising tensions to portray the Labor Party as ’soft on Communism,’ leading to the defeat of a Prime Minister who attended a religious school. In the same period, which British maritime patrol aircraft, known for its role in anti-submarine warfare and derived fr om the world’s ﬁrst jet airliner , was frequently deployed to a U.S. military base in the Atlantic to support anti-submarine operations during the latter stages of the Cold W ar? Ground T ruth : Hawker Siddeley Nimrod MR-Search : T o determine the name of the British maritime patrol aircraft that was frequently deployed to a U.S. military base in the Atlantic to support anti-submarine operations during the latter stages of the Cold W ar , let’s ﬁrst consider the time frame of the event involving Robert Menzies and the Prime Minister who attended a religious school. Then we can perform a search to ﬁnd the information about the British maritime patrol air craft. 19 Preprint. Query: British maritime patrol aircraft derived from the world’s ﬁrst jet airliner , fre- quently deployed to a U.S. military base in the Atlantic during the Cold W ar Document (T itle: Maritime patrol aircraft) also been located in these areas: U.S. Navy and Canadian aircraft based in Gr eenland, Iceland, and Newfoundland; British air craft based in Scotland and Northern Ir eland; and Norwegian, Dutch, and German aircraft based in their home countries. Since the end of the Cold W ar the threat of a large-scale submarine attack is a remote one, and many of the air forces and navies have been downsizing their ﬂeets of ... Let’s try one more time with the exact model derived fr om the ﬁrst jet airliner: Query: Air aztros 2000 maritime patrol air craft frequently deployed to a U.S. military base in the Atlantic during the Cold W ar Document (T itle: Kindley Air Force Base) P-3B air craft from various bases in the eastern United States, as well as NA TO/Allied support consisting of Royal Air Force Hawker Siddeley Nimrod MR2s, Canadian Armed Forces CP-140 Auroras and other similar maritime patrol and r econnaissance aircraft fr om other NA TO nations. During one period in 1985 that was characterized by exceptionally heavy Soviet Navy submarine activity off the ... This information does not directly name the aircraft. However , the mention of P-3C aircraft being fr equently deployed to a U.S. military base in the Atlantic suggests that we should focus on this model. Given that P-3C is a well-known maritime patrol aircraft and the question speciﬁes a connection to the ﬁrst jet airliner , let’s consider if P-3C has any such connection. The P-3C is derived from the Lockheed P-3 Orion, which is itself a derivative of the Lockheed Model 85 Electra airliner . The ﬁrst jet airliner was the de Havilland Comet, not directly related to the Electra. Therefor e, we might need to r e-evaluate our query to ensure we ar e considering the corr ect aircraft. Given the context and the need for accuracy , let’s consider the aircraft derived from the de Havilland Comet, which is the de Havilland Sea V ixen. However , the P-3C is more directly related to the Lockheed 1049C Hercules transport air craft, which is not derived from the Comet but rather the Lockheed Electra. Given the speciﬁcity of the question, it seems likely that the aircraft in question is the P-3C Orion, which is frequently deployed and ﬁts the timeframe. P-3C Orion Reﬂect on current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s ﬁrst conﬁrm the details we have and then r eﬁne our search to ensur e we are identi- fying the corr ect aircraft. W e know the air craft is a British maritime patr ol aircraft derived fr om world’s ﬁrst jet airliner and was frequently deployed to a U.S. military base in the Atlantic during Cold W ar . Given that P-3C Orion is the most likely candidate based on previous sear ch, let’s verify if there is any speciﬁc information about a British air craft derived from a jet airliner . Query: British maritime patrol aircraft derived fr om a jet airliner , frequently deployed to a U.S. military base in the Atlantic during the Cold W ar Document (T itle: Maritime patrol aircraft) also been located in these areas: U.S. Navy and Canadian aircraft based in Gr eenland, Iceland, and Newfoundland; British air craft based in Scotland and Northern Ir eland; and Norwegian, Dutch, and German aircraft based in their home countries. Since the end of the Cold W ar the threat of a large-scale submarine attack is a remote one, and many of the air forces and navies have been downsizing their ﬂeets of ... It appears that the information did not dir ectly name the air craft. However , the mention of the P-3C Orion being frequently deployed suggests that we should consider if there is another British aircraft that ﬁts the description. Given the speciﬁcity of the question, let’s consider if there is a British counterpart to the P-3C. The P-3C is derived from the Lockheed 1049C Hercules, which is not a jet airliner . Therefor e, we need to consider if there is a British jet airliner-derived maritime patrol air craft. Given the context and the need for accuracy , let’s consider the British counterpart to the P-3C, which is the Lockheed P-3 Orion. The P-3 Orion is derived from the Lockheed Electra, not a jet airliner . Therefore, we need to consider if ther e is another British aircraft derived from a jet airliner . Let’s consider the Hawker Siddeley Nimrod, which is a British maritime patrol air craft. The Nimrod is derived fr om the Hawker Siddeley T rident, which is a jet airliner . The Nimrod was indeed frequently deployed to a U.S. military base in the Atlantic during the Cold W ar . Hawker Siddeley Nimrod Reﬂect on current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s consider the details we have and ensur e we are identifying the corr ect aircraft. W e know it is a British maritime patr ol aircraft derived from a jet airliner and was frequently deployed to a U.S. military base in the Atlantic during the Cold W ar . Given the speciﬁc context, let’s consider if there is a dir ect match for the P-3C Orion in terms of derivation from a jet airliner . The Hawker Siddeley Nimrod is a str ong candidate, as it is derived from the Hawker Siddeley T rident, which is a jet airliner . Additionally , the Nimrod was indeed frequently deployed to a U.S. military base in the Atlantic during the Cold W ar for anti-submarine operations. Given this reasoning, we can conclude that the aircraft in question is the Hawker Siddeley Nimr od. Hawker Siddeley Nimrod 20 Preprint. T able 6: Case study 3: MR-Search subsequently revisits the problem through additional search and self-r eﬂection, ultimately producing the correct ﬁnal answer . Question : Who directed the 1909 Italian silent ﬁlm in which the actor who played the castle castellan in the 1928 German silent ﬁlm The Mysterious Mirr or portrayed the titular character of Shakespeare’s Macbeth? Ground T ruth : Mario Caserini MR-Search : T o determine who directed the 1909 Italian silent ﬁlm in which the actor who played the castle castellan in the 1928 German silent ﬁlm ’The Mysterious Mirror ’ portrayed the titular character of Shakespear e’s ’Macbeth’, let’s ﬁrst identify the actor and then ﬁnd the Italian silent ﬁlm he appeared in during 1909. Query: Italian silent ﬁlm from 1909 in which the actor who played the castle castellan in the 1928 German silent ﬁlm The Mysterious Mirr or portrayed the titular character of Shakespeare’s ’Macbeth’ directed by whom? Document (T itle: Macbeth (1916 ﬁlm)) Macbeth (1916 ﬁlm) Macbeth is a silent, black- and-white 1916 ﬁlm adaptation of the W illiam Shakespeare play Macbeth. It was directed by John Emerson, assisted by Erich von Stroheim, and produced by D. W . Grifﬁth, with cinematography by V ictor Fleming. The ﬁlm starred Herbert Beerbohm T ree and Constance Collier , both famous from the stage and for playing Shakespearean parts. Although r eleased during the ﬁrst decade of feature ﬁlmmaking, it was already the seventh version of Macbeth to be produced, one of eight during the silent ﬁlm era. It is considered to be a lost ﬁlm. In the companion book to his ... It appears that the search did not directly provide the name of the actor . Let’s try the search again with a focus on the actor who played the castle castellan in The Mysterious Mirr or and then ﬁnd the ﬁlm he portrayed Macbeth in 1909. Query: The actor who played the castle castellan in the 1928 German silent ﬁlm The Mysterious Mirror and also portrayed the titular character of Shakespear e’s Macbeth in an Italian silent ﬁlm from 1909. Who directed that Italian ﬁlm? Document (T itle: Maria Caserini) 1907, along with appearing in another thirteen ﬁlms through 1909. One of those ﬁlms was the 1909 production Macbeth, which was directed by her husband. From 1910 to 1927 she appeared in sixty ﬁve ﬁlms, all produced in Italy , most of which were dir ected by her husband, and many of which were Shakespear ean adaptations. She continued to perform in theater productions well after she left her ﬁlm career . She was residing in Milan at the time of her death on 15 April 1969. Maria Caserini Maria Caserini (n ´ ee Gasperini; 24 July 1884 – 15 April 1969) was an Italian stage ... This information does not dir ectly name the actor but mentions Maria Caserini and her husband. Let’s consider the top search result more carefully: One of the ﬁlms was the 1909 production Macbeth, which was directed by her husband. So, the director of the 1909 Italian silent ﬁlm in which the actor portrayed Macbeth is Maria Caserini’s husband. Her husband Reﬂect on current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s ﬁrst conﬁrm the name of Maria Caserini’s husband and then determine if we can ﬁnd the director of the 1909 Italian silent ﬁlm Macbeth with this information. Query: Who was the husband of Maria Caserini and what was his name? Also, ﬁnd the director of the 1909 Italian silent ﬁlm Macbeth. Document (T itle: Maria Caserini) 1907, along with appearing in another thirteen ﬁlms through 1909. One of those ﬁlms was the 1909 production Macbeth, which was directed by her husband. From 1910 to 1927 she appeared in sixty ﬁve ﬁlms, all produced in Italy , most of which were dir ected by her husband, and many of which were Shakespear ean adaptations. She continued to perform in theater productions well after she left her ﬁlm career . She was residing in Milan at the time of her death on 15 April 1969. Maria Caserini Maria Caserini (n ´ ee Gasperini; 24 July 1884 – 15 April 1969) was an Italian stage ... This information tells us that Maria Caserini’s husband directed several of her ﬁlms, including the 1909 Italian silent ﬁlm Macbeth. Since the search did not return the name of her husband directly , let’s consider if we can infer the director ’s name from the context or search for it directly: Query: Who was the director of the 1909 Italian silent ﬁlm Macbeth? 21 Preprint. Document (T itle: Macbeth (1916 ﬁlm)) Macbeth (1916 ﬁlm) Macbeth is a silent, black- and-white 1916 ﬁlm adaptation of the W illiam Shakespeare play Macbeth. It was directed by John Emerson, assisted by Erich von Stroheim, and produced by D. W . Grifﬁth, with cinematography by V ictor Fleming. The ﬁlm starred Herbert Beerbohm T ree and Constance Collier , both famous from the stage and for playing Shakespearean parts. Although r eleased during the ﬁrst decade of feature ﬁlmmaking, it was already the seventh version of Macbeth to be produced, one of eight during the silent ﬁlm era. It is considered to be a lost ﬁlm. In the companion book to his ... This information does not directly name the dir ector . However , it mentions that the 1909 ﬁlm Macbeth was directed by Maria Caserini’s husband. Given the previous context, let’s consider the most likely candidate based on the available information: Mario Caserini Reﬂect on current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s re-evaluate the information we have. W e know that the 1909 Italian silent ﬁlm Macbeth was directed by Maria Caserini’s husband. Since the previous sear ches did not provide the name of her husband, let’s consider if we can infer this information from the context or search for it directly: Query: Who was the husband of Maria Caserini who directed the 1909 Italian silent ﬁlm Macbeth? Document (T itle: Maria Caserini) Maria Caserini Maria Caserini (n ´ ee Gasperini; 24 July 1884 – 15 April 1969) was an Italian stage and ﬁlm actress, as well as a pioneer of ﬁlmmaking during the early 20th-century . She often starr ed in adaptations of stage and ﬁlm pr oductions for the works of W illiam Shakespeare. Born in Milan, she started acting in stage productions at a young age. Her ﬁrst ﬁlm was in 1906, in what is believed to be the earliest ﬁlm adaptation of Shakespeare’s Othello, titled Otello, which was directed by her husband, Mario Caserini. She appeared in a ﬁlm adaptation of Romeo and Juliet in ... This information conﬁrms that Maria Caserini’s husband directed the 1909 Italian silent ﬁlm Macbeth. Since we have the name Mario Caserini fr om the context, we can dir ectly answer: Mario Caserini T able 7: Case study 4: MR-Search subsequently revisits the problem through additional search and self-r eﬂection, ultimately producing the correct ﬁnal answer . Question : In which country did Itamar Zorman make his ﬁrst public appearance at a r enowned festival in the early 2010s, and where is he scheduled to perform a concerto at a historic venue in a city with a rich Norman heritage and a famous opera house? Ground T ruth : Switzerland MR-Search : T o determine the country where Itamar Zorman made his ﬁrst public appearance at a renowned festival in the early 2010s and where he is scheduled to perform a concerto at a historic venue in a city with a rich Norman heritage and a famous opera house, let’s ﬁrst search for information about Itamar Zorman’s early public appearances and then ﬁnd the details about his upcoming concert. Query: Itamar Zorman ﬁrst public appearance at a r enowned festival in the early 2010s and upcoming concerto at a historic venue in a city with rich Norman heritage and famous opera house Document (T itle: Itamar Zorman) Hamburg and Kronberg Academy Festival in Frankfurt as well as with Y uri Bashmet. His ﬁrst CD is planned to be released sometime in 2014 with his recitals the Louvre. In November of the same year he, Daniel Or en and Musicians fr om Marlboro gr oup, will perform Ludwig van Beethoven’s concerto at the T eatro Massimo in Palermo. For the past ﬁve years he has traveled through Israel and North America during which he was a founder of the Israeli Chamber Project as well as a L ysander Piano T rio member . On March 9, 2014 he performed one of Johannes Brahms’ sonatas in Detroit ... 22 Preprint. Itamar Zorman made his ﬁrst public appearance at the V erbier Festival in Switzerland in 2010. Now , let’s ﬁnd out where he is scheduled to perform a concerto at a historic venue in a city with a rich Norman heritage and a famous opera house. Query: Itamar Zorman concerto at a historic venue in a city with rich Norman heritage and famous opera house Document (T itle: Semperoper) Semper oper The Semperoper is the opera house of the S ¨ achsische Staatsoper Dr esden (Saxon State Opera) and the concert hall of the Staatskapelle Dresden (Saxon State Orchestra). It is also home to the Semperoper Ballett. The building is located near the Elbe River in the historic centre of Dresden, Germany . The opera house was originally built by the ar chitect Gottfried Semper in 1841. After a devastating ﬁr e in 1869, the opera house was rebuilt, partly again by Semper , and completed in 1878. The opera house has a long history of premier es, including major works by Richard W agner and Richard Strauss. The ... This information does not directly state the location of the upcoming concert. However , it mentions the T eatro Massimo in Palermo, which is in Italy . Let’s verify if this matches the criteria of a city with a rich Norman heritage and a famous opera house. Query: T eatro Massimo in Palermo, Italy Document (T itle: T eatro Massimo) T eatro Massimo The T eatro Massimo V ittorio Emanuele is an opera house and opera company located on the Piazza V erdi in Palermo, Sicily . It was dedicated to King V ictor Emanuel II. It is the biggest in Italy , and one of the largest of Eur ope (the third after the Op ´ era National de Paris and the K. K. Hof-Opernhaus in V ienna), renowned for its perfect acoustics. An international competition for the creation of the opera house was announced by the Palermo Council in 1864 at the instigation of the mayor , Antonio Starrabba di Rudin ` ı. For many years there had been talk of building ... The T eatro Massimo in Palermo is indeed a famous opera house. Now , let’s conﬁrm if Palermo has a rich Norman heritage: Query: Norman heritage in Palermo Document (T itle: Arab-Norman Palermo and the Cathedral Churches of Cefal ` u and Monreale) Arab-Norman Palermo and the Cathedral Chur ches of Cefal ` u and Monreale Arab- Norman Palermo and the Cathedral Churches of Cefal ` u and Monreale is a series of nine r eligious and civic structures located on the northern coast of Sicily dating from the era of the Norman Kingdom of Sicily (1130-1194): two palaces, thr ee churches, a cathedral, and a bridge in Palermo, as well as the cathedrals of Cefal ` u and Monreale. They have been designated together as a UNESCO W orld Heritage Site. The new Norman rulers built various structur es in what has become known as the Arab-Norman style. They incorporated the best practices of ... This information conﬁrms that Palermo has a rich Norman heritage. Therefore, Itamar Zorman is scheduled to perform a concerto at the historic venue in Palermo, Italy . Italy Reﬂect on your current answer to the question and provide an improved answer by searching for additional external information using search engines. The answer remains the same as our previous conclusion based on the available informa- tion: Italy Reﬂect on your current answer to the question and provide an improved answer by searching for additional external information using search engines. Let’s verify the details of Itamar Zorman’s ﬁrst public appearance at a renowned festival in the early 2010s to ensure we have the correct information. W e’ll conﬁrm the country where the festival is held. Query: Itamar Zorman ﬁrst public appearance at a r enowned festival in the early 2010s and the country of the festival Doc 1(T itle: ¨ Itamar Zorman ¨ ) at the Juilliard Berg Concerto Competition/ His ﬁrst public appearance, however , was not until next year at the V erbier Festival, which was broadcast live fr om Switzerland. The same year he did nine concerts with the Israel Philharmonic Orchestra under guidance from David Robertson and performed at the T okyo Symphony of Japan’s Suntory Hall, where he played with Baden Baden Philharmonic and both Russian and Saint Petersburg Philharmonic Or chestra, as well as the American South Bay Conservatory , the Israeli Haifa Symphony and the Dominican Fundaci ´ on Sinfonia of Santo Domingo. As of October 2013, he has participated at both the Laeiszhalle in ... This information conﬁrms that Itamar Zorman made his ﬁrst public appearance at the V erbier Festival in Switzerland in 2010. Therefore, the country where he made his ﬁrst public appearance at a renowned festival in the early 2010s is Switzerland. Switzerland 23

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment