Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange

Sor tify T echnical R epor t March 2026 Let the Agent Steer : Closed-Loop Ranking Optimization via Inﬂuence Ex chang e Sortify T eam 1 R ecommendation ranking is fundamentally an inﬂuence allocation problem: a sor ting f ormula distributes ranking inﬂuence among competing factors—or ganic rele vance, adv er tising bids, price competitiv eness—and the business outcome depends on ﬁnding the optimal “e x chang e rates” among them. Ho we v er , oﬄine pro xy metrics sy stematically misjudg e how inﬂuence reallocation translates to on- line impact: larg e oﬄine uplifts often con v er t into only a small fraction of the e xpected online gains, with transf er rates far below expectation. More subtly , the bias patter n is asymmetr ic across metr ics—some metrics exhibit optimistic bias (oﬄine ov erestimates online), while others e xhibit pessimistic bias (oﬄine underestimates online)—so a single calibration factor cannot correct both. T raditional approaches w orsen this predicament in three wa ys: manual calibration cannot keep pace with dis tr ibution shift; entangled diagnostic signals obscure whether the problem lies in mapping bias or constraint miscalibration; and each round star ts from scratch, prev enting historical e xper ience from accumulating. Figure 1 | Three-panel o verview : (a) dual-channel architecture diagram, (b) online GMV/Orders uplift trend across rounds, (c) LLM cor rection conv erg ence curve W e present Sortify , a fully autonomous ranking optimization agent dr iv en b y a larg e language model (LLM). The agent reframes ranking optimization as continuous inﬂuence ex change, autonomously reg- ulating the allocation of ranking inﬂuence among f actors and closing the full loop from diagnosis and decision to parameter deployment without human inter v ention. It addresses the structural problems through three core mechanisms. Firs t, grounded in L.J. Sa vag e’ s axiomatization of Subjective Expected Utility (SEU)—which establishes that any rational decision requires e xactly two independent inputs: a belief about the state of the w orld (probability) and a pref erence o v er outcomes (utility)—a dual-channel adaptation frame w ork decouples oﬄine-online transfer cor rection (the Belief channel) from constraint penalty adjustment (the Preference channel), architecturall y se v er ing the entanglement between epistemic er ror and axiological er ror to enable or thogonal diagnosis and independent correction. Second, an LLM 1 Yin Cheng, Liao Zhou, Xiyu Liang, Dihao Luo, T e wei Lee, Kailun Zheng, W eiwei Zhang, Mingchen Cai, Jian Dong, Andy Zhang. All authors are from Shopee. Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e meta-controller , ser ving as a second-order rational obser v er , operates on framew ork -lev el parameters— adjusting transfer function intercepts and penalty multipliers rather than lo w-le v el search parameters— and, based on evidence from 20-round episode histories, selectivel y cor rects residuals left b y routine LMS calibration. Third, a persistent Memor y DB with 7 relational tables accumulates cross-round lear ning, pro viding the state basis f or warm star t and cross-round calibration continuity . The ag ent’ s core e valuation metr ic, Inﬂuence Share, is a decomposable ranking inﬂuence measure in which all f actor contr ibutions sum to e xactly 100%. It addresses the fundamental limitation of traditional rank -cor relation metrics such as Kendall’ s 𝜏 , which cannot attr ibute inﬂuence to individual f actors or suppor t quantitativ e e xc hange betw een them. Sor tify has been deplo y ed on a larg e-scale recommendation platf or m spanning two Southeast Asian markets (hereafter Countr y A and Country B), running fully automated with no manual intervention required. The inner optimization loop operates autonomously in 4-hour cycles, e xecuting 5,000 Optuna tr ials per round. In Countr y A ’ s warm-star t experiment, the agent pushed GMV from − 3.6% to +9.2% within 7 rounds, with peak order v olume reaching +12.5%, while LLM calibration eﬀor t con ver ged from 5 cor rection items per round to 2. In Countr y B, a cold-start deplo yment validated its cross-market generalization and end-to-end viability : after two search phases (23 rounds total), the agent identiﬁed a parameter structure that achie v ed +4.15% GMV/UU and +3.58% A ds R ev enue in a 7-da y A/B test, leading to a full production rollout. Ov erall, the e vidence retained in the main text demonstrates that the sys tem can both continuousl y impro v e on e xisting calibration s tate and produce deplo y able policies in new markets. 2 Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Sy stem Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Ov er vie w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Inﬂuence Share and Parameter Search Engine . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Motiv ation: Be y ond Kendall T au . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Inﬂuence Share Deﬁnition: From Symmetric Groups to the Inﬂuence P er mutation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Multi-Objectiv e Search F or mulation . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Belief Channel: Oﬄine-Online T ransf er Calibration . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Linear T ransfer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 LMS Continuous Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.4 LLM Selectiv e Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Pref erence Channel: Cons traint Penalty A daptation . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Violation Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.3 Asymmetric Multiplicative Update . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 LLM Meta-Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2 T w o Or thogonal Knobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.3 Conte xt and Evidence Req uirements . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.4 Implementation: Model, Prompt, and Output Safety . . . . . . . . . . . . . . . . 16 2.5.5 Coordination with LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Memory DB and State P ersistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.1 Motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.2 Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.3 Cross-R ound Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Operational Frame wor k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Sy stem Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 One-Shot Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Y OLO Continuous Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Online A/B R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Exp-401838: Countr y A PDP W arm-Star t (7 R ounds) . . . . . . . . . . . . . . 25 4.4 Oﬄine-Online T ransfer Analy sis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5 Parameter Evolution and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5.1 R esidual Pendulum Eﬀect in W ar m-Start Exper iment . . . . . . . . . . . . . . . 27 4.6 LLM Calibration Eﬀectiv eness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.6.1 Cor rection Con v ergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.7 Sensitivity Anal ysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.8 Country B Deployment: From Cold Start to R ollout . . . . . . . . . . . . . . . . . . . . 31 3 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 4.8.1 Cold Start Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.8.2 Search Jour ney : From Direction to Deploy able Parameters . . . . . . . . . . . . 31 4.8.3 Winning Parameter Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.8.4 Long-Horizon A/B V alidation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.9 Eﬃciency and Operational Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Discussion: One-Person F easibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Scope of What W as Delivered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 T echnical Depth and De v elopment Complexity of Sor tify . . . . . . . . . . . . . . . . . 34 5.3 From ParaDance to Sor tify: Lineage and Domain Exper tise . . . . . . . . . . . . . . . . 35 5.4 The Orches tration Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5 Meta-Nes ting: AI Building AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.6 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Conclusion, Limitations, and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 40 A Contribu tors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 B Im plement ation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 B.1 A/B Exper iment Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 B.2 R edis Parameter Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 B.3 S tate File Manag ement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 B.4 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 C Notation T able . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 1. Introduction Ranking in recommendation sys tems determines what users see ﬁrst, directly impacting ke y business metrics such as gross merchandise v alue (GMV), order v olume, advertising re v enue, and user engag ement. Industrial ranking stac ks commonly combine multiple signals through tunable scor ing functions and f eature weights, especiall y in product searc h and marketplace recommendation settings Santu et al. [ 2017 ], Liu [ 2009 ]. Optimizing these parameters is theref ore a continuous, high-le v erage engineer ing task. As platf or ms scale across mark ets, verticals, and recommendation sur f aces, the need f or automated, self-cor recting parameter optimization becomes increasingl y urg ent Hutter et al. [ 2019 ]. Figure 2 | Old vs. N ew Paradigm: Left—manual optimization loop (oﬄine search, deplo y , human judgment, discard state, repeat). Right—Sor tify closed loop (oﬄine search, deplo y , auto-calibrate via dual channels, accu- mulate in Memor y DB, repeat with improv ed priors). Highlights the structural diﬀerences: stateless vs. persistent, single-axis vs. dual-channel, manual vs. LLM-orchestrated. The standard optimization workﬂo w operates as follo ws: an oﬄine searc h algorithm e xplores the parameter space using pro xy metr ics computed on logged data, identiﬁes a candidate conﬁguration, and pushes it to production for online A/B v alidation. A human operator then e xamines the A/B results, judg es whether the outcome is acceptable, and either adopts the new parameters or rev er ts to the previous conﬁguration Gilotte et al. [ 2018 ]. This manual loop, while conceptuall y simple, suﬀers from three structural limitations that incremental tuning impro vements cannot resolv e. Oﬄine-Online T ransf er Gap. The most critical structural problem is the sys tematic div erg ence betw een oﬄine proxy metr ics and online business outcomes. In the Country A warm-star t e xper iment retained in the main text, oﬄine 𝐼 gmv rose from +18.2% to +41.6%, y et the cor responding online GMV ﬂuctuated from − 3.6% to +9.2%, rev ealing a signiﬁcant optimistic bias in the oﬄine pro xy to ward online gains. More critically , the same round’ s parameter update aﬀects GMV , Orders, and A ds Re venue non-synchronousl y , indicating that the transf er relationships across diﬀerent business metr ics do not share a single calibration constant. Standard logged-data ev aluation methods estimate policy value from historical interactions Gilotte et al. [ 2018 ], Dudik et al. [ 2014 ], but they do not by themselv es pro vide continuousl y updated, per -metric transf er calibration f or a non-stationary production environment. Our production obser vations sho w that the mapping dr ifts continuousl y with traﬃc patterns, time-of-day eﬀects, and competitiv e dynamics. W ithout continuous recalibration, eac h carefully designed oﬄine 5 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e search is built on potentiall y stale mapping assumptions, and the precision gains from search are par tially oﬀset b y transfer er ror . Entangled Diagnostic Signals. When an optimization round produces disappointing online results, the operator f aces a fundamental attribution problem: did the oﬄine-online mapping predict incorrectly (a Belief error), or w ere the constraint penalties calibrated at the wrong sensitivity lev el (a Pr efer ence error)? These tw o failure modes require opposite cor rections—a Belief er ror demands adjusting the transf er function intercept, while a Preference er ror demands rescaling penalty weights Berg er [ 1985 ]. Without f or mal separation, the two signal sources are conﬂated: a round where GMV is positive but A ds Re venue is negativ e could equally mean the mapping is optimistic or the constraint is too loose. This diagnostic entanglement is not merel y a parameter tuning issue; it e xposes a limitation of monolithic optimization w orkﬂo ws that compress multiple objectiv es and constraints into a single correction c hannel Lin et al. [ 2019 ]. No Persistent Learning. Each optimization round in con v entional sys tems starts from a blank slate. The transf er relationships lear ned in round 𝑁 are not car r ied f orward to round 𝑁 + 1 . The cons traint sensitivities disco vered after a violation are f orgotten after the ne xt parameter push. This “Groundhog Da y” eﬀect means that a sy stem having completed 50 rounds of A/B tests possesses no more calibration intellig ence than one r unning its ﬁrst round Parisi et al. [ 2019 ]. In our production setting, each oﬄine- online observation pair represents 3.5 hours of liv e traﬃc data—a non-trivial inv estment. Discarding this accumulated evidence is both economicall y w asteful and technicall y a voidable. R ecent w ork on continual lear ning Parisi et al. [ 2019 ], meta-lear ning f or warm-started optimization Feurer et al. [ 2015 ], and LLM-based agents with persistent state W ang et al. [ 2024 ] sugges ts that cumulativ e lear ning across episodes is both f easible and valuable for sequential decision problems. W e present Sor tify , an agent-s teered closed-loop sys tem that reframes sor ting parameter optimization as continuous inﬂuence e x chang e. Rather than treating optimization as a stateless manual process, Sor tify enables an LLM agent to steer the rebalancing of ranking inﬂuence among competing factors through a persistent, autonomous f eedback loop. Sor tify’ s core architectural inno vation is the decomposition of the inﬂuence e xc hange problem into two or thogonal channels—Belief (how inﬂuence predictions translate to reality) and Pref erence (how hard inﬂuence constraints are enf orced)—orchestrated by an LLM meta- controller that operates on frame work -lev el parameters rather than bottom-lev el search parameters. The design of Sor tify rests on three principles. Firs t, separ ation of concer ns : transf er mapping cor rec- tion (Belief ) and constraint sensitivity adjustment (Pref erence) operate on independent ax es der iv ed from a Bay esian decision-theoretic decomposition 𝑃 ( 𝜃 | 𝐷 ) × 𝐿 ( 𝑎 , 𝜃 ) Berg er [ 1985 ], ensur ing that cor rections along one axis do not inter fere with the other . Second, fr amewor k-lev el LLM contr ol : the LLM adjusts the search frame work —speciﬁcally , transf er function intercepts and penalty multipliers—rather than the 7-dimensional sorting parameter vector directly . This distinction matters because frame w ork parameters encode reusable cross-round e xperience (e.g., “oﬄine GMV predictions ha v e been 5–8% optimistic o ver the pas t 3 rounds ”), while bottom-le v el parameters are ephemeral facts tied to a single data snapshot Feurer et al. [ 2015 ]. Third, cumulative memor y : a 7-table persistent Memor y DB stores ev er y oﬄine-online observation, calibration update, and LLM proposal, enabling the sys tem to build calibration intelligence across rounds rather than resetting after each iteration Parisi et al. [ 2019 ], Feurer et al. [ 2015 ]. The contributions of this repor t are as follo ws: • Dual-channel self-adaptation frame wor k. W e propose a Belief/Pref erence decomposition that enables or thogonal calibration of oﬄine-online transf er mapping and constraint sensitivity . In warm- start production A/B tests on a Countr y A recommendation platform, this frame work pushes GMV from − 3.6% to +9.2% within 7 rounds, with peak order v olume reaching +12.5% (Section 4.3 ). • LLM meta-controller with evidence-based reasoning. W e introduce an LLM agent that adjus ts tw o frame work -lev el knobs—transfer function intercept ( 𝛿 intercept ∈ [ − 0 . 1 , + 0 . 1 ] ) and penalty multiplier ( ∈ [ 0 . 5 , 2 . 0 ] )—g rounded in 20-round episode histor ies and 30-update calibration traces. In the Country A warm-star t experiment, the LLM’ s calibration eﬀor t con v erg es from 5 correction items 6 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e per round to 2 within 7 rounds, indicating that frame w ork -le vel cor rections diminish as evidence accumulates (Section 4.6 ). • Persistent memory architecture f or cumulativ e learning. W e demonstrate that a 7-table relational Memory DB enables the sys tem to w ar m-start on exis ting calibration state and extend cross-round learning into subsequent experiments. The Countr y A case star ts directly from an e xisting Memor y DB state, while the Countr y B case completes a cross-market deplo yment without architectural modiﬁcation (Sections 4.5 and 4.8 ). The remainder of this repor t is organized as f ollo ws. Section 2 presents the sys tem architecture, detailing the Inﬂuence Share metr ic, dual-channel mechanism, LLM meta-controller , and Memor y DB design. Section 3 descr ibes the operational frame work, including the 10-step one-shot pipeline and Y OLO continuous loop with f ailure reco very . Section 4 pro vides ev aluation evidence from 30 optimization rounds across Countr y A and Countr y B, compr ising 7 Country A warm-start rounds and 23 Countr y B deplo yment rounds. Section 6 discusses limitations—including residual parameter oscillation, statistical uncer tainty in low -traﬃc windo ws, and insuﬃcient evidence f or e xtreme structural breaks—and outlines future directions. 7 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 2. Sy stem Ar chitecture In this section, we present the complete architecture of Sor tify , a three-la y er closed-loop sys tem f or continuous sor ting parameter optimization. The sys tem takes online A/B e xper iment data as input, calibrates its inter nal model of oﬄine-online transf er relationships, and outputs optimized sor ting pa- rameters f or the ne xt deplo yment cycle. W e detail the ﬁv e core components: the Inﬂuence Share metr ic and parameter search engine (Section 2.2 ), the Belief channel f or transfer calibration (Section 2.3 ), the Pref erence channel f or constraint adaptation (Section 2.4 ), the LLM meta-controller (Section 2.5 ), and the persistent Memor y DB (Section 2.6 ). 2.1 Ov ervie w Sor tify operates through a three-la y er architecture that separates concerns across human conﬁguration, intellig ent calibration, and parameter search. • La y er 1 (Outer — Human/Conﬁguration): Deﬁnes optimization objectiv es (e.g., maximize 𝐼 gmv ), constraint boundaries (e.g., 𝐼 order degradation ≤ 5%), initial parameter ranges, and penalty weight seeds. This lay er is set once per market and updated infrequentl y . • La y er 2 (Middle — LLM + Algorithm): P er f or ms dual-c hannel calibration. The Belief channel cor rects the oﬄine-to-online transf er mapping via LMS regression and LLM-dr iv en intercept adjust- ments. The Pref erence channel adapts constraint penalty w eights via multiplicative updates. The LLM meta-controller orches trates both channels using e vidence from accumulated episode history . • La y er 3 (Inner — Optuna TPE Sear ch): Executes 5,000 tr ials with 25 parallel w orkers in the 7-dimensional parameter space, operating under the calibrated target rang es and penalty w eights produced b y Lay er 2. The dual-channel design in La yer 2 is not an arbitrar y engineer ing partition but a necessar y conse- quence of rational decision axioms. L.J. Sav age ’ s (1954) theory of Subjectiv e Expected Utility (SEU) establishes that an y rational decision requires exactl y tw o independent inputs: what y ou belie ve the w orld to be (probability/belief) and what y ou care about (utility/pref erence). If a sys tem’ s be- ha vior satisﬁes consistency axioms, its pref erences can be represented b y subjectiv e probabilities and a utility function; the utility representation is uniq ue onl y up to a positive aﬃne transf or mation. Sor tify operationalizes this axiom into a sy stem-lev el decomposition: • Belief channel (truth-seeking): Cor responds to the stat e belief in Sa vag e’ s theory—“Ho w do oﬄine metrics map to online reality?” It concer ns only objectiv e phy sics and system regular ities, car r ying no v alue judgment. • Pref erence channel (v alue-seeking): Corresponds to the utility function in Sav age ’ s theory—“Ho w much does it hur t when a constraint is violated?” It deﬁnes onl y loss boundar ies, without interfering with ph ysical regular ities. If these two are not f orcibly decoupled, the sys tem f alls into unav oidable cognitive biases: Wishful Thinking —distorting the objectiv e assessment of oﬄine-online transfer rates because of an intense desire f or goal achiev ement; or Sour Grapes —reducing the impor tance of red-line constraints to make the o v erall loss function conv erge when transfer rates are poor . The dual-channel design sev ers this “diagnostic entanglement” at the architectural lev el, ensur ing independent self-calibration along two or thogonal directions. Further more, an intellig ent agent can er r in two or thogonal directions, each requir ing a mutually e x clusive cor rection path: 1. Epistemic Error : “I predicted it w ould happen, but it didn ’ t”—the sy stem’ s wor ld model has f ailed. The cor rection must not modify penalty weights; it mus t cor rect the belief mapping. 2. Axiological Error: “It happened, but I didn’ t e xpect it to hur t this much ”—the prediction was cor rect but the utility assessment was insuﬃcient. The cor rection must not modify the transfer mapping; it must cor rect the utility function. 8 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e The complete data ﬂo w f orms a closed loop: online A/B data enters the Memor y DB as a ne w episode → LMS updates transfer model slopes and intercepts → LLM proposes frame work -lev el cor rections → calibrated constraints feed into Optuna search → best parameters are published to Redis → new A/B data is g enerated → the cycle repeats. Each cycle completes in approximatel y 4 hours. Figure 3 | Three-lay er architecture diag ram. La yer 1 (Human/Conﬁg) feeds objectiv es and constraints to Lay er 2 (LLM + Algor ithm), which contains the Belief channel, Pref erence channel, and LLM meta-controller . La yer 2 outputs calibrated target_rang e and penalty_weight to Lay er 3 (Optuna TPE Search, 5000 tr ials × 25 work ers). La yer 3 produces best parameters → Redis → online A/B → Memor y DB → back to Lay er 2. 2.2 Inﬂuence Share and Parameter Searc h Engine Ph ysical Intuition: R ooms and W alls in High-Dimensional Space Bef ore diving into the detailed mathematical der ivation and the engineering motivation behind the sy stem, let us establish an intuitive geometric and phy sical analogy to understand what “ranking” and “inﬂuence ” tr uly represent. Imagine a search request that recalls 𝑛 candidate items. From a traditional engineer ing perspective, ranking simply means calculating a total score f or each of these 𝑛 items and order ing them from highest to lo wes t. Ho w ev er, if w e shift our perspectiv e to a high-dimensional g eometr ic space, a completel y diﬀerent and more prof ound picture emerg es. W e can view the scores of these 𝑛 items as a single “point ” (a score vector) in an 𝑛 -dimensional space. Within this vas t space, there exis t invisible “walls ”. When do w e hit a w all? The critical state where an y tw o items hav e ex actly identical scores f or ms a “hyperplane wall” (a compar ison boundar y) in this space. Because there are 𝑛 items, the space is intersected by numerous such walls. These intersecting walls par tition the entire 𝑛 -dimensional space into multiple closed “rooms ” (ge- ometrically kno wn as chambers). Ev er y single “room” represents a speciﬁc, deterministic, and unique ﬁnal ranking order . As long as our score vector —the “point”—s tay s saf ely inside a par ticular room, no items ha ve tied scores, and the ﬁnal ranking list remains absolutel y ﬁxed. No w , consider the task of optimizing the ranking algorithm. In Sor tify , we do this by adjusting the parameters 𝜽 (such as c hanging the w eight of GMV), which alters the ﬁnal scores. In our geometric analogy , adjusting the parameters is essentiall y pushing this “point ” to mo v e continuously through the 9 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e space . As the parameters chang e, the point mo ves. If it merely wanders within the boundar ies of a single room, the ranking order does not chang e at all, ev en though the underl ying scores are shifting. It is only when the point crosses a “wall” —s tepping from one room into an adjacent one—that tw o items sw ap their ranks, and the o v erall ranking list actually chang es. This brings us to the core insight of our sys tem: the atomic ev ent of ranking c hange is not the reshuﬄing of a global list, but the point cr ossing a speciﬁc wall. Follo wing this logic, ho w should w e deﬁne the “inﬂuence ” of individual business factors (such as orders, GMV , or ads) on the ﬁnal ranking? Since a rank chang e is equiv alent to “the point crashing through a wall, ” w e can simpl y perf or m a ph ysical “f orce analy sis.” The distance to the w all is deter mined b y the score diﬀerence betw een tw o items, and this total score diﬀerence is the sum of diﬀerences contr ibuted by individual f actors. Theref ore, each factor acts as a “pusher” e x er ting f orce in a speciﬁc direction. The “inﬂuence ” of each f actor is determined b y ho w much “f orce” it ex erts in the normal direction of that wall to push the point across. The Inﬂuence Share metr ic measures e xactly this: the percentage of pushing f orce each factor contributes relative to the total push. By adopting this phy sical intuition, w e translate abstract parameter tuning and per mutation-group dynamics into a crys tal-clear force and kinetic attribution analy sis in high- dimensional space. This perspectiv e liberates us from opaque rank cor relation metr ics (like Kendall T au) and allo ws us to del v e into the microscopic phy sical process of e v er y rank sw ap to precisely quantify the contribution of each business factor . 2.2.1 Motiv ation: Be y ond Kendall T au T raditional parameter search in sor ting sys tems relies on rank correlation metr ics such as Kendall T au to measure ho w w ell a candidate parameter set preser v es a targ et ranking order . While Kendall T au captures ordinal ag reement, it has tw o cr itical limitations f or multi-f actor sorting optimization. Firs t, it is not decomposable : Kendall T au produces a single scalar that cannot be attr ibuted to individual factors, making it impossible to answer questions like “ho w much of the sor ting decision is dr iv en b y GMV v ersus order count?” Second, it does not suppor t fact or trade-oﬀ analysis : because Kendall T au has no sum-to-one constraint, there is no natural wa y to express “shift 5% of sorting inﬂuence from orders to GMV while keeping ad exposure stable.” 2.2.2 Inﬂuence Share Deﬁnition: From Symme tric Groups to the Inﬂuence Permutation Algorithm W e do not treat Inﬂuence Share as an isolated heur istic deﬁnition. Instead, w e der iv e it as a continuous attribution scheme built on the per mutation str ucture of ranking itself. Consider a request 𝑞 with 𝑛 𝑞 items, inde x ed b y 𝑋 𝑞 = { 1 , . . . , 𝑛 𝑞 } . U nder parameter vector 𝜽 , the ranking outcome f or that reques t is a per mutation in the symmetr ic group: 𝜋 𝑞 ( 𝜽 ) ∈ 𝑆 𝑛 𝑞 . (1) Here 𝜋 𝑞 ( 𝜽 ) ( 𝑘 ) denotes the item placed at rank 𝑘 . Since the symmetric group 𝑆 𝑛 𝑞 is generated by adjacent transpositions 𝑠 𝑘 = ( 𝑘 , 𝑘 + 1 ) , any diﬀerence between tw o rankings can be decomposed into a sequence of local sw aps. For ranking control, the atomic ev ent is therefore not “replacing one permutation b y another” but ﬂipping the relativ e order of a speciﬁc item pair . T o connect this discrete g roup structure to continuous parameters, let candidate parameters 𝜽 induce a total score v ector S 𝑞 ( 𝜽 ) ∈ R 𝑛 𝑞 f or request 𝑞 . For an y 𝑖 ≠ 𝑗 , deﬁne the compar ison hyperplane and the cor responding chamber: 𝐻 𝑞 𝑖 𝑗 : = { 𝑥 ∈ R 𝑛 𝑞 : 𝑥 𝑖 = 𝑥 𝑗 } , 𝐶 𝜋 : = { 𝑥 : 𝑥 𝜋 ( 1 ) > 𝑥 𝜋 ( 2 ) > · · · > 𝑥 𝜋 ( 𝑛 𝑞 ) } . (2) When S 𝑞 ( 𝜽 ) ∈ 𝐶 𝜋 , the ranking is ex actly permutation 𝜋 ; when it crosses some 𝐻 𝑞 𝑖 𝑗 , the relative 10 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e order of items 𝑖 and 𝑗 ﬂips. This is the classical reﬂection-g eometric representation of the symmetric group. The local coordinate go v er ning ranking chang e is theref ore the pair wise margin: T o av oid notational ambiguity , w e assume scores are in generic position; if ties occur , a ﬁx ed deterministic tie-breaker resolv es them, so the ﬁnal ranking can still be treated as an element of 𝑆 𝑛 𝑞 . Δ 𝑞 ( 𝑖 , 𝑗 ; 𝜽 ) : = 𝑆 𝑞 ( 𝑖 ; 𝜽 ) − 𝑆 𝑞 ( 𝑗 ; 𝜽 ) . (3) In Sor tify , the total score is additiv ely decomposed ov er factors: 𝑆 𝑞 ( 𝑖 ; 𝜽 ) =  𝑓 ∈ F score 𝑞 , 𝑓 ( 𝑖 ; 𝜽 ) , Δ 𝑞 ( 𝑖 , 𝑗 ; 𝜽 ) =  𝑓 ∈ F Δ 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) , (4) where Δ 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) : = score 𝑞 , 𝑓 ( 𝑖 ; 𝜽 ) − score 𝑞 , 𝑓 ( 𝑗 ; 𝜽 ) . (5) This gives the ke y br idg e from group theor y to algor ithm design: parameter updates do not act on a discrete permutation directl y . They ﬁrst reshape the margin of each pair along the normal direction of the comparison w all; the sign of that margin determines whether a sw ap occurs; and the accumulation of such sw aps yields a new per mutation. U nder this lens, parameter search becomes an inﬂuence permutation algorithm: it continuousl y reallocates each factor ’ s share of responsibility for potential swaps across impor tant item pairs. T o turn that responsibility into a stable metric, we impose ﬁv e attribution requirements at the pair le vel: locality , sign inv ariance, non-negativity , par tition of unity , and inv ar iance under common positiv e rescaling. Under these requirements, the simplest choice is an 𝐿 1 normalization of absolute factor margins. Deﬁne the total pair wise inﬂuence budg et as: 𝑍 𝑞 𝑖 𝑗 ( 𝜽 ) : =  𝑓 ∈ F   Δ 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 )   . (6) If 𝑍 𝑞 𝑖 𝑗 ( 𝜽 ) = 0 , then no f actor pro vides an y discr iminativ e signal f or that pair , so the pair car r ies no ranking inf or mation and can be e x cluded from agg regation. For all informativ e pairs, the pair wise inﬂuence share of f actor 𝑓 is deﬁned as: 𝑠 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) =   Δ 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 )   𝑍 𝑞 𝑖 𝑗 ( 𝜽 ) . (7) It f ollow s immediatel y that Í 𝑓 ∈ F 𝑠 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) = 1 . The role of the absolute value is to separate direction from magnitude: the sign of Δ 𝑞 , 𝑓 indicates which item the factor pushes f or ward, while   Δ 𝑞 , 𝑓   measures ho w strongl y it pushes. Holding other components ﬁx ed, 𝑠 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) increases monotonically with   Δ 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 )   , giving an interpretable local inﬂuence signal without requiring the global permutation to v ar y monotonically with ev er y parameter . T o mo v e from pair -lev el shares to a request-le v el or dataset-le v el metr ic, w e aggregate only inf or mativ e and business-relev ant pairs. Let 𝑃 𝑞 denote the pair set f or request 𝑞 ; it may contain all inf or mativ e pairs or be restricted to a T op- ( 𝐾 + 𝐿 ) region. F or ( 𝑖 , 𝑗 ) ∈ 𝑃 𝑞 , let 𝑟 𝑞 𝑖 denote the rank of item 𝑖 in request 𝑞 , and deﬁne the position-sensitiv e weight: 𝑤 𝑞 𝑖 𝑗 = e xp  − min ( 𝑟 𝑞 𝑖 , 𝑟 𝑞 𝑗 ) 𝜏  . (8) Here 𝜏 controls the decay rate. P airs inv ol ving higher -ranked items receiv e larger weights, reﬂecting the business reality that top-of-pag e positions matter dispropor tionately . The aggregate Inﬂuence Share f or factor 𝑓 is then: 11 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 𝐼 𝑓 ( 𝜽 ) = Í 𝑞 Í ( 𝑖 , 𝑗 ) ∈ 𝑃 𝑞 𝑤 𝑞 𝑖 𝑗 · 𝑠 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) Í 𝑞 Í ( 𝑖 , 𝑗 ) ∈ 𝑃 𝑞 𝑤 𝑞 𝑖 𝑗 . (9) By construction, Í 𝑓 ∈ F 𝐼 𝑓 ( 𝜽 ) = 1 . This sum-to-one proper ty is what enables “ex chang e” reasoning: increasing 𝐼 gmv b y 5 percentage points necessar ily decreases other factors b y a combined 5 percentag e points, making trade-oﬀs explicit and quantiﬁable. At this point the der ivation closes: a ranking in 𝑆 𝑛 𝑞 is g enerated b y pair wise swaps, swaps are tr iggered b y pair wise mar gins crossing comparison walls, and Inﬂuence Share measures how much of those potential swaps is attr ibutable to each factor . Operationall y , the inﬂuence permutation algorithm consists of ﬁv e steps: compute f actor -wise scores and sum them into total scores; sor t to obtain 𝜋 𝑞 ( 𝜽 ) ; compute Δ 𝑞 , 𝑓 and 𝑠 𝑞 , 𝑓 on 𝑃 𝑞 ; aggregate with 𝑤 𝑞 𝑖 𝑗 ; and pass 𝐼 𝑓 ( 𝜽 ) to the do wnstream objectiv e and constraints. 2.2.3 Multi-Objectiv e Search F ormulation The parameter search optimizes a composite objective that maximizes the pr imar y business metr ic ( 𝐼 gmv uplift) while penalizing constraint violations via quadratic penalties: J ( 𝜽 ) = 10 · 𝐼 gmv ( 𝜽 ) − 𝐼 base gmv | 𝐼 base gmv | | {z } relativ e uplift − 𝐽  𝑗 = 1 𝜆 𝑗 ·  max ( 0 , 𝑣 𝑗 ( 𝜽 ) )  2 (10) where 𝜽 = ( ps_ads_wo , ps_ads_w g , ps_org_w o , ps_org_w g , ps_porg_w , ps_price_pow , ps_w2 ) is the 7-dimensional parameter v ector , 𝑣 𝑗 ( 𝜽 ) is the violation of the 𝑗 -th constraint (positiv e when vio- lated), and 𝜆 𝑗 is the penalty w eight f or constraint 𝑗 . The quadratic penalty f or m is a deliberate design choice: penalty 𝑗 = 𝜆 𝑗 · 𝑣 2 𝑗 (11) A 1% violation incurs a penalty of 𝜆 𝑗 × 0 . 01 2 , while a 10% violation incurs 𝜆 𝑗 × 0 . 10 2 — a 100x diﬀerence. This conv exity tolerates minor boundary brushes while aggressivel y penalizing larg e violations, biasing the search to ward solutions where all constraints are mildl y satisﬁed rather than one being sev erely violated. Linear penalties, by contrast, w ould treat a 10% violation as merel y 10x w orse than a 1% violation, insuﬃcient f or industrial constraint enf orcement. Constraints include bounds on 𝐼 order degradation, 𝐼 ecpm_term degradation, ads top-10 e xposure rate, and 𝐼 gmv_ads . The penalty w eights 𝜆 𝑗 are initialized at values between 15,000 and 30,000 and are subsequentl y adapted b y the Preference channel (Section 2.4 ). Figure 4 | Flow diagram showing: item pairs → pair wise score diﬀerences → per-f actor share (sum=1) → rank - w eighted aggregation → 𝐼 gmv , 𝐼 order , 𝐼 ecpm_term . Below : 7 parameters map into the sor ting f ormula, changing factor contributions. 12 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 2.2.4 Searc h Engine The search is e xecuted by Optuna ’ s TPE (T ree-structured Parzen Es timator) sampler Akiba et al. [ 2019 ], which models the conditional distribution of parameters giv en good v ersus bad objective values. Each round runs 5,000 trials with 25 parallel w ork ers using the Constant Liar strategy — when a w orker starts a trial bef ore others ﬁnish, it assumes pending tr ials will return a pessimistic v alue, prev enting redundant e xploration. A typical searc h completes in 15–30 minutes . As demonstrated in Section 4.3 , this searc h engine consistently ﬁnds parameter conﬁgurations that achie v e 𝐼 gmv uplifts of +4.9% to +41.6% relativ e to baseline, subject to calibrated constraints. 2.3 Belief Channel: Oﬄine-Online T ransfer Calibration 2.3.1 Motiv ation The oﬄine-online transf er g ap documented in Section 1 is not a ﬁx ed constant — it varies across metrics, drifts ov er time, and shifts abr uptly with traﬃc patter n chang es. A static calibration (e.g., “oﬄine GMV predictions are alw a ys 3x too optimis tic”) w ould become s tale within da y s. The Belief channel addresses this b y maintaining a continuously updated linear transfer model f or each metric pair , combining a slo w algor ithmic estimator (LMS) with a fast patter n-recognizing agent (LLM). From a Ba y esian epistemological perspectiv e, the Belief channel embodies tw o fundamentally diﬀer - ent philosophies f or confronting an uncer tain wor ld—a dual tempo of belief updating: normal science with smooth priors (LMS) represents the classical Ba y esian framew ork, assuming the wor ld is continu- ous (Marko vian), with each round of ne w evidence incrementall y reﬁning the pr ior; paradigm shift and black sw ans (LLM intercept jumps) represent the decisiv e recognition of structural breakpoints when the underl ying distribution undergoes cliﬀ-like chang es—discarding the old prior and f orcibly injecting a new one. The former handles routine dis tr ibutional dr ift; the latter responds to sudden paradigmatic disruptions. T ogether , they constitute a complete response to an uncer tain w orld. 2.3.2 Linear T ransfer Model For each metric pair (e.g., 𝐼 gmv → GMV , 𝐼 order → Orders), w e model the oﬄine-to-online relationship as: ˆ 𝑢 online = 𝛼 · 𝑢 oﬄine + 𝛽 (12) where 𝑢 oﬄine and 𝑢 online are the obser v ed oﬄine and online uplifts respectiv ely , 𝛼 (slope) captures the transf er rate, and 𝛽 (intercept) captures the systematic bias. Six such relationships are maintained: GMV~ 𝐼 gmv , Orders~ 𝐼 order , A ds Re v enue~ 𝐼 ecpm_term , and three derived metr ics. 2.3.3 LMS Continuous Calibration After each round, the slope and intercept are updated via Least Mean Squares (LMS) reg ression with learning rate 𝜂 = 0 . 2 : 𝛼 𝑡 + 1 = 𝛼 𝑡 + 𝜂 · 𝑒 𝑡 · 𝑢 oﬄine , 𝑡 (13) 𝛽 𝑡 + 1 = 𝛽 𝑡 + 𝜂 · 𝑒 𝑡 (14) where the prediction er ror 𝑒 𝑡 = 𝑢 online , 𝑡 − ˆ 𝑢 online , 𝑡 . LMS provides s table, monotonic conv ergence for gradual dis tr ibution drift. Ho we v er, under sudden s tr uctural shifts (e.g., a promotional ev ent or traﬃc source c hange), LMS with 𝜂 = 0 . 2 requires approximatel y 15 rounds to reach 95% conv erg ence — equiv alent to o v er 2 da ys of production data, dur ing which the sys tem operates with a miscalibrated transf er model. 13 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 2.3.4 LLM Selectiv e Jumps T o compress reco very time from structural breaks, the LLM meta-controller (Section 2.5 ) can directl y adjust the intercept 𝛽 in discrete jumps of up to ± 0 . 1 : 𝛽 new = 𝛽 old + Δ 𝛽 LLM , Δ 𝛽 LLM ∈ { 0 , ± 0 . 05 , ± 0 . 1 } (15) The LLM e xamines 20-round episode history and can detect patter ns invisible to single-obser vation LMS — f or ex ample, “5 consecutive rounds where GMV transf er was pessimistic despite positive 𝐼 gmv trends.” When the LLM detects such a patter n, a single intercept jump can accomplish what LMS w ould need 10+ rounds to achiev e. Cr iticall y , the LLM adjusts only the int ercept , not the slope, because intercept shifts represent sys tematic bias chang es while slope changes indicate fundamental relationship restructur ing, which should be left to the more conser vativ e LMS estimator . The output of the Belief channel is a calibrated target_rang e for eac h constrained metric — the oﬄine uplift rang e that, after applying the calibrated transf er function, maps to an acceptable online outcome. Figure 5 | T w o-axis diagram. Hor izontal axis: Belief (targ et_range = position of constraint boundar y). V er tical axis: Preference (penalty_w eight = hardness of constraint boundar y). Arrow s show LMS moving along Belief axis continuously , LLM making discrete jumps along Belief axis, and violation pressure moving along Preference axis. The two ax es are orthogonal — cor rections on one do not aﬀect the other . As shown in Section 4.4 , ev en under w ar m-start conditions the Belief channel must continue updating with new obser vations: in Exp-401838, R2 still e xhibited a clearl y optimistic oﬄine bias, while by R7 the same kind of positiv e oﬄine signal had become able to translate into signiﬁcantly positive online GMV . 14 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 2.4 Pref erence Channel: Constraint Penalty Adap t ation 2.4.1 Motiv ation Ev en with a per f ectly calibrated transfer model, the search framew ork still needs to kno w how hard to enf orce each constraint. A constraint that w as har mlessly br ushed in round 𝑁 ma y become cr iticall y binding in round 𝑁 + 1 due to changing mark et conditions. The Preference channel continuously adjusts constraint penalty weights based on obser ved violation patter ns. Within Sa vag e ’ s SEU frame work, the Preference channel addresses a fundamentally diﬀerent er ror type from the epistemic er ror corrected b y the Belief channel— axiological error : “It happened, but I didn ’ t expect it to hur t this much.” The sys tem’ s prediction was entirel y correct, but its assessment of the outcome ’ s “pain ” (utility loss) was sev erely insuﬃcient. For instance, the sys tem accurately predicted a minor decline in a metr ic, the online result conﬁrmed a minor decline, yet this small decline tr iggered a catastrophic aler t in the business dashboard. In this case, the correction path must not touch the transfer mapping —the wor ld model is not wrong—but must instead cor rect the utility function: the sys tem is learning that “this red line is f ar harder than e xpected; ne xt time, e ven at the cost of the pr imar y objectiv e, it must nev er be crossed again.” 2.4.2 Violation Pressur e For each constraint 𝑗 , we compute a nor malized violation pressure: 𝑝 𝑗 = 𝑣 𝑗 ( 𝜽 ) 𝑣 threshold 𝑗 (16) where 𝑣 𝑗 ( 𝜽 ) is the obser v ed violation magnitude and 𝑣 threshold 𝑗 is the acceptable violation threshold. A pressure of 𝑝 𝑗 > 0 indicates the constraint is being violated; 𝑝 𝑗 ≤ 0 indicates it is satisﬁed. 2.4.3 Asymmetric Multiplicativ e U pdate The penalty w eight update f ollow s a multiplicativ e r ule with asymmetr ic step sizes: 𝛿 𝑗 = ( 𝛿 up = 0 . 25 if 𝑝 𝑗 > 0 − 𝛿 do wn = − 0 . 08 if 𝑝 𝑗 ≤ 0 (17) 𝜆 ( 𝑡 + 1 ) 𝑗 = 𝜆 ( 𝑡 ) 𝑗 · exp ( 𝛿 𝑗 ) (18) This design encodes three deliberate proper ties: 1. Scale inv ariance. Multiplicativ e updates apply equal percentag e chang es regardless of the absolute magnitude of 𝜆 𝑗 . When one constraint has 𝜆 = 1 , 000 and another has 𝜆 = 80 , 000 , eq ual violation pressure produces propor tionally equal adjustments. A dditive updates would o ver -cor rect small w eights and under -cor rect larg e ones. 2. Loss a version. The asymmetry 𝛿 up = 0 . 25 > 𝛿 do wn = 0 . 08 means that tightening a constraint (in response to violation) is 3x faster than relaxing it (when satisﬁed). This reﬂects the industrial reality that the cos t of a constraint violation (e.g., ad rev enue collapse) outw eighs the opportunity cost of an o v erl y tight constraint (e.g., slightl y suboptimal GMV). 3. Log-space interpre t ability . Because the update is multiplicativ e in the original space and additive in log-space, 𝛿 up and 𝛿 do wn are directly inter pretable as one-round tightening and relaxation steps. The y are intentionall y asymmetric, making the sys tem more responsive to violations than to relaxation oppor tunities. The Preference channel’ s output is a set of calibrated penalty_w eight v alues that feed directl y into the objectiv e function (Eq. 10 ). As demonstrated in Section 4.5 , the Preference channel’ s adaptiv e w eights contribute to nar rowing parameter oscillation amplitude across successive experiments. 15 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 2.5 LLM Meta-Controller 2.5.1 Motiv ation The LMS es timator in the Belief channel handles routine distribution drift eﬀectiv ely but is fun- damentall y a single-obser vation, ﬁxed-s tep-size algorithm. It cannot detect multi-round patterns (e.g., “5 consecutiv e rounds of pessimis tic GMV transf er”), dis tinguish between noise and structural breaks, or reason about the r elationship between Belief cor rections and Pref erence cor rections. The LLM meta-controller ﬁlls this gap by providing episodic patter n recognition and cross-channel reasoning. From a cyber netics perspective, the LLM meta-controller ac hiev es a leap from ﬁrst-order control to second-order control (reﬂection) . F irst-order control—the Optuna search engine ﬁnding optimal solutions in the high-dimensional parameter space under giv en Belief and Pref erence conditions—is direct control o ver the “environment.” The LLM meta-controller , by contras t, does not directly manipulate the lo w-le v el sor ting parameters 𝜽 . Its inputs are “how the sys tem has been perceiving data ” (LMS update histories and obser v ation er rors in the Memor y DB)—it is learning ho w the system learns . This intervention on frame w ork -le vel meta-parameters (intercepts and penalty multipliers) endow s the sys tem with the capacity f or self-reﬂection. What the LLM accomplishes is not parameter optimization, but “belief epiphan y” and “value recalibration.” 2.5.2 T w o Orthogonal Knobs The LLM operates on e xactl y two output variables: Δ 𝛽 LLM ∈ [ − 0 . 1 , + 0 . 1 ] (Belief channel: intercept adjustment) (19) 𝑚 LLM ∈ [ 0 . 5 , 2 . 0 ] (Preference channel: penalty multiplier) (20) These bounds are safety constraints, not suggestions. A single-round intercept shift of ± 0 . 1 is already a large cor rection; the cap prev ents catastrophic miscalibration from a single LLM er ror . The penalty multiplier range of [ 0 . 5 , 2 . 0 ] allo ws hal ving or doubling a constraint ’ s enf orcement s trength, suﬃcient f or responding to structural breaks without enabling r una wa y penalty inﬂation. A cr itical design decision is that the LLM adjusts fr amewor k-lev el parameters (intercepts and penalty multipliers), not the 7-dimensional bottom-le v el sor ting parameter vector ( 𝜽 ). The rationale is grounded in the distinction between e xperience and facts : “oﬄine GMV predictions ha ve been 5–8% optimistic ov er the past 3 rounds ” is a reusable structural insight that transf ers across data snapshots, while “ps_ads_w g = 2.15” is a one-time fact about today’ s data distribution that ma y be 1.73 tomor ro w . The LLM operates in the space of reusable e xper ience; Optuna operates in the space of ephemeral f acts. 2.5.3 Context and Evidence Requir ements The LLM receives a structured conte xt JSON containing: (1) the cur rent slope and intercept f or all 6 prior relations, (2) the 20 most recent episodes with paired oﬄine-online metr ics, (3) the 30 most recent prior update histor y records (both LMS and LLM), and (4) the current penalty weight state. Ev er y LLM proposal must cite speciﬁc episode ke ys as evidence. Proposals without e vidence are rejected. When the LLM’ s conﬁdence is lo w — f or instance, if the recent episodes sho w contradictory signals — it is instructed to retur n an empty proposal (no adjustment), which is preferable to a w eakly grounded cor rection. 2.5.4 Implementation: Model, Promp t, and Output Safe ty Model and inv ocation. The meta-controller inv okes an LLM through the Codex CLI’ s 2 exec sub- command in a sandbox ed subprocess. Sortify implements its o wn memory sys tem (Section 2.6 ) and 2 OpenAI Codex CLI, https://github.com/openai/codex 16 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e Figure 6 | Input assembl y: 20-round episode history (oﬄine/online metr ic pairs) + 30 prior update records (LMS and LLM cor rection history) + cur rent penalty w eights → Context JSON → LLM reasoning → Output: {delta_intercept, penalty_multiplier , reason, e vidence_ke ys}. T w o ar row s from output: delta_intercept f eeds into Belief channel (intercept update), penalty_multiplier feeds into Pref erence channel (w eight scaling). Saf ety bounds sho wn as guard rails on both outputs. tool-calling mechanisms, making it a full y capable autonomous agent in its o wn r ight—Code x ser v es solel y as the underl ying LLM inf erence engine, not as an agent framew ork. The in v ocation command is: codex exec - --skip-git-repo-check \ --sandbox read-only --color never \ -c model_reasoning_effort="high" The prompt is passed through standard input, and the response is read from s tandard output. The reasoning eﬀor t is set to high, which activ ates extended reasoning before the model produces its s tr uctured JSON output. A 300-second timeout pre vents pipeline stalls. No temperature parameter is set e xplicitly ; the e xtended-reasoning mode subsumes sampling control b y directing the model to reason thoroughly bef ore committing to an answ er . Prom pt structure. The LLM receiv es a single-shot prompt consisting of tw o concatenated par ts: 1. Sy stem instructions ( ∼ 600 words, hardcoded): Deﬁne the output JSON schema ( proposal_id , an ar ra y of proposals each containing relation_key , delta_intercept , penalty_multiplier , reason , and evidence_episode_keys ), specify saf ety bounds ( | Δ 𝛽 | ≤ 0 . 1 , 𝑚 ∈ [ 0 . 5 , 2 . 0 ] ), and frame the task as Ba y esian decision alignment—separating belief updates (intercept corrections on the transf er -function) from pref erence updates (loss-weight scaling for constraint violations). 2. Context JSON (dynamicall y generated): The structured context described in Section 2.5.3 , e xpor ted b y the export-llm-context command from the Memor y DB. No f e w-shot e xamples are pro vided. The prompt is deliberatel y kept zero-shot to av oid anchoring the model on past cor rection patter ns, which could suppress no v el structural insights. Output parsing and safe ty la yers. The LLM’ s raw te xt output passes through a se v en-lay er saf ety pipeline bef ore an y system state is modiﬁed: 1. Prom pt-le vel instruction: The sys tem instructions explicitl y constrain output format and v alue rang es. 17 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 2. R obust JSON extraction: A three-strategy parser attempts direct json.loads , then Markdo wn code-block e xtraction via rege x, then a brace-depth state machine to isolate the ﬁrst complete JSON object from potentiall y noisy output. 3. Schema v alidation and type coercion: Each proposal ﬁeld is v alidated: missing delta_intercept def aults to 0.0, missing penalty_multiplier def aults to 1.0, and all numer ic ﬁelds are cast to float . Malf or med entries are silently dropped. 4. Relation k ey whitelist: Ev er y relation_key in the proposal is check ed agains t the set of ke ys present in the conte xt’ s prior relations. U nknown ke ys tr igger an immediate rejection. 5. Hard clamp: R egardless of the LLM’ s output, Δ 𝛽 LLM is clamped to [ − 0 . 1 , + 0 . 1 ] and 𝑚 LLM to [ 0 . 5 , 2 . 0 ] in code. 6. Global w eight bounds: Even after multiplicativ e scaling, penalty w eights s tay within conﬁgured ﬂoors and ceilings (e.g., [ 1 , 000 , 80 , 000 ] ), prev enting cumulative dr ift across rounds. 7. F ailure fallbac k: If the LLM call times out, retur ns unparseable output, or produces an empty proposal ar ray , the system f alls back to a no-op proposal ( Δ 𝛽 = 0 , 𝑚 = 1 . 0 ). The pipeline continues without inter ruption. This def ense-in-depth design ensures that the LLM is a bounded advisor : it can accelerate con v erg ence through structural insight, but cannot destabilize the sys tem ev en under adv ersar ial or deg enerate outputs. 2.5.5 Coordination with LMS LMS and LLM operate in ser ial within each round: LMS updates ﬁrst (Step 5 of the pipeline), then the LLM reads the post-LMS s tate and proposes adjustments (S teps 7–8). This ordering ensures the LLM operates on the most cur rent algorithmic calibration. T o pre vent oscillation between channels, a freeze rule applies: if the Belief channel update e x ceeds a threshold magnitude, the Pref erence channel freezes f or one round, preser ving diagnostic clar ity . As demonstrated in Section 4.6 , in Exp-401838, LLM calibration eﬀor t con ver ges from 5 cor rection items per round to 2 within 7 rounds, indicating that frame work -lev el cor rections diminish as evidence accumulates. 2.6 Memory DB and State Persistence 2.6.1 Motiv ation Without persistent state, the sy stem suﬀers from a “Groundhog Da y” eﬀect: each round starts with no know ledge of past transfer relationships, penalty sensitivities, or LLM cor rection patter ns. After 30 rounds, the system w ould possess no more calibration intelligence than it had at round 1. The Memory DB solv es this b y providing a relational store that accumulates ev er y obser v ation, calibration update, and decision across the sy stem’ s lif etime. 2.6.2 Schema Design The Memory DB consists of 7 SQLite tables designed f or idempotent wr ites and immutable audit: The schema enforces two cr itical proper ties: 1. Idempo tency . Composite ke ys (relation_ke y + episode_ke y) in prior_observations prev ent duplicate wr ites if a pipeline step is retried after failure. The same obser v ation wr itten twice produces identical state. 2. Immutability . The memory_events table serv es as a write-ahead log. No ro w is ev er deleted or modiﬁed, pro viding a complete f orensic trace f or debugging calibration anomalies. 2.6.3 Cross-R ound Intelligence The Memory DB enables three forms of cross-round intellig ence: • LLM context assembl y . The 20 mos t recent episodes and 30 most recent update his tor y records 18 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e T able 1 | Memory DB schema o vervie w . T able Role Ke y Fields W rite P attern episodes F act table: one row per round with paired oﬄine-online snapshots episode_ke y , oﬄine_metr ics, online_uplifts, status Append-only ; status: oﬄine → deploy ed → online offline_candidates T op-5 parameter candidates per round episode_ke y , rank, params, metr ics Append per round prior_relations Current Belief channel state: slope & intercept per metr ic pair relation_ke y , slope, intercept, eta U pser t per round prior_observations Ra w (oﬄine, online, beta) tr iples per metr ic pair and episode relation_ke y , episode_ke y , oﬄine_uplift, online_uplift Append with idempotency prior_update_history Belief channel audit trail method (lms/llm), old/new slope, old/ne w intercept Append-only penalty_wt_upd_hist Pref erence channel audit trail constraint_metric, pressure, step, old/ne w weight Append-only memory_events Immutable global audit log ev ent_type, idempotency_ke y , metadata_json Append-only are e xtracted and f ormatted as the LLM’ s input conte xt. As the database gro ws, the LLM gains increasingl y r ich e vidence f or patter n detection. • W arm-start across experiments. Exp-401838 started from an e xisting Memor y DB state with pre-calibrated transf er models and penalty w eights, rather than def ault slopes and intercepts. This capability allo ws the sy stem to e xtend calibration intelligence from prior episodes into new production rounds (Sections 4.3 and 4.5 ). • Cross-mark et transfer . When deplo ying to Countr y B, the Memor y DB schema was reused without modiﬁcation. The sys tem initialized with default priors but accumulated Countr y B-speciﬁc calibra- tion within a single round, correctly identifying that org anic click weights (ps_org_w c) w ere sev erel y underweighted in the baseline f or mula (Section 4.8 ). As demons trated in Section 4.8 , the persistent Memor y DB enables Sor tify to generalize across markets without architectural chang es, validating the design pr inciple that calibration intellig ence should be cumulativ e rather than ephemeral. 19 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e Figure 7 | Complete data ﬂow across one Y OLO cy cle: (1) A/B data pulled → (2) Episode recorded in Memory DB → (3) LMS reads pr ior_obser v ations, updates pr ior_relations → (4) LLM conte xt assembled from episodes + update histories → (5) LLM proposal applied to intercepts and penalty weights → (6) T arg et ranges and penalty w eights f eed into Optuna objectiv e → (7) 5000-tr ial search → (8) Best parameters published to Redis → (9) New A/B data g enerated → back to (1). Memory DB is the central state store connecting all steps. 20 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 3. Operational Frame w ork In this section, w e descr ibe ho w Sor tify operates as a production system — the infrastructure it r uns on, the step-b y-step pipeline f or a single optimization round, and the continuous loop that chains rounds into an autonomous, self-reco v er ing process. The goal is to demonstrate that the architecture descr ibed in Section 2 is not merel y a theoretical framew ork but an operationall y viable sy stem that r uns unattended with built-in f ailure handling. 3.1 Sy stem Infrastructure T able 2 | Sy stem conﬁguration. Component Speciﬁcation Search engine Optuna TPE sampler Akiba et al. [ 2019 ] T r ials per round 5,000 Parallel work ers 25 (Constant Liar strategy) State database SQLite ( agent_memory.db ) Parameter publishing R edis ke y-v alue store Search time per round 15–30 minutes Data accumulation windo w 3.5 hours (minimum) Processing time per round 25–50 minutes T otal cy cle time ∼ 4 hours Dail y throughput ∼ 6 rounds/da y Time alignment 15-minute boundary snapping T ak ea wa y: The total cycle time of appro ximately 4 hours — 3.5 hours for online data accumulation plus 25–50 minutes for processing — enables roughly 6 optimization rounds per day . This cadence is f ast enough to respond to intra-day traﬃc pattern shifts while accumulating suﬃcient data volume f or statis tically meaningful A/B compar isons. The sy stem’ s r untime footprint is deliberately lightw eight. The Memor y DB is a single SQLite ﬁle; the search engine r uns on a single machine; parameter publishing uses a standard Redis instance shared with other production ser vices. No GPU resources are required. The only e xter nal dependency bey ond the platf or m’ s exis ting A/B infrastructure is the LLM API call for meta-controller proposals. 3.2 One-Shot Pipeline Each optimization round ex ecutes a 10-step pipeline. The design target is restart saf ety , not strict run-to-r un deter minism: persisted state transitions (such as episode wr ites, prior updates, and state-ﬁle updates) are idempotent, so re-r unning a completed write step after interr uption does not cor r upt state; ho we ver , steps that depend on e xternal services, especially LLM proposal generation, are not assumed to return byte-identical outputs on e v er y retr y . Steps 1–4: Dat a Collection. • Step 1 — Pull A/B report. Fetch the online A/B e xper iment repor t f or the current obser vation windo w . The repor t contains uplift percentag es f or GMV , Orders, A ds Re v enue, A dvertiser V alue, Org anic Clicks, and Ads CTR relativ e to the control group. • Step 2 — Initialize Memory DB. If this is the ﬁrst round, create the 7-table schema. On subsequent rounds, v er ify schema integ rity and open the exis ting database. • Step 3 — Record oﬄine metrics. W r ite the cur rent round’ s oﬄine search results (objective value, 21 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e Inﬂuence Share metr ics, parameter vector , top-5 candidates) as a ne w episode in the Memor y DB with status offline_recorded . • Step 4 — Recor d online metrics. Parse the A/B repor t from S tep 1, write online uplift values to the episode record, and transition the episode status to online_recorded . Step 5: LMS Calibration (Belief Channel). For each of the 6 pr ior relations, compute the prediction er ror betw een the calibrated transf er model’ s f orecast and the actual online uplift, then apply LMS updates to slope and intercept (Eqs. 13 and 14 ). All updates are persisted to prior_relations and logged in prior_update_history with method lms_regression . Step 6: LLM Context Export. Assemble the LLM conte xt JSON by extracting: (1) cur rent slope/intercept for all 6 relations, (2) the 20 most recent episodes, (3) the 30 mos t recent pr ior update history records, and (4) cur rent penalty w eight state. W r ite the conte xt to a timestamped ﬁle f or audit. Step 7: LLM Proposal Generation. Call the LLM API with the structured context and system prompt. The LLM returns a JSON proposal containing delta_intercept entr ies (one per relation) and penalty_multiplier entr ies (one per constraint), each with a reason ﬁeld and evidence_keys citing speciﬁc episodes. If conﬁdence is low , the LLM retur ns an empty proposal. Step 8: Appl y LLM Proposal. For each non-empty proposal item: apply delta_intercept to the corresponding relation’ s intercept in prior_relations , and apply penalty_multiplier to the cor responding cons traint’ s penalty w eight. Log all chang es in prior_update_history (method llm_proposal ) and penalty_weight_update_history . If the Belief update magnitude e x ceeds the freeze threshold, skip Pref erence updates f or this round. Step 9: Deriv e T arget Rang e. Using the updated transf er model (slope, intercept), compute the oﬄine uplift rang e that maps to an acceptable online outcome f or each constrained metric. This calibrated target_range replaces the static constraint boundar ies in the search objective. Step 10: Deriv e Penalty W eight. Compute violation pressure f or each constraint based on the most recent episode’ s results (Eq. 16 ), apply the asymmetric multiplicativ e update (Eqs. 17 and 18 ), and persist the ne w penalty weights. Final: Searc h and Publish. Execute the Optuna TPE search (5,000 tr ials, 25 work ers) with the calibrated targ et rang es and penalty w eights. Extract the bes t parameter v ector , publish to R edis, and write a handoﬀ state ﬁle ( one-shot-latest.env ) recording the run tag, conﬁg path, and best parameters f or the ne xt round. 3.3 Y OLO Continuous Loop The Y OLO (Y ou Only Liv e Once) mode chains one-shot pipeline e x ecutions into an inﬁnite au- tonomous loop with three phases. Phase 1 — Coldstart. On ﬁrst launch, boots trap initial parameters to Redis using def ault or user - speciﬁed values. Anchor the obser vation time window to the cur rent timestamp, aligned to a 15-minute boundary . No A/B data e xists y et; skip LLM diagnostics and proceed directly to oﬄine search with def ault pr iors. Phase 2 — Initial. Execute the ﬁrst full one-shot pipeline with optional manual conﬁr mation. This round establishes the baseline episode in the Memor y DB and the ﬁrst A/B observation pair . Phase 3 — Y OLO. Enter the inﬁnite loop: 22 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 1. W ait until the minimum data accumulation window (3.5 hours) has elapsed since the last parameter push. 2. Pull the A/B repor t f or the elapsed windo w . 3. Ex ecute the full one-shot pipeline (Steps 1–10 + Search + Publish). 4. U pdate yolo-state.env with the ne w round number, push timestamp, observation window , previous parameters, and A/B repor t path. 5. Append a complete round record to yolo-event-log.md (round number , timestamp, parameters, A/B results, LLM proposal, er rors). 6. R etur n to step 1. F ailure Handling. The Y OLO loop distinguishes tw o failure categor ies: • A/B data fetc h failures (transient): Retry up to 5 times at 5-minute inter v als. Detection cr iterion: A/B repor t ﬁle is missing or contains f e wer than 3 lines. • Non- A/B failures (search crash, Redis er ror , DB cor ruption): Exit immediately without retry . The operator in ves tigates and restar ts manually . State R eco very . On res tar t after a crash or manual s top, the Y OLO loop reads yolo-state.env to reco ver: the last completed round number, the timestamp of the last parameter push, the obser vation windo w boundar ies, and the previous parameter v ector . The loop resumes from the next round, reusing the obser vation windo w endpoint as the ne w window start. A f orced reset ( --reset ﬂag) clears the state ﬁle and restarts from Phase 1. Figure 8 | Y OLO Continuous Loop: S tate machine diag ram with three states — Colds tar t (boots trap params, anchor time window), Initial (ﬁrst full pipeline with manual conﬁr mation), Y OLO (inﬁnite loop: w ait 3.5h, pull A/B, pipeline, publish, update state, loop). R etr y logic shown as a self-loop on the A/B pull step (up to 5 retr ies). Exit ar ro w from non-A/B f ailures. Reco very ar row from restart reading y olo-state.en v bac k into Y OLO state. As demonstrated in Sections 4.3 and 4.8 , the Y OLO loop ran 7 consecutiv e rounds (25 hours) in Exp- 401838 and accumulated 23 rounds across tw o search phases in Countr y B, validating the operational robustness of this design. 23 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 4. Ev aluation In this section, we present e xper imental and deplo yment e vidence from 30 optimization rounds across tw o markets (Country A and Country B). W e ev aluate Sor tify along six ax es: online business eﬀectiv eness (Section 4.3 ), oﬄine-online transf er calibration quality (Section 4.4 ), parameter stability (Section 4.5 ), LLM meta-controller eﬀectiv eness (Section 4.6 ), sensitivity anal ysis (Section 4.7 ), and Country B deplo yment from cold start to production rollout (Section 4.8 ). W e conclude with an eﬃciency anal ysis (Section 4.9 ). 4.1 Metrics W e e valuate Sortify using three categor ies of metr ics: Oﬄine metrics (Inﬂuence Share based): • 𝐼 gmv : Inﬂuence share of the GMV factor in sorting decisions. The primar y optimization targ et. • 𝐼 order : Inﬂuence share of the order factor . Constrained to prev ent ex cessive order deg radation. • 𝐼 ecpm_term : Inﬂuence share of the advertising eCPM factor . Constrained to protect ad rev enue. • 𝐼 gmv_ads : Inﬂuence share of GMV within the advertising channel. • ads_top10_rate : Fraction of ads appear ing in the top-10 positions. Constrained to maintain ad e xposure. Online metrics (A/B test): • GMV : Gross merchandise v alue uplift vs. control group. Pr imar y business KPI. • Orders : Order volume uplift. Secondary business KPI. • Ads Re v enue : T otal adv er tising rev enue uplift. • Adv ertiser V alue : Composite adv er tiser R OI metr ic. • Organic Clic ks : Click -through v olume on organic (non-ad) listings. • Ads CTR : Click -through rate on ad placements. Deriv ed ev aluation metrics: • T ransfer rate : Ratio of online uplift to oﬄine uplift (e.g., online GMV% / oﬄine 𝐼 gmv %), measur ing calibration quality . • Oscillation ratio : Ratio of maximum to minimum v alue of ps_ads_w o (the mos t v olatile parameter) across rounds, measuring parameter stability . • LLM correction count : Number of non-zero adjustment items per LLM proposal, measur ing calibration eﬀor t con v ergence. 4.2 Experiment Setup T able 3 | Experiment ov erview . Exp-401838 Exp-437160 Mark et Country A PDP Country B PDP Rounds 7 (warm-start) 23 (V1: 11 + V3: 12) Duration 25 hours V1: 03-05–07; V3: 03-10–13; freeze 03- 14; 7d AB 03-15–22; rollout 03-24 Start cond. W ar m-start from exis ting Memory DB state Cold-start, default pr iors Obs. windo w 3.5 h/round 6 h/round (V3); 7d static AB post-freeze Proc. time 25–50 min/round 25–50 min/round Parame ters 7 (ads_w o/w g, org_w o/w g, porg_w , price_pow , w2) 7 (org_w c/w g/w o, ads_w c/w g/wo, biz_price_pow) Exp-401838 is a Countr y A warm-start experiment that retained historical calibration state. Its core purpose is not to re-demonstrate what happens during cold start, but to test whether a sys tem 24 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e that has already accumulated transf er models and penalty w eights can continue adv ancing in a better direction within production, and con v erge LLM calibration eﬀor t within a limited number of rounds. Exp-437160 validates the system’ s end-to-end deplo yment capability in a pre viously unseen mark et. The e xper iment progressed through two search phases — V1 (11 rounds, GMV win rate 30%) e xposed cold-start limitations in a ne w market, while V3 (12 rounds) identiﬁed a Pareto-optimal parameter conﬁguration at R7. This conﬁguration w as frozen on 03-14 and v alidated through a 7-day A/B test bef ore being promoted to production rollout on 03-24. The two e xper iments ser v e diﬀerent lif ecycle roles. Exp-401838 demonstrates the inner -loop op- timization mechanism — search, deplo y , obser v e, calibrate — in continuous warm-star t production operation. Exp-437160 extends this to the full deployment lif ecycle: it progressed through exploration, e xploitation, parameter freeze, long-hor izon validation, and production rollout, pro viding e vidence that the sy stem can produce deplo yable policies, not merely promising round-lev el signals. 4.3 Online A/B Results 4.3.1 Exp-401838: Country A PDP W arm-St art (7 R ounds) T able 4 | Exp-401838 online A/B results — round-by -round GMV and Orders progression. R ound GMV Orders A ds Re venue Adv ertiser V alue R2 − 3.6% 0.0% +0.7% +7.9% R3 − 1.5% +1.7% − 0.8% − 1.0% R4 +0.2% +1.7% − 8.7% − 8.1% R5 +1.9% +3.8% − 5.8% − 5.6% R6 +0.9% +2.7% − 9.4% − 10.8% R7 +9.2% +12.5% +5.7% − 8.9% T ak ea wa y: Exp-401838’ s GMV trajectory progresses g enerally upw ard across 7 rounds, from − 3.6% at R2 to +9.2% at R7, with R4 through R7 being 4 consecutiv e positive rounds. Orders tur ned positiv e from R3 onw ard, peaking at +12.5% . This indicates that under warm-start conditions the sys tem can e xtend e xisting calibration state into subsequent rounds rather than starting from scratch each round. Ca v eat. R7 metrics w ere collected dur ing a lo w-traﬃc earl y morning windo w ( ∼ 12K impressions), which introduces higher statistical uncer tainty than the typical obser vation window . The +9.2% GMV and +12.5% Orders results require v alidation in a high-traﬃc per iod. P ersistent adv ertiser v alue pressure. A dvertiser V alue was positiv e only in R2 (+7.9%) and tur ned negativ e in all subsequent rounds, indicating that ev en under warm-start conditions, optimizing f or GMV and Orders can sy stematically compress advertiser R OI — a limitation discussed in Section 6 . 4.4 Oﬄine-Online T ransf er Analy sis A central claim of Sor tify is that ev en under warm-star t conditions, the oﬄine-online mapping still requires continuous calibration. Exp-401838 itself pro vides two representative paired obser vations. T ak ea wa y: Within the same warm-star t experiment, the oﬄine-online mapping changes noticeably across rounds. R2 sho ws that “positive oﬄine ” does not guarantee positiv e online; R7 show s that after multiple rounds of calibration, oﬄine gains ﬁnally begin to translate more reliabl y into online GMV . This phenomenon supports the necessity of the Belief channel: the transf er relationship is not a one-time setting but requires continuous updating with ne w obser v ations. 25 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e Figure 9 | Online KPI T rends: Single-panel line c har t sho wing Exp-401838 GMV and Orders uplift across 7 rounds. Dashed horizontal line at 0% for ref erence. Highlights the R4–R7 consecutive positiv e region and the R7 peak results. T able 5 | R epresentative oﬄine-online gap in Exp-401838. R ound 𝐼 gmv U plift Online GMV Observation R2 +18.2% − 3.6% Signiﬁcantl y optimistic oﬄine; positiv e oﬄine uplift did not translate to positiv e GMV R7 +41.6% +9.2% Optimistic bias remains, but after calibration the oﬄine gains begin to translate reliabl y into positive online GMV 26 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e Looking fur ther at the metrics in Section 4.3 , GMV , Orders, A ds Re v enue, and Adv er tiser V alue do not mov e in sync. F or e xample, R7 simultaneously show s GMV +9.2%, Orders +12.5%, Ads Re venue +5.7%, but A dv er tiser V alue remains at − 8.9%. This indicates that diﬀerent business metrics should not share a single global calibration f actor but should each maintain their o wn transfer relationship. Figure 10 | Oﬄine-Online T ransf er Gap Visualization: Scatter plot with Exp-401838’ s per -round oﬄine 𝐼 gmv uplift on the x-axis and cor responding online GMV on the y-axis. Identity line ( 𝑦 = 𝑥 ) shown f or reference. Points fall w ell below the identity line, showing sys tematic optimis tic bias; later rounds are closer to the positive quadrant. 4.5 Parame ter Ev olution and Stability 4.5.1 Residual Pendulum Eﬀect in W arm-St art Experiment The most v olatile parameter across all e xper iments is ps_ads_w o (adv er tising order w eight), which controls the trade-oﬀ between ad visibility and or ganic e xper ience. This parameter e xhibits a characteristic oscillation pattern we term the “pendulum eﬀect.” T ak ea wa y: Even under warm-start conditions, ps_ads_w o remains the most volatile parameter, but its oscillation is contained within a 4.5x rang e, and the second cy cle is nar ro wer than the ﬁrst. This indicates that the sys tem has not y et conv erg ed to a single stable operating point, but has transitioned from unbounded e xploration to constrained residual oscillation rather than unbounded drift. 27 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e T able 6 | Exp-401838 parameter oscillation statis tics. Statistic Exp-401838 ps_ads_w o range 1.96–8.91 Oscillation ratio (max/min) 4.5x Complete swing cy cles 2 Second cy cle nar ro wer than ﬁrst Y es R oot cause analy sis. The pendulum eﬀect arises from a fundamental trade-oﬀ in the sor ting f or mula: increasing ad visibility (high ps_ads_w o) improv es ad impressions and potential ad re venue but reduces org anic clic k e xposure and can depress organic GMV . When the search engine ﬁnds a high-ad-w eight solution that maximizes 𝐼 gmv in one round, the online results rev eal org anic deg radation, causing the ne xt round’ s calibrated constraints to penalize ad w eight — swinging the parameter to the opposite e xtreme. This oscillation reﬂects a g enuine multi-objectiv e tension that the single-objective-with-penalties f or mulation can onl y approximate, not resolv e. Figure 11 | Parameter Evolution Heatmap: Heatmap with 7 parameters (ro ws) × 7 rounds (columns) from Exp- 401838. Color scale from lo w to high v alues. The ps_ads_wo row show s the most pronounced back -and-f or th oscillation, but the later half has a nar row er swing range. 4.6 LLM Calibration Eﬀectiv eness 4.6.1 Correction Con v ergence T ak ea wa y: In Exp-401838, LLM calibration eﬀor t conv erg es from 5 cor rection items in Round 2 to 2 items b y Round 7, with late-stage magnitudes approaching zero. This indicates that under warm-star t conditions, the LLM’ s role is closer to “ﬁlling in remaining calibration residuals ” rather than continuously re wr iting the entire transf er framew ork. 28 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e T able 7 | Exp-401838 LLM cor rection conv erg ence. Phase Exp-401838 Earl y (R2) 5 items, full recalibration Middle (R3–R6) 2–3 items, decreasing magnitudes Late (R7) 2 items, smallest magnitudes ( − 0 . 005 ) Con verg ence trend Clear con v ergence : 5 → 2 items, magnitude → ∼ 0 Figure 12 | LLM Cor rection Conv erg ence: T wo-panel visualization. T op panel: bar chart showing the number of non-zero LLM cor rection items per round f or Exp-401838, descending from 5 at R2 to 2 at R7. Bottom panel: line chart sho wing maximum cor rection magnitude per round, with an o verall declining trend. 29 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 4.7 Sensitivity Analy sis This section discusses the rationale behind ke y system h yper parameter choices and their impact on performance. LMS learning rate ( 𝜂 = 0 . 2 ). The learning rate controls the Belief channel’ s responsiveness to single observations. 𝜂 = 0 . 2 means each round cor rects only 20% of the prediction residual, req uir ing appro ximately ⌈ log ( 0 . 05 ) / log ( 0 . 8 ) ⌉ = 14 rounds to conv erg e to 95% accuracy from an arbitrar y initial bias. A higher 𝜂 w ould accelerate conv erg ence but increase sensitivity to noisy obser vations — in lo w- traﬃc windo ws (such as Exp-401838 R7 with ∼ 12K impressions), single-round obser vation variance is larg e, and an ov erl y aggressive lear ning rate would cause slope/intercept to oscillate unnecessar il y under noise. 𝜂 = 0 . 2 is a classic adaptiv e ﬁlter setting that achiev es a reasonable balance between con v erg ence speed and noise robustness, and the LLM’ s selective jump mechanism compensates f or LMS’ s slo w response to structural breaks. P enalty asymmetry ( 𝛿 up = 0 . 25 vs. 𝛿 do wn = 0 . 08 ). The tightening step is appro ximately 3x the relaxation step, encoding the industrial reality that “the cost of a constraint violation far ex ceeds the oppor tunity cost of an ov erl y tight constraint.” If both w ere set symmetr ically (e.g., both 0.15), penalty w eights w ould deca y too quic kly after constraint satisf action, increasing the probability of re-violation in subsequent rounds — in A/B testing, each constraint violation (e.g., ad rev enue collapse) can tr igger manual inter v ention or ev en e xper iment ter mination, far e x ceeding the cost of occasionally conser vativ e GMV optimization. LLM safe ty bounds ( | Δ 𝛽 | ≤ 0 . 1 , 𝑚 ∈ [ 0 . 5 , 2 . 0 ] ). Saf ety bounds ensure that ev en if the LLM’ s judgment is completely wrong, the single-round deviation remains controllable. T aking the intercept as an ex ample, the maximum single-round shift of ± 0 . 1 across 6 relations implies a maximum single-round displacement of the target_range of appro ximatel y 10 percentage points — large enough to co v er the typical f ew -percentage-point sys tematic bias in practice, but not so lar ge as to cause catastrophic search frame w ork dis tor tion on a misjudgment. The penalty multiplier rang e [ 0 . 5 , 2 . 0 ] allow s halving or doubling constraint enf orcement strength in a single round, suﬃcient to respond to sudden constraint violation patterns. Necessity of the freeze mechanism. When the Belief channel undergoes a lar ge update, if the Pref erence channel is simultaneously allow ed to make large penalty weight adjustments, the next round’ s online results become diﬃcult to attr ibute — it is hard to deter mine whether the GMV chang e came from the mapping cor rection or the cons traint strength chang e. The pur pose of freezing for one round is to isolate the impact of a signiﬁcant calibration e vent, providing clearer signals f or subsequent diagnosis. P enalty w eight clamp rang e [ 1 , 000 , 80 , 000 ] rationale. The lo w er bound of 1,000 accounts for the magnitude of the positiv e rew ard in the objective function: typical positiv e scores ( 10 × 𝐼 gmv relativ e uplift) rang e from 0.5 to 2.0 (corresponding to 𝐼 gmv uplift of 5%–20%); if penalty w eights f all belo w 1,000, ev en a 1% constraint violation (penalty = 1 , 000 × 0 . 01 2 = 0 . 1 ) w ould fail to eﬀectiv ely constrain the positiv e score. The upper bound of 80,000 preserves the search engine ’ s e xploration capability : when penalty w eights are too high (e.g., 80 , 000 × 0 . 01 2 = 8 . 0 ), ev en minor violations produce penalties f ar e x ceeding the positive score, f orcing the search engine into an e xtremely conser vativ e region and potentiall y missing parameter conﬁgurations that are ov erall super ior despite minor constraint brushes. 30 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 4.8 Country B Deplo yment: From Cold Start to Rollout Exp-437160 is the only experiment in this study that progressed through the complete deplo yment lif ecy cle — from cold-star t parameter search to production rollout. This section presents the full evidence chain. 4.8.1 Cold Start Signal T o validate that Sor tify generalizes bey ond a single market, w e deplo yed the sys tem on Countr y B’ s product detail page with a cold-start (no inher ited Memory DB). The ﬁrst round of search produced the f ollo wing results: T able 8 | Exp-437160 cold-start results (Country B, V1 Round 1). Metric V alue Constraint / T arge t Status Oﬄine objectiv e 0.493 Maximize A chie ved 𝐼 gmv uplift +4.9% ≥ +4% Met 𝐼 ecpm_term chang e − 4 . 5 % ≥ − 5 % Near boundar y ads_top10_rate — Constrained Met ps_org_w c change +271.2% (1.0 → 3.71) — Larg est adjustment ratio_ads chang e − 32 . 0 % — A d share compressed The sys tem correctly identiﬁed that Country B’ s baseline sor ting formula sev erel y underw eighted organic click signals (ps_org_w c jumped from 1.0 to 3.71, a +271% adjustment — the lar ges t single- parameter chang e across all e xper iments). This market-speciﬁc insight emerg ed from the ﬁrst round of search without any pr ior calibration, demons trating that the Inﬂuence Share frame w ork generalizes to ne w markets. Ho we v er , this ﬁrst-round directional signal w as far from a deplo y able policy — subsequent search phases were needed to ﬁnd a parameter conﬁguration that could sur vive long-hor izon validation. 4.8.2 Searc h Journey : From Direction to Deplo yable Parameters The Countr y B experiment progressed through two search phases bef ore ar r iving at a deplo yable conﬁguration: V1 (11 rounds, 03-05 to 03-07). The initial cold-start search phase achie v ed a GMV win rate of onl y 30% (mean GMV uplift − 2 . 4 %). While the sys tem successfully identiﬁed the direction — organic click signals were underweighted — the parameters did not con ver ge. All 7 parameters oscillated widel y across the search space throughout the 11 rounds. V1’ s core lesson w as that directional corr ectness does not imply paramet er deployability . V3 (12 rounds, 03-10 to 03-13). After nar ro wing the search space and adjusting constraint conﬁg- urations based on V1 observations, the sys tem w as restarted f or a second search phase. Most of the 12 rounds s till e xhibited the ads-organic seesaw eﬀect — Ads Impressions and Or ganic Clic ks alternated in s tr ict opposition — a structural pattern consistent with the pendulum eﬀect obser v ed in Countr y A (Section 4.5.1 ). R7 parameter freeze (03-14). Within V3’ s 12 rounds, the R8 observation window (measuring R7 parameters) produced the onl y round where GMV and Ads Re v enue were simultaneously positive: GMV +2.6%, A ds Re v enue +6.1%. Based on this Pareto-optimal outcome, the R7 parameters were frozen f or e xtended validation. 4.8.3 Winning P arameter Structure The frozen R7 parameters re veal a structurall y distinct solution compared to the Countr y A e xper i- ments: 31 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e T able 9 | Exp-437160 frozen parameters (R7, Countr y B). Parame ter V alue Interpretation ps_org_wc 7.667 Org anic Click weight (dominant) ps_ads_wc 4.381 A ds Click w eight (high) ps_org_wg 0.568 Org anic GMV weight (suppressed) ps_org_wo 0.805 Org anic Order weight (suppressed) ps_ads_wg 0.641 A ds GMV weight (suppressed) ps_ads_wo 0.957 A ds Order weight (moderate) ps_biz_price_pow 1.047 Pr ice pow er (near -neutral) This conﬁguration e xhibits a “high Click, low GMV/Order” structure — the sor ting f or mula is primar ily dr iv en by click signals, with transaction-based signals deliberately suppressed. This structure reduces the seesaw eﬀect between organic ranking and ads ranking: when the sor ting f or mula does not e x cessivel y amplify GMV/Order signals, org anic items do not cro wd out ad placements simply because the y “sell more, ” enabling both GMV and A ds Re v enue to improv e simultaneously . This contrasts with the residual pendulum eﬀect obser v ed in the Country A warm-start experiment (Section 4.5.1 ), where the ke y trade-oﬀ still centered on the ads order w eight. The Countr y B R7 parameters circum vented this conﬂict by shifting the pr imary ranking signal to Clicks — clic k signals are f ar less zero-sum between organic and advertising channels than transaction signals. 4.8.4 Long-Horizon A/B V alidation The frozen R7 parameters w ere validated through a 7-da y A/B tes t (03-15 to 03-22; rollout Snapshot #38047, 10% treatment vs. 20% control). T able 10 | 7-da y A/B core metr ics (YMAL scene, yin_e xc hange vs Control). Metric Diﬀ No te GMV/UU +4.15% Primar y KPI GMV +4.10% Consistent with GMV/UU GMV/Order +3.97% A v erage order v alue dr iv es GMV Order/UU +0.17% Near -neutral A ds Re v enue +3.58% Ad re v enue co-positive CPC +4.19% Higher per -click value T ak e Rate +3.01% Monetization improv ed Click/UU +1.21% User eng agement up A ds Load − 2 . 64 % Fe wer ad slots, higher value The GMV improv ement is dr iv en pr imar il y b y av erage order value (GMV/Order +3.97%) rather than order volume (Order/UU +0.17%), indicating that the click -dr iv en ranking sur faces higher -v alue items. A ds Re v enue increased +3.58% despite A ds Load decreasing by 2.64%, indicating that the per -ad value impro v ed signiﬁcantl y (CPC +4.19%) — fe w er but more v aluable ad placements. The 7-day observation windo w is suﬃcient to ﬁlter out intra-da y volatility , providing strong er deployment conﬁdence than single-round observations. 4.9 Eﬃciency and Operational Cost T ak ea wa y: Sortify operates with minimal resource o v erhead — a single LLM API call per round, no GPU requirements, and negligible storag e gro wth. The pr imary cost is the 3.5-hour data accumulation 32 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e T able 11 | Operational eﬃciency metr ics. Metric V alue Notes T otal cy cle time ∼ 4 hours 3.5h accumulation + 25–50min processing Daily optimization rounds ∼ 6/day 4h/round cadence T rials per round 5,000 25 work ers; 15–30min search Human inter v ention (post-colds tar t) Zero W ithin the inner optimization loop Consecutiv e unattended rounds 7; 23 Exp-401838 and Exp-437160; go v er nance-limited, not failure-limited A/B fetc h retr y success 5 × 5 min No data loss obser v ed Memory DB growth ∼ 1 KB/round 7 tables; negligible storag e LLM API calls 1/round Single meta-controller inf erence windo w , which is dictated b y s tatistical requirements rather than sy stem limitations. The sy stem ran 7 consecutiv e rounds (25 hours) in Exp-401838, and accumulated 23 rounds across tw o search phases in Country B; all stops and transitions in these e xper iments or iginated from go v er nance-lev el decisions, not sy stem failures. It is impor tant to distinguish the scope of automation. The inner loop — parameter search, online deplo yment, A/B obser vation, LLM calibration, and next-round launc h — r uns without human inter - v ention. Ho w ev er , the outer go v ernance loop — which round’ s parameters to freeze f or extended v alidation, whether to promote to full rollout based on long-horizon A/B results, and post-rollout mon- itoring — remains human-dr iv en. The Countr y B deplo yment illustrates this boundar y : the inner loop autonomousl y e xecuted 23 rounds across two search phases, but the decisions to freeze R7 parameters, conduct 7-da y static A/B testing, and appro ve production rollout w ere all made by human operators based on snapshot-lev el A/B analy sis. This division reﬂects a deliberate design choice: the sys tem automates the high-frequency , data-intensiv e optimization decisions where human latency is the bottleneck, while preserving human gov er nance o ver low -frequency , high-stakes deployment decisions. Compared to the manual optimization baseline — which typically required 1–2 human-hours per optimization round f or data analy sis, parameter selection, and deplo yment v er iﬁcation — Sor tify reduces the marginal cost of each inner -loop round to appro ximatel y zero human-hours after initial setup, while increasing the optimization cadence from ∼ 1 round/da y (limited b y human a vailability) to ∼ 6 rounds/da y . This speedup not only means more parameter e xploration rounds, but more cr iticall y enables the sys tem to respond to intra-day traﬃc patter n changes — for ex ample, morning peak, midda y trough, and e v ening peak traﬃc structures may aﬀect optimal parameter conﬁgurations, and the 6-round/day cadence ensures at least one calibration update per major traﬃc period. Cost structure analy sis. Sortify’ s r untime cos t has three components: (1) Compute — the Optuna search uses 25 parallel w orkers on a single multi-core machine, consuming roughly 6.25–12.5 CPU- hours per round (25 cores × 15–30 min); (2) LLM calls — each round makes one API call with 4K –8K input tokens and a structured JSON response. At OpenAI’ s public pr ices as of March 26, 2026 (GPT -5.4: $2.50/1M input, $15.00/1M output; GPT -5.3-Codex: $1.75/1M input, $14.00/1M output), the per -round cost with high reasoning eﬀor t is approximatel y $0.03–$0.10, since billed output includes reasoning tokens bey ond the visible JSON. At 6 rounds/da y this amounts to roughly $0.18–$0.60/da y in LLM spend. If Codex CLI is used via ChatGPT sign-in rather than an API ke y , cost is absorbed b y plan credits rather than per -call billing; (3) Data accumulation — the 3.5-hour online observation windo w is a ﬁx ed statis tical requirement, not a sys tems bottleneck. Storag e ov erhead is negligible: the Memory DB gro ws b y ∼ 1 KB per round and remains belo w 100 KB after 30 rounds. 33 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 5. Discussion: One-Person F easibility A dimension of Sor tify that merits e xplicit discussion is the de v elopment process itself: the entire sy stem—algor ithm design, code implementation, experimental operation, data analy sis, repor t wr iting, and visual asset creation—w as produced by a single individual orches trating a ﬂeet of AI coding agents. Not a single line of production code w as wr itten by a human hand; not a single sentence of this repor t was drafted by a human author . The human’ s role was e xclusiv ely that of arc hitect and orc hestrat or : deﬁning objectiv es, decomposing problems, revie wing outputs, and steering the agents tow ard coherent e xecution. This section ex amines the f easibility , mechanisms, and broader implications of this one-person paradigm. 5.1 Scope of What W as Deliver ed T o appreciate the claim, it is w or th enumerating the concrete artifacts that constitute the Sor tify project: • Algorithm design. The dual-channel Belief/Preference decomposition, the LMS-based transf er calibration, the LLM meta-controller ’ s s tr uctured prompting protocol, and the asymmetr ic penalty update rules (Section 2 ). • Production codebase. The 10-step one-shot pipeline, the Y OLO continuous loop with failure reco v er y , the 7-table Memory DB schema, the Optuna-based search engine integration, and the R edis parameter publishing interface (Section 3 ). • Experimental operation. 30 optimization rounds across tw o countr y markets (7 in Countr y A, 23 in Country B), including 25+ hours of unattended Y OLO r uns and a complete cross-market deplo yment lif ecy cle (Section 4 ). • Analy sis and reporting. Quantitativ e ev aluation of online A/B results, parameter stability analy sis, LLM calibration con v erg ence tracking, cross-market g eneralization assessment, and the complete bilingual technical repor t. • Presentation materials. Programmatic video compositions, ﬁgure generation, and visual assets f or communicating results. In a con v entional team setting, this scope w ould typicall y be distributed across 3–5 roles: an algorithm researcher , a software engineer, an operations specialist, a data anal yst, and a technical writer . The com- pression of these roles into a single individual was enabled not by extraordinary individual productivity , but b y a fundamental shift in the unit of work from writing to directing . 5.2 T echnical Depth and Dev elopment Complexity of Sortify The deliv erable inv entor y abov e pro vides a macro-le vel vie w of project scope but does not re veal the tec hnical depth that Sor tify embodies as a standalone software librar y . Sortify is f ar from a loose collection of scr ipts: it is a production-grade framew ork compr ising approximatel y 98,000 lines of code across 78 Python modules or ganized into 7 ma jor subsy stems—spanning the full technical stack from lo w-le v el data loading to high-lev el autonomous decision-making. Parame ter searc h engine. One of Sor tify’ s core capabilities is a highly conﬁgurable parameter search (h yper parameter tuning) engine. The engine constructs a dependency -aw are f or mula computation graph: giv en a set of ranking f or mulas and their nested w eight variables, the sy stem automatically parses dependencies, builds a lay ered computation graph, and incrementally updates only the aﬀected v ar iables when weights chang e—av oiding appro ximately 99% of redundant computation. On top of this, the sy stem integ rates the Optuna multi-objectiv e optimization frame w ork, suppor ting NSGA -II P areto frontier search, Ba yesian optimization, e volutionary strategies, and genetic algorithms. The search process also incor porates multiple cor relation analyzers (Pearson, Spear man, Kendall), six metric fusion methods (w eighted mean, L2 norm, RMS, among others), and T op-K -based ranking quality ev aluation—f or ming 34 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e an end-to-end closed loop from data loading through f or mula computation and parameter optimization to result anal ysis. The pairwise inﬂuence algorithm. One of Sor tify’ s most original technical contributions is the Pair - wise Inﬂuence metr ic sys tem. The algor ithm addresses a cr itical question: in a multi-factor ranking f or mula, how can the actual inﬂuence of each w eight factor on the ﬁnal ranking outcome be quantiﬁed? T raditional sensitivity anal ysis measures onl y the marginal eﬀect of parameter per turbation on a sin- gle metr ic, whereas the Pairwise Inﬂuence algorithm pushes the analy sis g ranularity do wn to item-pair comparisons —sampling item pairs from the T op-K list using conﬁgurable strategies (adjacent, random, e xhaustiv e), computing the diﬀerence v ector Δ 𝑥 𝑖 𝑗 across f actor dimensions f or each pair, then der iving each factor ’ s relative contribution share via Share 𝑓 = | 𝑤 𝑓 · Δ 𝑥 𝑖 𝑗 , 𝑓 |  Í 𝑘 | 𝑤 𝑘 · Δ 𝑥 𝑖 𝑗 , 𝑘 | , and ﬁnally aggre- gating with conﬁgurable w eighting sc hemes (e xponential deca y , in v erse-rank deca y) to yield a global inﬂuence score 𝐼 𝑓 ∈ [ 0 , 1 ] . The algor ithm fur ther computes inter -factor conﬂict rates CR 𝑎 , 𝑏 —the fre- quency with which tw o f actors induce opposite ranking pref erences on the same item pair —pro viding ir replaceable diagnostic information f or w eight tuning. This algorithm was de v eloped from initial concept through mathematical f or malization, v ector ized implementation, and production integration, entirely b y AI ag ents under human direction. Fully autonomous operation pipeline. Be y ond the search engine, Sor tify implements a complete autonomous operation pipeline. Its core compr ises tw o la yers: OneShotPipeline —orches trating of- ﬂine optimization, LLM proposal generation, and online monitor ing into a single automated ﬂo w; and Y oloRunner —a continuously autonomous agent loop with episode-based state tracking, oﬄine-to-online w orkﬂo w bridging, automatic rollback decisions, and live Belief/Pref erence state management. T o support this autonomous operation, the sys tem implements a 7-table SQLite-back ed persis tent memory sys tem (with a DualMemory dual-bac kend architecture) that records each optimization round’ s oﬄine results, deplo yment conﬁgurations, online metr ics, and decision rationale, pro viding the LLM meta-controller with complete histor ical conte xt f or proposal generation. This means Sortify is not merely a callable tool library but an intelligent sys tem capable of autonomously initiating, ex ecuting, e valuating, and iterating optimization cy cles. De velopment complexity . The implementation diﬃculty of the abov e capabilities should not be under - estimated. In ter ms of technical stack div ersity alone, Sor tify spans ﬁv e ma jor domains: data engineer ing (incremental Parq uet loading and streaming), numer ical computation (N umPy-v ector ized f ormula ev alua- tion with incremental updates), optimization theory (multi-objective P areto optimization with constrained search), sy stems engineer ing (SQLite persistence, R edis parameter publishing, Grafana real-time monitor - ing integration), and AI applications (LLM prompt engineering, autonomous decision-making, memor y retriev al). These subsy stems do not operate in isolation but are tightl y coupled through dependency graphs, ev ent streams, and state machines—a design ﬂaw in any single module could trigger cascading f ailures at runtime. The entire codebase comprises approximatel y 98,000 lines, o ver 50 core classes, and more than 10 anal yzer types. That a sy stem of this scale w as built from scratch and deplo yed to production b y a single person directing AI agents is itself the strong est e vidence f or the feasibility of the one-person paradigm. 5.3 From ParaDance to Sortify : Lineage and Domain Expertise The preceding section es tablished Sortify’ s engineering scale: appro ximately 98,000 lines of code across 78 modules. Ho w could a single person na vigate sys tem design at this lev el of comple xity? A cr ucial part of the answ er is that it was not a from-scratch endeav or . Sor tify’ s design philosoph y , 35 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e e valuation architecture, and optimization methodology trace directly to ParaDance 3 , an open-source multi-objectiv e parameter optimization library created b y the same author . U nderstanding this lineage is essential f or appreciating both the depth of domain expertise underl ying Sor tify and the ev olutionar y arc from a g eneral-pur pose tuning tool to an autonomous, agent-s teered optimization system. A 20-month iteration journey . ParaDance was initiated in May 2023 and under went continuous de velopment through Januar y 2025—a 20-month arc spanning 357 commits, 40+ released v ersions (v0.1.1 through v0.6.7), and 6 major version epochs. The project ev ol ved from a simple Lorenz-curv e sampling utility into a full-ﬂedged production framew ork comprising 3,688 lines of source code across 48 Python modules org anized into 6 core subsy stems: ev aluation (16 independent ev aluators), optimization (Optuna-based Bay esian search with multi-objective agg regation), data loading, pipeline orchestration, sampling, and visualization. Each major version marked a qualitativ e leap: v0.2.x introduced por tf olio visualization; v0.3.x established the Bay esian optimization engine; v0.4.x built the ev aluation ecosys tem including merg e-sor t-based weighted inv ersion pair computation and the JSON f ormula system; v0.5.x added stag ed optimization, multi-threaded real-time monitoring, and side-inf or mation reranking; and v0.6.x deliv ered chec kpoint recov er y and production-hardening f eatures. Design decisions ahead of their time. Sev eral arc hitectural choices in ParaDance were uncon ventional f or a 2023-era Python data science tool and ha ve since been validated b y broader industry trends: • Pluggable ev aluator arc hitecture. BaseCalculator used partialmethod to mount 16 ev aluators as instance methods via a single-inher itance chain, a v oiding the MR O complexity of mixin-based approaches. Adding a ne w ev aluator required only three steps: wr ite a standalone function, register it as a partialmethod , and add a ﬂag string—zero modiﬁcation to e xisting code. • Declarativ e JSON formula engine. A f or mula chaining sys tem with inter mediate-result cascading ( step1#intermediate → step2 ), conditional branching ( if(cond, true_val, false_val) ), and sandbo xed ev aluation enabled algor ithm engineers to iterate scor ing f or mulas via conﬁguration chang es alone, without code deplo yment. • Scale-a ware adaptiv e w eight searc h. Automatic log10 -scale alignment of search boundar ies across f eatures with vas tly diﬀerent magnitudes (e.g., CTR ∼ 0.01 vs. GMV ∼ 10,000), eliminating the need f or manual nor malization. • Staged optimization with w armup. A tw o-phase searc h protocol where the ﬁrst 𝑁 rounds use a w ar mup f or mula to guide e xploration, then transition to the main objective with continuous value anchoring—addressing cold-start ineﬃciency . • Production-grade resilience. SQLite-backed optimization persistence, c heckpoint-based reco v- ery for inter r upted long-r unning jobs, Joblib-based parallel optimization, and Dirichlet-constrained w eight sampling on the probability simplex. These design c hoices—conﬁguration-dr iven pipelines, pluggable e valuation, scale-aw are search, and persistent optimization state—directl y inf or med Sor tify’ s architecture. The 7-table Memor y DB in Sor tify is a conceptual descendant of ParaDance ’ s SQLite persistence; the dual-channel Belief/Pref erence decomposition extends P araDance’ s multi-objectiv e agg regation into the online calibration domain; and the LLM meta-controller occupies the role that P araDance’ s stag ed optimization and warmup f or mulas pla yed at a more pr imitiv e lev el. Industry adoption. As of the time of wr iting, P araDance has accumulated ov er 101,000 do wnloads on PyPI and is used b y algor ithm engineers at major inter net companies across the industry . It has become one of the most widely adopted hyperparameter tuning tools in production recommendation sy stems, v alidating both the problem f or mulation and the architectural decisions made dur ing its 20-month de vel- opment. This adoption base also pro vided the author with continuous feedbac k from div erse production 3 https://github.com/yinsn/ParaDance 36 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e en vironments—feedbac k that directl y shaped Sor tify’ s design prior ities, par ticular ly the emphasis on transf er calibration, constraint sensitivity management, and autonomous operation with minimal human intervention. In other w ords, what ParaDance bequeathed to Sor tify w as not merely reusable code assets but domain judgment f org ed through 20 months of production reﬁnement—precisely the judgment that enabled the human orches trator to make sound architectural decisions dur ing agent-driv en dev elopment, rather than e xpending limited iteration budget on tr ial and er ror . 5.4 The Orc hestration Model The operational patter n that emerg ed was consistent across all project phases. The human operator w ould: 1. Deﬁne the intent — specify what needed to be built, ﬁx ed, or analyzed, ar ticulated as a natural- languag e objective with acceptance cr iter ia. 2. Decompose into agent tasks — break the objectiv e into discrete, independently e xecutable units suitable f or parallel agent dispatch. 3. Re view and integrate — inspect agent outputs f or cor rectness, coherence, and alignment with the o v erall design, then merge appro ved ar tifacts into the project. 4. Iterate on failures — when agent outputs missed the mark, diagnose whether the failure was in task speciﬁcation (human er ror) or e x ecution (agent error), and adjust accordingly . This loop mir rors the LLM meta-controller patter n within Sor tify itself: a higher -lev el intelligence (the human) does not perform the lo w-le v el optimization (code wr iting, te xt drafting) but instead adjusts fr amewor k-lev el par ameter s (task speciﬁcations, design cons traints) that guide the ag ents ’ e xecution. The parallel is not coincidental—the same pr inciple of steering o ver doing that makes Sor tify’ s autonomous optimization viable also makes the one-person dev elopment model viable. 5.5 Meta-Nesting: AI Building AI The Sor tify project exhibits a meta-structural nesting that mer its e xplicit discussion. What makes the project distinctiv e is not onl y what it is but how it was creat ed —the tw o la yers f or m a self-ref erential recursiv e loop. La y er 1: The domain agent. At the business logic lev el, Sortify itself is an autonomous agent designed to replace the sys tematic decision-making work traditionall y per f or med b y algor ithm engineers in rank - ing optimization. Historicall y , such engineers had to continuously monitor discrepancies between oﬄine metrics and online dashboards, manuall y adjust penalty ter ms and fusion f or mulas, and attempt to br idg e epistemic bias and axiological misalignment. Sor tify , grounded in SEU theory with its Belief/Preference dual channels, autonomously observes 20 rounds of historical evidence, autonomously diagnoses str uc- tural breaks, and autonomously e x ecutes inﬂuence rebalancing—achie ving a structural subs titution of human cognitiv e labor in a speciﬁc domain (recommendation sys tem tuning). La y er 2: The builder agent ﬂeet. At the de velopment lev el, the entire sy stem was built from scratch b y a single individual directing a ﬂeet of AI coding agents. This agent ﬂeet collectivel y assumed the roles traditionall y distr ibuted across architects, full-stack engineers, data engineers, QA testers, and e ven academic wr iters and ﬁgure designers. They jointly deliv ered: the Optuna-based parameter search frame work with dependency -aw are computation graphs and end-to-end pipeline orchestration; the 7-table persistent Memor y DB; and the complete bilingual technical report conforming to academic standards. The recursiv e loop. Super imposing these tw o lay ers rev eals a recursiv e structure: one set of AI ag ents (the builder ﬂeet) constructed anot her AI ag ent (Sor tify), which in tur n autonomously automates highly complex business logic . In this process, the human role elev ates from “craftsperson wr iting code ” to 37 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e “chief orchestrator deﬁning rules, suppl ying high-dimensional intuition, and issuing acceptance cr iter ia.” This “ AI building AI” bootstrapping process is not merely a philosophically interesting isomor phism—it constitutes direct e vidence that multi-agent collaboration is viable f or sy stem engineering tasks demanding both deep algorithmic logic and academic r igor . 5.6 Implications Paradigm restructuring, not mer ely eﬃciency gains. When an entire project’ s algor ithm design, code implementation, experimental e xecution, and bilingual L A T E X repor t wr iting with ﬁgures are all completed b y a single person orc hestrating ag ents—with zero human-authored code and zero human-drafted te xt— this has mo ved decisivel y be y ond the Copilot-era paradigm of “code completion ” or “productivity tools.” It marks a res tr uctur ing of the paradigm itself: in traditional softw are engineering, the gap betw een “having a good idea ” and “shipping a sy stem” spans an enor mous implementation chasm. Ev en after conceiving a core insight like “decouple belief and pref erence via SEU, ” months of infrastructure construction, data cleaning, and debugging would be required to v alidate its value. T oday , implementation cost approaches zero. The ceiling of a project is no long er set b y a de v eloper’ s familiarity with framew ork APIs or a team’ s headcount, but b y the decision-maker ’ s dept h of understanding of business pain points, ar chit ectural taste f or comple x sys tems, and orc hestr ation ar tistry in directing agent collaboration. Dissolving barriers to complex sy stem building. The traditional barr ier to building production-g rade sy stems like Sor tify is not pr imar il y intellectual but operational : the coordination cost of multi-person teams, the communication o v erhead of translating design intent into implemented code, and the iteration latency betw een specifying a chang e and seeing its eﬀect. When AI agents absorb the implementation burden, the binding constraint shifts from team size to the quality of a single individual’ s problem decomposition and design judgment . This implies that a domain e xper t who deeply understands a problem but lac ks a full engineering team can no w build and deplo y sys tems of equiv alent comple xity—a qualitativ e e xpansion of what is achie vable by a single practitioner . Cognitiv e le verag e v ersus cognitiv e replacement. A critical nuance is that the human contr ibution was not eliminated but concentr ated . The design decisions—wh y dual-c hannel rather than single-channel, wh y frame w ork -lev el control rather than direct parameter manipulation, why persistent memory rather than stateless rounds—required domain understanding, architectural taste, and judgment that the ag ents could not autonomously generate. What the ag ents eliminated w as the mechanical eﬀor t of translating those decisions into r unning code, tested pipelines, and f or matted prose. The ratio of human cognitive contribution to total project output was thus e xtremely high per unit of human eﬀor t, but the human eﬀor t itself w as ir reducible f or the project to succeed. This is cognitiv e lever ag e , not cognitiv e replacement. Repr oducibility and v eriﬁability . Ev er y ar tif act in this project car ries a complete pro v enance trail: the agent conv ersations that produced each code module, the revie w decisions that accepted or rejected intermediate outputs, and the iteration history that reﬁned the ﬁnal result. This le v el of traceability is, parado xically , mor e comprehensiv e than what most multi-person team workﬂo ws produce, where design rationale is often lost betw een Slac k threads, whiteboard sessions, and undocumented code revie w s. The one-person orchestration model, by routing all decisions through a single point of accountability with AI-assisted record-keeping, may produce more auditable dev elopment histories than conv entional processes. Scalability of the model. The one-person f easibility demonstrated here is not an isolated cur iosity but an earl y instance of a patter n that will likel y become routine as AI coding agents mature. The speciﬁc combination that enabled it—a w ell-deﬁned problem domain, modular sys tem architecture, and ag ents 38 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e capable of producing cor rect code from natural-language speciﬁcations—is not unique to Sor tify . An y project that can be decomposed into well-scoped, independently testable components is amenable to this de velopment model. The limiting f actor is not the tools but the human ’ s ability to hold the full sys tem design in mind and specify tasks with suﬃcient precision. The fact that this repor t—including this v er y paragraph—was authored by an AI agent under human direction is itself a demonstration of the thesis it describes. The boundary betw een human creativity and machine ex ecution has not disappeared, but it has mo v ed: the human pro vides the what and why ; the ag ents provide the how . For Sor tify , this division of labor was suﬃcient to deliver a production sys tem, deplo y it across tw o mark ets, and produce the comprehensive ev aluation and documentation presented in this repor t. Future technology builders will no longer be mired in the implementation sw amp but will operate as pure thinkers and commanders—this is not an optimization of eﬃciency but a redeﬁnition of the act of creation itself. 39 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e 6. Conclusion, Limitations, and Future Dir ections W e ha v e presented Sor tify , an agent-s teered closed-loop sy stem that reframes ranking optimization as continuous inﬂuence e x chang e—where an LLM ag ent steers the rebalancing of ranking inﬂuence among competing f actors. Sortify addresses three structural limitations of traditional manual optimization—the oﬄine-online transf er gap, entangled diagnostic signals, and s tateless round-b y-round operation—through a uniﬁed closed-loop architecture. The dual-channel adaptation frame work decouples transf er mapping cor rection (Belief ) from constraint penalty adjustment (Pref erence) into orthogonal dimensions; the LLM meta-controller operates on framew ork -lev el parameters, pro viding evidence-based cor rections to LMS calibration residuals; and the persistent Memor y DB accumulates calibration intelligence across rounds, suppor ting w arm-star t and cross-market transfer . Deplo yed on Country A and Countr y B recommendation platf or ms with the inner optimization loop operating autonomousl y in 4-hour cy cles , the sy stem produces tw o lev els of evidence: peak round-le v el per f or mance in the Countr y A w ar m-start experiment (GMV +9.2% , Orders +12.5% , with LLM cor rection items con v erging from 5 to 2 within 7 rounds), and deplo yment-validated perf or mance in Countr y B (GMV/UU +4.15% , A ds Re v enue +3.58% in 7-day A/B v alidation, promoted to production rollout). Despite these results, Sor tify exhibits se veral concrete limitations that constrain its cur rent applica- bility : • Parameter oscillation not fully conv erged. In the Country A w ar m-start e xper iment retained in the main te xt, ps_ads_wo still e xhibits a 4.5x residual pendulum eﬀect (Section 4.5 ). The multi-modal objectiv e landscape created by the ads-org anic trade-oﬀ is a fundamental limitation that the single- objectiv e-with-penalties formulation can appro ximate but not resol ve. This oscillation indicates that the sy stem has transitioned from unbounded e xploration to constrained oscillation, but has not y et con ver ged to a single stable operating point. • Statistical uncertainty in lo w-traﬃc windo ws. The strong est results (GMV +9.2%, Orders +12.5% in Exp-401838 R7) w ere obser v ed dur ing a low -traﬃc ear ly morning windo w with approximatel y 12K impressions (Section 4.3 ). While the sys tem operated cor rectly , the statis tical po wer of these estimates is low er than for rounds with typical traﬃc volume. • Insuﬃcient em pirical e vidence f or e xtreme structural br eaks. The retained Country A e xper iments primar ily demonstrate conv ergence in the number of LLM cor rection items, but do not co v er a suﬃcientl y long or sev ere post-w ar m-start lif ecy cle to validate handling of major str uctural breaks (Section 4.6 ). This means the e vidence presented f or the LLM meta-controller leans more to ward “reducing ongoing calibration w orkload” than “handling e xtreme mutations.” These limitations point to two directions f or future w ork. Firs t, replacing the single-objectiv e- with-penalties f or mulation with a multi-objective Pare to frontier approach would allow the sys tem to e xplicitly model the ads-org anic trade-oﬀ as a frontier rather than a single optimum, potentiall y eliminating the pendulum eﬀect. Second, de veloping a conﬁdence-w eighted transf er model that adjusts calibration aggressiv eness based on obser v ation v olume w ould allo w the sys tem to be more conser vativ e during lo w-traﬃc window s and more aggressive dur ing high-traﬃc per iods, reducing statis tical r isk without sacriﬁcing conv erg ence speed. These e xtensions build naturally on Sortify’ s e xisting dual-channel and Memory DB architecture, requiring modiﬁcations to the objectiv e f or mulation and context assembl y rather than fundamental architectural changes. Finall y , the Sortify project itself—from the f oundational algorithm design and appro ximately 98,000 lines of engineering implementation, through 30 rigorous production experiment rounds, to the very drafting of this repor t—was ex ecuted entirely by a single human orches trating a swarm of AI agents, untouched by a single line of human code or draft te xt. This f act stands as a silent y et deaf ening e xper iment in its own r ight. As e xplored in Section 5 , it un veils an “ AI-generating- AI” meta-nesting logic: an AI ﬂeet in the creation domain bir thed an indus tr ial-grade AI hub (Sortify) in the targ et domain, which in tur n autonomousl y go verns highly complex business operations. This transcends mere eﬃciency 40 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e optimization; it is a structural paradigm shift in software engineering and human exploration. When the long, arduous chasm betw een conceptualization and realization is fundamentall y obliterated, a br utal y et f ascinating ne w era emerg es: the ceiling of creation is no longer dictated b y the sheer scale of ex ecution capacity , but solely by the cognitiv e penetration and orches tration ar tistry of the solitary decision-maker . 41 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e R efer ences T ak uy a Akiba, Shotaro Sano, T oshihiko Y anase, T akeru Ohta, and Masanor i Ko yama. Optuna: A ne xt-generation hyperparameter optimization frame w ork. In Proceedings of the 25th A CM SIGKDD International Conf erence on Kno wledg e Discov er y & Data Mining , pages 2623–2631, 2019. doi: 10.1145/3292500.3330701. James O Berg er . Statistical Decision Theor y and Bayesian Analy sis . Springer , 2nd edition, 1985. Mirosla v Dudik, Dumitr u Erhan, John Langf ord, and Lihong Li. Doubly robust policy ev aluation and optimization. Statistical Science , 29(4):485–511, 2014. Matthias Feurer , Jos t T obias Spr ing enberg, and Frank Hutter . Initializing bay esian hyperparameter optimization via meta-learning. In Proceedings of the AAAI Confer ence on Artiﬁcial Intellig ence , v olume 29, 2015. doi: 10.1609/aaai.v29i1.9354. Ale xandre Gilotte, Clément Calauzènes, Thomas N edelec, Ale xandre Abraham, and Simon Dollé. Oﬄine a/b testing f or recommender sy stems. In Pr oceedings of the Elev enth A CM International Conf erence on W eb Searc h and Data Mining , pages 198–206, 2018. doi: 10.1145/3159652.3159687. Frank Hutter, Lars Kotthoﬀ, and Joaquin V anschoren. Automated Machine Learning: Me thods, Sy stems, Challeng es . Spr inger , 2019. Xiao Lin, Hongjie Chen, Changhua Pei, Fei Sun, Xuanji Xiao, Hanxiao Sun, Y ongfeng Zhang, W en wu Ou, and Peng Jiang. A pareto-eﬃcient algor ithm f or multiple objectiv e optimization in e-commerce recommendation. In Proceedings of the 13th A CM Confer ence on Recommender Sys tems , pages 20–28, 2019. doi: 10.1145/3298689.3346998. Tie- Y an Liu. Lear ning to rank f or inf or mation retr iev al. F oundations and T rends in Information Retriev al , 3(3):225–331, 2009. German I Parisi, Ronald K emker , Jose L Part, Chr istopher Kanan, and Stef an W er mter . Continual lifelong learning with neural networks: A revie w . N eural Ne twor ks , 113:54–71, 2019. Shubhra Kanti Kar maker Santu, Parikshit Sondhi, and ChengXiang Zhai. On application of lear ning to rank f or e-commerce search. In Proceedings of the 40th International A CM SIGIR Conf erence on Resear c h and Development in Inf ormation Re trieval , pag es 475–484, 2017. doi: 10.1145/ 3077136. 3080838. Lei W ang, Chen Ma, Xue yang F eng, Ze yu Zhang, Hao Y ang, Jingsen Zhang, Zhiyuan Chen, Jiakai T ang, X u Chen, Y ankai Lin, W ayne Xin Zhao, Zhen-Hua W ei, and Hengshu Zhou. A surve y on larg e languag e model based autonomous ag ents. F rontier s of Computer Science , 18(6), 2024. doi: 10.1007/s11704- 024- 40231- 1. 42 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e A. Contributors Yin Cheng ( yin.cheng@shopee.com ) and his AI Ag ents were responsible f or all core w ork on this project: • Project De v elopment : Sy stem architecture design and full-stack engineer ing implementation • Algorithm Design : Conception and iteration of the P air wise Inﬂuence ranking algor ithm • Experiment Execution : Deplo yment, monitoring, and data analy sis of A/B experiments • Report W riting : Ideation, writing, and revision of this technical repor t A ckno wledgments Special thanks to Liao Zhou ( liao.zhou@shopee.com ) f or his deep in vol vement throughout the project—pro viding sustained discussion, f eedback, and e xper ience transf er on e xper imentation method- ology and de velopment environment ev olution, whic h play ed an impor tant role in the iterativ e reﬁnement of our approach. W e also thank the f ollo wing colleagues f or their suppor t: xiyu.liang@shopee.com kailun.zheng@shopee.com dihao.luo@shopee.com tewei.lee@shopee.com zhangweiwei@shopee.com mark.cai@shopee.com jian.dong@shopee.com andy.zhanggx@shopee.com B. Implementation Details B.1 A/B Experiment Integration Sor tify integrates with the platf or m’ s e xisting A/B testing infrastructure rather than deplo ying its own e xper imentation frame work. Each parameter push to R edis tr iggers a traﬃc split between the control group (current production parameters) and the treatment group (Sor tify’ s optimized parameters). The A/B repor t is g enerated by the platf or m’ s standard metrics pipeline and pulled b y Sor tify’ s one-shot pipeline in Step 1. Time windo w alignment. All obser vation window s are snapped to 15-minute boundar ies to align with the platf or m’ s metr ics agg reg ation cadence. The minimum accumulation window of 3.5 hours ensures suﬃcient traﬃc volume f or stable uplift estimates in most time-of-da y conditions. The maximum windo w is capped at 3.5 hours to pre vent stale data from ear lier traﬃc patterns contaminating later observations. B.2 Redis Parame ter Publishing The 7-parameter v ector is serialized as JSON and wr itten to a Redis ke y monitored b y the sor ting service. The publication is atomic—either all 7 parameters update simultaneousl y or none do. A rollback mechanism allo ws rev er ting to the previous parameter set by re-publishing the PREV_PARAMS_JSON stored in yolo-state.env . B.3 State File Management T wo state ﬁles maintain continuity across pipeline e x ecutions and system restar ts: yolo-state.env — Y OLO loop state: YOLO_ROUND=5 LAST_PUSH_TIME=2026-03-05T18:45:00 LAST_ONLINE_FROM=2026-03-05T15:15:00 LAST_ONLINE_TO=2026-03-05T18:45:00 43 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e PREV_PARAMS_JSON={"ps_ads_wo": 3.51, ...} PREV_AB_REPORT_PATH=/path/to/report one-shot-latest.env — Last successful one-shot handoﬀ: LATEST_RUN_TAG=20260302_143022 CONFIG_PATH=.../pdp-th-topk-order-guard-agent.next.json BEST_SO_FAR=.../search-artifacts/*_best_so_far.json B.4 Directory Structure Each round produces ar tifacts in a times tamped director y: updates/ |-- memory/ | ‘-- agent_memory.db <- Master DB (shared across rounds) |-- / | |-- online-loop.log <- 10-step pipeline output | |-- search.log <- Optuna trial results | |-- ab_report.md <- Online A/B metrics | |-- llm_proposal.json <- LLM output with evidence | |-- rule.publish.json <- Published parameters | |-- search-logs/ | | ‘-- *_trial_table.csv <- All 5,000 trial results | ‘-- search-artifacts/ | ‘-- *_best_so_far.json <- Best parameters found |-- yolo-state.env |-- yolo-event-log.md <- Audit log of all rounds ‘-- one-shot-latest.env C. No t ation T able 44 Let the Ag ent Steer : Closed-Loop Ranking Optimization via Inﬂuence Exchang e T able 12 | T AB-APP -02. Symbol deﬁnitions. Symbol Deﬁnition Introduced 𝜋 𝑞 ( 𝜽 ) Ranking per mutation for reques t 𝑞 under parameters 𝜽 Eq. 1 𝑆 𝑛 𝑞 Symmetric group acting on 𝑛 𝑞 items Eq. 1 S 𝑞 ( 𝜽 ) T otal score vector f or request 𝑞 Eq. 2 𝐻 𝑞 𝑖 𝑗 Comparison h yper plane 𝑥 𝑖 = 𝑥 𝑗 f or items 𝑖 , 𝑗 Eq. 2 𝐶 𝜋 Ranking chamber cor responding to permutation 𝜋 Eq. 2 𝜽 7-dimensional sor ting parameter v ector Eq. 10 F Set of all sor ting f actors Eq. 4 𝑆 𝑞 ( 𝑖 ; 𝜽 ) T otal score of item 𝑖 in reques t 𝑞 Eq. 3 Δ 𝑞 ( 𝑖 , 𝑗 ; 𝜽 ) T otal pairwise margin for item pair ( 𝑖 , 𝑗 ) Eq. 3 Δ 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) Score diﬀerence contributed b y f actor 𝑓 on item pair ( 𝑖 , 𝑗 ) Eq. 4 𝑍 𝑞 𝑖 𝑗 ( 𝜽 ) T otal pair wise inﬂuence budg et f or item pair ( 𝑖 , 𝑗 ) Eq. 6 𝑠 𝑞 , 𝑓 ( 𝑖 , 𝑗 ; 𝜽 ) Pairwise inﬂuence share of factor 𝑓 f or item pair ( 𝑖 , 𝑗 ) Eq. 7 𝑃 𝑞 Inf or mativ e / top-sensitiv e pair set agg regated for reques t 𝑞 Eq. 9 𝑟 𝑞 𝑖 Rank of item 𝑖 in request 𝑞 Eq. 8 𝑤 𝑞 𝑖 𝑗 Rank -based exponential deca y w eight f or item pair ( 𝑖 , 𝑗 ) Eq. 8 𝜏 Decay rate parameter f or pair weighting Eq. 8 𝐼 𝑓 ( 𝜽 ) Aggreg ate Inﬂuence Share for f actor 𝑓 Eq. 9 J ( 𝜽 ) Composite search objectiv e function Eq. 10 𝜆 𝑗 Penalty weight f or constraint 𝑗 Eq. 10 𝑣 𝑗 ( 𝜽 ) Violation magnitude f or constraint 𝑗 Eq. 10 𝛼 T ransf er model slope (Belief c hannel) Eq. 12 𝛽 T ransf er model intercept (Belief c hannel) Eq. 12 𝜂 LMS lear ning rate (= 0.2) Eq. 13 𝑒 𝑡 T ransf er prediction er ror at round 𝑡 Eq. 13 Δ 𝛽 LLM LLM intercept adjus tment ( ∈ [ − 0 . 1 , + 0 . 1 ] ) Eq. 15 𝑝 𝑗 Normalized violation pressure f or constraint 𝑗 Eq. 16 𝛿 up Penalty tightening step size (= 0.25) Eq. 17 𝛿 do wn Penalty relaxation step size (= 0.08) Eq. 17 𝑚 LLM LLM penalty multiplier ( ∈ [ 0 . 5 , 2 . 0 ] ) Eq. 20 T able 13 | A bbreviations. Abbre viation Full Form GMV Gross Merchandise V alue PDP Product Detail Pag e TPE T ree-structured Parzen Estimator LMS Least Mean Squares Y OLO Y ou Only Liv e Once (continuous pipeline mode) KPI K ey Perf ormance Indicator CTR Clic k - Through Rate eCPM Eﬀectiv e Cost P er Mille R OI R etur n on In ves tment 45

Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment