Large Language Model Agent in Financial Trading: A Survey

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Trading is a highly competitive task that requires a combination of strategy, knowledge, and psychological fortitude. With the recent success of large language models(LLMs), it is appealing to apply the emerging intelligence of LLM agents in this competitive arena and understanding if they can outperform professional traders. In this survey, we provide a comprehensive review of the current research on using LLMs as agents in financial trading. We summarize the common architecture used in the agent, the data inputs, and the performance of LLM trading agents in backtesting as well as the challenges presented in these research. This survey aims to provide insights into the current state of LLM-based financial trading agents and outline future research directions in this field.

💡 Research Summary

This survey paper provides a comprehensive review of the emerging field of large language model (LLM) agents applied to financial trading. The authors systematically examined 27 recent studies (published primarily between 2022 and 2024), of which seven explicitly use the term “agent” in their titles. Their analysis is organized around three core questions: (1) What architectural patterns are employed for LLM‑based trading agents? (2) Which data modalities feed these agents? (3) How do these agents perform in back‑testing, and what limitations remain?

The paper identifies two overarching architectural paradigms. The first, “LLM as a Trader,” lets the language model directly generate buy, hold, or sell signals. Within this paradigm four sub‑categories are distinguished:

News‑Driven agents ingest real‑time news articles, macro‑economic releases, and corporate disclosures as prompt context. Early works simply concatenate raw headlines and ask the model to predict short‑term price direction, often using sentiment scores to construct a long‑short portfolio. More sophisticated pipelines add summarization, refinement, and reasoning modules; for example, MarketSense AI builds progressive daily news summaries and stores them in a “memory” bank that is retrieved during trading. Experiments with both closed‑source models (GPT‑3.5/4) and open‑source alternatives (Qwen, Baichuan) consistently show that sentiment‑based strategies can generate positive alpha.
Reflection‑Driven agents introduce layered memory and reflection mechanisms that mimic human learning. FinMem and FinAgent are representative systems: raw inputs (news, reports, charts) are first summarized into “memories”; when new market observations arrive, relevant memories are retrieved, combined with the fresh data, and processed into higher‑level “reflections” that guide the final decision. This architecture reduces hallucination, improves contextual relevance, and enables multimodal inputs (numeric, textual, visual).
Debate‑Driven agents employ multiple role‑specific LLMs that argue over interpretations of the same data. A heterogeneous debating framework (e.g., mood, rhetoric, dependency agents) has been shown to boost sentiment classification accuracy. TradingGPT extends this idea by having agents debate each other’s proposed actions and reflections before committing to a trade, thereby increasing robustness.
Reinforcement‑Learning‑Driven agents treat back‑testing results as a reward signal and apply RLHF or RLAIF to align model outputs with profitable behavior. SEP leverages a memorization‑reflection loop together with reinforcement learning to fine‑tune predictions based on historical success/failure. Another line of work (LG‑SCRL) extracts news embeddings via an LLM, projects them into a stock feature space, and trains a policy network with Proximal Policy Optimization (PPO). These approaches demonstrate improved long‑term returns but raise concerns about over‑fitting and computational cost.

The second paradigm, “LLM as an Alpha Miner,” decouples signal generation from execution. Here the LLM produces alpha factors (predictive scripts or code) that are then fed into a traditional quantitative pipeline. QuantAgent implements an inner‑loop writer‑judge system where a “writer” LLM drafts a trading script from a human‑provided idea, a “judge” LLM critiques it, and the loop iterates. The outer loop evaluates the resulting strategy on real market data, feeding performance back to improve the judge. AlphaGPT follows a similar human‑in‑the‑loop design, emphasizing efficiency in the resource‑intensive alpha‑mining process.

Data inputs across all surveyed agents fall into four categories:

Numerical data – price, volume, high/low, technical indicators. Because LLMs are text‑oriented, these numbers are stringified before being fed to the model. Studies show that even simple textual representations of numeric features can be effectively used for short‑term signal generation.
Textual data – fundamental documents (10‑K, 10‑Q filings, analyst reports) and alternative sources (news, social media). All surveyed agents rely heavily on textual inputs; sentiment extraction from news is a common sub‑task, while only SEP incorporates real‑time Twitter streams.
Visual data – chart images (candlesticks, volume bars). Multimodal LLMs such as GPT‑4V and LLaVA enable this modality, but research is nascent. FinAgent’s early experiments with GPT‑4V‑processed charts reported measurable performance gains over a comparable system lacking visual input.
Simulated data – synthetic market scenarios used for pre‑training or stress testing.

Performance evaluation is uniformly based on back‑testing metrics: cumulative return, Sharpe ratio, maximum drawdown, and sometimes turnover. News‑driven agents typically achieve modest but consistent alpha; reflection‑driven and debate‑driven systems improve risk‑adjusted returns and reduce volatility; RL‑driven agents show the highest potential upside but also the greatest variance across market regimes. Alpha‑mining agents demonstrate that LLM‑generated factors can rival manually engineered ones, yet they still require rigorous out‑of‑sample validation.

Model selection analysis reveals a strong dominance of OpenAI’s GPT‑3.5 and GPT‑4, reflecting their superior general‑purpose capabilities. Cost‑effectiveness drives many researchers to favor GPT‑3.5 despite GPT‑4’s higher performance, while open‑source alternatives occupy a long‑tail niche for specialized or low‑budget projects.

The authors discuss several persistent challenges:

Numerical reasoning limitations – LLMs struggle with precise arithmetic and time‑series forecasting, necessitating auxiliary feature engineering or hybrid architectures.
Hallucination risk – generation of spurious facts can lead to erroneous trades; memory and reflection layers mitigate but do not eliminate this risk.
Data pipeline complexity – real‑time ingestion of heterogeneous sources (news, social media, charts) demands robust engineering and incurs latency and cost.
Regulatory and ethical concerns – automated agents could inadvertently contribute to market manipulation or amplify systemic risk, underscoring the need for compliance frameworks.

Future research directions proposed include: (1) deeper multimodal integration to fully exploit chart imagery; (2) automated, high‑quality feedback loops for continual RL fine‑tuning; (3) embedding risk‑management and compliance modules directly into the agent architecture; (4) large‑scale live‑trading pilots to assess real‑world robustness; and (5) development of domain‑specific open‑source LLMs that balance performance with cost.

In summary, the survey maps the current landscape of LLM‑based financial trading agents, highlighting architectural diversity, data richness, promising back‑testing results, and a clear set of technical and regulatory hurdles that must be addressed before these agents can reliably outperform professional traders in live markets.

Large Language Model Agent in Financial Trading: A Survey

💡 Research Summary

Comments & Academic Discussion

Leave a Comment