PlotTwist: A Creative Plot Generation Framework with Small Language Models
Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Mod…
Authors: Abhinav Thorat, Ravi Kolla, Jyotin Goel
PlotT wist: A Cr eativ e Plot Generation Framework with Small Language Models Abhinav Thorat * ( ), Ravi K olla * , Jyotin Goel † , and Niranjan Pedanekar Sony Research India {abhinav.thorat, ravi.kolla, jyotin.goel, niranjan.pedanekar}@sony.com Abstract. Creativ e plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrativ e that sustains global structure, character de velopment, and emotional resonance. Although re- cent Large Language Models (LLMs) demonstrate strong fluency across general- purpose tasks, they typically require preference alignment to perform well on specialized domains such as creati ve plot generation. Howe ver , conducting such alignment at the scale of frontier LLMs is computationally prohibiti ve, signifi- cantly limiting accessibility and practical deployment. T o address this, we present P L OT T W I S T , a structured framework that enables Small Language Models (SLMs) with ≤ 5B activ e parameters to generate high-quality , premise-conditioned plots competitiv e with frontier systems up to 200 × larger . Our approach decomposes generation into three specialized components: (1) an Aspect Rating Re ward Model trained via a novel Positive-Ne gative prompting strategy to deliv er structured narrativ es across fi ve Narrati ve Quality Dimensions (NQDs); (2) a Mixture-of- Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive ex- periments demonstrate that P L O T T W I S T consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrativ e quality , as the frame work reliably distinguishes plots deriv ed from critically acclaimed versus widely panned screen- plays. T ogether , these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creati ve plot generation. Keyw ords: Creativ e Writing, Plot Generation, LLM, SLM, Mixture-of-Experts, Direct Preference Optimization 1 Introduction Writers across film studios, streaming platforms, and publishing houses constantly face the same challenge: transforming a concise creativ e premise into a compelling narrativ e outline under tight deadlines. A sho wrunner needs to dev elop episode arcs for a new series. A screenwriter must pitch three distinct treatments by week’ s end. An educator * Equal contribution. † W ork done during an internship at Son y Research India. 2 Abhinav et al. requires div erse story examples for a creativ e writing course. In each case, the task is not simply to generate text, b ut to craft plots that exhibit coherent structure, belie v able character arcs, consistent tone, and emotionally resonant turning points, which are qualities that distinguish professional storytelling from arbitrary e vent sequences [ 26 , 28 ]. While e xperienced writers navigate these demands through years of training and i ntuition, the question remains: can computational systems provide structured assistance for this fundamentally creativ e task? The challenge of creativ e plot generation with generative models extends beyond surface-le vel text production. Unlike summarization or question-answering, where local context often suffices, plot generation demands long-horizon reasoning over concise conditioning signals. A promising premise deri ved from audience research by an enter- tainment channel netw ork such as “ a romantic comedy set in the modern tech startup era ” provides minimal concrete guidance to the writers, yet must expand into a causally connected sequence of events spanning setup, development, climax, and resolution. The narrativ e must maintain global coherence while ensuring that early character moti vations align with later decisions, that tonal shifts feel earned rather than arbitrary , and that pacing sustains eng agement across the entire arc. These requirements pose significant dif ficulties for standard autoregressi ve language models, which optimize token-lev el likelihoods and lack explicit mechanisms for enforcing discourse-lev el constraints. Prior work has sho wn that hierarchical planning and ex- plicit structural decomposition can impro ve narrati ve consistency [ 6 , 8 , 26 , 28 ] but such approaches typically assume lar ge model capacities, task-specific supervision, or relax ed efficienc y constraints, leaving open the question of whether ef fective plot generation is achiev able under computational constraints. Recently , LLMs hav e demonstrated impressive fluenc y across creativ e writing tasks, yet their success comes at a steep cost. Frontier models such as GPT -4.1 [ 16 ], Claude Sonnet 4 [ 1 ], Gemini 2.0 Flash [ 3 ] operate at scales e xceeding hundreds of billions of parameters, demanding substantial computational infrastructure for both training and inference, a burden that ultimately cascades down to end users. Beyond raw cost, the literature consistently shows that task-specific alignment yields meaningfully better per- formance than relying on general-purpose models alone [ 17 , 22 ]. In the context of creative writing, howe ver , this alignment imperative is compounded by a deeper challenge: scale alone does not reliably resolve long-horizon coherence. Ev en the largest models e xhibit narrati ve drift, inconsistent characterization, and structural incoherence when generating extended plots without additional inducti ve biases [ 6 , 28 ]. Achieving professional-grade plot generation thus requires tar geted alignment to the creati ve domain, yet for models of this scale, such alignment is computationally prohibitiv e, particularly when the resulting system is intended for a narro w , specialized use case. This tension between alignment necessity and computational feasibility forms the central motiv ation for our work. This observation motiv ates a fundamental question: can Small Language Models (SLMs), defined here as models with ≤ 3B acti ve parameters per tok en, generate creative plots of comparable quality to frontier LLMs when aligned using an appropriate struc- tural scaf folding? W e hypothesize that the key lies not in model scale, b ut in externalizing narrativ e structure into explicit e valuati ve and training signals. Rather than relying on a monolithic model to implicitly learn all aspects of narrativ e quality through token PlotT wist: Creati ve Plot Generation with Small Language Models 3 prediction, we propose decomposing the generation process into specialized components: an Aspect-based Rew ard Model that provides structured feedback across multiple narra- tiv e dimensions, a Plot Generator optimized via preference learning to align with these signals, and an independent ev aluation frame work that emulates human critical judgment. This architectural separation enables SLMs to lev erage explicit guidance where large models rely on emergent capabilities, ef fectively trading model capacity for structured workflo w design. T o operationalize this approach, we introduce P L OT T W I S T , a three-component frame- work for concise premise-conditioned plot generation with SLMs. The first component is an Aspect Rating Reward Model that e valuates plots across fi ve Narrati ve Quality Dimensions (NQDs): character dev elopment, tone consistency , pacing, narrati ve coher- ence, and emotional turning points. The second component is a Mixture-of-Experts (MoE) Plot Generator based on Qwen-3-30B-A3B (3B activ e parameters), trained via Direct Preference Optimization (DPO) [ 18 ] on preference pairs deriv ed from the above aspect rating re ward model. The third component is an Agentic Ev aluation Module that operates independently of the training pipeline, providing post-hoc asse ssment through structured, weakness-focused criteria. This modular separation ensures that training-time optimization and ev aluation-time analysis rely on distinct mechanisms, reducing the risk of rew ard hacking or circular validation. The key contrib utions of this work are as follows. – Structured W orkflow using SLMs f or Plot Generation. W e propose an Aspect Rating Rew ard Model, a DPO-trained MoE Plot Generator, and an Independent Agentic Ev aluation Module. Each component is independently deployable for dis- tinct use cases while collectiv ely enabling structured preference alignment under capacity constraints. – Positi ve-Negative Prompting f or Aspect Rating. W e introduce a no vel prompting strategy that mitigates positi vity bias in LLM-based e valuation, constructing reliable aspect-lev el training data across five NQDs from model ensembles. – External V alidation of Evaluation Components. W e demonstrate that both e valu- ativ e components reliably distinguish narrativ e quality by assigning higher scores to acclaimed screenplays (101 Greatest Screenplays of All T ime) than to critically panned films (Golden Raspberry A wards) across all narrativ e aspects. – Competitive P erformance. PlotT wist outperforms strong baselines including fron- tier models (GPT -4.1 [ 16 ], Claude Sonnet 4 [ 1 ], Gemini 2.0 Flash [ 3 ]), large open- weight models (Llama-3-70B [ 5 ]), and specialized narrativ e generation systems (Agents’ Room [ 11 ], W izardLM-StoryT elling [ 27 ]) using only 3B active parameters, demonstrating that structured preference alignment can compensate for limited model capacity in creativ e generation. – Quality-Adaptive Generation Beha vior . PlotT wist exhibits principled intervention scaling across quality strata, providing light refinement for strong narrativ es and substantial restructuring for weak ones rather than uniform score inflation. T ogether , these contrib utions establish that structured preference-based workflo ws can enable SLMs to achiev e competitive performance on creati ve long-form generation tasks, offering a scalable and resource-ef ficient alternativ e to frontier model deployment for premise-conditioned creativ e plot generation. 4 Abhinav et al. 2 Related W ork Story and Plot Generation. The challenge of automatic narrati ve generation has been approached progressi vely in scientific domain. Early neural methods introduced hierar - chical generation, decomposing stories into premise and continuation stages [ 6 ], while Plan-and-Write frame works e xplicitly separated outline planning from surface realiza- tion [ 28 ]. Building on this, PlotMachines [ 19 ] incorporated dynamic plot state tracking to maintain narrativ e coherence across long outputs. More recently , LLM-based systems hav e adv anced narrativ e generation further . Agents’ Room [ 11 ] employs multi-agent collaboration to simulate the di vision of labor in professional writing rooms, and [ 8 ] demonstrate that explicit reasoning chains substantially improve long-form narrative quality . Despite these advances, such systems remain tethered to frontier-scale computa- tion, limiting their practical accessibility . A recent surv ey [ 26 ] comprehensi vely charts this ev olving landscape, underscoring that structured, quality-aware generation remains an open problem. Prefer ence Alignment and Efficient Models. Reinforcement Learn- ing from Human Feedback (RLHF) has been sho wn to be ef fecti ve for aligning LLMs with human preferences, but the training pipeline, typically in volving reward model learning and reinforcement learning optimization, can be computationally expensi ve. Direct Preference Optimization (DPO) [ 18 ] of fers a simpler , more stable alternativ e by reframing alignment as a classification problem o ver preference pairs, making it espe- cially attracti ve for resource-constrained settings. Our w ork leverages DPO within an MoE architecture [ 21 , 7 ], which enables sparse, efficient computation without sacrificing model capacity . Ev aluation of Cr eative T ext. Evaluating creati ve generation remains fundamentally challenging. Early work questioned whether pretrained models genuinely improv e story quality [ 20 ], while more recent studies suggest that LLM-generated te xt often lacks authentic creativity despite exhibiting strong surf ace-le vel fluenc y [ 2 ]. Auto- mated e v aluation has increasingly turned to LLM-as-a-Judge frame works [ 30 ], which of fer scalable proxies for human judgment. Ho wev er, such judges are susceptible to posi- tional and shortcut biases [ 14 ], moti vating the design of more robust ev aluation protocols. Recent work e xplores curiosity-driv en [ 13 ] and agent-based [ 29 ] judging strate gies to improv e reliability . Our NQDs are theoretically grounded in af fective narratology [ 10 ] and operationalized through structured reward modeling, drawing on recent creativ e ev aluation benchmarks [31,12] to ensure both validity and interpretability . 3 Problem F ormulation W e consider the problem of creativ e plot generation using small language models (SLMs) 1 , where the model is conditioned on a high-le vel premise specifying thematic and contextual constraints of the desired plot. A premise concisely defines the nar- rativ e setting, genre, and related guiding components. An example premise is as fol- lo ws: Generate a plot of a romantic comedy set in the modern tech startup era. 1 W e distinguish models by activ e parameter count rather than total parameters. W e refer to models with fe wer than 3 billion acti ve parameters per tok en as SLMs, e ven when implemented as MoE architectures with larger total parameter counts. PlotT wist: Creati ve Plot Generation with Small Language Models 5 Prior work in computational narrati ve modeling identifies se veral structural attrib utes consistently associated with high-quality plot construction, including character de vel- opment, tone consistency , pacing, narrati ve coherence, and emotionally salient turning points [ 2 ]. These attributes reflect global or ganizational properties of narrativ es rather than surface-le vel fluency , capturing how ev ents are causally linked, how characters e volv e, and how af fective dynamics unfold o ver time. Motiv ated by this literature, we fo- cus on these fiv e aspects as a compact yet comprehensive set of NQDs for ev aluating plot quality . Collectiv ely , the NQDs span structural (coherence), temporal (pacing), character - centric (dev elopment), stylistic (tone consistency), and affecti ve (emotional turning points) components of narrativ e quality , providing broad cov erage without introducing redundant or ov erlapping criteria. Concretely , narrative coher ence measures global logical consistency and causal connectivity across e vents; char acter development ev aluates meaningful internal trans- formation; pacing assesses the distribution of narrative progression and tension; tone consistency captures alignment between stylistic register and the giv en premise; and emotional turning points identify impactful affecti ve transitions. T ogether, the NQDs provide an interpretable and structured framew ork for quantifying plot quality while maintaining manageable ev aluation complexity . Our objecti ve is to de velop a structured SLM-based w orkflo w that generates premise- conditioned plots exhibiting strong performance across all NQDs, aiming to approximate the narrativ e quality observed in professionally authored plots. 4 Proposed Methodology In this section, we present the complete technical description of the proposed methodol- ogy for the problem considered in this work. The methodology comprises three primary modules as shown in Figure 1. 1. Aspect Rating Reward Model : It focuses on estimating ratings for all aspects in the NQDs in a manner that aligns with ratings given by human e xperts. Later, these estimated aspect ratings are used in the rew ard process of plot generator module. 2. Plot Generator : Giv en a premise, a concise specification of the narrativ e setting, genre, and related constraints, this module generates a plot that adheres to the premise while excelling across all aspects in the NQDs. 3. Agentic Ev aluation : It ev aluates plots across all aspects in the NQDs with the objec- ti ve of emulating human expert judgment. While direct e valuation by domain e xperts would pro vide the most reliable assessment, such e valuation is often subject to av ail- ability , time-consuming and expensi ve. Therefore, following prior work [ 26 , 12 ], we adopt an independent agentic ev aluation framew ork for assessing plot quality . 4.1 Aspect Rating Reward Model Motiv ated by this aspect-lev el characterization of plot quality along the considered NQDs, our objectiv e is to develop a re ward model that produces aspect-lev el ratings and can be used to guide the plot generator model. T o this end, we begin by constructing a 6 Abhinav et al. St ep 1 T r aining a R e war d Model Using Multi-A spect Ratings P olicy Learning f or Plot Gener ation wit h A spect -Guided R e war d F eedback Ev aluation of Plot s wit h an A gentic A spect -F ocused F r ame w ork T r ain R e war d Model Plot -Based Aspect Pr ediction f or Charact er De v elopment, Pacing, Narrativ e Coher ence, and Ot her Aspect s Aspect Pr ediction T raining via Super vised Fine- T uning wit h Huber Loss Generating Aspect Ratings via P ositiv e–Negativ e Pr ompting in a Model-Agnostic Setup Generating Plot s fr om Pr ompt s F ocused on Synopsis and Genr e Generating Plot s Using Pr ompt and Genr e Input s f or Ev aluation Aspect - Wise Agentic Ev aluation f or Each Individual Aspect wit h Detailed Instructions Aspect Rating of Plot s f or Aspect s such as Narrativ e Coher ence, Pacing, and Charact er De v elopment Candidat e Plot s Generat ed b y LLMs & t he Generat or Model Plot Generation P olicy Updat es via DPO Using R ewar d Model Pr ef er ence and Thr esholds DPO St ep 2 T r ain P lo t T wi st G ener a t or St ep 3 A g en t i c Ev al u a t ion The Plot of a Det ectiv e St or y Set in t he W orld W ar II Era Aspect Ratings SFT R ewar d Model R ewar d Model Ran k s Plot s Generat or MoE Model Plot T wist MoE Model 🅑>🅓>🅐>🅒 The Plot of Militar y R escue St or y Set in t he W orld W ar II Era Aspect Ratings A coder meet s an une x pect ed riv al . 🅐 A q uir k y app spar k s unli k ely r omance . 🅒 Lo v e launches amid t ech competition . 🅓 T w o star tups collide o v er inno v ation . 🅑 Generat e Plot of a R omantic Comedy Set in t he Modern T ech Star tup Era Generat e t he Plot of a Sci-Fi Drama Set in t he Early Mars Coloni z ation Era . Fig. 1: PlotT wist Framew ork dataset comprising plots paired with their corresponding ratings across the considered NQDs. W e then fine-tune an LLM on this dataset to obtain an Aspect Rating Rew ard Model capable of assigning continuous-v alued scores to plots. T o formalize this setup, we introduce the required notation here. F or a given plot- p and aspect- a , we use r a ( p ) to denote the rating of plot- p along aspect- a in NQD. Aspect rating dataset construction: As there are no existing datasets that provide fine-grained ratings of plots across the considered NQDs, we construct such a dataset synthetically using LLMs. T o the best of our knowledge, the only widely a vailable human- provided score is the IMDb rating for mo vies. Howe ver , this rating serves as a holistic assessment that encapsulates man y creati ve elements simultaneously . Consequently , it cannot be used as a direct proxy for individual aspect-specific ratings, but it can serve as a coarse aggregate indicator of the o verall quality of the plot. Note that IMDb ratings are not used in model training; they are employed solely for data curation and stratification purposes, as detailed in the experiments section. W e begin by randomly sampling 5000 movies from the MovieLens [ 9 ] dataset spanning a broad range of IMDb ratings to ensure div ersity . For each mo vie, we then scrape its corresponding plot from the W ikipedia 2 . Next, we employ LLMs to generate synthetic aspect-le vel ratings for each plot. W e emphasize that LLMs are used solely in the rew ard model, while the final plot generator model, described in the following section, is implemented as an SLM. For each plot p , we generate ratings for all aspects in NQDs using positive–ne gative prompting 3 in a model agnostic setup, described below . This approach mitigates positiv- 2 T o meet the token output and computation requirements we select only movies with plot length less than or equal to 4000 words. 3 For reference, both positive and negati ve prompts for narrati ve coherence aspect are giv en in the Appendix A PlotT wist: Creativ e Plot Generation with Small Language Models 7 R e war d Model - Narr ativ e Coher ence R e war d Model Pr ompt R e war d R ole: Pr of essional Mo vie Criti T ask: Ev aluat e Narrativ e Coher enc Output: Single JSON Scor e (0-10) Input Clear St or yt ellin Logical Plot Pr ogr essio Cause-Eff ect Consist enc W orld-Building Coher enc Subplot Int egratio Charact er Motiv ation Alignment Scalar Scor e (0-10) Mo vie Plot A gentic E v al - Char act er De v elopement E v aluation Pr ompt R ole: Pr of essional F il m A nal y s T ask: Ev aluat e Pr otagonist Charact e Output: Out p ut Scor e (0-10) + D etaile d R e p or t Output Input Scalar Scor e (0-10) Mo vie Plot Scoring Crit eria: 10 P oint s T otal Scoring Crit eria: 10 P oint s T otal Scoring Crit eria: 10 P oint s T otal P ositiv e Pr ompt(r + ) Negativ e Pr ompt(r - ) P enaliz e Plot holes inconsist encies & illogical e v ent s R ewar d Coher ent Structur e & Str ong Narrativ e Flo w DPO Optimization Loop Align Plot T wist Generation T o war ds higher Narrativ e coher ence Cor e Char act er E lement s Motiv ation Clarit Beha vioral Consist enc Charact er Ar Psy chological D ept h Cor e Char act er E lement s Bac k st or y Int egratio Audience Connectio Charact er Distinctiv eness Plot - Char act er Consist enc y R elationship D ynamic Plot Agenc T hematic Alignment Fig. 2: Left: Illustration of positi ve and ne gati ve prompting used to deri ve aspect-specific rew ard signals for DPO alignment. Right: Decomposing character dev elopment into criteria and aggregating assessments to produce interpretable e valuation scores. ity bias, encourages critical assessment, and yields stronger correlation with external indi- cators, enabling more accurate and balanced e valuation of plots. W e first take fi ve LLMs, namely Qwen-2.5-7B , Llama-3.2-70B , Llama-3.1-8B , DeepSeek-14B and Gemma-27B , to av oid model bias in the aspect rating. W e then prompt them with a plot to output a rating, in a scale of 1-10, for each aspect by only considering the positi ves present in the plot along that aspect, denoted as r + a,m ( p ) where m identifies the LLM. Similarly , we prompt LLMs to output a rating, in a scale of 1-10, by only considering the negati ves present in the plot along each aspect, denoted as r − a,m ( p ) . Note that, if an aspect is well captured in the plot then we expect r + a,m ( p ) and r − a,m ( p ) to be high and low respecti vely . Then, the final aspect rating of a plot is calculated as shown belo w: r a ( p ) = X m r + a,m ( p ) − r − a,m ( p ) . (1) Figure 2 (left) sho ws an example of ho w positiv e-negativ e prompting is applied to obtain a rating for the narrati ve coherence aspect. W e note that this style of prompting mitigates the inherent positi ve bias often observ ed in LLMs [ 30 ], enabling them to critique plots more effecti vely . Supervised Fine-T uning (SFT) W e now train a re ward model using the aspect-rating dataset constructed above. T o that end, we adopt a Qwen-3-32B (4-bit) model and perform regression-aware SFT on this dataset. The training objective minimizes a weighted sum of two components: (i) a Cross-Entropy (CE) token loss, which preserves the model’ s language modeling capability , and (ii) a Huber loss, to enable rob ust learning of continuous-v alued reward scores, as it acts lik e mean squared error for small residuals 8 Abhinav et al. and like mean absolute error for large residuals, providing smooth and differentiable optimization while remaining robust to outliers. The mathematical formulations of both losses are provided belo w . L C E ( θ ) = 1 T X T t =1 log p θ ( y t | x 1: t ) , (2) where L C E ( θ ) , x 1: T , y 1: T , T and θ are the cross entropy loss, input token sequence, target or ground-truth token sequence, length of the sequence and parameters of the model respectiv ely . Mathematical expression of Huber loss is gi ven below: L δ ( r ) = ( 1 2 r 2 , if | r | ≤ δ δ | r | − 1 2 δ , if | r | > δ, (3) where r denotes the difference between the predicted re ward and the tar get (ground-truth) rew ard, and we set δ = 1 in the experiments. 4.2 Plot Generator Model W e no w describe the plot generator model, whose objecti ve is to de velop an SLM capable of generating creati ve plots conditioned on a gi ven concise premise. T o that end, we adopt Qwen-3-30B-A3B, an MoE architecture that enables increased model capacity and expert specialization while maintaining efficient inference [ 7 , 21 ]. This property is particularly important for creativ e generation tasks, which require modeling diverse narrativ e patterns and stylistic v ariations. Although the MoE model has a total parameter count of 30B, only 3B parameters are acti ve per token, which classifies it as an SLM under our definition. T o further align the model tow ard producing higher-quality plots, we employ Direct Preference Optimization (DPO), a Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) approach that directly optimizes preference objectiv es without requiring an explicit reward model or on-policy reinforcement learning [ 18 ]. DPO provides a stable and sample-ef ficient alternative to traditional RLHF/RLAIF methods, making it particularly well-suited for fine-grained creative alignment where reliable preference signals are scarce. Given that the Qwen-3-30B-A3B model already exhibits strong instruction-following capabilities, we omit SFT for instruction alignment and focus exclusi vely on preference-based optimization. Generat e Plot of a R omantic Comedy Set in t he Modern T ech Star tup Era Data Calib r ation F or Dir ect Pr ef er ence Optimization Pr emise Scoring Plot s Wit h R ewar d Model Accept - R eject Pairs A ccept R eject Base Model A coder meet s an une xpect ed riv al. 🅐 A coder meet s an une xpect ed riv al. 🅐 A coder meet s an une xpect ed riv al. 🅐 T w o star tups collide o v er inno v ation. 🅑 A quirky app sparks unlik ely r omance. 🅒 A quirky app sparks unlik ely r omance. 🅓 A quirky app sparks unlik ely r omance. 🅓 🅐>🅓>🅒>🅑 Scor e of A & D >= 8 A-D >=0 . 5 Fig. 3: Dataset curation for performing DPO T o perform DPO, we first construct a dataset of pairwise plot preferences, as il- lustrated in Figure 3. Each sample in the dataset consists of a premise, a pair of plots PlotT wist: Creativ e Plot Generation with Small Language Models 9 generated under that premise, and a preference ordering between them. As described in the previous section, we curate a corpus of 5,000 movie plots. W e then generate corresponding premise descriptions for each plot using the Gemma-27B model. F or each premise, we prompt the base Qwen-3-30B-A3B MoE model, along with sev eral frontier models (Claude Sonnet 4, Gemini 2.0 Flash, and GPT 4.1), to generate plots conditioned on the same premise. All generated plots are e valuated using the re ward model described in Section 4.1, with aspect-le vel ratings av eraged to obtain a final re ward for each plot. T o align our MoE model tow ard higher-quality plot generation, we retain only those samples where strictly any frontier model achie ves the highest re ward score, e xceeds a threshold of score 8, and outperforms the next-best model by a margin of at least 0.5. Applying this procedure across all 5,000 premises yields 160 high-confidence preference samples. Although modest in size, this dataset is intentionally curated to ensure reliable preference signals, and prior work has sho wn that DPO can effecti vely le verage a small number of high-quality preference pairs [ 18 ]. W e subsequently perform DPO on the Qwen-3-32B-A3B model using this dataset, yielding the final plot generator model. Full details of the computational infrastructure used for training and inference are pro vided in Appendix E. 4.3 Agentic Evaluation of Plots Follo wing plot generation, reliable validation of creati ve quality is essential. Although the aspect rating reward model provides structured supervision across the predefined NQDs, it remains a predictive model optimized for aspect-lev el signals derived from training data. Relying solely on this model risks ev aluation bias, as it may re ward patterns correlated with learned signals rather than reflect broader narrati ve soundness. Ideally , creativ e quality would be assessed by domain experts, whose judgments are holistic and context-sensiti ve; howe ver , such ev aluation entails substantial cost and is therefore not feasible in this work. Moreov er, human assessment of creati ve writing is inherently comparativ e and intuitive, shaped by implicit e xpectations and sensitivity to disruptions in coherence, plausibility , and emotional progression. Motiv ated by these considerations, we introduce an independent agentic ev aluation frame work, illustrated in Figure 2 (right) for character de velopment aspect, that performs structured, multi-criteria reasoning across the NQDs while remaining independent of the aspect rating re ward model. By decomposing ev aluation into interpretable criteria and aggregating them through controlled reasoning steps, the agentic e valuator approximates holistic narrati ve judgment without directly optimizing the plot generator model. Inspired by evidence that LLMs e xhibit substantial alignment with human judgment in ev aluative tasks [30], we formulate precise instructions and explicit e valuation criteria that enable Qwen3-32B, operating at 16-bit precision, to approximate human assessment. The agentic ev aluation prompts are provided in Appendix B. T o ensure reliable and consistent aspect-wise ev aluation, each NQD is specified through explicit, instruction-le vel criteria that translate abstract narrati ve concepts into concrete and observ able failure modes. Narrativ e coherence is assessed by identifying breakdowns in logical progression, causal relationships, and structural consistenc y , in- cluding plot holes, contradictions, and incoherent world-building. Emotional turning points are ev aluated by examining the strength and plausibility of emotional arcs and 10 Abhinav et al. pi votal moments, with attention to missed opportunities or unearned de velopments. T one consistency is assessed by detecting abrupt or incongruent shifts in mood, genre, or atmosphere that disrupt narrati ve immersion. P acing is ev aluated through the rhythm and distribution of narrati ve e vents, identifying rushed, stagnant, or unev en progression. Char- acter de velopment is assessed by e xamining the presence of meaningful character arcs, coherent motiv ations, and authentic relational dynamics, while av oiding penalization of minor or functionally constrained roles. 5 Experiments W e first present experiments validating the proposed aspect rating re ward model and agentic e v aluation frame work. W e then study the performance of our PlotT wist frame- work under v arying conditions, followed by comparisons against competiti ve baselines. Finally , we report ablation studies to quantify the contributions of indi vidual components. 5.1 Aspect Rating Reward Model and Agentic Evaluation: V alidation The objective of this experiment is to verify that both components, the aspect rating rew ard model and the agentic ev aluation framework, perform as intended, assigning higher scores to high-quality plots and lo wer scores to low-quality plots. Since IMDb ratings do not directly reflect plot quality , we instead use high-quality plots from the 101 Greatest Screenplays of All Time (GSA T) films and low-quality plots from films recognized by the Golden Raspberry Screenplay A wards (Razzies) to obtain a more reliable ev aluation signal. Due to the class imbalance (37 Razzie vs. 94 GSA T) 4 , we employ a repeated balanced subsampling protocol in which the Razzie set is fix ed and an equal number of Best films is sampled over 1,000 runs, ensuring robustness to sampling variance. Aspect Rating Reward Model. W e e valuate the aspect rating re ward model on the afore- mentioned Razzie and GSA T plots and observ e consistent separation across all NQDs. Plots of Razzie films obtain a mean score of 7 . 21 , whereas GSA T plots achieve 8 . 28 , yielding a dif ference of +1 . 07 (95% Confidence Interval (CI) [0 . 96 , 1 . 19]) , with GSA T outperforming Razzie plots in 100% of sampling runs. The largest aspect-le vel gaps oc- cur in pacing ( +1 . 41 ), emotional impact ( +1 . 13 ), and narrati ve coherence ( +1 . 12 ), with clear separation also in character dev elopment ( +0 . 75 ) and tone consistency ( +0 . 96 ). All dimensions exhibit large ef fect sizes (Cohen’ s d ranging from 2 . 77 to 4 . 68 ). W elch’ s t -tests confirm statistically significant dif ferences across all NQDs, including a highly significant aggregate separation ( t = 9 . 69 , p = 3 . 41 × 10 − 18 ). Overall, these results demonstrate that the proposed re ward model developed using our positi ve-ne gati ve prompting approach robustly and consistently differentiates plot quality between lo w and 4 The imbalance is due to the follo wing three structural f actors: the Razzie A wards’ comparati vely recent founding (1981), the fact that W orst Screenplay is not awarded ev ery year, and a 4,000- word plot-length filter , all of which collectively reduce the a v ailable Razzie pool relativ e to the long-canonized GSA T list. PlotT wist: Creativ e Plot Generation with Small Language Models 11 high quality plots across NQDs, capturing meaningful narrative and stylistic distinctions rather than artifacts of sampling or dataset imbalance. Model Character Development T one Consistency Pacing Narrative Coherence Emotional T uning Qwen2.5-14B-Instruct 8 . 03 ± 0 . 73 8 . 37 ± 0 . 46 8 . 48 ± 0 . 43 8 . 47 ± 0 . 55 8 . 80 ± 0 . 30 Qwen2.5-7B-Instruct 7 . 88 ± 0 . 78 8 . 37 ± 0 . 48 8 . 37 ± 0 . 55 8 . 22 ± 0 . 73 8 . 70 ± 0 . 33 Claude Sonnet 4 8 . 67 ± 0 . 74 8 . 64 ± 0 . 32 8 . 65 ± 0 . 43 8 . 81 ± 0 . 44 8 . 95 ± 0 . 16 Gemini 2.0 Flash 8 . 39 ± 0 . 60 8 . 55 ± 0 . 35 8 . 67 ± 0 . 34 8 . 72 ± 0 . 54 8 . 89 ± 0 . 22 GPT -4.1 8 . 29 ± 0 . 76 8 . 61 ± 0 . 37 8 . 74 ± 0 . 30 8 . 74 ± 0 . 46 8 . 88 ± 0 . 22 Llama 70B 7 . 84 ± 0 . 95 8 . 27 ± 0 . 48 8 . 22 ± 0 . 55 8 . 11 ± 0 . 67 8 . 68 ± 0 . 36 DeepSeek R1 14B 7 . 92 ± 0 . 81 8 . 36 ± 0 . 42 8 . 44 ± 0 . 38 8 . 40 ± 0 . 57 8 . 76 ± 0 . 26 Phi-4 Mini Instruct 7 . 45 ± 0 . 93 7 . 97 ± 0 . 90 7 . 81 ± 1 . 24 7 . 51 ± 1 . 46 8 . 35 ± 0 . 93 Qwen3-32B 8 . 25 ± 0 . 74 8 . 53 ± 0 . 38 8 . 50 ± 0 . 55 8 . 36 ± 0 . 67 8 . 87 ± 0 . 21 Mistral Small 2501 24B 8 . 07 ± 0 . 63 8 . 36 ± 0 . 41 8 . 48 ± 0 . 47 8 . 38 ± 0 . 59 8 . 77 ± 0 . 24 Agents’ Room [11] 8 . 55 ± 0 . 48 8 . 59 ± 0 . 34 8 . 68 ± 0 . 39 8 . 81 ± 0 . 42 8 . 94 ± 0 . 19 W izardLM-StoryT elling-30B [27] 7 . 41 ± 0 . 82 8 . 14 ± 0 . 51 8 . 20 ± 0 . 57 7 . 89 ± 0 . 84 8 . 38 ± 0 . 50 P L OT T W I ST 8 . 64 ± 0 . 62 8 . 70 ± 0 . 30 8 . 85 ± 0 . 29 8 . 89 ± 0 . 39 8 . 98 ± 0 . 21 T able 1: Performance comparison of NQD ratings across models. All reported results are computed ov er 160 test prompts and are presented as mean ± standard de viation. The best and second-best results are highlighted in bold and underlined, respectiv ely . Agentic Evaluation. The agentic ev aluator exhibits similar beha vior, assigning higher scores to GSA T plots across all NQDs. The mean score increases from 7 . 47 (Razzie) to 8 . 62 (GSA T), a difference of +1 . 15 (95% CI [1 . 02 , 1 . 26]) , with perfect directional consistency across subsampling runs. The strongest separations appear in pacing ( +1 . 67 ) and narrati ve coherence ( +1 . 58 ), follo wed by character de velopment ( +1 . 09 ) and tone consistency ( +1 . 08 ), all with very large ef fect sizes (Cohen’ s d ranging from 5 . 02 to 7 . 00 ). Emotional impact shows a smaller but consistent gap ( +0 . 30 ). W elch’ s t - tests confirm highly significant differences across all dimensions ( t = 12 . 99 , p = 8 . 10 × 10 − 28 ). T aken together , these findings show that our agentic e valuation robustly captures narrativ e and stylistic quality differences between critically panned and highly acclaimed plots, closely mirroring the behavior observ ed in the above re ward model. 5.2 Quality-Stratified Analysis of PlotT wist W e now study the performance of the proposed PlotT wist framework across inputs deriv ed from plots of varying original quality to assess whether it captures meaning- ful narrativ e distinctions and exhibits quality-adaptiv e generation behavior , rather than uniformly inflating scores. T o this end, we consider 160 movies partitioned into four IMDb-calibrated quality cate gories, Excellent, Good, Mid and Lo w , representing pro- gressiv ely decreasing lev els of original plot quality: Excellent (IMDb > 8 ), Good ( 7 < IMDb ≤ 8 ), Mid ( 6 < IMDb ≤ 7 ), and Low (IMDb ≤ 6 ). For each movie plot, we generate a corresponding premise using Gemma-27B, which is then used to condition 12 Abhinav et al. the PlotT wist’ s plot generator model. Within each category , we compare original and generated plots across the fi ve NQDs using the agentic ev aluator . For each dimension, we compute mean scores, bootstrap-based 95% confidence intervals o ver paired differences, effect sizes (Cohen’ s d ), the probability that generated plots outperform originals, and W elch’ s t -tests for secondary statistical v alidation. This combination of resampling-based estimation and parametric testing ensures robustness to distrib utional assumptions while enabling interpretable effect-size comparisons. Excellent Category (IMDb > 8 ) In the Excellent category , where original plots already demonstrate strong narrativ e quality , generated plots achiev e modest yet consistent improv ements across all ev aluated dimensions. Mean dif ferences range from +0 . 27 to +1 . 11 , with character de velopment exhibiting the strongest and most reliable impro ve- ment ( +1 . 11 , 95% CI [0 . 62 , 1 . 72] , p = 1 . 44 × 10 − 3 ). Gains in narrative coherence, pacing, and tone consistency are uniformly positiv e and approach or reach statistical significance, with medium effect sizes (Cohen’ s d ≈ 0 . 4 - 0 . 5 ). In contrast, emotional impact sho ws a smaller and statistically non-significant impro vement, suggesting limited scope for affectiv e enhancement when baseline quality is already high. These results indicate that, for high-quality narrati ves, the PlotT wist model functions primarily as a refinement mechanism, strengthening structural and character-le vel elements without substantially altering narrativ e intent. Good Category ( 7 < IMDb ≤ 8 ) For films in the Good category , generated plots demonstrate clear and statistically robust improv ements across all narrativ e dimensions. Mean differences range from +0 . 32 to +1 . 87 , with confidence intervals e xcluding zero in every case and p -values well below 0 . 01 . The largest gains are observed in pacing ( +1 . 73 ), tone consistency ( +1 . 87 ), and character development ( +1 . 58 ), each associated with large ef fect sizes (Cohen’ s d ≈ 0 . 9 - 1 . 0 ). These findings suggest that, for narratives with solid foundations but noticeable structural or stylistic limitations, the PlotT wist model performs systematic narrativ e enhancement, particularly in dimensions related to global organization and tonal alignment. Mid Category ( 6 < IMDb ≤ 7 ) The Mid category e xhibits large and highly consistent improv ements across all ev aluated aspects. Generated plots outperform originals by +0 . 55 to +1 . 48 , with tight bootstrap confidence intervals and very strong statistical significance ( p -val ues as low as 1 . 69 × 10 − 7 ). Effect sizes are uniformly large (Cohen’ s d ≈ 1 . 0 – 1 . 4 ), indicating that these improvements are both statistically reliable and practically substantial. This category represents the optimal operating regime of the PlotT wist model, wherein moderate baseline quality allows for meaningful restructuring of narrativ e flow , clearer character arcs, and improved stylistic coherence. Low Category (IMDb ≤ 6 ) In the Low category , generated plots demonstrate near- complete dominance over original plots. Mean improv ements approach +2 . 0 points across all narrativ e dimensions, accompanied by very large effect sizes (Cohen’ s d > 1 . 1 ), tight confidence interv als, and probabilities of dominance exceeding 0 . 87 . All comparisons are statistically significant at p -value < 10 − 5 , indicating overwhelming separation between original and generated score distributions. These results indicate that, when original narrativ es are weak, the PlotT wist model performs full narrative regeneration, substantially improving coherence, pacing, characterization, and tonal stability rather than merely applying surface-le vel edits. PlotT wist: Creativ e Plot Generation with Small Language Models 13 Summary . The PlotT wist generator demonstrates quality-adapti ve beha vior across all strata: providing light refinements for Excellent films, substantial structural enhance- ments for Good films, and systematic restructuring for Mid-tier narrativ es. For Low quality films, the model performs near-complete narrati ve regeneration with gains e x- ceeding two points. These results establish that PlotT wist is sensitiv e to baseline quality , scaling its intervention, from subtle refinement to full re generation, to address specific plot deficiencies in a principled, quality-aware manner . 5.3 Baselines W e ev aluate PlotT wist against a comprehensiv e suite of baselines selected to provide a holistic comparison across three orthogonal axes: model scale, architectural design, and plot generation paradigm. (see Appendix D for a complete o verview of all models and their roles in the framew ork). Model scale. T o contextualize the performance of our 3B-activ e-parameter system, we benchmark against substantially larger closed-source frontier models including GPT -4.1 [ 16 ], Claude Sonnet 4 [ 1 ], and Gemini 2.0 Flash [ 3 ]. W e also include high-capacity open-weight models such as Llama-3-70B [ 5 ] and Qwen-3-32B [ 25 ]. These comparisons serve to determine whether PlotT wist’ s structured approach allo ws it to achiev e parity with models operating at significantly higher parameter counts and compute budgets, effecti vely testing the ef ficiency of our framew ork against raw scale. Architectural design (MoE). As PlotT wist utilizes a specialized MoE backbone (Qwen- 3-30B-A3B), we ev aluate against relev ant MoE and sparsely activ ated baselines to understand ef ficacy of our approach with MoE for a creati ve task like plot generation. W e include DeepSeek-R1 [ 4 ] and standard MoE configurations within the Qwen-3 family [ 25 ] that employ expert routing for general-purpose tasks. Contrasting PlotT wist with these generalist MoE architectures allo ws us to distinguish the performance gains attributable to our proposed direct preference optimization strate gy from those inherent to sparse activ ation alone. Generation paradigm and alignment strategy . Finally , we ev aluate against baselines employing fundamentally different paradigms for plot generation. W e include reasoning- optimized models Mistral Small 2501 [ 23 ] and Phi-4 Mini [ 15 ] to benchmark structured reasoning and temporal progression. W e further compare against two narrati ve-specific paradigms: Agents’ Room [ 11 ], a collaborative agentic frame work that decomposes narrativ e generation into discrete subtasks handled by specialized planning and writing agents via a shared scratchpad; and W izardLM [ 27 ], a monolithic instruction-following approach that embeds narrativ e constraints directly into model weights via Evol-Instruct fine-tuning. Collecti vely , these baselines isolate the impact of our architectural design from confounding factors such as model scale, architecture, or training data. 5.4 Perf ormance analysis of PlotT wist W e now compare the performance of the proposed PlotT wist model against the baselines introduced in the previous section. Evaluation is conducted on a held-out test set of 160 premises sampled from the total set of 5,000 premises considered in this work, with all models assessed on the same set of inputs. Each model is prompted with the 14 Abhinav et al. same premises to generate plots conditioned on the specified narrati ve constraints. An example prompt is provided in the Appendix C. All generated plots are e valuated using our proposed Agentic Ev aluation framework across the fi ve NQDs. T able 1 reports the mean and standard de viation of the resulting aspect ratings ov er the 160 test premises. As shown in the table, our proposed model outperforms all baselines on four of the fi ve aspects. The only exception is character de velopment, where Claude Sonnet 4 achie ves a higher av erage score. Although the absolute differences in mean scores between our model and the next-best baseline appear modest, they correspond to systematic improv ements when interpreted on a continuous scale. For example, an a verage pacing aspect improv ement of 0 . 17 ov er the next-best baseline corresponds to a cumulati ve gain of 0 . 17 × 160 = 27 . 2 across the 160-plot test set. This indicates broadly distributed performance gains rather than improv ements driven by a small number of outliers. 5.5 Ablations T o understand the factors contrib uting to PlotT wist’ s performance, we analyze our results from three perspecti ves: model scale, architectural design, and generation paradigm, using the same 160 test premises and baselines from T able 1. Impact of Model Scale. Despite having only 3B activ e parameters per token, PlotT wist (8.81 ov erall av erage aspect score) outperforms all baselines including frontier systems with ∼ 200x more acti ve parameters: GPT -4.1 (8.65) [ 16 ], Claude Sonnet 4 (8.73) [ 1 ], and Gemini 2.0 Flash (8.64) [ 3 ]. Among similarly-sized models, standard instruction- tuned baselines achie ve av erage rating of 8 . 38 - 8 . 43 (Qwen2.5-7B [ 24 ], Qwen2.5-14B [ 24 ], DeepSeek R1 14B [ 4 ]), confirming that our gains ( +0 . 38 - 0 . 43 points) arise from methodology rather than model scale. This demonstrates that structured preference-based alignment can compensate for limited model capacity in creativ e generation tasks. Effect of Architectural Design. The base Qwen-3-30B-A3B MoE model [ 25 ] before DPO training achiev es approximately overall a verage aspect score of 8 . 03 , comparable to dense 7B models despite using only 3B activ e parameters, validating the ef ficiency of sparse activ ation for creati ve tasks [ 21 , 7 ]. The substantial gain from DPO training ( +0 . 78 points to 8 . 81 ) demonstrates that preference-based alignment is the primary driv er of quality improvement, not architectural choice alone. Howev er, the MoE architecture enables this improvement while maintaining inference ef ficiency comparable to 3B dense models. Comparing against the dense Qwen3-32B baseline (8.50), our approach achiev es +0 . 31 points with only 9 . 4% of the activ e parameters. Generation Paradigm Comparison. Standard instruction-tuning approaches (Qwen3- 32B: ov erall av erage score of 8 . 50 [ 25 ], W izardLM-30B: 8 . 03 [ 27 ]) rely on token-le vel supervision, which provides weak signals for global narrati ve properties [ 6 ]. The multi- agent approach Agents’ Room [ 11 ] achie ves strong performance (8 . 74) by decomposing generation into specialized planning and writing agents, b ut requires orchestrating multiple models and maintaining shared state across inference passes. Our preference- based approach achiev es superior performance (8 . 81) with a single model and single inference pass, with particular gains in coherence ( 8 . 89 vs. 8 . 81 ) and pacing ( 8 . 85 vs. 8 . 68 ). This demonstrates that structured aspect-le vel re wards from direct preference optimization [ 18 ] can internalize the collaborativ e reasoning of multi-agent systems without explicit decomposition o verhead. PlotT wist: Creativ e Plot Generation with Small Language Models 15 These ablations establish three key findings: (1) our PlotT wist framework is more effecti ve than scale as our 3B model outperforms ∼ 600B frontier models, (2) MoE ef ficiency combined with DPO alignment yields the primary quality gains ( +0 . 78 points), and (3) preference-based alignment matches or exceeds multi-agent collaboration without orchestration complexity . T ogether , these validate that combining MoE ef ficiency with multi-aspect preference optimization enables strong plot generation under resource constraints. 6 Conclusion In this work, we studied premise-conditioned creati ve plot generation with SLMs as a long-horizon structured text generation problem under capacity constraints. T o that end, we proposed a structured workflo w that decouples generation, training-time preference alignment, and ev aluation, integrating an aspect rating reward model, an MoE plot generator trained via DPO, and an independent agentic ev aluation module. Through comprehensiv e empirical studies, we showed that this approach consistently impro ves plot quality across multiple narrativ e aspects relativ e to strong baseline models. Further , we conducted additional v alidation on curated movie plot datasets and demonstrated that the proposed ev aluative components reliably distinguish narrativ e quality in a manner aligned with human judgments. T ogether , these findings underscore the value of structured preference-based alignment as a scalable and effecti ve alternativ e to brute- force model scaling for creativ e text generation with limited-capacity language models. References 1. Anthropic: Claude 4 model card and system safety . https://www.anthropic.com/ research/claude- 4 (May 2025) 2. Chakrabarty , T ., Laban, P ., Agarwal, D., Muresan, S., W u, C.S.: Art or artifice? large language models and the false promise of creativity . In: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. pp. 1–34 (2024) 3. DeepMind, G.: Gemini 2.0: A generalist agent. arXi v preprint arXiv:2412.00001 (2024) 4. DeepSeek-AI: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXi v:2501.12948 (2025) 5. Dubey , A., Jauhri, A., Pandey , A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 6. Fan, A., Le wis, M., Dauphin, Y .: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (A CL). pp. 889–898 (2018) 7. Fedus, W ., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity . Journal of Machine Learning Research 23 (120), 1–39 (2022) 8. Gurung, A., Lapata, M.: Learning to reason for long-form story generation. arXiv preprint arXiv:2503.22828 (2025) 9. Harper , F .M., K onstan, J.A.: The movielens datasets: History and conte xt. Acm transactions on interactiv e intelligent systems (tiis) 5 (4), 1–19 (2015) 10. Hogan, P .C.: Affecti ve narratology: The emotional structure of stories. U of Nebraska Press (2011) 16 Abhinav et al. 11. Huot, F ., Amplayo, R.K., Palomaki, J., Jakobovits, A.S., Clark, E., Lapata, M.: Agents’ room: Narrativ e generation through multi-step collaboration. arXiv preprint (2024) 12. Kim, S., Oh, D.: Evaluating creati vity: Can llms be good evaluators in creati ve writing tasks? Applied Sciences 15 (6), 2971 (2025). https://doi.org/10.3390/app15062971 , https://www.mdpi.com/2076- 3417/15/6/2971 13. Kumar , V .B., Goyal, D., Eppa, A., Bhandari, N.: Curiosity-driven llm-as-a-judge for personal- ized creativ e judgment (2025), 14. Marioriyad, A., Rohban, M.H., Baghshah, M.S.: The silent judge: Unacknowledged shortcut bias in llm-as-a-judge (2025), , presented at NeurIPS 2025 Reliable ML W orkshop 15. Microsoft: Phi-4-mini technical report: Compact yet powerful multimodal language models. arXiv preprint arXi v:2501.00002 (2025) 16. OpenAI: Gpt-4.1 system card. https://openai.com/research/gpt- 4- 1 (April 2025) 17. Ouyang, L., W u, J., Jiang, X., Almeida, D., W ainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray , A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 , 27730–27744 (2022) 18. Rafailov , R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Y our language model is secretly a rew ard model. Advances in neural information processing systems 36 , 53728–53741 (2023) 19. Rashkin, H., Celikyilmaz, A., Choi, Y ., Gao, J.: Plotmachines: Outline-conditioned generation with dynamic plot state tracking. arXiv preprint arXi v:2004.14967 (2020) 20. See, A., Pappu, A., Saxena, R., Y erukola, A., Manning, C.D.: Do massively pretrained language models make better storytellers? arXi v preprint arXiv:1909.10705 (2019) 21. Shazeer , N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outra- geously large neural networks: The sparsely-gated mixture-of-e xperts layer . arXiv preprint arXiv:1701.06538 (2017) 22. Sun, J., Shaib, C., W allace, B.C.: Evaluating the zero-shot robustness of instruction-tuned language models. arXiv preprint arXi v:2306.11270 (2023) 23. T eam, M.A.: Mistral small 3 (2501) release. https://mistral.ai/news/ mistral- small- 3/ (January 2025) 24. T eam, Q.: Qwen2.5: A party of language models. arXiv preprint arXi v:2409.12191 (2024) 25. T eam, Q.: Qwen3 technical report. arXiv preprint arXi v:2504.00001 (April 2025) 26. T eleki, M., Bengali, V ., Dong, X., Janjur , S.T ., Liu, H., Liu, T ., W ang, C., Liu, T ., Zhang, Y ., Shipman, F ., et al.: A survey on llms for story generation. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 13954–13966 (2025) 27. TheBloke, Hartford, E., Kaiokendev: W izardlm-uncensored-supercot- storytelling-30b-gptq. https://huggingface.co/TheBloke/ WizardLM- Uncensored- SuperCOT- StoryTelling- 30B- GPTQ (2023) 28. Y ao, L., Peng, N., W eischedel, R., Knight, K., Zhao, D., Y an, R.: Plan-and-write: T owards bet- ter automatic storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 7378–7385 (2019) 29. Y u, F .: When ais judge ais: The rise of agent-as-a-judge ev aluation for llms (2025), https: 30. Zheng, L., Chiang, W .L., Sheng, Y ., Zhuang, S., W u, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36 , 46595–46623 (2023) 31. Zheng, M., Song, D., Zhou, G., Y ou, J., Zhan, J., Ma, X., Song, X., Lim, S.N., Chen, Q., Y ang, H.: Cml-bench: A framework for e valuating and enhancing llm-powered mo vie scripts generation (2025), PlotT wist: Creativ e Plot Generation with Small Language Models 17 A Reward Model Prompts Aspect 1: Narrative Coher ence Positi ve Prompt ( r + a ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Narrative Coherence: Field Definition (Positi ve Focus): Narrativ e clarity , logical plot progression, coherent world-b uilding, strong cause– effect relationships, and well-inte grated subplots. Strict output rules: 1. Output only a v alid JSON object. 2. Include only Narrative_Coherence . 3. Integer v alue from 0 to 10. 4. Score generously . ### MoviePlot: { } Negative Pr ompt ( r − a ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Narrative Coherence: Field Definition (NegativeF ocus): Confusing storytelling, plot holes, inconsistent world-b uilding, disconnected subplots, or illogical character decisions. Strict output rules: 1. Output only a v alid JSON object. 2. Include only Narrative_Coherence . 3. Integer v alue from 0 to 10. 4. 0 = no issues, 10 = se vere issues. ### MoviePlot: { } Aspect 2: Emotional T urning P oints Positi ve Prompt ( r + e ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Emotional Turning Points: Field Definition (Positi ve Focus): Powerful emotional moments, ef fectiv e turning points, meaningful re velations, and emotionally satisfying narrativ e shifts. Strict output rules: 1. Output only JSON. 2. Include only Emotions_Turning_Points . 3. Integer 0–10. 4. Score generously . 18 Abhinav et al. ### MoviePlot: { } | ### Review: Negative Pr ompt ( r − e ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Emotional Turning Points: Field Definition (Negative F ocus): Flat emotional arcs, forced turning points, unearned twists, or moments that f ail to engage the audience. Strict output rules: 1. Output only JSON. 2. Include only Emotions_Turning_Points . 3. Integer 0–10. 4. 0 = no issues, 10 = se vere issues. ### MoviePlot: { } | ### Review: Aspect 3: T one Consistency Positi ve Prompt ( r + t ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Tone Consistency: Field Definition (Positi ve Focus): Successful maintenance of mood, atmosphere, and stylistic coherence throughout the story . Effecti ve emotional consistency , well-maintained genre con ventions, and smooth transitions between story beats. Intentional tonal shifts are re warded when they serv e the narrative purpose. Strict output rules: 1. Output only a v alid JSON object. 2. Include only Tone_Consistency . 3. Integer v alue from 0 to 10. 4. Score generously . ### MoviePlot: { } | ### Review: Negative Pr ompt ( r − t ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Tone Consistency: Field Definition (Negative F ocus): Jarring mood shifts, inconsistent atmosphere, conflicting stylistic elements, genre incoherence, or awkward tonal transitions that disrupt immersion or emotional conti- nuity . Strict output rules: 1. Output only a v alid JSON object. 2. Include only Tone_Consistency . 3. Integer v alue from 0 to 10. 4. 0 = no issues, 10 = se vere issues. PlotT wist: Creativ e Plot Generation with Small Language Models 19 ### MoviePlot: { } | ### Review: Aspect 4: Character Dev elopment Positi ve Prompt ( r + c ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Character Development: Field Definition (Positi ve Focus): Compelling character arcs, meaningful growth, clear motiv ations, well-developed relationships, authentic character v oices, and satisfying character journeys. Emphasis is placed on characters who ev olve, learn, or change meaningfully ov er the course of the story . Strict output rules: 1. Output only a v alid JSON object. 2. Include only Character_Development . 3. Integer v alue from 0 to 10. 4. Score generously . ### MoviePlot: { } | ### Review: Negative Pr ompt ( r − c ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Character Development: Field Definition (Negative F ocus): W eak or static character arcs, lack of growth, unclear moti vations, poorly dev eloped relationships, inconsistent character v oices, or unsatisfying character journeys. Em- phasis is placed on characters who remain static, act illogically , or fail to develop meaningfully . Strict output rules: 1. Output only a v alid JSON object. 2. Include only Character_Development . 3. Integer v alue from 0 to 10. 4. 0 = no issues, 10 = se vere issues. ### MoviePlot: { } | ### Review: Aspect 5: Pacing Positi ve Prompt ( r + p ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Pacing: Field Definition (Positi ve Focus): Effecti ve narrati ve rhythm, well-balanced scene progression, appropriate timing of plot e vents, and smooth transitions that maintain momentum and audience engage- ment. Emphasis is placed on pacing that supports tension, emotional beats, and story clarity . 20 Abhinav et al. Strict output rules: 1. Output only a v alid JSON object. 2. Include only Pacing . 3. Integer v alue from 0 to 10. 4. Score generously . ### MoviePlot: { } | ### Review: Negative Pr ompt ( r − p ) Y ou are a professional mo vie critic whose only output must be a single JSON object with exactly one inte ger field (0–10): Pacing: Field Definition (Negative F ocus): Unev en or inconsistent pacing, excessiv e slowdo wns or rushed segments, poorly timed plot events, unnecessary filler scenes, or abrupt transitions that disrupt narrative flow or emotional impact. Strict output rules: 1. Output only a v alid JSON object. 2. Include only Pacing . 3. Integer v alue from 0 to 10. 4. 0 = no issues, 10 = se vere issues. ### MoviePlot: { } | ### Review: PlotT wist: Creativ e Plot Generation with Small Language Models 21 B Agentic Evaluation System Pr ompts System Prompt 1: Narrati ve Coherence Evaluation T ask Overview Ev aluate a movie plot’ s narrative structure and logical consistenc y using a 10-criteria framew ork. Assign precise numerical scores reflecting coherence quality . Evaluation Methodology Each criterion is scored from 0–1 (increments of 0.1 allo wed). Scores are summed for a total out of 10. Scoring Criteria Plot Structure and Logic (4 points) 1. Plot Progr ession 2. Causal Connectivity Logical beginning–middle–end flo w . Events arise naturally from prior ac- tions. 3. Plot Integrity 4. Conflict Focus No plot holes or contradictions. A sustained central conflict driv es the story . Character Integration (3 points) 5. Protagonist Consistency 6. Supporting Character Function 7. Resolution Authenticity Narrative Flo w and Unity (3 points) 8. Pacing Appropriateness 9. Thematic Inte gration 10. T onal Consistency Output Format 1. Plot Progression: X.X 2. Causal Connectivity: X.X 3. Plot Integrity: X.X 4. Conflict Focus: X.X 5. Protagonist Consistency: X.X 6. Supporting Character Function: X.X 7. Resolution Authenticity: X.X 8. Pacing Appropriateness: X.X 9. Thematic Integration: X.X 10. Tonal Consistency: X.X TOTAL: X.X/10 22 Abhinav et al. System Prompt 2: Emotional T urning P oint Evaluation T ask Overview Identify and ev aluate the primary emotional turning point of the narrativ e using a 10-criteria framew ork focused on emotional impact and character change. Scoring Criteria Conflict & Character Foundation (4 points) 1. Conflict Resolution 2. Character Believability Addresses or reframes central conflict. Emotion aligns with established arc. 3. Character T ransformation 4. Emotional Satisfaction Meaningful internal change. Emotionally resonant payoff. Narrative Construction (3 points) 5. Narrativ e Causality 6. Thematic Crystallization 7. Relationship Impact T echnical & Structural Elements (3 points) 8. Cinematic Execution 9. Structural Necessity 10. Audience Alignment Output Format 1. Conflict Resolution: X.X 2. Character Believability: X.X 3. Character Transformation: X.X 4. Emotional Satisfaction: X.X 5. Narrative Causality: X.X 6. Thematic Crystallization: X.X 7. Relationship Impact: X.X 8. Cinematic Execution: X.X 9. Structural Necessity: X.X 10. Audience Alignment: X.X TOTAL: X.X/10 System Prompt 3: Character De velopment Evaluation T ask Overview Evaluate protagonist character de velopment using a 10-criteria framew ork assessing motiv ation, arc progression, and narrativ e function. Scoring Criteria Core Character Elements (4 points) PlotT wist: Creativ e Plot Generation with Small Language Models 23 1. Motivation Clarity 2. Behavioral Consistency Clear goals and desires. Actions align with personality . 3. Character Arc 4. Psychological Depth Believ able transformation. Emotional and psychological complex- ity . Character Foundation (3 points) 5. Backstory Integration 6. Audience Connection 7. Character Distincti veness Narrative Function (3 points) 8. Relationship Dynamics 9. Plot Agency 10. Thematic Alignment Output Format 1. Motivation Clarity: X.X 2. Behavioral Consistency: X.X 3. Character Arc: X.X 4. Psychological Depth: X.X 5. Backstory Integration: X.X 6. Audience Connection: X.X 7. Character Distinctiveness: X.X 8. Relationship Dynamics: X.X 9. Plot Agency: X.X 10. Thematic Alignment: X.X TOTAL: X.X/10 24 Abhinav et al. System Prompt 4: P acing Analysis Evaluation T ask Overview Assess narrati ve pacing using a 10-criteria framework measuring rhythm, momentum, and emotional timing. Scoring Criteria 1. Premise Establishment Speed 2. Structural Foundation 3. Pacing Consistency 4. Event Fr equency 5. Scene Purposefulness 6. T ension Management 7. T ransition Quality 8. Emotional Beat Timing 9. Climax Timing 10. Genre–T one Alignment Output Format 1. Premise Establishment Speed: X.X 2. Structural Foundation: X.X 3. Pacing Consistency: X.X 4. Event Frequency: X.X 5. Scene Purposefulness: X.X 6. Tension Management: X.X 7. Transition Quality: X.X 8. Emotional Beat Timing: X.X 9. Climax Timing: X.X 10. Genre-Tone Alignment: X.X TOTAL: X.X/10 System Prompt 5: T one Consistency Evaluation T ask Overview Evaluate tonal coherence using a 10-criteria frame work assessing atmosphere, stylis- tic unity , and emotional continuity . Scoring Criteria 1. Initial Atmosphere Establishment 2. Scene-to-Scene Consistency 3. T onal Relief Integration 4. Earned T one Shifts 5. Dialogue Style Consistency 6. V isual Reinforcement 7. Stakes Alignment 8. Comedy/Drama Balance 9. Ending Consistency 10. Motif and Symbol Unity Output Format PlotT wist: Creativ e Plot Generation with Small Language Models 25 1. Initial Atmosphere Establishment: X.X 2. Scene-to-Scene Consistency: X.X 3. Tonal Relief Integration: X.X 4. Earned Tone Shifts: X.X 5. Dialogue Style Consistency: X.X 6. Visual Reinforcement: X.X 7. Stakes Alignment: X.X 8. Comedy/Drama Balance: X.X 9. Ending Consistency: X.X 10. Motif and Symbol Unity: X.X TOTAL: X.X/10 C Plot Generation with PlotT wist Plot Generation with PlotT wist T ask Description The follo wing example demonstrates long-form creati ve mo vie plot generation under the PlotT wist setting. The task emphasizes narrativ e coherence, concealed truths, emotional stak es, and morally consequential decision-making, serving as a qualitativ e illustration of model behavior . Premise Generate a movie plot that follo ws a dedicated detective as he in vestigates a murder that implicates his beloved, forcing him to navig ate a web of concealed truths, blackmail, and the desperate lengths one will go to protect a dark secret. PlotT wist Generated Plot 26 Abhinav et al. Detecti ve Aris Thorne, a man whose sharp eyes and even sharper instincts ha ve solved countless cases but whose heart remains scarred by the unsolved murder of his wife Elara fi ve years prior , finds his world shattered when the body of City Councilman Silas Croft—a po werful figure known for his ruthless political maneuvering and rumored ties to the city’ s underbelly—is discovered in his opulent penthouse with a single, precise stab wound to the heart. The evidence points damningly tow ard Lena V oss, Aris’ s former lover and now a dev oted social worker fighting for the city’ s forgotten children. W itnesses report a heated argument between Lena and Croft hours before his death, and a fragmented security recording captures her lea ving the b uilding, her f ace pale with fear . Aris, torn between duty and lingering lo ve, cannot belie ve she is capable of murder—yet the physical evidence is o verwhelming. A rare antique switchblade, identical to one Aris gifted Lena years earlier , is found clutched in Croft’ s hand. Even more troubling is a cryptic encrypted file on Croft’ s laptop, labeled Project Chimera , implicating Lena in a scheme in volving missing children. As Aris in vestigates further , the case re veals a labyrinth of concealed truths. Croft was not merely corrupt; he was the linchpin of an extensiv e blackmail network exploiting vulnerable families. The encrypted files re veal that Croft had been threatening Lena with exposure—not only of her past, but of evidence tied directly to Elara’ s death. Aris had long belie ved Elara’ s death was a tragic accident and had quietly suppressed details to protect her reputation. The blackmail was nev er solely about Lena—it was about Aris himself. Lena ulti- mately murdered Croft in desperation to stop him from e xposing the truth, protect the children he w as traf ficking, and prev ent Aris’ s buried past from destro ying them both. Aris finds himself ensnared between the city’ s elite, an ambitious District Attorney eager to frame Lena, Croft’ s remaining associates intent on silencing her , and his own conscience, which recognizes that Lena’ s act was born of sacrifice rather than malice. Aris uncovers the final truth: Elara had been inv estigating Croft’ s operations as a journalist and was killed in a staged car accident when she came too close. In a climactic confrontation at the rain-lashed docks, Aris dismantles the remnants of Croft’ s network, but the true reckoning is internal. Choosing love over career, he confesses to the murder to protect Lena, sacrificing his badge and freedom. As rain washes over the city , Aris walks away from ev erything he once was, the weight of concealed truths finally lifted. What remains is the memory of Lena—not as a suspect, but as the w oman he loved—and the understanding that truth and lo ve, when intertwined, can demand a dev astating cost. PlotT wist: Creativ e Plot Generation with Small Language Models 27 D Models and Their Roles in the PlotT wist Framework Overvie w of all language models employed across training, ev aluation, and baseline comparison stages of the PlotT wist system. PlotT wist Core Components Qwen-3-30B-A3B (MoE) — PlotT wist Plot Generator MoE backbone with 3B active parameters ; preference-aligned via Dir ect Preference Optimization (DPO) to generate high-quality premise-conditioned plots. Only 3B parameters are activ e per token despite a total count of 30B, classifying it as an SLM under the paper’ s definition. Qwen-3-32B (4-bit) — PlotT wist Aspect Rating Reward Model Fine-tuned via regr ession-aware SFT using a weighted combination of cross-entr opy loss and Huber loss to predict continuous aspect-lev el narrative quality scores across all fiv e NQDs. Qwen-3-32B (16-bit) — PlotT wist Agentic Evaluator Independent post-hoc ev aluation module operating separately from the training pipeline. Assesses generated plots across all fi ve Narrative Quality Dimensions (NQDs) using structured, weakness-focused criteria to approximate human expert judgment. Ensemble Models f or Positive-Negati ve Aspect Rating Qwen-2.5-7B Llama-3.2-70B Llama-3.1-8B DeepSeek-14B Gemma-27B Fiv e-model ensemble used to generate synthetic aspect-lev el ratings via positiv e– negati ve prompting. Each model outputs both a positiv e score r + a,m ( p ) and a negati ve score r − a,m ( p ) per aspect, aggregated as: r a ( p ) = X m r + a,m ( p ) − r − a,m ( p ) Model di versity across the ensemble mitig ates indi vidual model bias in the rating con- struction process. Gemma-27B additionally generates premise descriptions from movie plots for DPO dataset curation. DPO Candidate Plot Generators GPT -4.1 Frontier model used to generate candidate plots per premise for re ward-model scoring and DPO preference pair construction . Retained as accepted plot only when it achiev es the highest re ward score ( ≥ 8 ) and outperforms the ne xt-best model by a margin of at least 0.5. Claude Sonnet 4 28 Abhinav et al. Frontier model used to generate candidate plots per premise for re ward-model scoring and DPO preference pair construction . Retained as accepted plot only when it achiev es the highest re ward score ( ≥ 8 ) and outperforms the ne xt-best model by a margin of at least 0.5. Gemini 2.0 Flash Frontier model used to generate candidate plots per premise for re ward-model scoring and DPO preference pair construction . Retained as accepted plot only when it achiev es the highest re ward score ( ≥ 8 ) and outperforms the ne xt-best model by a margin of at least 0.5. Baselines: Model Scale Llama-3-70B Large open-weight baseline. T ests performance at high parameter count without task- specific alignment , providing an upper -bound reference for scale alone. Qwen3-32B (Dense) Dense baseline from the same model family as the PlotT wist generator . Used to iso- late perf ormance gains attrib utable to DPO alignment from those due to the MoE architecture. Qwen2.5-7B-Instruct Qwen2.5-14B-Instruct Small instruction-tuned baselines used for scale ablation , confirming that PlotT wist’ s gains arise from methodology rather than model size. Baselines: Architectural Design DeepSeek-R1 14B MoE and reasoning-oriented baseline. Contrasts sparse activation without prefer ence alignment , isolating the contribution of DPO from that of e xpert routing. Phi-4 Mini Instruct Compact reasoning-optimized model baseline. Evaluates whether small reasoning- focused ar chitectures can match structured preference-aligned generation. Mistral Small 2501 24B Reasoning-optimized baseline assessing structur ed temporal progression and narrative coherence in mid-scale models. Baselines: Generation Paradigm Agents’ Room Multi-agent collaborative narrativ e generation baseline. Decomposes the writing pro- cess into specialized planning and writing agents communicating vi a a shared scratchpad to maintain long-term coherence across the plot. PlotT wist: Creativ e Plot Generation with Small Language Models 29 WizardLM-StoryT elling-30B Monolithic instruction-tuning baseline. Relies on the Evol-Instruct methodology to embed narrative constraints directly into model weights rather than resolving them through external orchestration or preference alignment. E Computational Resources All open-weight models used in this work were downloaded from the Hugging Face Model Hub 5 and ex ecuted on a dedicated compute cluster consisting of 4 × NVIDIA L40S GPUs . The L40S is a high-performance data center GPU featuring 48 GB of GDDR6 memory per card, providing a total of 192 GB of aggregate GPU memory , which was sufficient to accommodate the quantized and sparse model configurations employed throughout the frame work. Closed-source frontier models were not run on local infrastructure and were instead accessed exclusi vely via their respectiv e commercial APIs. Model Execution Usage in PlotT wist PlotT wist Core Components — Local (4 × L40S) Qwen-3-30B-A3B (MoE) Local DPO fine-tuning and inference for PlotT wist Plot Generator Qwen-3-32B (4-bit) Local SFT training of Aspect Rating Re- ward Model Qwen-3-32B (16-bit) Local Inference for Agentic Ev aluation module Ensemble Rating Models — Local (4 × L40S) Qwen-2.5-7B Local Positiv e–negati ve aspect rating en- semble Llama-3.1-8B Local Positiv e–negati ve aspect rating en- semble DeepSeek-14B Local Positiv e–negati ve aspect rating en- semble Gemma-27B Local Positiv e–negati ve aspect rating en- semble; premise generation Llama-3.2-70B Local Positiv e–negati ve aspect rating en- semble 5 https://huggingface.co 30 Abhinav et al. Open-W eight Baselines — Local (4 × L40S) Llama-3-70B Local Model scale baseline Qwen3-32B (dense) Local Architectural design baseline Qwen2.5-7B-Instruct Local Scale ablation baseline Qwen2.5-14B-Instruct Local Scale ablation baseline DeepSeek-R1 14B Local Architectural design baseline Phi-4 Mini Instruct Local Architectural design baseline Mistral Small 2501 24B Local Generation paradigm baseline W izardLM-30B Local Generation paradigm baseline F rontier Models — API Access Only GPT -4.1 OpenAI API DPO candidate plot generation; model scale baseline Claude Sonnet 4 Anthropic API DPO candidate plot generation; model scale baseline Gemini 2.0 Flash Google API DPO candidate plot generation; model scale baseline Note: Agents’ Room [ 11 ] was evaluated according to its original open-source implemen- tation and run locally on the same infrastructur e.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment