StoryState: Agent-Based State Control for Consistent and Editable Storybooks
Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at https://github.com/YuZhenyuLindy/StoryState
💡 Research Summary
StoryState addresses a critical limitation of current one‑click storybook generators: the implicit representation of story elements such as characters, world settings, and page‑level objects within the model itself. Because these elements are not exposed, any user‑initiated edit typically triggers a full regeneration, often breaking visual consistency across pages. The authors propose an agent‑based orchestration layer that sits on top of any training‑free text‑to‑image (T2I) backend and maintains an explicit, editable story state throughout the creation and editing lifecycle.
The story state S is a structured JSON‑like object composed of three parts: (i) a character sheet C containing each character’s name, narrative role, persistent visual attributes (species, age, clothing, etc.) and optional reference images; (ii) global world settings W that encode style, tone, recurring locations, and shared props; and (iii) a per‑page state {Sᵢ} for i = 1…N, each holding a short scene description, the list of characters present (linked to C), explicit visual constraints (e.g., “same yellow raincoat as page 1”), and pointers to generated text and image assets. This explicit representation enables localized updates: an edit modifies only the minimal subset of (C, W, {Sᵢ}) while leaving the rest untouched.
Four lightweight LLM agents manage the workflow:
- Planner Agent – parses the initial user prompt and produces a page‑level outline, initializing each Sᵢ with scene descriptions, narrative flow, and initial character‑page assignments.
- State Manager Agent – consolidates ambiguous references, creates unified character entries in C, records persistent constraints (e.g., “Lily always wears a yellow raincoat unless changed”), and stores global attributes in W.
- Text Agent – consumes a page’s Sᵢ together with C and W to generate the final narration for that page. During editing, it regenerates text only for pages whose state changed.
- Prompt Writer Agent – translates the full story state into a set of structured prompts for the T2I backend: a global identity prompt P₀ (capturing C and W) and page‑specific prompts {Pᵢ}. The authors adopt a 1Prompt1Story‑style pipeline that applies singular‑value reweighting and identity‑preserving cross‑attention, thereby strengthening character identity while allowing page‑specific variation.
Editing workflow is driven entirely by state modifications. For a localized visual edit on page j, the State Manager updates Sⱼ with new constraints; the Prompt Writer recomputes only Pⱼ, leaving P₀ and all other Pᵢ unchanged; the T2I backend regenerates the image for page j alone. Global edits (e.g., “Lily has green eyes throughout”) modify the character sheet C, prompting a rewrite of P₀ and regeneration of only those pages that reference the altered character. The Text Agent follows the same selective regeneration principle.
A Consistency Critic Agent performs multimodal verification after each regeneration. Using CLIP embeddings and rule‑based checks, it assesses whether the generated image and text align with the current state S and neighboring pages. Detected mismatches are reported as structured feedback, prompting minimal corrective updates to Sⱼ and a possible second generation pass. This loop ensures that cross‑page consistency is maintained without any model fine‑tuning.
Experiments: The authors built a benchmark of 192 ten‑page illustrated storybooks following the ConsiStory+ protocol, each generated from a concise prompt specifying a main character, visual style, and narrative outline. For each base story they defined multiple realistic page‑level edit requests. Three systems were compared: (i) Gemini Storybook (one‑click, no iterative editing), (ii) 1Prompt1Story (training‑free consistency‑focused pipeline), and (iii) StoryState. All methods received the same initial prompt and target page count.
Metrics included: (1) Visual Consistency – average cosine similarity of CLIP embeddings between adjacent pages (higher is better); (2) Pages Changed – average number of pages altered after a single edit (lower is better); (3) User Effort – measured in interaction turns and wall‑clock time per edit (lower is better). A user study with 100 non‑expert participants collected pairwise preferences for consistency and perceived control, as well as Likert ratings of overall quality.
Results: Gemini Storybook achieved the highest raw consistency score (0.89) but cannot be evaluated on editing efficiency. Among the interactive methods, StoryState outperformed 1Prompt1Story: consistency 0.83 vs. 0.78, pages changed 1.6 vs. 4.5, interaction turns 3.1 vs. 4.3, and edit time 74 s vs. 96 s. Qualitative examples showed that Gemini required full‑book regeneration for any change, often causing unintended alterations; 1Prompt1Story improved identity consistency but limited pose and action diversity; StoryState preserved character identity while allowing varied poses and scene compositions, and enabled precise page‑level edits without affecting unrelated pages.
In the user study, StoryState was preferred for consistency in 36 % of comparisons (slightly above Gemini’s 34 %) and for control in 48 % of comparisons (ahead of 1Prompt1Story’s 47 %). Participants reported that StoryState felt the most intuitive for making targeted revisions while maintaining overall story coherence.
Key insights:
- An explicit, persistent story state can provide strong, fine‑grained control over multi‑page generation without any model retraining.
- Small LLM agents, when coupled with structured prompt generation, are sufficient to manage both global identity and local scene constraints across diverse T2I backends.
- An automated consistency‑checking loop dramatically reduces the need for manual trial‑and‑error, lowering user effort and improving perceived controllability.
Limitations and future work: The current prototype relies on a Gemini‑based text agent and a 1Prompt1Story‑style image pipeline; extending to more advanced multimodal LLMs, richer layout specifications, or dynamic narration (e.g., audio) would broaden applicability. Additionally, the consistency critic currently uses CLIP similarity and rule‑based checks; incorporating more sophisticated visual reasoning could further improve detection of subtle drift.
Conclusion: StoryState demonstrates that a model‑agnostic, training‑free framework built around an explicit story state and a handful of LLM agents can achieve localized editing, high cross‑page visual consistency, and reduced user effort. This approach paves the way for interactive, controllable multimodal content creation tools that empower non‑technical users to iteratively refine illustrated storybooks without sacrificing visual coherence.
Comments & Academic Discussion
Loading comments...
Leave a Comment