CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

CiQi-Agen t: Aligning Vision, T o ols and Aesthetics in Multimo dal Agen t for Cultural Reasoning on Chinese P orcelains W enhan W ang ∗ , 1 Zhixiang Zhou ∗ , 1 Zhongtian Ma ∗ , 2 Y anzh u Chen 1 Ziyu Lin 1 Hao Sheng 1 P engfei Liu 1 Honglin Ma 3 W enqi Shao † , 1 Qiaosheng Zhang † , 1 , 2 Y u Qiao 1 , 2 1 Shanghai Inno v ation Institute 2 Shanghai AI Lab oratory 3 Shaanxi Academ y of Cultural Relics Conserv ation Abstract The connoisseurship of an tique Chinese porcelain demands extensiv e historical exp ertise, material understanding, and aesthetic sensitivity , making it diﬃcult for non-sp ecialists to engage. T o demo cratize cultural-heritage understanding and assist exp ert connoisseurship, w e introduce CiQi-Agent —a domain-sp eciﬁc Porcelain Connoisseurship Agen t for in telligent analysis of antique Chinese p orcelain. CiQi-Agen t supp orts multi-image p orce- lain inputs and enables vision to ol inv o cation and m ultimo dal retriev al-augmented gen- eration, p erforming ﬁne-grained connoisseurship analysis across six attributes : dynasty , reign p eriod, kiln site, glaze color, decorativ e motif, and vessel shap e. Bey ond attribute classiﬁcation, it captures subtle visual details, retrieves relev an t domain kno wledge, and in tegrates visual and textual evidence to produce coherent, explainable connoisseurship de- scriptions. T o achiev e this capabilit y , we construct a large-s cale, exp ert-annotated dataset CiQi-V QA , comprising 29,596 p orcelain sp ecimens, 51,553 images, and 557,940 visual question–answ ering pairs, and further establish a comprehensive b enc hmark CiQi-Benc h aligned with the previously mentioned six attributes. CiQi-Agen t is trained through sup er- vised ﬁne-tuning, reinforcement learning, and a to ol-augmen ted reasoning framework that in tegrates tw o categories of to ols: a vision to ol and multimodal retriev al to ols. Experimen- tal results show that CiQi-Agent (7B) outp erforms all comp etitiv e op en- and closed-source mo dels across all six attributes on CiQi-Bench, achieving on a verage 12.2% higher accu- racy than GPT-5. The mo del and dataset hav e b een released and are publicly av ailable at https://huggingface.co/datasets/SII- Monument- Valley/CiQi- VQA . 1 In tro duction Cultural relics are in v aluable carriers of h uman civilization, encapsulating b oth artistic creation and historical ev olution. The connoisseurship and authentication of artifacts require deep ex- p ertise—com bining historical knowledge, p erceptual exp erience, and material understanding. Consequen tly , professional barriers hav e long restricted public engagement in cultural heritage connoisseurship. With the rapid progress of artiﬁcial in telligence (AI) metho ds, esp ecially large language mo dels (LLMs) and multimodal large language mo dels (MLLMs) [ 18 , 20 , 43 , 4 ], new opp ortunities ha ve emerged to demo cratize artifact understanding. By join tly mo deling vision, language, and reasoning, MLLMs enable in teractive, explainable, and scalable analysis of visual art and historical ob jects [ 14 , 26 , 2 , 51 ]. In this w ork, w e fo cus on An tique Chinese Porce- lain , one of the most represen tative yet tec hnically challenging categories of cultural relics, as our en try p oin t for AI-driven artifact connoisseurship. ∗ Equal contribution. † Corresp onding to: W enqi Shao ( weqish@gmail.com ), Qiaosheng Zhang ( zhangqiaosheng@p jlab.org.cn ). 1 Reasoning It’s a blue - and - white vase. Let me refi ne my answer. Zoom & Search Scrolling lotus … Ming Dynasty … What is this ? It’s a vase. Answer - Only Final Answer This is a Ming Dynasty blue - and - white vase with a lotus scroll design . Ground Truth: Celadon Fl attened Flask wi th Handles, Guan ware, Song Dynasty ( 宋官窑青釉贯耳扁瓶 ) Based on its form and glaze characteristics, this porcelain piece can be identified as : Southern Song Dynas ty , Longquan ware , Celadon Cong - shaped Vas e . Key Points of Identification: • Dynasty : Southern Song Dynasty The smooth and lustrous glaze … • Kiln Site : Longquan Kiln Longquan celadon is renowned for its jade- like texture and subtle tones of fenqing (powder blue) and meiziqing (plum green). • Glaze : Celadon Glaze The visible crackle pattern (known as kaipian or crackle lines) is a typical characteristic of Longquan celadon. • Shape : Cong - shaped Vase The form derives from the ancient jade cong , symbolizing “the square within the circle .” The vessel is … Baseline (GPT - 5): CiQi - Agent: Based on visual observation, t his porcelain piece is a flattened fl ask with loop handles . The vessel exhibits…— features characteri stic of Song Dyn asty Guan ware . The body lines … More over, the loop - handle design is also a common decorative elemen t in Guan war e porcelain of the Song period . To further verify its provenance and detailed attributes, I plan to invoke t he search_image tool and input the image index 1 to retrieve the most similar records from the database. {"name":"search_image","arguments":{"index": 1 }} According to the information returned by the tool, this flattened flask with loop handles exhibits typical styli stic traits of Song Dynasty Guan ware, as seen in its form, glaze color, and crackle pattern . In particular, the celadon tone and naturally developed crackl es further confirm its identity as a product of the official kilns . Combined with its loop - handle design and overall shape, it can be identified as an authentic work of the Song Dynasty official kiln system . To obtain more pre cise details regarding its chronological period, kiln site, and exact des ignation, I plan to i nvoke the search_text tool to retrieve detailed artifact records. {"name": "search_text", "arguments": {"query": " Song Dynasty; Guan ware; celadon; vase with loop handles; crackle glaze "}} Vases with loop handles ( guan’er ping) can already be found among bronze vessels from the Spring and Autumn and Warring States periods, where they served as wine containers . In the Song dynasty, such shapes became popular … Based on the analysis, … All the se features conform to the craftsmanship standards of the official kilns of the Song period . The glaze …, typical of decorative vessels used in the imperial court of the Song Dynasty . {"name": "image_zoom_in_tool", "arguments": {"index": 1 , "bbox_2d": [ 223 , 223 , 670 , 670 ], "label": " vase with loop handles "}} According to the localized image analysis returned by the tool, this flattened flask … detailed obs ervation reveals the distinctive feature of “ purple mou th and iron foot ”. These attribut es further verify that it is a product of the Song offic ial kilns . Considering its form and glaze characteristics, the piece can be identified as : Song Dynasty , Guan ware, Celadon Flattened Flask with Loop Handles . Name : Song Dynasty, Guan ware, Celadon Vase with String - Pattern and Loop Handles Description : The form imitates a bronze vessel. It has a slightly contracting mouth with rolled rim, a long neck … Name : Qing Dynasty, Imitation Song Guan ware, Octagonal Vase with Loop Handles Description : The vessel is flattened and octagonal in form, with a flared mouth and slightly rolled lip, constricted neck … Name : Qing Dynasty, Yongzheng mark, Milk - glazed Double - spouted Vase Description : The vase is of flattened square form, modeled after the shape of an ancient bronze fang hu vessel … Name : Southern Song Dynasty, Longquan ware, Celadon Mallet - shaped Vase Description : The vessel has a disk - shaped mouth with a copper rim, a long neck, slanted shoulders leading to a broad … (b) CiQi -Agent (Ours) (a) General-purpose MLLM . . . Figure 1: Comparison betw een (a) General-purp ose MLLM and (b) CiQi-Agent. (a) Con ven tional MLLMs rely on single-p ass answering , directly outputting a lab el leading to super- ﬁcial or inaccurate iden tiﬁcations. (b) The prop osed CiQi-Agent introduces T o ol-Augmente d R e asoning , enabling multi-step p orcelain analysis through image zo om-in, image/text retriev al. This iterativ e pro cess yields more reliable ﬁnal answ ers aligned with exp ert connoisseurship reasoning. Existing approac hes to p orcelain connoisseurship remain limited in several k ey asp ects. In most computer vision (CV) studies, p orcelain connoisseurship is simpliﬁed as a ﬁne-grained recognition task [ 22 , 21 , 19 , 12 ], fo cusing mainly on visual classiﬁcation without incorp orating language-based reasoning or interactiv e explanation. Moreo ver, general-purp ose MLLMs p er- form p oorly in this sp ecialized domain due to the scarcit y of p orcelain-related data, leading to w eak generalization, unstable judgmen ts, and the absence of a uniﬁed ev aluation standard, resulting in outputs that are often incomplete and lacking in professionalism. These limitations highligh t the need for a dedicated p or c elain c onnoisseurship agent that in tegrates p erception, reasoning, and cultural in terpretation. Building suc h an agent presents several unique challenges. First, the n umber of v aluable p orcelain artifacts is inherently limited, making high-qualit y data collection extremely diﬃcult. Second, accurate annotation and description of p orcelain piec es require deep domain exp ertise in art history and craftsmanship, whic h mak es large-scale lab eling b oth costly and inconsisten t. Third, there is no uniﬁed ev aluation standard for p orcelain connoisseurship, leaving MLLMs without reliable metrics to assess their p erformance. Finally , the essence of p orcelain connois- seurship lies in ﬁne-gr aine d r e c o gnition , which demands precise identiﬁcation of features such as v essel shap e, glaze color, decorative motifs, and historical p eriod—a task that remains highly c hallenging even for human exp erts. T o ov ercome these c hallenges, we propose CiQi-Agen t , the ﬁrst Chinese p orcelain connois- seurship agent that integrates domain-grounded data curation, a t wo-phase training paradigm com bining sup ervised ﬁne-tuning (SFT) and reinforcement learning (RL), and to ol-augmen ted 2 reasoning in to a uniﬁed framework (as shown in 1 ). CiQi-Agen t supp orts multi-image input and p erforms ﬁne-grained connoisseurship ov er six attributes (dynast y , reign p erio d, kiln site, glaze color, decorativ e motif and vessel shap e) by nativ ely integrating vision to ol inv o cation and multimodal retriev al-augmen ted generation (RA G) as essen tial reasoning capabilities, and can optionally output the top- k visually similar p orcelains to facilitate h uman in terpretation and reference. Sp eciﬁcally , our con tributions are as follows: • W e construct CiQi-VQA , a large-scale dataset for ancien t Chinese p orcelain connois- seurship, with 29 , 596 original artifacts from 20+ dynasties (2nd c. BCE–19th c. CE), co vering 100+ v essel shap es, 200+ glaze colors, and 200+ decorative motifs. T o sup- p ort multimodal model training, we further expand it to ov er 500K high-quality visual question answering (V QA) pairs via a h ybrid pip eline com bining exp ert annotation and LLM-assisted cleaning. • W e establish CiQi-Bench , an exp ert-aligned b enchmark for Chinese p orcelain connois- seurship, comprising 775 high-qualit y sp ecimens and tw o complemen tary ev aluation pro- to cols: (1) a ﬁne-grained m ultiple-choice setting cov ering six attributes and standardized naming, and (2) a free-form generation setting assessed via LLM-based attribute-wise similarit y scoring. • W e prop ose a t wo-phase iterative training framew ork for CiQi-Agen t: Phase I uses GRPO- based RL with a large tool-calling rew ard to rapidly b ootstrap to ol-calling skills and gen- erate syn thetic tra jectories; Phase I I integrates these tra jectories back into SFT, follow ed b y RL with a reweigh ted, accuracy-conditioned rew ard that join tly optimizes tool-calling proﬁciency and connoisseurship accuracy , thereby aligning CiQi-Agent’s to ol-calling ca- pabilit y with its domain exp ertise. • CiQi-Agen t incorp orates b oth visual and retriev al-based to ols, including an image zoom- in tool and image/text retriev al to ols to enable m ultimo dal RAG. The retriev al database is primarily built from the CiQi-V QA dataset, consisting of 8 , 161 curated p orcelain pieces with 16 , 380 images, supplemented b y 49 , 606 cleaned plain-text en tries from professional articles. • The proposed CiQi-Agen t, built upon the Qw en2.5-VL-7B-Instruct, ac hieves sup erior per- formance on our b enc hmark, consistently outperforming all mainstream op en- and closed- source m ultimo dal mo dels across all ev aluation attributes. 2 Related W ork Classical AI Metho ds for P orcelain Connoisseurship. Early attempts at automating p orcelain connoisseurship treated it as a ﬁne-grained image classiﬁcation problem. T radi- tional CV metho ds relied on hand-crafted features (e.g., color, texture descriptors) combined with classiﬁer such as supp ort vector mac hine (SVM), ac hieving mo derate success on small datasets [ 45 , 33 ]. With the rise of deep learning, conv olutional neural net w ork (CNN)-based approac hes hav e shown improv ed accuracy by automatically learning visual features lik e glaze color and v essel shap e [ 22 , 21 ]. Researc hers ha ve also explored m ulti-task mo dels to jointly classify attributes suc h as dynast y , kiln, and glaze t yp e. F or instance, Ling et al. present a deep-learning framew ork for four attributes (dynast y , glaze, ware, t yp e) on Song/Y uan p orce- lain [ 19 ]. Nonetheless, these classiﬁcation-driv en works are limited to predeﬁned lab els and cannot perform the ric her, explanatory reasoning that true connoisseurship demands. They also t ypically treat artifact analysis as one-shot classiﬁcation, rather than interactiv e or multi-step reasoning [ 25 , 5 , 7 ]. In summary , classical AI methods underline the need for larger high-qualit y 3 Figure 2: Visualization of the four key attributes in p orcelain connoisseurship. Sho wn are the distributions of dynasty , glaze color, v essel shap e, and decorativ e motif in the ra w p orcelain dataset. F or eac h attribute, the top 10 most frequent categories are presen ted. datasets and for moving b ey ond one-shot classiﬁcation tow ard in teractive, knowledge-ric h anal- ysis. MLLMs and Domain-Sp eciﬁc Multimo dal Systems. General-purp ose m ultimo dal LLMs suc h as BLIP-2 [ 18 ] and LLaV A [ 20 ] bridge image and text understanding, but t ypically op erate in a single-pass manner without iterative p erception or external-knowledge integration [ 39 , 11 , 13 ]. Recen t visual-agen t systems address these limitations b y enabling multi-step p erception and to ol-based reasoning [ 46 , 50 , 34 , 6 ]. In parallel, domain-adapted m ultimo dal mo dels hav e emerged in medicine [ 28 , 27 , 42 ], remote sensing [ 44 , 16 ], and cultural heritage analysis suc h as V aseV QA [ 8 , 36 , 24 ]. How ever, these systems primarily p erform single-step pr e diction and do not supp ort to ol-augmen ted reasoning. This gap underscores the need for domain-sp eciﬁc visual agen ts capable of multi-step p erception and knowledge-grounded decision making. 3 Dataset and Benc hmark 3.1 CiQi-V QA dataset Ra w Data Collection. W e ﬁrst collect raw an tique Chinese p orcelains from multiple publicly accessible sources, including w eb-based searc hes, op en-access digital m useum collections, and digitized scholarly b ooks. F rom these sources, we curate a dataset of 29 , 596 unique sp ecimens spanning 38 dynasties, 42 reign p erio ds, 246 glaze colors, 248 decorative motif categories, and 158 vessel shap es. T o the b est of our kno wledge, this is the most comprehensive dataset for p orcelain appreciation currently a v ailable. ∗ Eac h p orcelain sp ecimen is asso ciated with at least one high-quality image and a standardized name that explicitly enco des four key attributes : dynasty , glaze c olor , vessel shap e , and de c or ative motif . In addition, a p ortion of the sp ecimens further sp eciﬁes t wo additional attributes — r eign p erio d and kiln origin . A complete example with all six attributes would b e: “Qing Dynast y , Kangxi p eriod (1662–1722 CE), Jingdezhen kiln, Blue-and-white, Cloud-and-dragon motif, Bowl ( 清康熙景德镇青花云龙纹碗 ).” Figure 2 visualizes the distribution across the four key attribute dimensions, sho wing the top-10 most frequen t categories in each. F rom this collection, w e later select 775 pieces to construct CiQi-Bench for ev aluation, while the remaining 28 , 821 pieces are used to build the CiQi-V QA training set. Metadata Enric hmen t. ∗ A comparison with existing p orcelain-related datasets is provided in the supplementary materials. 4 Figure 3: Examples from CiQi-VQA dataset. The ﬁgure illustrates annotations across four k ey attributes: dynasty , glaze color, decorativ e motif and vessel shap e. W e further extract and clean descriptive texts related to p orcelain connoisseurship from the raw sources and align them with each sp ecimen. How ever, a substan tial p ortion of the collected p orcelains only provides standardized names without detailed narrative descriptions. T o address this issue, we invite h uman exp erts to participate in the metadata enrichmen t pro cess. Sp eciﬁcally , for 61.18% of the specimens that lack detailed descriptions, the exp ert team comp oses complementary connoisseurship descriptions based on the sp ecimen images and the av ailable source metadata. In addition, to improv e the reliability of the dataset, the exp erts review and correct the naming accuracy of the standardized names for all collected sp ecimens, ensuring that the enco ded attribute information is consistent and prop erly formatted. The exp ert team is led by a senior researcher with more than 20 y ears of exp erience in p orcelain iden tiﬁcation and connoisseurship researc h, and includes four graduate students from related disciplines who contribute to description completion and naming veriﬁcation under the leader’s sup ervision. Finally , we feed the standardized names, enriched descriptiv e texts, and raw images in to MLLMs to pro duce a p olished connoisseurship description for each sp ecimen, whic h is structured in six paragraphs corresp onding to six connoisseurship attributes. V QA Data Generation. W e leverage MLLMs to generate specialized V QA pairs targeting the four key attributes, dynasty , glaze c olor , de c or ative motif , and vessel shap e , as illustrated in Figure 3 . W e fo cus on these four rather than the full six b enc hmark attributes b ecause they are most central to con- noisseurship; the remaining tw o are treated as more adv anced reﬁnemen ts and are emphasized at the RL stage. F or each p orcelain sp ecimen, the holistic description is also conv erted into a V QA format, giving ﬁve VQA training samples p er item. T o further diversify linguistic expression, we adopt a light w eight augmentation strategy instead of multi-epo c h training on identical samples. F or eac h of the ﬁve V QA samples, we prompt the LLM to generate four additional v ariants that preserve the semantics but diﬀer in phrasing and style. The ﬁnal CiQi-VQA training set thus contains 20 stylistically div erse yet seman tically consistent V QA samples per specimen † , and we p erform SFT for a single epo ch on this augmen ted dataset. † F or a small subset of sp ecimens (e.g., mono c hrome wares), only 15 questions are av ailable b ecause these ob jects inherently lack decorativ e motifs. 5 T able 1: Ov erall statistics of CiQi-VQA dataset and CiQi-Bench. P orcelains Images Questions A ttributes V QA Multiple-c hoice SFT 28,821 50,675 557,165 — RL ∗ 10,275 10,275 10,275 — dynast y , reign, kiln, color, motif, shap e Ev aluation 775 878 775 5,425 T otal 29,596 51,553 557,940 5,425 dynast y , reign, kiln, color, motif, shap e ∗ The raw p orcelain data used for RL is a subset of that used for SFT. 3.2 CiQi-Benc h F or the b enc hmark, we curated a set of 775 p orcelain sp ecimens and designed tw o ev aluation proto cols. Multiple-Choice Questions. The ﬁrst proto col adopts a multiple-choic e format. F or each sp ecimen, w e construct seven questions cov ering the four key attributes, tw o additional at- tributes (reign p eriod and kiln origin), and the full standardized name. W e use GPT-5 to automatically generate the multiple-c hoice questions: given the image and the ground-truth annotation as input, the mo del is instructed to pro duce plausible y et challenging distractor options. Mo del p erformance is then quan tiﬁed by the answer accuracy ov er all questions. F ree-F orm Questions. The second proto col fo cuses on fr e e-form gener ation . In this setting, the mo del is prompted to pro duce a holistic textual description of eac h p orcelain sp ecimen. An LLM-based ev aluator is subsequently employ ed to compare the generated description with the ground-truth text and assign six separate similarity scores—one for eac h of the four k ey attributes and the t wo additional attributes. ‡ T able 1 summarizes the ov erall statistics of the p orcelain dataset and b enc hmark. The raw p orcelain data used for RL is a subset of that used for SFT. 4 T raining F ramew ork of CiQi-Agen t In this section, w e present the ov erall architecture of our proposed CiQi-Agen t, which em ulates the reasoning pro cess of h uman experts by combining visual p erception, retriev al-augmented kno wledge access with reinforcement-driv en tool-calling. As illustrated in Figure 4 , the agent autonomously analyzes visual details, retrieves multimodal evidence, and generates context- a ware judgments through the co ordinated use of sp ecialized to ols. 4.1 T o ol Design T o enable more ﬂexible and interpretable reasoning, a suite of external to ols is provided, which can b e autonomously selected and executed by CiQi-Agen t during inference. Each to ol inv o- cation is encapsulated within a standardized tag pair, allowing the agen t to issue structured commands and in tegrate to ol outputs into its reasoning context in a consisten t format. These to ols are organized in to tw o categories, vision to ol and r etrieval to ol , whic h resp ectiv ely serv e the purp oses of p erceptual enhancement and knowledge acquisition, as illustrated in Figure 4 . Vision T o ol. The image zo om-in to ol enables the agen t to autonomously fo cus on visually salien t regions that are p oten tially informative for p orcelain connoisseurship. Instead of relying on predeﬁned or user-sp eciﬁed co ordinates, the agen t ﬁrst analyzes the global image con text to ‡ The sp eciﬁc MLLMs and prompt templates used during the construction of our dataset and b enc hmark are pro vided in the supplementary materials. 6 Q w en 2 . 5 - VL - I n s t r uc t T r a i n i n g D a t a S u pe r vi s e d F i n e - t u n i n g R e i n f o r c e m e n t L e a r n in g T o o l T r a j ect o r i es A g en t V i si on L a n g u a g e M o d e l Los s T r aj ec t o r i es P h a s e 1 P h a s e 2 LLM - as - a - J u d g e I m a ge Z o o m - in T o o l I m a ge R e t r i e va l T o o l T e xt R e t r i e va l T o o l " n a m e " : " i m a g e _ z o o m _ i n _ to o l ", " a r g u m e n t s " : { "i n d e x " : 1 , " b b o x " : [ 2 2 0 , 0 , 6 6 0 , 3 0 0 ] , "l a b e l " : "b o w l l i d "} " n a m e " : " s e a r c h _ i ma g e ", " a r g u m e n t s " : { "i n d e x " : 1 } " n a m e " : " s e a r c h _ te x t ", " a r g u m e n t s " : { "q u e r y " : "Cov e r e d j a r w i th f o u r l u g s ” } T o o l R e s po n s e : T oo l R e s po ns e : T oo l R e s po ns e : Ori g i n al I m a g e M u l t i m o d a l T o o l C a l l i n g W o r k f l o w T wo - p h a s e T r a i n i n g P i p e l i n e P h a s e 1 P h a s e 2 T ool r e s pon e s e s ( m a s k e d ) Mu l t i - t u r n R o l l o u t s [ p r o mp t] [ too l _ c a l l ][ too l _ r e s p o n s e ] . . . [ a n s w e r ] S u c c e s s f u l l y z o o m th e i m a g e i n to [ 2 2 0 , 0 , 6 6 0 , 3 0 0 ]. R e tr i e v e d 3 i ma g e s w i th t h e h i g h e s t s i mi l a r i ty b a s e d o n i ma g e . R e tr i e v e d 1 c o r p u s a n d 2 i ma g e s w i th t h e h i g h e s t s i mi l a r i ty . n a m e : Blu e - a n d - w h it e M e ip in g V a s e . . . d y n a s t y : M in g r e ig n : N o n e k iln : N o n e c o lo r : B lu e - a n d - w h it e p a t t e r n : D e e r m o t i f s h a p e : M e ip in g d e s c r ip t io n : N o n e s im ila r it y : 0 . 9 7 2 c a p t io n : C o v e r e d j a r w it h fo u r lu g s t e x t : T h e v e s s e l h a s a s h o r t r o u n d e d lip a n d a s t r a ig h t m o u t h , w it h s lo p in g s h o u ld e r s b e a r in g fo u r h o r iz o n t a lly a r r a n g e d , s e m i - r in g lu g s . T h e b a s e is fl a t . T h e d i a m e t e r o f t h e b e lly e x c e e d s t h e t o t a l h e ig h t ... s o u r c e : Fa m o u s K iln s o f C h in a : T h e H o n g z h o u K iln s im ila r it y : 0 . 4 7 1 Figure 4: F ramew ork of the CiQi-Agent. The agent integrates visual zo om-in, image/text retriev al to ols within a tw o-phase training pip eline. Supervised ﬁne-tuning establishes to ol- calling skills and p orcelain connoisseurship kno wledge, while reinforcement learning with an LLM-as-a-Judge reﬁnes accuracy and strategic to ol-calling. infer whic h lo cal areas merit closer examination—such as decorative motifs, glaze textures, or inscription details. It then dynamically predicts the corresp onding b ounding-b o x parameters and extracts high-resolution visual patc hes from those regions. The resulting sub-images are subsequen tly reintegrated in to the multimodal reasoning context, where they serve as ﬁne- grained p erceptual evidence for the ongoing analysis. Retriev al T o ols. Tw o retriev al to ols are provided: an image r etrieval to ol and a text r etrieval to ol , enabling the agent to access evidence from a m ultimo dal p orcelain database autonomously . Both to ols op erate on a uniﬁed RA G framework that integrates high-resolution images and textual kno wledge related to p orcelain connoisseurship. F or image r etrieval , the query image is enco ded by the CLIP enco der [ 47 ], and cosine similarit y is computed against all image embeddings in the database. The system returns top- k similar en tries and their metadata, whic h are reinserted into the reasoning context as visual evidence. F or text retriev al, the query text is enco ded by b oth a CLIP enco der and a text- em b edding mo del [ 3 ], yielding tw o v ectors in distinct semantic spaces. Eac h is matched with its corresp onding index via cosine similarity , and the results are fused to iden tify the most relev ant records b y a parameter α , which con trols the relative con tribution of each model to the ﬁnal retriev al set. The retriev ed metadata, such as descriptions, pro venance, or linked images, are then supplied to the agen t as external knowledge. This dual-space retriev al mechanism grounds the model’s reasoning in complementary visual and textual evidence, enhancing factual reliabilit y and interpretabilit y . 4.2 Tw o-phase T raining Pip eline Giv en the base mo del’s initial lack of to ol-calling comp etence and the absence of exp ert- annotated pro cedural tra jectories, the acquisition of to ol-calling skills is relegated to unguided exploration in the RL phase. This deﬁciency leads to a sample-ineﬃcien t process that constrains 7 the attainable p erformance. Consequently , we employ a two-phase training pip eline, with each phase consisting of an SFT stage and a subsequen t RL stage. Phase I: F oundational Comp etence. The ﬁrst phase aims to establish general connoisseur- ship knowledge and tool-calling ability . During the SFT, the mo del is trained on the CiQi-VQA dataset (Sec. 3.1 ), augmen ted with instruction-following data [ 40 ] to retain general reasoning capabilit y and with 10 , 575 ev eryda y p orcelain samples to reduce o verﬁtting. The subsequent RL stage emplo ys Gr oup R elative Policy Optimization (GRPO) [ 37 ], and a strong to ol-calling rew ard encourages the mo del to quickly learn the mec hanics of to ol-calling, yielding a model that generates syn thetic to ol-calling tra jectories for Phase II . Phase I I: Strategic Comp etence. The second phase fo cuses on reﬁning the agent’s ability to in tegrate to ol-calling with higher-lev el reasoning. The syn thetic tra jectories from Phase I are merged back in to the SFT corpus, allowing the mo del to in ternalize b oth reasoning and to ol- calling patterns. The subsequent RL stage emplo ys a redesigned rew ard that explicitly links to ol- calling rew ards to connoisseurship accuracy , encouraging the mo del to improv e connoisseurship p erformance through more eﬀective and purp oseful to ol-calling. Through this tw o-phase curriculum, the agen t progressively acquires domain-sp eciﬁc exp er- tise required for p orcelain connoisseurship while learning to call tools to enhance p erception and reasoning. 4.3 Rew ard Design During RL, the o verall rew ard function is a w eigh ted av erage of format r ewar d R format , ac cur acy r ewar d R acc , and to ol-c al ling r ewar d R tool : R = γ format · R format + γ acc · R acc + R tool , where weigh t parameters γ format and γ acc con trol the relativ e imp ortance of the format and accuracy rew ards. F ormat Reward. The format reward R format ∈ { 0 , − 1 } enforces compliance with the pre- scrib ed output and to ol-calling syntax. Sp eciﬁcally , the agent receives R format = 0 if b oth the resp onse and to ol calls strictly follow the required format; otherwise, it incurs a p enalt y of R format = − 1. Accuracy Reward. The accuracy reward R acc measures ho w accurately the mo del names the p orcelain. It is comp osed of six attribute scores: dynasty , r eign p erio d , kiln origin , glaze c olor , de c or ative motif , and vessel shap e , plus a c onsistency score that c hecks whether the output conforms to the required format. Eac h score s i ∈ [0 , 1]. If the ground-truth do es not contain a certain attribute, that attribute is excluded from the av erage. F ormally , let M denote the set of attribute indices present in the ground-truth for a given sample; then R acc = |M| − 1 P i ∈M s i . The ev aluation is p erformed b y the LLM-as-a-Judge describ ed in Sec. 4.4 . T o ol-calling Rew ard. The to ol-calling rew ard R tool is designed to guide the agent’s learning of to ol-calling b eha vior and its eﬀective use in impro ving connoisseurship p erformance. • During the ﬁrst-phase RL training, the primary ob jective is to help the mo del quic kly acquire the ability to call to ols appropriately . F or each rollout, if the mo del in vok es k tool to ols, it receives a corresp onding reward R tool = k tool , prop ortional to the num b er of successful to ol-calls, encouraging early mastery of to ol-calling b eha vior. • During the se c ond-phase RL training, the rew ard sc heme shifts from quantit y to qualit y . T o promote strategic and meaningful to ol-calling, the agent is rew arded only when the in vocation of to ols contributes to higher connoisseurship accuracy . Sp eciﬁcally , if the mo del inv ok es m tool distinct to ols during a rollout, it receives a scaled tool-calling rew ard 8 T able 2: Pearson Correlation and MAE b et w een human exp ert scores and LLM-as-a-Judge scores on CiQi-Benc h. All v alues are rep orted to three decimal places. Dynast y Reign p eriod Kiln Glaze color Motif Shape P earson r 0.995 1.000 0.979 0.958 0.938 0.859 MAE 0.013 0.000 0.036 0.028 0.065 0.077 R tool = (0 . 9 + 0 . 1 m tool ) R acc while rollouts without an y to ol inv o cation receive no tool- calling rew ard. This design incentivizes the mo del to explore to ol-calling strategies that genuinely enhance iden tiﬁcation accuracy rather than indiscriminate calling of to ols. 4.4 LLM-as-a-Judge Although p orcelain naming follows a standardized attribute structure, linguistic realizations of eac h attribute exhibit substantial lexical v ariation while preserving semantic equiv alence. As a result, exact string matching or embedding-based similarity metrics fail to reliably assess correctness, particularly when ﬁne-grained, domain-sp eciﬁc terminology is in volv ed. T o ad- dress this challenge, we instead adopt an LLM-as-a-Judge approach: during RL training and ev aluation, the agent is required to output a standardized naming enclosed within a prede- ﬁned tag pair. The conten t inside these tags, along with the corresp onding ground-truth porcelain name, is submitted to an LLM-based ev aluator. The ev aluator is in- structed to rate each attribute according to a structured scoring rubric, yielding individual scores s i ∈ [0 , 1], where higher v alues indicate greater accuracy or st ylistic consistency . The detailed prompt is pro vided in 9.4 – 9.5 of the Supplementary Material. T o v alidate the reliability of LLM-as-a-Judge, w e run the SFT mo del after ﬁrst-phase train- ing on CiQi-Bench to generate predictions and collect indep enden t ev aluations from domain exp erts using the same scoring rubric. T able 2 rep orts the Pearson correlation co eﬃcien ts and mean absolute errors (MAE) betw een expert scores and LLM-based scores across all six attributes. The consistently high correlations, together with the small MAE v alues, indicate strong alignmen t b et ween the LLM-based ev aluator and h uman exp ert judgment. 5 Exp erimen ts In this section, w e conduct extensiv e exp erimen ts to ev aluate the p erformance of our proposed CiQi-Agen t. W e ﬁrst conduct exp erimen ts on the CiQi-Bench, and then p erform ablation studies to analyze the impact of eac h comp onen t on our mo del’s p erformance. 5.1 Exp erimen t Setup Baseline Conﬁguration. W e compare our method against a comprehensive set of state-of- the-art m ultimo dal mo dels, including b oth closed-source and op en-source v arian ts. F or closed- source mo dels (GPT-5 [ 30 ], GPT-4.1 [ 31 ], GPT-4o [ 29 ], Op enAI o3 [ 32 ], Gemini 2.5 Pro [ 10 ], and Claude Opus 4 [ 1 ]), w e use their oﬃcial APIs with the default inference settings, with the temp erature set to 0 . 0 to ensure deterministic outputs. F or op en-source mo dels, we select the largest released mo dels for each mo del family (Qw en2.5-VL-72B-Instruct [ 35 ], GLM-4.5V [ 9 ], In ternVL3.5-241B-A28B-Flash [ 15 ], Kimi-VL-A3B-Instruct [ 17 ]). Mo dels are loaded using their oﬃcial implemen tations and ev aluated with temp erature=0 . 0 to ensure repro ducibility . Our Metho d Conﬁguration. Our agen t is built up on Qwen2.5-VL-7B-Instruct, trained using the tw o-phase pip eline describ ed in Sec. 4 , with Qwen2.5-72B-Instruct [ 41 ] serv ed as the 9 T able 3: Performance on CiQi-Benc h (Multiple-c hoice & F ree-form). Bold indicates the b est p erformance, and underline indicates the second-b est p erformance. Model Multiple-choice Accuracy (%) F ree-form Accuracy (%) Dynasty Reign Kiln Color Motif Shap e Naming Average Dynasty Reign Kiln Color Motif Shap e Average GPT-5 [ 30 ] 65.7 61.4 79.6 86.5 69.3 83.8 84.3 75.8 39.4 32.8 42.6 74.4 35.3 63.9 48.0 GPT-4.1 [ 31 ] 59.3 68.3 71.1 85.0 62.2 81.8 77.9 72.2 36.7 27.2 29.0 67.5 27.6 60.1 41.3 GPT-4o [ 29 ] 59.1 60.4 68.6 89.2 70.1 84.2 82.1 73.4 26.9 13.4 15.1 53.9 21.1 47.6 29.7 o3 [ 32 ] 57.6 57.4 72.2 82.6 62.4 76.8 76.6 69.4 42.7 36.6 44.4 74.2 33.1 62.1 48.8 Gemini 2.5 Pro [ 10 ] 54.4 57.4 68.0 65.2 58.5 64.2 82.5 64.3 48.1 34.9 48.4 50.2 16.1 39.0 39.5 Claude Opus 4 [ 1 ] 54.8 40.6 65.3 74.9 59.0 75.2 69.3 62.7 36.8 10.3 22.0 64.1 25.1 59.1 36.2 Qwen2.5-VL-72B-Instruct [ 35 ] 57.6 34.7 69.2 86.7 71.7 84.1 80.3 69.2 29.5 31.2 27.7 75.8 31.0 62.6 43.0 GLM-4.5V (106B) [ 9 ] 58.3 59.4 75.8 82.3 70.4 81.8 80.6 72.6 31.0 14.3 32.8 65.4 31.1 65.2 39.9 InternVL3.5-241B-A28B-Flash [ 15 ] 57.1 38.6 59.5 82.1 64.8 73.9 68.5 63.5 42.4 31.6 36.9 52.6 19.6 41.5 37.4 Kimi-VL-A3B-Instruct (16B) [ 17 ] 59.3 22.8 48.8 84.8 59.8 77.9 70.3 60.5 17.3 23.7 16.2 69.5 26.5 61.3 35.7 CiQi-Agent (Ours, 7B) 77.6 70.3 81.8 91.4 75.7 88.1 85.2 81.5 71.3 49.1 69.8 85.4 49.7 75.0 66.7 LLM-as-a-Judge ev aluator. The weigh t parameters are set as γ format = 0 . 2 and γ acc = 1 . 0, resp ectiv ely . F or retriev al, w e construct a database consisting of 16 , 830 images and 49 , 606 texts from our CiQi-VQA dataset, which is strictly non-ov erlapping with b oth the RL training set and the CiQi-Bench to av oid information leak age and ensure fair assessment. See 10 in the Supplemen tary Material for the detailed conﬁgurations. Ev aluation Metrics. W e rep ort accuracy as the primary metric for m ultiple-choice questions, computed as the p ercentage of correctly answ ered questions across all seven dimensions: o verall naming, dynast y , reign perio d, kiln site, glaze color, decorativ e motif, and vessel shap e. F or free- form generation tasks, we emplo y Qw en2.5-72B-Instruct [ 41 ] as the LLM-as-a-Judge ev aluator. 5.2 Main Results on CiQi-Benc h T able 3 summarizes the results on CiQi-Benc h. CiQi-Agen t ac hieves state-of-the-art p erfor- mance across all dimensions, v alidating the eﬀectiv eness of our tw o- phase training pipeline and to ol-augmen ted reasoning framew ork. On m ultiple-choice tasks, CiQi-Agent attains 85 . 2% o verall naming accuracy and an a verage of 81 . 5% across all sev en attributes, surpassing GPT-5 (av erage 75 . 8%) and Qw en2.5-VL-72B- Instruct (a verage 69 . 2%). Notably , it achiev es 77 . 6% on dynast y , 70 . 3% on reign p erio d, and 81 . 8% on kiln site—outp erforming all baselines b y substantial margins. Performance gains are esp ecially pronounced in visually grounded attributes, reaching 91 . 4% on glaze color, 75 . 7% on decorativ e motif, and 88 . 1% on vessel shap e. F or free-form generation, CiQi-Agen t further demonstrates strong generalization, ac hiev- ing 71 . 3% on dynasty (vs. 42 . 7% for o3), 69 . 8% on kiln site (vs. 44 . 4% for o3), and 85 . 4% on glaze color (vs. 75 . 8% for Qwen2.5-VL-72B), with an av erage of 66 . 7% across all six attributes (vs. 43 . 0% for Qw en2.5-VL-72B and 48 . 0% for GPT-5). Remark ably , despite ha ving only 7B pa- rameters, CiQi-Agent outp erforms GPT-5 across all attributes and in b oth multiple-c hoice and free-form a verages, demonstrating that to ol-calling and domain-aligned training can eﬀectively comp ensate for mo del scale. 5.3 Ablation Study T raining stages. T o understand how eac h training stage contributes to the ﬁnal p erformance, w e p erform an ablation study based on Qw en2.5-VL-7B-Instruct, as shown in T able 4 . W e conduct an ablation study with the following three mo del v ariants: (1) the base mo del; (2) the SFT mo del ﬁne-tuned on the base model, corresp onding to the Phase II mo del; (3) the SFT+RL mo del, i.e., our full CiQi-Agen t . Since the to ol-calling tra jectories used for RL training are generated b y the Phase I mo del, Model (3) is built on Mo del (2). Eﬀe ct of SFT. SFT substan tially b oosts p erformance: o verall naming improv es from 69 . 0% to 81 . 9%, and the multiple-c hoice av erage rises from 67 . 4% to 75 . 9%. On free-form ev aluation, 10 T able 4: Ablation study on Qw en2.5-VL-7B-Instruct v arian ts. Arrows indicate impro vemen t ( ↑ ) or decline ( ↓ ) relativ e to the previous v arian t, and b old indicates the b est performance. Model Multiple-choice Accuracy (%) F ree-form Accuracy (%) Dynasty Reign Kiln Color Motif Shape Naming Average Dynasty Reign Kiln Color Motif Shape Average Qwen2.5-VL-7B-Instruct [ 35 ] 54.0 60.4 55.1 83.6 69.8 79.6 69.0 67.4 20.3 22.2 15.5 70.6 28.4 61.4 36.4 + SFT 65.5 ↑ 53.5 ↓ 77.1 ↑ 92.6 ↑ 72.8 ↑ 88.2 ↑ 81.9 ↑ 75.9 ↑ 64.6 ↑ 45.2 ↑ 56.6 ↑ 81.6 ↑ 35.8 ↑ 70.9 ↑ 59.1 ↑ + SFT + RL 77.6 ↑ 70.3 ↑ 81.8 ↑ 91.4 ↓ 75.7 ↑ 88.1 ↓ 85.2 ↑ 81.5 ↑ 71.3 ↑ 49.1 ↑ 69.8 ↑ 85.4 ↑ 49.7 ↑ 75.0 ↑ 66.7 ↑ dynast y accuracy jumps from 20 . 3% to 64 . 6%, and the av erage increases from 36 . 4% to 59 . 1%, establishing core connoisseurship kno wledge. Eﬀe ct of RL. Adding RL aligns tool-calling with knowledge-grounded reasoning: kiln accuracy jumps from 67 . 2% to 81 . 8% and dynasty from 51 . 6% to 77 . 6%, and all free-form attributes impro ve (e.g., kiln: 63 . 6% → 69 . 8%, dynast y: 66 . 8% → 71 . 3%). Corresp ondingly , the m ultiple- c hoice av erage increases from 69 . 7% to 81 . 5%, and the free-form a verage from 62 . 3% to 66 . 7%, yielding our ﬁnal CiQi-Agen t. T able 5: Ablation study on GPT-5, Qwen2.5-VL-7B-Instruct and CiQi-Agent with diﬀerent to ol conﬁgurations. Bold indicates the b est p erformance. Mo del Multiple-c hoice Accuracy (%) Dynast y Reign Kiln Color Motif Shap e Naming Average GPT-5 [ 30 ] 65.7 61.4 79.6 86.5 69.3 83.8 84.3 75.8 + vision to ol 62.4 68.3 80.4 86.9 67.7 83.2 85.0 76.3 + retriev al to ols 54.9 58.4 69.4 85.5 65.6 81.4 80.2 70.8 + all to ols 55.7 58.4 68.3 85.8 66.7 82.3 78.6 70.8 Qw en2.5-VL-7B-Instruct [ 35 ] 54.0 60.4 55.1 83.6 69.8 79.6 69.0 67.4 + vision to ol 32.7 51.5 31.4 55.0 44.4 54.7 60.1 47.1 + retriev al to ols 33.6 54.5 35.5 55.4 45.2 55.6 60.6 48.6 + all to ols 32.3 53.5 36.9 54.1 45.2 55.4 59.3 48.1 CiQi-Agen t (Ours, 7B) (w/o to ols) 51.6 32.7 67.2 92.7 73.3 86.2 84.3 69.7 + vision to ol 69.2 66.3 59.8 91.0 78.6 76.7 84.2 75.1 + retriev al to ols 68.7 59.4 81.0 88.9 75.4 83.3 84.3 77.3 + all to ols 77.6 70.3 81.8 91.4 75.7 88.1 85.2 81.5 T o ol Conﬁgurations. T o understand how eac h to ol contri butes to the ﬁnal p erformance, we p erform an ablation study on GPT-5, Qwen2.5-VL-7B-Instruct, and our CiQi-Agent, as shown in T able 5 . T o ol on Base Mo dels. T o ols do not consistently beneﬁt general-purp ose base mo dels. (e.g., GPT-5’s av erage: 75 . 8% → 76 . 3% / 70 . 8% / 70 . 8%, Qw en2.5-VL-7B-Instruct’s av erage: 67 . 4% → 47 . 1% / 48 . 6% / 48 . 1%). A plausible explanation is that the returned evidence (retrieved text or visual cues) is domain-sp eciﬁc and ma y require p orcelain-appraisal kno wledge to interpret and reconcile, whic h these base models do not reliably exhibit. In con trast, CiQi-Agen t sho ws consis- ten t gains (e.g., av erage: 69 . 7% / 75 . 1% / 77 . 3% → 81 . 5%), suggesting that mo dels adapted with p orcelain-domain exp ertise can more eﬀectiv ely interpret to ol-pro vided evidence and translate it in to improv ed p erformance. T o ol Combination on CiQi-A gent. Combining to ols is more eﬀective than using a single to ol for CiQi-Agent. The b est av erage p erformance is achiev ed when vision and retriev al to ols are enabled together (av erage: 81 . 5%), which supp orts the view that the t wo to ols provide complemen tary information that is most useful when integrated. Ov erall, the ablation results reveal a clear division of lab or among training stages and to ols. SFT pro vides a comprehensiv e uplift of the mo del’s domain kno wledge and yields consistent impro vemen ts across all ev aluations. Building on this foundation, RL with the vision to ol strengthens the mo del’s ability to capture ﬁne-grained visual cues, whic h is particularly b eneﬁ- 11 cial for visually grounded attributes such as motif. When multimodal retriev al to ols are further incorp orated, the agen t gains the ability to compare input images against external p orcelain ex- emplars, signiﬁcan tly impro ving history-based attributions, including dynast y , reign, and shap e. 6 Conclusion In this w ork, we presen t CiQi-Agen t, a domain-sp eciﬁc m ultimo dal agent for antique Chinese p orcelain connoisseurship. W e build CiQi-VQA, a large-scale dataset of exp ert-curated p orce- lain images and question–answer pairs, and CiQi-Bench, an exp ert-aligned b enchmark that ev aluates six connoisseurship attributes. F or the training framew ork, we design a t wo-phase training pip eline that combines SFT, RL, and to ol-augmen ted reasoning to align to ol-calling with domain exp ertise. CiQi-Agent integrates visual zo om-in and image/text retriev al to ols to p erform ﬁne-grained analysis with m ultimo dal RAG. Extensive exp erimen ts sho w that CiQi- Agen t signiﬁcan tly outp erforms mainstream MLLMs across all attributes, demonstrating the eﬀectiv eness of our dataset, b enc hmark, and training framework for cultural-heritage connois- seurship. F or future w ork, we plan to mov e b ey ond connoisseurship and tackle the more challenging task of authentication, i.e., distinguishing gen uine an tique p orcelains from later imitations. In addition, CiQi-Agent represen ts a ﬁrst step tow ard using MLLMs for cultural-heritage analysis; the same framework can b e extended to build agents for other artifact types (e.g., ancient coins, calligraph y , pain tings) or to develop a more general foundation mo del for cultural-heritage connoisseurship. 7 Conclusion In this w ork, we presen t CiQi-Agen t, a domain-sp eciﬁc m ultimo dal agent for antique Chinese p orcelain connoisseurship. W e build CiQi-VQA, a large-scale dataset of exp ert-curated p orce- lain images and question–answer pairs, and CiQi-Bench, an exp ert-aligned b enchmark that ev aluates six connoisseurship attributes. F or the training framew ork, we design a t wo-phase training pip eline that combines SFT, RL, and to ol-augmen ted reasoning to align to ol-calling with domain exp ertise. CiQi-Agent integrates visual zo om-in and image/text retriev al to ols to p erform ﬁne-grained analysis with m ultimo dal RAG. Extensive exp erimen ts sho w that CiQi- Agen t signiﬁcan tly outp erforms mainstream MLLMs across all attributes, demonstrating the eﬀectiv eness of our dataset, b enc hmark, and training framework for cultural-heritage connois- seurship. F or future w ork, we plan to mov e b ey ond connoisseurship and tackle the more challenging task of authentication, i.e., distinguishing gen uine an tique p orcelains from later imitations. In addition, CiQi-Agent represen ts a ﬁrst step tow ard using MLLMs for cultural-heritage analysis; the same framework can b e extended to build agents for other artifact types (e.g., ancient coins, calligraph y , pain tings) or to develop a more general foundation mo del for cultural-heritage connoisseurship. References [1] An thropic. Claude 4 system card. T echnical rep ort, An thropic, 2025. 9 , 10 , 17 [2] Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Uda yakumar Shetty , Naitik Agraw al, Dh wanil Subhash bhai Shah, Y uqian F u, Xi W ang, Kristina T outanov a, Danda P ani Paudel, and Luc V an Go ol. Understanding museum exhibits using vision- language reasoning. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Com- puter Vision (ICCV) , pages 2227–2238, Octob er 2025. 1 12 [3] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3- Em b edding: Multi-lingual, multi-functionalit y , multi-gran ularity text embeddings through self-kno wledge distillation. arXiv pr eprint arXiv:2402.03216 , 2024. 7 [4] Zhe Chen, Jiannan W u, W enhai W ang, W eijie Su, Guo Chen, Sen Xing, Muy an Zhon g, Qin- glong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, T ong Lu, Y u Qiao, and Jifeng Dai. In tern vl: Scaling up vision foundation mo dels and aligning for generic visual-linguistic tasks. In 2024 IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 24185–24198, 2024. 1 [5] Zhenfang Chen, Qinhong Zhou, Yik ang Shen, Yizhou Hong, Zhiqiu Sun, Daniel Gutfreund, and Ch uang Gan. Visual c hain-of-thought prompting for knowledge-based visual reasoning. In AAAI , v olume 38, pages 1254–1262, 2024. 3 [6] Ethan Chern, Zh ulin Hu, Steﬃ Chern, Siqi Kou, Jiadi Su, Y an Ma, Zhijie Deng, and P engfei Liu. Thinking with generated images. arXiv pr eprint arXiv:2505.22525 , 2025. 4 [7] Y uhao Dong, Zuyan Liu, Hai-Long Sun, Jingk ang Y ang, Winston Hu, Y ongming Rao, and Ziw ei Liu. Insigh t-v: Exploring long-c hain visual reasoning with m ultimo dal large language mo dels. In CVPR , pages 9062–9072, June 2025. 3 [8] Jinc hao Ge, T engfei Cheng, Biao W u, Zeyu Zhang, Shiya Huang, Judith Bishop, Gillian Shepherd, Meng F ang, Ling Chen, and Y ang Zhao. V asevqa: Multimo dal agen t and b enc h- mark for ancient greek p ottery . arXiv pr eprint arXiv:2509.17191 , 2025. Concurrent work. 4 [9] GLM-V T eam. Glm-4.1v-thinking and glm-4.5v: T ow ards versatile multimodal reasoning with scalable reinforcemen t learning. arXiv pr eprint arXiv:2507.01006 , 2025. 9 , 10 , 17 [10] Go ogle DeepMind. Gemini 2.5 pro, 2025. 9 , 10 [11] T anmay Gupta and Aniruddha Kem bhavi. Visual programming: Comp ositional visual reasoning without training. In CVPR , pages 14953–14962, June 2023. 4 [12] Y anfeng Hu, Siqi W u, Zhuoran Ma, and Si Cheng. In tegrating deep learning and ma- c hine learning for ceramic artifact classiﬁcation and market v alue prediction. npj Heritage Scienc e , 13(1):1–17, 2025. 2 , 1 , 16 , 17 [13] Y ushi Hu, W eijia Shi, Xingyu F u, Dan Roth, Mari Ostendorf, Luk e Zettlemoy er, Noah A. Smith, and Ranjay Krishna. Visual sketc hpad: Sk etching as a visual c hain of thought for m ultimo dal language mo dels. In NeurIPS , volume 37, pages 139348–139379, 2024. 4 [14] Yip o Huang, Xiangfei Sheng, Zhic hao Y ang, Quan Y uan, Zhichao Duan, Pengfei Chen, Leida Li, W eisi Lin, and Guangming Shi. Aesexp ert: T ow ards m ulti-mo dalit y foundation mo del for image aesthetics perception. In Pr o c e e dings of the 32nd ACM International Con- fer enc e on Multime dia , MM ’24, page 5911–5920, New Y ork, NY, USA, 2024. Asso ciation for Computing Mac hinery . 1 [15] In ternVL T eam. In ternvl3.5: Adv ancing op en-source multimodal mo dels in v ersatility , reasoning, and eﬃciency . arXiv pr eprint arXiv:2508.18265 , 2025. 9 , 10 , 17 [16] Y uqi Jiang, Xudong Lu, Qian Jin, Qi Sun, Hanming W u, and Cheng Zhuo. F abgpt: An eﬃcien t large m ultimo dal mo del for complex w afer defect kno wledge queries. In ICCAD , pages 1–8, 2025. 4 [17] Kimi T eam. Kimi-VL tec hnical rep ort. arXiv pr eprint arXiv:2504.07491 , 2025. 9 , 10 , 17 13 [18] Junnan Li, Dongxu Li, Silvio Sav arese, and Steven C.H. Hoi. BLIP-2: Bo otstrapping language-image pre-training with frozen image enco ders and large language mo dels. In Pr o c e e dings of the 40th International Confer enc e on Machine L e arning (ICML) , 2023. 1 , 4 [19] Ziy ao Ling, Gio v anni Delnevo, P aola Salomoni, and Silvia Mirri. Multi-task learning for iden tiﬁcation of p orcelain in song and yuan dynasties. arXiv pr eprint arXiv:2503.14231 , 2025. 2 , 3 , 16 [20] Haotian Liu, Chun yuan Li, Qingyang W u, and Y ong Jae Lee. Visual instruction tuning for large language and vision assistan t. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2023. 1 , 4 [21] Y ang Liu, Lei Zhou, Pengc heng Zhang, Xiao Bai, Lin Gu, Xiaohan Y u, Jun Zhou, and Edwin R. Hanco c k. Where to focus: In vestigating hierarchical attention relationship for ﬁne-grained visual classiﬁcation. In ECCV , pages 57–73, 2022. 2 , 3 [22] Yi Liu, Bo Liu, Jun Y u, Jing Xia, and Ch un Luo. Automatic classiﬁcation of blue and white p orcelain sherds based on data augmen tation and feature fusion. Applie d Artiﬁcial Intel ligenc e , 36(1):1994232, 2022. 2 , 3 [23] Jinjing Ma, Y ongk ang P eng, W enfang Cheng, Meik ang Qiu, and Y u Nie. Identiﬁcation metho d of ancient ceramics revision. In 2021 8th IEEE International Confer enc e on Cyb er Se curity and Cloud Computing (CSCloud)/2021 7th IEEE International Confer enc e on Edge Computing and Sc alable Cloud (EdgeCom) , pages 213–218, 2021. 16 [24] Yingzi Ma, Y ulong Cao, Jiac hen Sun, Marco Pa vone, and Chaow ei Xiao. Dolphins: Mul- timo dal language mo del for driving. In ECCV , pages 403–420, 2025. 4 [25] Chanc harik Mitra, Brandon Huang, T revor Darrell, and Roei Herzig. Comp ositional c hain- of-though t prompting for large m ultimo dal mo dels. In CVPR , pages 14420–14431, June 2024. 3 [26] Y oussef Mohamed, Runjia Li, Ibrahim Said Ahmad, Kilic hbek Ha ydarov, Philip T orr, Ken- neth Ch urch, and Mohamed Elhosein y . No culture left b ehind: ArtELingo-28, a benchmark of WikiArt with captions in 28 languages. In Y aser Al-Onaizan, Mohit Bansal, and Y un- Nung Chen, editors, Pr o c e e dings of the 2024 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 20939–20962, Miami, Florida, USA, Nov ember 2024. Asso cia- tion for Computational Linguistics. 1 [27] Mic hael Mo or, Qian Huang, Shirley W u, Michihiro Y asunaga, Y ash Dalmia, Jure Lesko vec, and et al. Med-ﬂamingo: A m ultimo dal medical few-shot learner. In ML4H , pages 353–367, 2023. 4 [28] Hareem Nisar, Syed Muhammad Anw ar, Zhifan Jiang, Abhijeet P arida, Vish wesh Nath, Holger R. Roth, and Marius G. Linguraru. D-rax: Domain-sp eciﬁc radiologic as- sistan t leveraging multi-modal data and exp ert mo del predictions. arXiv pr eprint arXiv:2407.02604 , 2024. 4 [29] Op enAI. GPT-4o system card. arXiv pr eprint arXiv:2410.21276 , 2024. 9 , 10 , 17 [30] Op enAI. GPT-5 system card. T echnical rep ort, Op enAI, 2025. 9 , 10 , 11 , 17 [31] Op enAI. Introducing GPT-4.1 in the api, 2025. 9 , 10 , 17 [32] Op enAI. Introducing op enai o3 and o4-mini, April 2025. 9 , 10 , 17 14 [33] L. K. P avithra, T. Sree Sharmila, and P . Subbulakshmi. T exture image classiﬁcation and retriev al using m ulti-resolution radial gradien t binary pattern. Applie d Artiﬁcial Intel li- genc e , 35(15):2298–2326, 2021. 3 [34] Runqi Qiao, Qiuna T an, Minghan Y ang, Guanting Dong, P eiqing Y ang, Shiqiang Lang, Enh ui W an, Xiao wan W ang, Yida Xu, Lan Y ang, Chong Sun, Chen Li, and Honggang Zhang. V- think er: In teractive thinking with images. arXiv pr eprint arXiv:2511.04460 , 2025. 4 [35] Qw en T eam. Qwen2.5-vl technical rep ort. arXiv pr eprint arXiv:2502.13923 , 2025. 9 , 10 , 11 , 17 [36] Y abing Ran, Bingzhao Gao, and Qiankun Y u. Driver-guide: A m ultimo dal large language mo del-based agent for driving scene understanding. In CCC , pages 8670–8675, 2025. 4 [37] Zhihong Shao, P eiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haow ei Zhang, Mingc huan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in op en language mo dels. arXiv pr eprint arXiv:2402.03300 , 2024. 8 [38] Jingmin Sun, Hailiang Lu, Lei Qiao, Xiaolong Li, Kaihan Chen, and W eiran Cao. Iden tiﬁca- tion of p orcelain ew ers in tang, song, and yuan dynasties b y digital shap e characterization. Cer amics International , 49(9, Part A):14246–14254, 2023. 16 [39] D ´ ıdac Sur ´ ıs, Sac hit Menon, and Carl V ondrick. Vip ergpt: Visual inference via python execution for reasoning. In ICCV , pages 11888–11898, Octob er 2023. 4 [40] Rohan T aori, Ishaan Gulra jani, Tianyi Zhang, Y ann Dub ois, Xuec hen Li, Carlos Guestrin, P ercy Liang, and T atsunori B. Hashimoto. Stanford alpaca: An instruction-follo wing llama mo del, 2023. 8 [41] Qw en T eam. Qwen2.5: A party of foundation mo dels, Septem b er 2024. 9 , 10 [42] T ao T u, Shekoofeh Azizi, Dann y Driess, Mik e Schaek ermann, Mohamed Amin, and et al. T ow ards generalist biomedical ai. NEJM AI , 1(3):AIoa2300138, 2024. 4 [43] P eng W ang, Shuai Bai, Sinan T an, Shijie W ang, Zhihao F an, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W en bin Ge, Y ang F an, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Da yiheng Liu, Chang Zhou, Jingren Zhou, and Jun yang Lin. Qw en2-vl: Enhancing vision-language mo del’s p erception of the world at any resolution, 2024. 1 [44] Xingxing W eng, Chao P ang, and Gui-Song Xia. Vision-language mo deling meets remote sensing: Mo dels, datasets and p ersp ectiv es. IEEE Ge oscienc e and R emote Sensing Maga- zine , 2025. Early Access. 4 [45] Zheng W eng, Y ach un Guan, and Hong jie Luo. Mac hine vision based classiﬁcation and iden tiﬁcation for non-destructive authen tication of ancient ceramic. Journal of the Chinese Cer amic So ciety , 45(12):1833–1842, 2017. 3 [46] Chenfei W u, Shengming Yin, W eizhen Qi, Xiao dong W ang, Zec heng T ang, and Nan Duan. Visual chatgpt: T alking, dra wing and editing with visual foundation mo dels. arXiv pr eprint arXiv:2303.04671 , 2023. 4 [47] An Y ang, Junshu P an, Jun yang Lin, Rui Men, Yic hang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastiv e vision-language pretraining in chinese, 2023. 7 [48] Y ang Y ang, Hui W u, Dingguo Y u, and Chen Y ang. Ceramic type recognition algorithm based on ontology mo deling and transfer learning. In 2022 International Confer enc e on Cultur e-Oriente d Scienc e and T e chnolo gy (CoST) , pages 6–10, 2022. 16 15 [49] Y uze Zhao, Jintao Huang, Jinghan Hu, Xing jun W ang, Y unlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhik ai W u, Baole Ai, Ang W ang, et al. Swift: a scalable light w eight infrastructure for ﬁne-tuning. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e , volume 39, pages 29733–29735, 2025. 16 [50] Ziw ei Zheng, Michael Y ang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Y ang, Chao Shen, and Xing Y u. Deep eyes: Incentivizing “thinking with images” via reinforcement learning. arXiv pr eprint arXiv:2505.14362 , 2025. 4 [51] Li Zhou, Lutong Y u, Dongc hu Xie, Shaoh uan Cheng, W eny an Li, and Haizhou Li. Hanfu- b enc h: A multimodal b enc hmark on cross-temp oral cultural understanding and transcre- ation. In Christos Christo doulop oulos, T anmoy Chakrab ort y , Carolyn Rose, and Vio- let P eng, editors, Pr o c e e dings of the 2025 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 24627–24649, Suzhou, China, No vem b er 2025. Asso ciation for Computational Linguistics. 1 16 Supplemen tary Material The supplementary material pro vides additional qualitativ e analyses, implementation details, and ev aluation prompts for CiQi-Agent. It is organized as follows: Section 8 presents step-b y- step case studies demonstrating ho w CiQi-Agen t performs p orcelain connoisseurship through m ultimo dal reasoning and m ulti-stage tool in vocation (image zo om-in, visual search, and tex- tual retriev al); Section 9 do cumen ts all prompts used in our framew ork, including those for metadata enric hment, VQA data generation, m ultiple-choice option construction, and LLM-as- a-Judge in b oth training and ev aluation; Section 11 , compares the prop osed CiQi-VQA dataset with existing p orcelain-related datasets, highlighting its larger scale and ﬁner-grained attribute co verage; ﬁnally , in Section 12 , we conduct additional comparative exp erimen ts on the Hu et al. [ 12 ] dataset. These results further demonstrate the sup erior p erformance of our CiQi-Agen t on datasets b ey ond CiQi-Benc h. 8 Case Study Blue-and-White Flared-Rim Bo wl with Figure Decoration, Kangxi Period, Qing Dynasty ( 清康熙青花人物图撇口碗 ) User: Please pro vide a standardized name for this p orcelain piece. After your analysis, c ho ose to either inv ok e a to ol or respond directly , and place the ﬁnal iden tiﬁcation result within the tags. Assistan t: This p orcelain piece is a blue-and-white b owl with ﬂared rim. Its form features a ﬂared mouth, rounded b elly , and a ring fo ot, presenting an elegan t and reﬁned proﬁle. The exterior wall is decorated with blue-and-white ﬁgure scenes; the painting is meticulous, with vivid expressions, smo oth garmen t lines, and background elements suc h as ro c ks and trees arranged with balanced density . The cobalt blue app ears pure and brigh t, displa ying the t ypical “emerald-tin ted blue” characteristic of Kangxi blue-and- white ware. Along the inner rim is a band of blue-and-white diap er patterns, consistent with oﬃcial kiln features of the Kangxi reign. T o further verify its dating and stylistic traits, I plan to in vok e the search image to ol to compare the b o wl with st ylistically similar pieces. {"name": "search_image", "arguments": {"index": 1}} User: Succe ssfully found the following conten t: 1 Image searc h results: - Name: Ming W anli W ucai Garlic-Mouth V ase with Dragon, Flow ers and Birds - Dynast y: Ming - Reign: W anli - Glaze Color: W ucai (Fiv e-color) - Decoration: Dragon with ﬂo wers and birds - F orm: Garlic-mouth v ase - Description: The slightly con tracted mouth forms a garlic shap e, with a long neck, sw ollen b elly , and shallow fo otring. The en tire v essel is thick and heavy . It was made b y forming separate sections and then joining them, therefore ﬁve seam lines app ear on the mouth, neck, and b elly . Both the inner fo otring and base are unglazed, with areas stained b y dripp ed glaze. The clay bo dy is coarse and ric h in iron, causing unglazed parts to ﬁre in to an orange tone with iron sp ec kles. The whole surface is cov ered with a white glaze; in addition to underglaze-blue linear decoration, further scrolls, ent wined ﬂo wers, freehand birds, and angular patterns are applied in m ulticolored lo w-temp erature enamels including red, yello w, orange, green, and black. The main motif is a dragon striding among ﬂow ers on the b elly . F our dragons with large round eyes app ear as if dancing and grabbing p earls, their strange and exaggerated shap es carrying a carto on-lik e app eal. A six-character k aish u mark “Made in the W anli p eriod of the Great Ming Dynasty” is written in underglaze blue on the mouth rim. The use of man y lo w-temp erature o verglaze colors b egan in the Xuande reign, and by the Jia jing and W anli p erio ds, p otters applied v arious glazes onto complex patterns to create a lively and vibrant visual eﬀect. - Source: National P alace Museum, T aip ei - Similarit y: 0.935499 — Image searc h results: - Name: Ming W anli W ucai Garlic-Mouth V ase with Dragon, Flow ers and Birds - Dynast y: Ming - Reign: W anli - Glaze Color: W ucai (Fiv e-color) - Decoration: Dragon with ﬂo wers and birds - F orm: Garlic-mouth v ase - Description: Same as ab o ve. - Source: National P alace Museum, T aip ei - Similarit y: 0.93520796 — Image searc h results: - Name: Qing Qianlong F amille-Rose V ase with Floral Pho enix Motif and Ruyi Handles - Dynast y: Qing - Reign: Qianlong - Glaze Color: F amille-rose - Decoration: Floral motifs 2 - F orm: V ase - Description: Celestial-glob e form with ﬂared mouth, long neck, rounded shoulders, and sw ollen b elly , with a pair of ruyi-shap ed handles on the sides of the neck. Flat base, out ward-turned fo otring, and gold-pain ted rim. The b ody is decorated in famille-rose enamels ov er a white ground. Below the rim is a p omegranate motif; the neck sho ws lotus, bats, and ancien t b ells linked together, interspersed with ﬂoral scrolls. Two gilt string bands encircle the shoulder, framing m ulticolored patterns and leading to a ruyi- cloud b order. The ruyi handles are link ed with w an symbols and p endan t ribb ons attached to the shoulders. The b elly features paired pho enixes b etw een m usical stones and lotus ﬂo wers, surrounded b y dense ﬂoral scrolls. Near the bottom is a st ylized lotus-p etal band, ab o v e a ruyi-cloud border and a string line encircling the foot. A square six-c haracter seal mark “Made in the Qianlong p erio d of the Great Qing Dynasty” is written in underglaze blue. - Source: National Palace Museum, T aip ei - Similarit y: 0.9320861 Assistan t: According to the results returned by the to ol, the shap e, decorative motifs, glaze color, and reign mark of this blue-and-white ﬁgure-decorated ﬂared-rim b o wl all corresp ond to t ypical c haracteristics of the Kangxi p eriod of the Qing dynasty . It can b e clearly distinguished from p orcelain pro duced during the Xuande, Jia jing, and W anli p eriods of the Ming dynasty , particularly in terms of the h ue of the cobalt blue, the reﬁnemen t of the painting, and the prop ortional maturit y of the vessel form. These features reﬂect the highly adv anced craftsmanship of Kangxi p orcelain. Based on the narrativ e ﬁgure motif and the ov erall v essel shap e, it can b e conﬁrmed as a w ork from the Kangxi imp erial kilns. Next, I in tend to inv oke the search text to ol and input a detailed description of this b o wl to further compare it with t ypical features of Kangxi blue-and-white p orcelain, in order to v erify its p erio d attribution and standardized name. {"name": "search_text", "arguments": {"query": "Qing Kangxi blue-and-white figure-  → decorated flared-rim bowl"}} User: Succe ssfully found the following conten t: T ext search results: - T ext: F o ot diameter 2.7 cm. The center of the b o wl is decorated with a T aiji (yin–y ang) motif. The depiction of the yin–yang ﬁshes is similar to the piece mentioned ab o v e, but the painting is less reﬁned. The outer ring’s clo c kwise thin line is relativ ely smo oth. The cobalt-blue color is rich. Within the double circles at the base cen ter is an unrecognizable grass-st yle mark. - Title: F ragment of a green-glazed exterior b o wl with blue-and-white in terior, Qing Kangxi - Source: Authentication of Ming and Qing Porcelain Motifs — Decorativ e Motifs V olume - Similarit y: 0.8 — T ext search results: 3 - T ext: Flared rim, deep b elly , and fo otring. A band of ﬂoral motifs is pain ted along the inner rim. Inside the double blue rings at the cen ter is a painting of ro c ks, ﬂo wers, and birds. The exterior wall depicts egrets in a lotus p ond, with luxurian t lotuses rendered in clear la yers. The lotus leav es show a sense of front-and-bac k yin–y ang shading. The ﬁne lines outline ﬂo wing water gracefully . Nine egrets, each in a diﬀerent p ose, are accurately dra wn; light cobalt w ashes outside the outlines enhance expression with an elegant touc h. The b o dy is thin and ﬁne, the glaze glossy and evenly applied. Within the double blue circles at the base is the k aishu mark “Made in the Jia jing p erio d of the Great Ming,” y et it is a reﬁned work from Kangxi folk kilns. - Title: Flared-rim deep-b elly b o wl with fo otring - Source: Complete Collection of Porcelain Collected in Jiangxi — Qing Dynasty , V olume I - Similarit y: 0.7787347755736039 — Image searc h results: - Name: Blue-and-white b o wl with “banana leav es and ladies comp osing p o etry” and Xuande reign mark, Ming - Dynast y: Ming - Reign: Xuande - Glaze Color: Blue-and-white - Decoration: Ladies - F orm: Bo wl - Description: A band of scroll motifs is pain ted around the fo otring. Six cobalt-blue lines are added b et ween the rim and the fo ot. The cobalt blue is heavy , with bro wnish-green sp ec kles. The b o dy is thick, ﬁne, white, and compact, with iron sp ots of diﬀerent sizes. The base b ears a tw o-column six-c haracter k aishu mark “Made in the Xuande p erio d of the Great Ming,” enclosed b y double circles. The theme of this b owl is “lady with banana lea ves comp osing p oetry .” Due to somewhat thick brushw ork, the lady app ears with b old eyebro ws and a full face, lac king delicate elegance, yet the painting is still ﬁne and remains a high-qualit y work. - Source: National P alace Museum, T aip ei - Similarit y: 0.19999999999999996 Assistan t: Based on the analysis, this blue-and-white ﬂared-rim b o wl with ﬁgure deco- ration is a t ypical p orcelain piece from the Kangxi p eriod of the Qing dynasty . Its v essel form, decorativ e motifs, glaze color, and reign mark are all consistent with the c harac- teristics of imp erial-kiln wares from that p erio d. In conclusion, it should b e identiﬁed as “Qing Kangxi Blue-and-White Flared-Rim Bo wl with Narrative Figure Decoration.” {"name": "image_zoom_in_tool", "arguments": {"index": 1, "bbox_2d": [112, 114, 826,  → 781], "label": "narrative figure motif"}} User: Successfully zo omed in on the image to the region lab eled “narrative ﬁgure motif” at [112, 114, 826, 781]. 4 Assistan t: Based on the analysis, this p orcelain piece is a typical work from the Kangxi p eriod of the Qing dynast y . Its vessel form, decorativ e motifs, glaze color, and reign mark all conform to the c haracteristics of imp erial-kiln wares from this p erio d. It should therefore b e iden tiﬁed as “Qing Kangxi Blue-and-White Flared-Rim Bo wl with Figure Decoration.” Qing Kangxi Blue-and-White Flared-Rim Bowl with Figure Decoration 5 White-Glazed Standing Guan yin, Dehua Kiln, Ming Dynasty ( 明德化窑白釉观音立像 ) User: Please pro vide a standardized name for this p orcelain piece. After your analysis, c ho ose to either inv ok e a to ol or respond directly , and place the ﬁnal iden tiﬁcation result within the tags. Assistan t: This p orcelain piece is a white-glazed standing Guanyin ﬁgure. Its glaze is lustrous and cream y , displa ying the c haracteristic “iv ory white” tone that is a deﬁning feature of Dehua white p orcelain. The sculptural workmanship is reﬁned: the Guan yin has a digniﬁed and compassionate expression, the drap ery ﬂows naturally , and the p edestal is carefully mo deled, all of which corresp ond to typical features of Ming-dynasty Dehua kiln ﬁgures. Based on its style and craftsmanship, it can b e inferred that this work w as pro duced by the Dehua kilns during the Ming dynast y . T o further v erify its kiln origin and chronological attribution, I intend to inv oke the searc h image to ol and input the image of this piece, in order to retrieve similar museum collections and assist in the iden tiﬁcation. {"name": "search_image", "arguments": {"index": 1}} User: Succe ssfully found the following conten t: 6 Image searc h results: - Name: Qing Deh ua Kiln Guanyin - Dynast y: Qing - Glaze Color: White glaze - F orm: Guan yin ﬁgure - Description: Guan yin, “White-Robed Guan yin,” coated in white glaze, hollo w inside. The face is o v al like a go ose egg, with delicate features. W earing a white rob e, a chest nec klace, and a white ho o d, seated in a relaxed p ose. - Source: National P alace Museum, T aip ei - Similarit y: 0.9279872 — Image searc h results: - Name: 16th–18th Cen tury White-Porcelain Arhat Figure - Glaze Color: White p orcelain - F orm: Arhat ﬁgure - Description: The glaze is ivory-white with a sligh tly yello w tint, with m ultiple pinholes across the surface. The ﬁgure is modeled seated by the w ater, with a bald head, long ears, and bare feet. Drap ed in a rob e exp osing the chest, the folds are ﬂuen tly sculpted. The relaxed p ose shows the left hand holding a sutra scroll and the right hand tuck ed within the sleev e. The face features pronounced bro ws and nose ridge, expressing con templation. Accompanied b y a woo den base. - Source: National P alace Museum, T aip ei - Similarit y: 0.8567522 — Image searc h results: - Name: T ang P ainted Grey-T erracotta F emale Figure - Dynast y: T ang - Glaze Color: P ainted - Decoration: Lady motif - F orm: T omb ﬁgure - Description: The pain ted garmen t on this grey terracotta female ﬁgure has mostly ﬂaked a wa y , y et the form still fully expresses the features of T ang court women—“large hair buns, lo ose rob es,” and a plump ﬁgure. She stands uprigh t with head raised, long ey es and small mouth, wearing a lo ose, p oin ted-collar rob e draping from the neck. The ligh tly incised sleev e and fold lines resemble expressiv e ink strok es, accentuating the rounded b o dy . The hem reac hes the ground, with pointed shoes p eeking out. The righ t arm lifts a wide sleev e 7 to the chest, with the left hand dropping sligh tly , and the calm expression enhances the graceful b earing. The full face framed by a double-bun hairst yle embo dies the T ang ideal of b eaut y . The era fav ored full-b o died women with conﬁdent, healthy elegance, a trend that con tinued into the Late T ang and Fiv e Dynasties. - Source: National P alace Museum, T aip ei - Similarit y: 0.84899557 Assistan t: Based on the analysis, the c haracteristics of this p orcelain piece closely match the t ypical glaze tone, v essel form, and craftsmanship of Dehua w are. F urther veriﬁcation through to ol results conﬁrms that its style, glaze, and pro duction techniques are consistent with those of Deh ua kilns from the Ming dynasty . {"name": "search_text", "arguments": {"query": "Ming dynasty Dehua kiln white-glazed  → standing Guanyin figure"}} User: Succe ssfully found the following conten t: T ext search results: - T ext: Ming, 17th cen tury , Deh ua white-glazed seated Guanyin ﬁgure, sho wing a Guanyin statue with hands hidden within the folds of her rob e. - Title: Ming, 17th Cen tury Dehua White-Glazed Seated Guanyin Figure - Source: Christie’s 2015 Spring Auction Catalogue — Imp ortan t Chinese Ceramics - Similarit y: 0.8 — T ext search results: - T ext: White p orcelain from the Dehua kilns of the Ming dynast y is distinctive, esp ecially the religious sculptures of Guan yin, Bo dhidharma, and other ﬁgures created b y the master sculptor He Chaozong. - Title: White P orcelain of the Ming Dynasty Dehua Kilns - Source: P alace Museum Ceramic Encyclop edia - Similarit y: 0.6398200054682729 — Image searc h results: - Name: Qing Deh ua Kiln Guanyin - Dynast y: Qing - Glaze Color: White glaze - F orm: Guan yin ﬁgure - Description: Guan yin, “White-Robed Guan yin,” coated in white glaze, hollo w inside. The face is o v al like a go ose egg, with delicate features. W earing a white rob e, a chest 8 nec klace, and a white ho o d, seated in a relaxed p ose. - Source: National P alace Museum, T aip ei - Similarit y: 0.19999999999999996 Assistan t: Based on the information returned b y the tool, this porcelain piece is a white- glazed standing Guan yin ﬁgure from the Dehua kilns of the Ming dynasty . Its glaze, form, and workmanship are consisten t with the t ypical characteristics of this p eriod. It can therefore b e identiﬁed as “Ming Dehua Kiln White-Glazed Standing Guanyin Figure.” I will no w further zo om in on the details for conﬁrmation. {"name": "image_zoom_in_tool", "arguments": {"index": 1, "bbox_2d": [112, 114, 670,  → 1004], "label": "Guanyin"}} User: Succe ssfully zo omed in on the region lab eled “Guanyin” at [112, 114, 670, 1004]. Assistan t: Based on the analysis of the p orcelain and the veriﬁcation through the to ol results, this piece is a white-glazed standing Guanyin ﬁgure produced by the Deh ua kilns in the Ming dynasty . It exhibits the typical glaze tone, v essel form, and craftsmanship asso ciated with that p erio d, and th us corresp onds to the standard designation “Ming Deh ua Kiln White-Glazed Standing Guanyin Figure.” Ming Dehua Kiln White-Glazed Standing Guanyin Figure 9 T endrilled Lotus Patternp ( 缠枝莲纹 ) User: Please determine the decorativ e motif of this antique. After your analysis, c ho ose to either inv ok e a to ol or respond directly , and place the ﬁnal iden tiﬁcation result within the tags. Assistan t: The decorative motif of this p orcelain piece is a tendrilled lotus pattern. Its c haracteristics include gracefully ﬂo wing vine scrolls and full, symmetrical lotus blossoms, creating a complex y et orderly ov erall comp osition. This style is t ypical of imp erial w are from the Y ongzheng p eriod of the Qing dynast y . F rom the image, the lines of the motif app ear delicate and precisely rendered, demonstrating a high lev el of carv ed-decoration craftsmanship. T o further v erify its st ylistic p eriod and craftsmanship, I intend to inv oke the searc h image to ol and input the index of this p orcelain image (starting from 1) in order to search for similar examples in ancien t collections that can assist in identiﬁcation. {"name": "search_image", "arguments": {"index": 1}} User: Succe ssfully found the following conten t: 10 Image searc h results: - Name: Qing Y ongzheng Imitation Longquan Kiln Pho enix-T ail V ase - Dynast y: Qing - Reign: Y ongzheng - Glaze Color: Celadon glaze - Decoration: T endrilled ﬂoral motif - F orm: V ase - Description: Flared mouth with straight nec k, sw ollen b o dy tapering do wnw ard and ﬂaring out ward at the fo ot. The entire vessel is coated in a celadon glaze. Sev eral string lines encircle the mouth rim. The neck and b elly are each decorated with ﬂoral blossoms and tendrilled scroll motifs, with a band of c hrysanthem um-p etal pattern around the lo wer b elly . The glaze tone is pale and evenly applied. Inside the fo otring is a six-character, three-column seal-script mark: “Made in the Y ongzheng p eriod of the Great Qing.” - Source: National P alace Museum, T aip ei - Similarit y: 0.9185529 — Image searc h results: - Name: Qing Y ongzheng Win ter-Green Celadon V ase with Incised Decoration - Dynast y: Qing - Reign: Y ongzheng - Glaze Color: Win ter-green glaze - Decoration: Incised motif - F orm: V ase - Description: Flared mouth with straight nec k, sw ollen b o dy tapering do wnw ard and ﬂaring out ward at the fo ot. The entire vessel is coated in a celadon glaze. Sev eral string lines encircle the mouth rim. The neck and b elly are each decorated with ﬂoral blossoms and tendrilled scroll motifs, with a band of c hrysanthem um-p etal pattern around the lo wer b elly . The glaze tone is pale and evenly applied. Inside the fo otring is a six-character, three-column seal-script mark: “Made in the Y ongzheng p eriod of the Great Qing.” - Source: National P alace Museum, T aip ei - Similarit y: 0.9150853 — Image searc h results: - Name: Qing Y ongzheng Win ter-Green Celadon V ase with Incised Decoration - Dynast y: Qing - Reign: Y ongzheng 11 - Glaze Color: Win ter-green glaze - Decoration: Incised motif - F orm: V ase - Description: Flared mouth with straight nec k, sw ollen b o dy tapering do wnw ard and ﬂaring out ward at the fo ot. The entire vessel is coated in a celadon glaze. Sev eral string lines encircle the mouth rim. The neck and b elly are each decorated with ﬂoral blossoms and tendrilled scroll motifs, with a band of c hrysanthem um-p etal pattern around the lo wer b elly . The glaze tone is pale and evenly applied. Inside the fo otring is a six-character, three-column seal-script mark: “Made in the Y ongzheng p eriod of the Great Qing.” - Source: National P alace Museum, T aip ei - Similarit y: 0.9142115 Assistan t: Based on the results returned b y the to ol, the decorative motif on this p orce- lain piece is indeed a tendrilled lotus pattern. Its st ylistic features closely corresp ond to those found on imp erial w ares from the Y ongzheng p erio d of the Qing dynast y , further substan tiating the accuracy of its p erio d attribution and craftsmanship. {"name": "image_zoom_in_tool", "arguments": {"index": 2, "bbox_2d": [112, 114, 888,  → 987], "label": "tendrilled lotus pattern"}} User: Successfully zo omed in on the region lab eled “tendrilled lotus pattern” at [112, 114, 888, 987]. Assistan t: The decorativ e motif of this p orcelain piece is a typical tendrilled lotus pat- tern. The scrolling lines are graceful and ﬂuent, and the lotus blossoms are rendered with delicate precision, fully reﬂecting the reﬁned craftsmanship and aesthetic pursuit of imp erial p orcelain from the Y ongzheng p erio d of the Qing dynasty . Tendrilled lotus pattern 12 9 Prompts 9.1 Prompt for Metadata Enric hmen t Listing 1: Prompt for generating m ultiple-choice options Please describe the Chinese porcelain shown in the image: . Then revise the description according to the human experts annotation:. The final description should comprehensively cover six attributes: dynasty, reign period,  → kiln site, glaze color, decorative motif, and vessel shape. 9.2 Prompt for VQA Data Generation Listing 2: Prompt for generating m ultiple-choice options Given the expert-enhanced description of a porcelain artifact: Your task is to extract four categories of information from this description: (1) Dynasty, (2) Vessel Shape, (3) Glaze Color, and (4) Decorative Motifs. Construct one QA pair at a time. For the current step, answer **only one** of the four categories. Format: Q: A question that asks specifically about one category (e.g., dynasty, vessel shape,  → glaze color, or decorative motifs). A: A concise but complete answer extracted from the description, including all relevant  → descriptive details. Do not include information from the other three categories in the answer. Do not infer beyond the given description; extract only what is stated. Generate the QA pair for the following category: 9.3 Prompt for Generating Multiple-c hoice Options Listing 3: Prompt for generating m ultiple-choice options You are an expert in porcelain-related question design. Below is a porcelain-  → related question and its correct answer. Please transform this question into a  → multiple-choice question with four options, where one option is correct and the  → other three are plausible but incorrect. Please output strictly in the following format: Content of option A Content of option B Content of option C Content of option D Letter of the correct option Notes: - The options must be relevant to the given question and answer. - The incorrect options should be misleading but distinguishable from the correct one. - Do not add any other text or explanations outside the specified format. 13 9.4 Prompt for LLM-as-a-Judge in Ev aluation Listing 4: Prompt for LLM-as-a-Judge in Ev aluation You are an expert reviewer in the field of antique Chinese porcelain, specialized in  → evaluating the accuracy of model identification results. Please compare the following "reference answer" and "model output" and produce a  → comprehensive evaluation according to the rules below across 6 dimensions. --- ### Scoring requirements: For each of the seven dimensions listed below, assign an individual score between 0 and 1.  → Adopt a conservative scoring style. ### Detailed scoring rules: 1.**Dynasty accuracy** Whether the output mentions and correctly states the dynasty to which the porcelain  → belongs, and whether the provided dynasty information is sufficiently precise. 2.**Reign period accuracy** Whether the output mentions and correctly states the imperial reign (e.g., Kangxi,  → Qianlong); note any deviation or omission. 3.**Kiln site accuracy** Whether the output mentions and correctly states the kiln site characteristics (e.g.,  → Jingdezhen, Ru Kiln); note any deviation or omission. 4.**Glaze color accuracy** Whether the output mentions and correctly states the glaze/color characteristics (e.g.,  → blue-and-white, famille-rose, red-ground with green enamels); note any deviation or  → omission. 5.**Decoration / Motif accuracy** Whether the output correctly describes the decorative motifs or subjects (e.g., dragons  → and phoenixes, floral patterns, human figures, cloud patterns), and whether the  → motif matches the stylistic expectations of the claimed period. 6.**Form / Vessel-type accuracy** Whether the output reasonably identifies and describes the vessel form, and whether this  → conforms with the reference answer. Note: If a particular dimension is absent in the reference answer, mark it with -1 to  → indicate missing data. Please first provide the reasoning for each score, and then place the final numeric scores  → inside the following tags in order: ... ... ... ... ... ... For example: 1.0 0.6 -1.0 1.0 0.0 0.8 Reference answer: {ground_truth} Model output: {prediction} 14 9.5 Prompt for LLM-as-a-Judge in T raining The training-time LLM-as-a-Judge prompt is largely aligned with the ev aluation-time version; the only diﬀerence is the addition of a Format consistency criterion to mitigate p oten tial reward hac king during reinforcement learning. Listing 5: Prompt for LLM-as-a-Judge in T raining You are an expert reviewer in the field of antique Chinese porcelain, specialized in  → evaluating the accuracy of model identification results. Please compare the following "reference answer" and "model output" and produce a  → comprehensive evaluation according to the rules below across seven dimensions. --- ### Scoring requirements: For each of the seven dimensions listed below, assign an individual score between 0 and 1.  → Adopt a conservative scoring style. ### Detailed scoring rules: 1.**Format consistency** Whether the output strictly follows the naming order: "Dynasty", "Reign period", "Kiln  → site", "Glaze color", "Decoration motif", "Vessel shape". Some elements may be  → missing, but the order must not be disrupted. The model output must contain only a  → single standard name; it must not include any explanatory, descriptive, or other  → additional text. If the order is correct but some fields are missing, 1 score may  → still be given. 2.**Dynasty accuracy** Whether the output mentions and correctly states the dynasty to which the porcelain  → belongs, and whether the provided dynasty information is sufficiently precise. 3.**Reign period accuracy** Whether the output mentions and correctly states the imperial reign (e.g., Kangxi,  → Qianlong); note any deviation or omission. 4.**Kiln site accuracy** Whether the output mentions and correctly states the kiln site characteristics (e.g.,  → Jingdezhen, Ru Kiln); note any deviation or omission. 5.**Glaze color accuracy** Whether the output mentions and correctly states the glaze/color characteristics (e.g.,  → blue-and-white, famille-rose, red-ground with green enamels); note any deviation or  → omission. 6.**Decoration / Motif accuracy** Whether the output correctly describes the decorative motifs or subjects (e.g., dragons  → and phoenixes, floral patterns, human figures, cloud patterns), and whether the  → motif matches the stylistic expectations of the claimed period. 7.**Form / Vessel-type accuracy** Whether the output reasonably identifies and describes the vessel form, and whether this  → conforms with the reference answer. Note: If a particular dimension is absent in the reference answer, mark it with -1 to  → indicate missing data. Please first provide the reasoning for each score, and then place the final numeric scores  → inside the following tags in order: ... ... ... 15 ... ... ... ... For example: 1.0 1.0 0.6 -1.0 1.0 0.0 0.8 Reference answer: {ground_truth} Model output: {prediction} 10 Detailed Conﬁgurations SFT Setup. W e conducted SFT using the ms-swift [ 49 ] framework for 1 ep o c h with a learning rate of 1 × 10 − 5 , a batch size of 16, and gradient accum ulation o ver 4 steps. W e used the default AdamW optimizer and set the maxim um sequence length to 8192. RL Setup. During the RL training, we use the AdamW with a learning rate of 1 × 10 − 6 , a batc h size of 128 with 16 rollouts for each prompt, and a KL co eﬃcient of β = 0. T o ol Setup. The size of the initial image sent to the agen t is set to b e maxim um 313 , 600 pixels. F or the image zo om-in to ol, the zo om-in op eration is performed on the initial image by mapping the bbox back to the original image size. W e set k = 3 for b oth searc h image and searc h text, and the co eﬃcient α is set as 0 . 2. The maxim um n umber of to ol inv o cations p er query is set to 4 to balance b et w een reasoning depth and computational eﬃciency . 11 Comparison with Existing Datasets Belo w, we presen t a comparison b et ween our prop osed CiQi-V QA dataset and several existing p orcelain-related datasets, as shown in T able 6 . W e rep ort the num b er of p orcelain sp ecimens included in each dataset, as well as the n umber of categories cov ered under six attributes. A “—” indicates that the dataset do es not provide statistics for that attribute. As shown in the comparison, our dataset contains the largest scale and oﬀers the most ﬁne-grained attribute co verage among all datasets. T able 6: Dataset Comparison. Datasets P orcelain Sp ecimens Dynasties Reigns Kilns Colors Motifs Shapes Ling et al. [ 19 ] 5,993 2 — 10 8 — 12 Ma et al. [ 23 ] 5,624 — — — — — 7 Sun et al. [ 38 ] 232 3 — — — — 1 Y ang et al. [ 48 ] 2,750 — — — 7 — — Hu et al. [ 12 ] 8,213 6 — — 20 6 7 CiQi-V QA (Ours) 29,596 38 42 43 246 248 158 16 T able 7: Comparativ e exp erimen ts on the Hu et al. [ 12 ] dataset. Mo del Glaze/Kiln Shape W eighted Avg. † GPT-5 [ 30 ] 86.9 91.4 88.1 GPT-4.1 [ 31 ] 86.4 88.6 87.0 GPT-4o [ 29 ] 86.4 91.4 87.7 o3 [ 32 ] 86.9 91.4 88.1 Claude Opus 4 [ 1 ] 84.4 90.0 85.9 Qw en2.5-VL-72B-Instruct [ 35 ] 82.4 88.6 84.0 GLM-4.5V (106B) [ 9 ] 85.4 91.4 87.0 In ternVL3.5-241B-A28B-Flash [ 15 ] 74.4 90.0 78.4 Kimi-VL-A3B-Instruct (16B) [ 17 ] 74.9 85.7 77.7 Hu et al. [ 12 ] — — 70.0 CiQi-Agen t (Ours, 7B) 87.4 92.9 88.9 † W eighted by item counts: w G = 199 / 269, w S = 70 / 269. 12 Additional Comparativ e Exp erimen ts W e conducted additional comparative exp erimen ts using the op en-source dataset provided by Hu et al. [ 12 ]. In their w ork, Hu et al. [ 12 ] p erformed ob ject detection and recognition tasks on Chinese p orcelains, deﬁning three attributes for recognition: vessel shap e, glaze color/kiln site (treated as a single attribute in their pap er), and decorative motif. Each ma jor attribute con tains multiple sub-attributes (see T able 6 for details). In our comparative study , b ecause the released dataset is incomplete and the test set lacks annotations, we conducted the comparison using their v alidation set. W e consider this compar- ison still meaningful b ecause the pap er states that b oth the v alidation set and the test set w ere obtained by randomly sampling 20% and 10% of the full dataset, resp ectiv ely . Therefore, the results rep orted on the test set in the original pap er are also informative for the v alidation set and should not deviate signiﬁcan tly . Additionally , in the open-source dataset, the decorativ e motif attribute is not represented by images of Chinese p orcelains; therefore, w e restricted our exp erimen ts to the glaze color/kiln site and v essel shap e attributes. W e generated multiple-c hoice questions for these tw o attributes, with the options corre- sp onding to all sub-attributes under each attribute (20 for glaze color/kiln site and 7 for v essel shap e). Sp eciﬁcally , the glaze color/kiln site attribute included 199 questions, while the vessel shap e attribute contained 70 questions. W e ev aluated CiQi-Agent alongside nine MLLMs— including the GPT series, Claude, and others. The exp erimen tal results are summarized in T able 7 . The exp erimen tal results show that on the op en-source dataset pro vided by Hu et al. [ 12 ], our CiQi-Agent consistently outp erforms all baseline mo dels. Moreo ver, its ﬁne-grained classi- ﬁcation capability signiﬁcantly surpasses the approach describ ed in Hu et al. [ 12 ], achieving an a verage accuracy impro vemen t of 18 . 9%. The small p erformance gap b et ween our mo del and the other MLLMs is largely due to the simplicity of this dataset, which includes only 7 v essel- shap e categories and 20 glaze categories. In contrast, our constructed dataset includes 100+ of categories for b oth vessel shap es and glazes, making the task substantially more challenging. 17

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment