Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Integrating Multimodal Large Language Model Kno wledge into Amodal Completion Heecheol Y un 1 Eunho Y ang 1 , 2 1 KAIST 2 AITRICS { yoon6503, eunhoy } @kaist.ac.kr Previous methods (No Guidance) MLLM Query “Fron t section of a bu s w i th a l arge w in dshield, side mirror , an d colo r fu l st r ip e s o n t he s i d e. ” 1. Re sizing inpain t ing mask Nee d guidance? Ours (with MLLM Guidance) 2. Infer ring o c clude d reg ions Extr an eou s gene r a t ion Unna tur al ou tput “C h i c ke n ” “Bus ” Figure 1. Our method selectiv ely leverages MLLMs to reason about the e xtent and content of occluded parts. Incorporating them into amodal completion effecti vely enhances performance. Abstract W ith the widespr ead adoption of autonomous vehicles and r obotics, amodal completion, whic h r econstructs the oc- cluded parts of people and objects in an image , has become incr easingly crucial. Just as humans infer hidden re gions based on prior experience and common sense, this task in- her ently r equir es physical knowledge about r eal-world en- tities. However , existing appr oaches either depend solely on the image gener ation ability of visual generative mod- els, which lack such knowledge , or leverag e it only during the se gmentation stage, pre venting it fr om explicitly guid- ing the completion process. T o addr ess this, we pr opose AmodalCG, a novel framework that harnesses the r eal- world knowledge of Multimodal Larg e Language Models (MLLMs) to guide amodal completion. Our fr amework ﬁrst assesses the e xtent of occlusion to selectively in voke MLLM guidance only when the tar get object is heavily occluded. If guidance is r equired, the fr amework further incorpor ates MLLMs to reason about both the (1) extent and (2) con- tent of the missing r e gions. F inally , a visual generative model integr ates these guidance and iteratively r eﬁnes im- perfect completions that may arise from inaccurate MLLM guidance. Experimental r esults on various r eal-world im- ages show impressive impr ovements compared to all exist- ing works, suggesting MLLMs as a pr omising dir ection for addr essing challenging amodal completion. 1. Introduction Imagine a situation where a desired object in a photo is un- intentionally obscured by other foregrounds, prev enting us from obtaining its full appearance. Amodal completion [ 6 ] is a task designed for such scenarios, with the goal of recon- structing the whole object based on its visible parts. In daily life, occlusion occurs frequently , making amodal comple- tion highly valuable in a range of do wnstream applications, such as autonomous vehicles and robotics. Similar to how humans infer the hidden parts of objects, amodal completion can greatly beneﬁt from rich common- sense knowledge about real-world entities. Ho wev er, exist- ing methods [ 15 , 21 ] o verlook this and heavily rely on gen- eration capabilities of visual generati ve models [ 17 ] without providing sufﬁcient guidance about the occluded regions. MC Diffusion [ 21 ] proposes a training-free approach that adapts the denoising process of Stable Dif fusion (SD) for amodal completion, using only category-le vel text prompts 1 Image Modal Image SD Inpainting pix2gestalt MC Dif fusion Ours OW AAC Figure 2. Completion results when meaningful parts of the target object are occluded. Stable Dif fusion (SD) inpainting [ 17 ] often generates objects other than the tar get object. Existing amodal completion methods [ 1 , 15 , 21 ] lack an understanding of what should be generated for the missing parts. In contrast, our method provides e xplicit guidance on what should be reconstructed. as guidance. pix2gestalt [ 15 ] ﬁne-tunes SD on synthetic amodal completion datasets, where SD is conditioned on the input image and a modal mask. Although O W AA C [ 1 ] recently proposes to employ MLLM [ 8 ] in its framework, its role is restricted to segmenting the tar get object from ab- stract user queries, while SD remains conditioned solely on category-le vel text prompts. Consequently , these methods often produce unnatural completions in challenging occlu- sion scenarios and require multiple sampling attempts with different seeds to obtain satisfactory results. As illustrated in Figure 2 , e xisting methods generate an object entirely un- related to the bicycle or fail to determine whether a person is sitting or standing based on the image context. T o ov ercome this limitation, we propose AmodalCG ( Amodal C ompletion via MLLM G uidance), a nov el framew ork that effecti vely leverages the rich physical-world knowledge embedded in Multimodal Lar ge Language Mod- els (MLLMs) [ 11 , 13 , 14 , 19 ] for amodal completion. Speciﬁcally , AmodalCG identiﬁes and integrates two key types of MLLM-derived guidance about the occluded con- tent. First, the MLLM generates geometric guidance, esti- mating the true extent of the occluded regions. This guid- ance is necessary as SD becomes prone to erroneous com- pletions when the inpainting mask is excessi vely larger than the actual object. T o address this, AmodalCG emplo ys MLLM to estimate the full extent of the target object and uses this prediction to resize the inpainting mask accord- ingly . This offers explicit cues on how much of the object should be reconstructed, pre venting ov er-extended comple- tion and unintended content generation. The second is se- mantic guidance, which pro vides a detailed te xtual descrip- tion of what should be generated in the occluded area. Once the inpainting mask is resized to ﬁt the full object, the MLLM infers the appropriate content for the occluded re- gion. This description is then used as a text prompt for SD, giving it e xplicit guidance on what needs to be ﬁlled in. Howe ver , incorporating MLLM guidance into amodal completion presents two key challenges. First, generating MLLM guidance for e very sample can be inef ﬁcient, as some cases—such as those with minimal occlusion—can already be ef fecti vely completed without detailed guid- ance. Second, the inherent ambiguity of amodal completion makes it difﬁcult for MLLMs to produce accurate predic- tions about the hidden re gions, particularly when estimating the size of the full target object. T o address these challenges, we introduce the follo wing two strategies. First, before in voking a lar ge-scale MLLM to generate detailed guidance, a lightweight model is used to assess the degree of occlusion and selectively triggers the large model. For samples with minimal occlusion, the framework proceeds without calling the large model, thereby reducing computational cost. Second, to alleviate the difﬁculty of estimating the extent of occluded regions, we adopt a multi-scale e xpansion strategy . Instead of pro- ducing a single estimate, the MLLM predicts multiple can- didate scales for the full tar get object. Then, starting from 2 the tightest prediction, our frame work progressiv ely v eriﬁes whether the target object can be fully reconstructed within each predicted region, selecting the most suitable predic- tion for completion. This multi-scale strategy increases the chances of the MLLM predicting an accurate object size, thereby facilitating reliable completion. Our method enables amodal completion for open-world objects without additional training. It is simple yet highly effecti ve, improving both amodal segmentation and oc- cluded object recognition. Our contributions are summa- rized as follows: • W e propose AmodalCG, a framework that selectively in- tegrates the rich common-sense knowledge of MLLMs into challenging amodal completion. • W e identify two ke y types of MLLM-deriv ed guidance and alle viates the computational and uncertainty chal- lenges when incorporating these guidance signals through selectiv e large-model in vocation and a multi-scale e xpan- sion strategy . • Our method impro ves amodal se gmentation by 5.49% and occluded object recognition by 2.92% compared to the baselines, highlighting MLLMs as a promising solu- tion for challenging amodal completion. 2. Related work Amodal Completion. Early studies on amodal completion primarily focused on training task-speciﬁc models. How- ev er, this training inherently requires ground-truth appear - ance for the in visible regions, which are costly to obtain. Consequently , these approaches are typically trained on nar- row domains such as vehicles [ 10 , 22 ], humans [ 26 , 27 ], or indoor scenes [ 3 , 4 ], resulting in limited generalization to categories outside the training data. T o overcome the limitations of these datasets, recent studies [ 1 , 15 , 21 ] have lev eraged the power of large-scale diffusion models, such as Stable Diffusion (SD) [ 17 ], which are trained on massive datasets [ 18 ]. These approaches utilize SD to directly complete the appearance of oc- cluded objects. pix2gestalt [ 15 ] ﬁne-tunes SD on synthetic datasets curated for amodal completion. MC Diffusion [ 21 ] bypasses the expensi ve ﬁne-tuning stage by proposing a training-free pipeline that ﬁrst identiﬁes occluders from the segmentation masks [ 7 , 12 ] and then inpaints the occlud- ers’ regions using SD. T o pre vent unintended generations, it clusters intermediate features of SD to retain only those fea- tures similar to the target object. Under the similar frame- work, O W AA C [ 1 ] enhances the completion performance by employing more accurate category-le vel text prompts and reﬁning inpainting masks through morphological op- erations. OW AA C also employs MLLM [ 8 ] into its frame- work, whose role is restricted to segmenting the target ob- ject and does not inﬂuence the completion process. In con- trast, our method incorporates the rich real-world kno wl- edge of MLLMs directly into the generation process, pro- viding detailed and explicit guidance to SD for completing occluded regions. 3. AmodalCG: Amodal Completion via MLLM Guidance Giv en an input image I along with a modal mask for the target object M modal and, in some cases, its semantic cate- gory P cat , recent amodal completion approaches utilize vi- sual generativ e models, such as Stable Diffusion (SD) [ 17 ], to reconstruct the occluded regions. Under the same set- ting, we propose AmodalCG, a framework that le verages the rich real-world knowledge of MLLMs to guide the com- pletion process. Our method consists of ﬁ ve main com- ponents. First, Sec. 3.1 introduces the Occluder Detec- tion Module , which identiﬁes occluders to form the in- painting mask M inpaint . Next, Sec. 3.2 describes the Guid- ance Decision Module , which determines whether to inv oke MLLM guidance for reconstructing the target object. W e then present two modules that generate MLLM-derived in- formation about the occluded regions: the Geometric Guid- ance Module (Sec. 3.3 ) and the Semantic Guidance Module (Sec. 3.4 ). Finally , Sec. 3.5 describes the Inpainting Mod- ule , which integrates both types of MLLM guidance into the completion process through multi-scale expansion. All prompts used in our framework are provided in Appendix Sec. 6 . Fig. 3 illustrates an ov ervie w of our pipeline. 3.1. Occluder Detection Giv en an input image I and a modal mask M modal , the Occluder Detection Module outputs the inpainting mask M inpaint , deﬁned as the union of the occluder masks [ 1 , 21 ]. T o identify occluders, we ﬁrst perform semantic segmenta- tion [ 7 ] on I and then use a geometric order prediction net- work [ 9 ] to determine the occlusion order of each segment. 3.2. Selectiv e In vocation of MLLM Guidance The Guidance Decision Module selectiv ely inv okes MLLM guidance based on the lev el of occlusion. Since samples with minimal occlusion can be adequately recon- structed without additional guidance, our frame work omits generating detailed MLLM guidance when the target object is regarded as minimally occluded. Unlike reasoning about occluded regions, deciding whether an object is nearly com- plete is a simpler task. Therefore, we employ a smaller-scale MLLM for the Guidance Decision Module to assess the ne- cessity of MLLM guidance. Giv en an image of the isolated target object on a white background, the module outputs two pieces of information in JSON format: (1) a binary indica- tor specifying whether MLLM guidance is required, and (2) the category of the target object P cat . In Secs. 3.3 and 3.4 , we describe ho w this information is subsequently used to generate geometric and semantic guidance. 3 Query object: Doll “A doll with a colorful dress, intricate patterns, and delicate lace, holding a small bouquet of flowers.” Tight: [240, 80, 400, 320] Moderate: [220, 6 0, 420, 360] Coarse: [200, 40, 440, 400] Describe the occluded part of the … Predict the size of the full … Tight Moderate Coarse Tig ht Mo dera te Guidance Decisi on Module Need guidance? Larger MLLM Larger MLLM Prompt: Doll Ye s No 1st gener ation 2nd genera tion To uc h tight m ask boundary? → Yes Proceed with next - scale mask To uc h moderate mask boundary? → No Stop genera tion Inpainting mask: Ad d a small fix e d mar gin to the modal boun ding b ox Input Figure 3. Overvie w of AmodalCG. Our framework ﬁrst determines which samples would beneﬁt from MLLM guidance ( Guidance De- cision Module ). For those requiring guidance, the MLLM generates two key types of information about the occluded part of the target object: (1) the bounding box size of the full target object ( Geometric Guidance Module ) and (2) textual descriptions of the occluded region ( Semantic Guidance Module ). These are then incorporated into the completion process through a multi-scale expansion strategy , which selects the appropriate bounding box scale among the MLLM’s predictions ( Inpainting Module ). Mask Completion Result Completion Result Mask Figure 4. Amodal completion results based on inpainting mask size. Unwanted objects are generated when the inpainting mask is substantially larger than the actual occluded re gion. 3.3. Estimating the Extent of Occluded Regions One type of guidance used in our framework is geometric guidance, which represents the estimated size of the full tar - get object, including its occluded regions. W e ﬁrst describe why this guidance is important for amodal completion and then details how the Geometric Guidance Module predicts the extent of the full tar get object. Geometric guidance is crucial to pre vent unintended generation. A major reason existing amodal completion methods often generate undesired objects beyond the tar- get object is the use of an excessi vely lar ge inpainting mask M inpaint compared to the actual occluded area. As illus- trated in Fig. 4 , an inpainting mask adjusted to ﬁt the tar - get object produces precise completions, whereas an unad- justed, overly large mask tends to generate unintended ob- jects outside the target region. Based on this observ ation, the Geometric Guidance Module predicts the bounding box of the full target object and redeﬁnes the inpainting mask M ∗ inpaint as the intersection between the predicted bound- ing box ˆ M bbox and the original inpainting mask M inpaint : M ∗ inpaint = ˆ M bbox ∩ M inpaint . (1) By providing the Inpainting Module with explicit guidance on how much of the object should be generated, our method effecti vely suppresses unnecessary object creation and pre- vents o ver -extension of the target object. Estimating the extent of occluded regions. This geomet- ric guidance is crucial for all samples, regardless of occlu- sion le vel, since e ven minimally occluded objects may hav e ov erly large inpainting masks (see Fig. 4 for an example). Accordingly , we adopt two dif ferent strategies based on the output of the Guidance Decision Module. If the module de- termines that MLLM guidance is unnecessary ( e.g ., sam- ples with minimal occlusion), the framew ork assumes that only minor completion is required. In this case, ˆ M bbox is obtained by slightly enlarging the modal bounding box with a ﬁxed margin. Con versely , if MLLM guidance is deemed necessary , the framework assumes that extensiv e comple- tion is required, and thus ˆ M bbox is predicted by the MLLM. Howe ver , directly predicting an accurate ˆ M bbox is challeng- ing for the MLLM due to the high uncertainty inherent in occluded regions. Therefore, we jointly e xploit the image generation capability of the Inpainting Module to mitigate this uncertainty . Speciﬁcally , the MLLM is instructed to predict three bounding boxes at dif ferent scales: tight , mod- erate , and coarse . Then, the Inpainting Module progres- siv ely ev aluates each bounding box, starting from the tight- est prediction, until the object can be fully reconstructed within the prediction, as further described in Sec. 3.5 . Input Prompt for the MLLM. W e provide three types of information as a text prompt to the MLLM: (1) the coordi- nates of the modal bounding box, (2) the image size, and (3) 4 System Prompt Your will be provided with an image of a n object, the object’s name, its visible bounding box in the format [x_min, y_min, x_max, y_max], and the size of the image as [ height, width]. …, Provide tight, moderate, and coarse bounding boxes for the full object. … The name of the objec t is elephant , and its visible bounding box is [399, 80, 637, 424] . The height and width of the image are [426, 640] . User: :MLLM Tight: [350, 50, 637, 424] Moderate: [320, 30, 637 , 424] Coarse: [300 , 2 0, 6 37, 424] Figure 5. Example of the Geometric Guidance Module predicting multi-scale bounding boxes for the full tar get object. Target : Horse “ A dapple gr ay horse standing with a rider , wearing a bridle and saddle, with muscular legs and a well - groomed coat .” “A majestic gray horse with a strong, muscular neck and a flowing mane, wearing a bridle and saddle, standing gracefully.” Ta r g e t : P l at e (b) Ours (a) W ithout masking occluders “ A white plate with crispy breaded chicken, steamed broccoli, sliced carrots, and mashed potatoes with gravy .” “ A white ceramic plate with a glossy finish, subtle decorative patterns along the rim, elegant and simple design.” Figure 6. Without masking occluders, MLLMs tend to describe occluders as well, resulting in extraneous generation in the ﬁnal output. W e highlight the semantic category of the target object in blue and the descriptions of the occluders in red . the semantic cate gory . For the visual prompt, we isolate the target object on a white background and highlight its modal bounding box in red, allowing the MLLM to clearly asso- ciate the coordinates provided in the textual prompt with the corresponding region in the image. Based on this input, the MLLM predicts tight , moderate , and coarse bounding boxes of the full target object. Fig. 5 illustrates our method, and the exact prompts used in the e xperiments are provided in Appendix Sec. 6 . 3.4. Generating Detailed T extual Descriptions The Semantic Guidance Module extends beyond a cate- gory lev el prompt P cat by generating P long , a detailed tex- tual description specifying what should be generated in the occluded regions. This module is in voked only when the “Man sitting on steps, wearing a white shirt and khaki shorts, hands resting on knees, casual sneakers.” System Prompt Your job is to speculate the obscured pa rt of the object inside the red box in the image and provide a Stable Diff usion prompt about the hidden part, using no more than 77 tokens. … A part of the man in the red box is obscured by black occluders. User: :MLLM Figure 7. Example of the Semantic Guidance Module generating detailed descriptions of occluded regions. Guidance Decision Module determines that MLLM guid- ance is necessary . Belo w , we describe ho w the MLLM gen- erates descriptions of occluded regions. P = ( P long , if Semantic Guidance Module is in voked , P cat , otherwise. Unlike existing description generation methods that fo- cus on visible parts of scenes, our focus is on generating descriptions for occluded regions of an object. Howe ver , describing the occluded parts, rather than the visible ones, presents a unique challenge. As shown in Figure 6 , existing methods [ 2 , 5 , 23 ], which typically use visual marks to in- dicate the tar get object, often include descriptions of the oc- cluders when they signiﬁcantly overlap with the tar get ob- ject, leading to the generation of unintended objects in the ﬁnal outputs. In the ﬁgure, although the rider and broccoli are occluders in each image, the MLLM provides descrip- tions about them, causing their inclusion in the ﬁnal outputs. Thus, in amodal completion, it is crucial to prevent occlud- ers from inﬂuencing the MLLM’ s responses. Interestingly , we observe that using a visual prompt that masks the occluders effecti vely mitigates this issue. By re- moving the occluders’ appearance, this approach allo ws the MLLM to easily distinguish the occluders from the tar get object, thereby minimizing their inﬂuence on the MLLM’s response. It also allows the MLLM to understand the over - all conte xt of the image by preserving the appearance of the parts outside the occluders. Figure 7 illustrates our method for generating descriptions of occluded regions. 3.5. Completion with MLLM Guidance Finally , the Inpainting Module reconstructs the appearance of the tar get object using the two types of guidance and out- 5 Image Modal Image MC Diffusion pix2gestalt Ours GT Mask OW AAC Figure 8. Qualitativ e evaluation of our method. puts the completed object ˆ I amodal along with its segmen- tation mask ˆ M amodal . W e ﬁrst place the target object on a gray background I bkgd and perform inpainting using the re- sized inpainting mask M ∗ inpaint and the text prompt P . Af- ter inpainting, we separate the reconstructed target object from I bkgd by obtaining the background mask ˆ M bkgd using SAM [ 7 ] and then in verting it to deri ve the amodal mask of the target object: ˆ M amodal = (1 − ˆ M bkgd ) ∪ M modal . This approach is used because segmenting the background is generally easier, whereas the target object often contains complex internal details. Multi-scale expansion. As described in Sec. 3.3 , the Ge- ometric Guidance Module outputs three different scales of ˆ M bbox when MLLM guidance is inv oked. In such cases, the Inpainting Module lev erages its image generation capa- bility to determine an appropriate mask scale among them. Speciﬁcally , the module starts from the tightest prediction and checks whether the target object can be fully recon- structed within the prediction. If the object is successfully reconstructed, the generated result is returned as the ﬁnal output; otherwise, the module proceeds to the next larger scale and repeats the veriﬁcation. This progressiv e process ensures that the target object is fully reconstructed within the predicted region while av oiding unintended generation. T o verify whether the object has been fully reconstructed, we check whether the generated object touches the bound- ary of ˆ M bbox . If the generated object reaches the boundary , we regard it as incomplete; otherwise, it is considered fully reconstructed. Although our method can continue to reﬁne the mask by incrementally enlarging it beyond the coarsest prediction, we limit the e xpansion process to three scales in our experiments for ef ﬁciency . 4. Experiments When ev aluating amodal completion, two key aspects should be considered. The ﬁrst is whether the tar get object is fully generated, and the second is whether the appear- ance of the object is naturally reconstructed. Follo wing the ev aluation methods of the baselines, we assess these aspects through two tasks: Amodal Segmentation and Occluded Object Recognition. W e ﬁrst present the performance of our method in amodal segmentation, followed by results for oc- cluded object recognition. Finally , we validate the effecti ve- ness of each component of our method. Implementation. W e use InternVL3.5-8B [ 20 ] as the Guid- ance Decision Module and GPT -4o [ 14 ] for both the Geo- metric and Semantic Guidance Modules. Stable Dif fusion v2 inpainting model [ 17 ] is used for the Inpainting Mod- ule. In the Geometric Guidance Module, a 10% margin is added to the modal bounding box when MLLM guidance is not in vok ed. Detailed experimental settings are provided in Appendix Sec. 6 . 4.1. Amodal Segmentation Evaluation Details Amodal segmentation e v aluates the similarity between the segmentation mask of the completed object and the ground truth mask of the full object using mean Intersection-over -Union (mIoU). Although amodal segmentation does not consider the appearance of the recon- structed object and multiple valid ground truth masks may 6 Method COCO-A BSDS-A MP3D-A Hard Moderate Easy Hard Moderate Easy Hard Moderate Easy pix2gestalt 64.77 80.32 85.93 55.74 86.53 87.35 48.81 70.92 78.74 MC Diffusion 59.01 73.17 86.90 57.37 64.73 74.59 42.66 64.29 74.67 O W AA C 51.06 62.41 78.72 54.10 64.86 71.56 46.76 66.52 75.75 Ours 75.09 86.49 92.37 67.09 86.60 90.25 51.73 75.83 84.09 T able 1. Results on amodal segmentation by occlusion ratio. exist for the occluded regions, it allo ws us to assess whether the target object is accurately reconstructed without incom- pletion or overe xtension by comparing the similarity of the masks. W e e valuate our method on three datasets: COCO- A [ 28 ], BSDS-A [ 28 ], and MP3D-A [ 25 ]. These datasets include a variety of objects commonly found in ev eryday life and are the most frequently used datasets for amodal segmentation. Results T able 1 sho ws that our method outperforms all baselines and remains the most robust under high-occlusion scenarios. Follo wing [ 21 ], we deﬁne samples with an occlu- sion ratio abov e 0.5 as hard, below 0.2 as easy , and the rest as moderate. As illustrated in Figure 8 , existing methods struggle with hea vily occluded objects due to their inability to incorporate detailed information about the occluded re- gions. In contrast, our method generates detailed guidance about occluded regions by leveraging the rich knowledge of the MLLM, thereby signiﬁcantly improves amodal seg- mentation performance compared to the baselines. Specif- ically , our method can pre vent the generation of extrane- ous objects by utilizing an inpainting mask that ﬁts the tar- get object’ s size, and can also generate more natural recon- structions through progressiv e mask expansion guided by detailed prompts. The effectiv eness of each component of our method is further explored in Sec. 4.3 . 4.2. Occluded Object Recognition Evaluation Details Occluded object recognition is a classi- ﬁcation task that classiﬁes the class of an occluded object. This task allows us to ev aluate how well the appearance of the occluded object has been restored. F ollowing the setting used in pix2gestalt, we employ CLIP [ 16 ] as the classiﬁ- cation model and e v aluate objects placed on a white back- ground. For the dataset, we utilize the Occluded and Sep- arated COCO datasets [ 24 ], which contain 80 COCO se- mantic categories. The occluded COCO dataset consists of occluded objects represented as a single se gment, while the separated COCO dataset consists of occluded objects repre- sented as multiple segments, making it more challenging. Results T able 2 demonstrates that our method is highly effecti ve in completing the appearance of occluded ob- jects compared to the baselines. As shown in the table, our method consistently outperforms existing approaches. This is because existing methods heavily rely on the prior knowl- edge of Stable Dif fusion to reconstruct occluded objects Method Occluded Separated T op 1 ↑ T op 3 ↑ T op 1 ↑ T op 3 ↑ No completion 34.00 49.26 21.10 34.70 pix2gestalt 43.39 58.97 31.15 45.77 MC Diffusion 44.74 62.07 34.50 49.72 O W AA C 40.50 55.30 27.83 40.97 Ours 45.06 62.99 40.01 56.70 T able 2. Qualitativ e ev aluation on occluded object recognition us- ing two datasets. W e report T op 1 and T op 3 accuracy (%) using CLIP as a classiﬁcation model. without providing detailed guidance on what to generate for the occluded parts. In contrast, our approach leverages de- tailed information about the occluded regions generated by the MLLM to help the reconstruction of the occluded ob- ject, and our e xperimental results indicate that this approach is highly ef fecti ve for amodal completion. Furthermore, our method performs well on the more challenging separated COCO dataset, which demonstrates that our method is also effecti ve in dif ﬁcult samples. COCO-A BSDS-A MP3D-A GCR ↑ GSR ↑ GCR ↑ GSR ↑ GCR ↑ GSR ↑ 94.18 48.73 98.17 35.02 99.62 - T able 3. Guidance Call/Skip Rate (GCR/GSR) (%) of the Guid- ance Decision Module. GSR is not reported for MP3D-A, as it does not contain samples with low occlusion ( i.e ., < 10%). 4.3. Ablation Study Analysis of Guidance Decision Module W e ﬁrst ev aluate whether the Guidance Decision Module appropriately in- vok es MLLM guidance when needed. The module is as- sessed in two aspects: (1) whether it correctly in vokes guid- ance — Guidance Call Rate (GCR) , and (2) whether it ap- propriately avoids unnecessary guidance — Guidance Skip Rate (GSR) . Since ground-truth labels indicating whether guidance is required are unavailable, we approximate them by deﬁning samples with an occlusion ratio greater than 50% as those that truly require guidance, and samples with less than 10% occlusion as those that likely do not. As shown in T ab . 3 , our module accurately in vok es guidance 7 Figure 9. F ailure cases where the Guidance Decision Module failed to skip guidance. Method COCO-A mIoU ↑ (%) SD Inpainting (w/o guidance) 74.77 + geometric guidance (single scale) 85.64 + semantic guidance 85.60 + multi-scale expansion (Ours) 86.31 T able 4. Ablation study of our framework. Each component is se- quentially added. for samples that truly require it, sho wing high GCR. In con- trast, GSR is relati vely lo wer . W e found that this is mainly due to cases, such as those shown in Fig. 9 , where the object is truncated by the image boundary or belongs to ambiguous background regions ( i.e ., trees). Since the occlusion ratio is computed within the image boundary , these objects exhibit low occlusion ratios despite being incomplete, which causes the module to in voke guidance e ven when the ratio is low . Analysis of Geometric Guidance Module Our geometric guidance, which adjusts the inpainting mask to align with the actual size of the target object, plays a key role in per- formance impro vement, as applying it alone yields strong results, as illustrated in T able 4 . W e attribute this to the fre- quent occurrence of substantial occluders in natural scenes, which causes the inpainting mask to become ov erly large and leads to extraneous generation. Additionally , we exam- ine how completion results can be reﬁned by our multi-scale expansion strategy in Fig. 10 . As illustrated, our method enables the controllable generation of objects by utilizing masks of different scales. Analysis of Semantic Guidance Module After resizing the inpainting mask to match the target object’s size, our detailed descriptions from the Semantic Guidance Module helps restore the object with a plausible pose and appear- T arget object Coarse mask Moderate mask T ight mask Figure 10. Completion results across different mask scales. Target : Cow ”A brown cow standing in a field, with its head turned slightly, showing its ears and eyes.” “A brown cow standing on all four legs, with a sturdy build, short fur, and a calm expression.” “ Cow ” (a) (b) (c) (d) Ta r g e t : P l at e “ A white plate with crispy breaded chicken, steamed broccoli, sliced carrots, and mashed potatoes with gravy .” “ A white ceramic plate with a glossy finish, subtle decorative patterns along the rim, elegant and simple design.” “ Plate ” Figure 11. (a) Image of the target object. (b) Completion results with a semantic category as a text prompt. (c) Completion results with a description generated without masking occluders. (d) Ours. ance. As shown in Fig. 11 , these descriptions effecti vely capture the pose and visual characteristics of the hidden re- gions, enabling the generation of realistic appearances such as the cow’s patterns or a clean plate. Howe ver , the ef fect of semantic guidance on COCO-A is minimal, as illustrated in T ab . 4 . W e attrib ute this to the f act that COCO-A primar- ily consists of objects with limited pose v ariation, such as static items or natural backgrounds, and that the segmen- tation task itself does not account for appearance. In fact, our semantic guidance proves beneﬁcial for occluded object recognition, as sho wn in T ab . 5 . W e present further analysis in Appendix Sec. 8 . Method Occluded Separated T op 1 ↑ T op 3 ↑ T op 1 ↑ T op 3 ↑ Ours w/o semantic guidance 44.54 62.65 39.04 56.33 Ours 45.06 62.99 40.01 56.70 T able 5. Effecti veness of our semantic guidance in OOR. 5. Conclusion In this paper , we proposed AmodalCG, a framew ork that selectiv ely harnesses the rich real-world knowledge of MLLMs to guide amodal completion. First, the Guidance Decision Module selectiv ely in voked MLLM guidance by assessing the lev el of occlusion. For samples requiring guid- ance, our framework generated two key types of guidance from MLLMs to impro ve the completion process. First, ge- ometric guidance provided cues on how much of the object should be reconstructed, thus prev enting e xtraneous genera- tion. Second, semantic guidance offered detailed instruction about what should be generated. During the completion pro- cess, we exploited the strengths of both MLLMs and visual generativ e models through a multi-scale e xpansion strat- egy . Experimental results demonstrated that MLLMs can effecti vely enhance amodal completion, offering a promis- ing direction for integrating large multimodal reasoning into amodal completion. 8 References [1] Jiayang Ao, Y anbei Jiang, Qiuhong Ke, and Krista A Ehinger . Open-world amodal appearance completion. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 6490–6499, 2025. 2 , 3 [2] K eqin Chen, Zhao Zhang, W eili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’ s ref- erential dialogue magic. arXiv pr eprint arXiv:2306.15195 , 2023. 5 [3] Helisa Dhamo, Nassir Nav ab, and Federico T ombari. Object- driv en multi-layer scene decomposition from a single image, 2019. 3 [4] Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi. Segan: Segmenting and generating the in visible, 2018. 3 [5] Songtao Jiang, Y an Zhang, Chenyi Zhou, Y eying Jin, Y ang Feng, Jian W u, and Zuozhu Liu. Joint visual and te xt prompt- ing for improv ed object-centric perception with multimodal large language models. arXiv pr eprint arXiv:2404.04514 , 2024. 5 [6] Gaetano Kanizsa, Paolo Legrenzi, and Paolo Bozzi. Orga- nization in vision: Essays on gestalt perception. (No Title) , 1979. 1 [7] Alexander Kirillov , Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, T ete Xiao, Spencer White- head, Alexander C Berg, W an-Y en Lo, et al. Segment any- thing. In Pr oceedings of the IEEE/CVF International Con- fer ence on Computer V ision , pages 4015–4026, 2023. 3 , 6 [8] Xin Lai, Zhuotao Tian, Y ukang Chen, Y anwei Li, Y uhui Y uan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 9579–9589, 2024. 2 , 3 [9] Hyunmin Lee and Jaesik Park. Instance-wise occlusion and depth orders in natural scenes. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 21210–21221, 2022. 3 [10] Huan Ling, David Acuna, Karsten Kreis, Seung W ook Kim, and Sanja Fidler . V ariational amodal object completion. In Advances in Neural Information Processing Systems , pages 16246–16257. Curran Associates, Inc., 2020. 3 [11] Haotian Liu, Chun yuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neur al information pr ocessing systems , 36, 2024. 2 [12] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Y ang, Chunyuan Li, Jianwei Y ang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 , 2023. 3 [13] OpenAI. Gpt-4v(ision) system card, 2023. https : / / cdn . openai . com / papers / GPTV _ System _ Card . pdf . 2 [14] OpenAI. Gpt-4o system card, 2024. https : / / cdn . openai.com/gpt- 4o- system- card.pdf . 2 , 6 [15] Ege Ozguroglu, Ruoshi Liu, D ´ ıdac Sur ´ ıs, Dian Chen, Achal Dav e, Pa vel T okmakov , and Carl V ondrick. pix2gestalt: Amodal se gmentation by synthesizing wholes. In 2024 IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 3931–3940. IEEE Computer So- ciety , 2024. 1 , 2 , 3 [16] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, P amela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on mac hine learning , pages 8748–8763. PMLR, 2021. 7 [17] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨ orn Ommer . High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 10684–10695, 2022. 1 , 2 , 3 , 6 [18] Christoph Schuhmann, Romain Beaumont, Richard V encu, Cade Gordon, Ross W ightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell W orts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Pr ocessing Systems , 35:25278–25294, 2022. 3 [19] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv pr eprint arXiv:2312.11805 , 2023. 2 [20] W eiyun W ang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang W ei, Zhaoyang Liu, Linglin Jing, Sheng- long Y e, Jie Shao, et al. Intern vl3. 5: Advancing open-source multimodal models in versatility , reasoning, and efﬁcienc y . arXiv pr eprint arXiv:2508.18265 , 2025. 6 [21] Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Amodal com- pletion via progressiv e mixed context dif fusion. In Pr oceed- ings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 9099–9109, 2024. 1 , 2 , 3 , 7 [22] Xiaosheng Y an, Feigege W ang, W enxi Liu, Y uanlong Y u, Shengfeng He, and Jia Pan. V isualizing the in visible: Oc- cluded vehicle segmentation and recovery . In Proceedings of the IEEE/CVF International Conference on Computer V i- sion , pages 7618–7627, 2019. 3 [23] Lingfeng Y ang, Y ueze W ang, Xiang Li, Xinlong W ang, and Jian Y ang. Fine-grained visual prompting. Advances in Neu- ral Information Pr ocessing Systems , 36, 2024. 5 [24] G Zhan, W Xie, and A Zisserman. A tri-layer plugin to improv e occluded detection. arxiv 2022. arXiv preprint arXiv:2210.10046 , 2022. 7 [25] Guanqi Zhan, Chuanxia Zheng, W eidi Xie, and Andrew Zis- serman. Amodal ground truth and completion in the wild. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 28003–28013, 2024. 7 , 2 [26] Ni Zhang, Nian Liu, Junwei Han, Kaiyuan W an, and Ling Shao. Face de-occlusion with deep cascade guidance learn- ing. T rans. Multi. , 25:3217–3229, 2023. 3 [27] Qiang Zhou, Shiyin W ang, Y itong W ang, Zilong Huang, and Xinggang W ang. Human de-occlusion: In visible perception and recovery for humans. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 3691–3701, 2021. 3 9 [28] Y an Zhu, Y uandong T ian, Dimitris Metaxas, and Piotr Doll ´ ar . Semantic amodal segmentation. In Pr oceedings of the IEEE conference on computer vision and pattern r ecog- nition , pages 1464–1472, 2017. 7 10 Integrating Multimodal Large Language Model Kno wledge into Amodal Completion Supplementary Material 6. Implementation and Evaluation Details In this section, we provide the implementation details of our experiments. 6.1. Guidance Decision Module V isual Prompt W e segment the visible parts of the target object and place them on a white background. Then, we crop the image with a 100-pixel margin around the target object to create the visual prompt. T ext Prompt W e present the exact prompt used in the Guid- ance Decision Module, which produces tw o outputs: (1) whether MLLM guidance is required, and (2) the semantic category of the tar get object. You are an expert visual annotator. You are given an image where a red bounding box highlights a target subject. Follow these instructions carefully and output only a valid JSON object. Instructions: 1. Determine amodal completion requirement: - Output "no" only if you are confident that the target object in the red box is already complete, minimally occluded, or cannot be further extended. - Otherwise, output "yes". - If you are uncertain, output "yes". 2. Identify the category name: - Output the category name of the target subject. 3. Do not include explanations, reasoning, or any text outside of the JSON. Output format: { "requires extensive completion": "yes" | "no", "category": str } Example output: { "requires extensive completion": "yes", "category": "Bear" } 6.2. Geometric Guidance Module V isual Prompt W e segment the visible parts of the target object and place them on a white background. W e use the full-sized image as a visual prompt, allo wing the MLLM to consider the original image scale when predicting the size of the full object. T ext Prompt The user prompt is described in Fig. 5 . Belo w is the exact system prompt we used to make the MLLM predicts the size of the full target object. System Prompt You will be provided with an image of an object, the object’s name, its visible bounding box in the format [ x min , y min , x max , y max ] , and the size of the image as [ height, width ] . The bounding box is marked with a red box in the image. The object in the red box is partially obscured and your task is to estimate three bounding boxes for the entire object, including both the visible and invisible parts. Provide tight, moderate, and coarse bounding boxes for the full object in the format [ x min , y min , x max , y max ] . The tight bounding box should include minimal margin around the visible parts of the object. Just provide the three bounding boxes, without explanation. 6.3. Semantic Guidance Module V isual Prompt W e mark the visible parts of the target ob- ject with a bounding box and mask the occluders. Then, we crop the image with a 100-pixel margin around the tar- get object to create the visual prompt. This encourages the MLLM to focus on the surrounding regions of the target object when generating descriptions. T ext Prompt The user prompt is described in Fig. 7 . F or the system prompt, we assign the MLLM the task of inferring the occluded parts and instruct it to describe only the target object without including any descriptions of the occluders and background. Belo w is the exact system prompt we used to generate descriptions of occluded regions. 1 System Prompt Your job is to speculate the obscured part of the object inside the red box in the image and provide a Stable Diffusion prompt about the hidden part, using no more than 77 tokens. Do not include the names of the occluders and descriptions of the background in the prompt, focusing solely on the object. Your response should only contain the prompt. Your response should begin with a prefix that says ‘Prompt:’. 6.4. Evaluation Details Giv en the inherent ambiguity of amodal completion, where multiple plausible answers can exist, previous ap- proaches [ 1 , 21 ] ha ve primarily relied on user studies rather than quantitativ e metrics, or ha ve e valuated similarity with incomplete, occluded objects [ 1 ], which is not robust when objects are hea vily occluded. Therefore, to reliably e v aluate whether the occluded regions are properly reconstructed, we adopt amodal segmentation and occluded object recog- nition as our main quantitativ e ev aluation metrics, follow- ing the ev aluation setting of pix2gestalt [ 15 ]. Amodal seg- mentation assesses reconstruction quality by comparing the mask of the reconstructed object with human-annotated or 3D-projected ground-truth masks. Occluded object recog- nition provides a complementary ev aluation by checking whether the reconstructed appearance of the occluded re- gion is consistent with the correct semantic category of the object, offering a rob ust measure of appearance ﬁdelity . W e clarify our ev aluation protocol. Unlike other base- lines that perform multiple reﬁnement steps through iter- ativ e generation, pix2gestalt completes the task in a sin- gle forward generation process without reﬁnement. There- fore, to account for the inherent uncertainty of amodal com- pletion and to ensure a fair comparison in computational cost, we ev aluate pix2gestalt by generating three samples for each input and reporting their av erage performance. 7. Dataset Analysis Fig. 12 shows occlusion statistics for each dataset. All datasets consist of real images, each accompanied by a sin- [0.0-0.2] [0.2-0.5] [0.5-1.0] Occlusion ratio 0 20 40 60 80 100 P er centage (%) COCO - A [0.0-0.2] [0.2-0.5] [0.5-1.0] Occlusion ratio 0 20 40 60 80 100 B SDS- A [0.0-0.2] [0.2-0.5] [0.5-1.0] Occlusion ratio 0 20 40 60 80 100 MP3D- A Figure 12. Occlusion percentage of the three datasets. Figure 13. Examples of the three datasets. gle annotated ground-truth mask. COCO-A and BSDS-A include human-annotated masks, while MP3D-A relies on 3D projection, which introduces slight noise (see Fig. 13 ). High-occlusion samples are less common in COCO-A and BSDS-A due to the difﬁculty of manual annotation. These datasets are widely used to ev aluate real-world applicability , as they are the only real-world datasets covering numerous categories [ 25 ]. 8. Analysis on the Semantic Guidance Module In this section, we explore the effecti veness of the visual prompt used in the semantic guidance module and dis- cuss wh y naiv e prompting falls short in amodal completion. Lastly , we present failure cases of the semantic guidance module. T o effecti vely guide amodal completion, descriptions of the occluded parts should satisfy two key conditions. First, they should exclude descriptions of the occluders, as includ- ing such descriptions may lead to the generation of unin- tended objects. Second, the descriptions must be consistent with the image context; otherwise, the target object may be unnaturally restored. Masking occluders in the visual prompt enables the MLLM to generate descriptions that meet both conditions. W e believ e this approach helps the MLLM clearly distin- guish the target object from the occluders, while the pre- served shape of the occluders aids in inferring a plausible pose. T o demonstrate this, we compare the MLLM’ s re- sponses when using visual prompts with and without oc- cluder masking. As shown in the Fig. 14 , without occluder masking, the MLLM often fails to distinguish the target ob- ject from the occluders, resulting in descriptions that in- clude details about the occluders. In contrast, our visual prompt enables the MLLM to accurately distinguish the tar - get object, e ven when the target object and the occluders ov erlap signiﬁcantly . Furthermore, the descriptions gener- ated using our visual prompt align well with the image con- text. 2 A dapple gray horse standing with a rider , wearing a br idle and saddle, with muscular legs and a well-groomed c oat. A majestic gray horse with a strong, muscular neck and a flowing mane , wearing a bridle and saddle , standing gracefully. A smooth, light-colored wall with slight imperfections and cracks, featuring a mounted green me tal headboard with vertical ba rs . A smooth, peach-colored wall with subtle shadows and slight imperfections. A snowy mountain wit h ski slopes, tall evergreen tre es, and skiers in winter gear . Majestic snow-covered mountain peaks under a clea r blue sky, sunlight casting soft shadows on the slopes. Butterpaper , translucent, slightly crinkled, covering a portion of a hot dog , edges slightly torn, thin and glossy texture Crumpled butter paper with translucent texture, slightly greasy, reflecting light, partially covering fried food. A white plate with crispy breaded chicken, steamed brocc oli, sliced carrots, and mashed potatoes with gravy . A white ceramic plate with a glossy finish, subtle decorative patterns along the rim, elegant and simple design. A human finger holding a bottle , partially obscured, natural skin tone, realistic detail. A human finger with natural skin tone, slightly bent, with visible knuckles and smooth texture. A hand holding an ice cream sandwich with fingers gently wrapped around the cone. A human hand with fingers slightly curled, natural skin tone, rea listic texture, visible knuckles and fingernails. A sleek, modern towel rack with a metallic finish, holding a neatly folded towel . A sleek, chrome towel rack with a modern, minimalist design, featuring smooth, curved bars and a polished finish. W ithout occluder masking Ours W ithout occluder masking Ours A man wearing a yellow jacket , black pants, and ski boots, holding ski poles, standing on a snowy slope. Man skiing down a snowy slope, wearing a winter jac ket and ski pants, holding ski poles, snow-covered mountains in the background. A white ceramic plate with a glossy finish, partially covered by vibrant green lettuce leave s and colorful vegetables . A white ceramic plate with a smooth, glossy surface and a subtle rim detail. Figure 14. A comparison of the descriptions generated by the MLLM when using visual prompts with and without occluder masking. In the descriptions, the semantic category of the tar get object is highlighted in bold , and descriptions of the occluders are highlighted in red . 3 T arget object Original inpainting mask Tight Moderate Coarse Figure 15. Examples of inpainting masks generated by our method. The target object is highlighted with a red box in each image. The target objects are a house in the ﬁrst image, a man in the second, and a car in the third. ”A white ceramic bowl filled wit h sliced fruits like bananas, kiwis, and oranges , smooth glossy surfa ce, round shape.” “Woman standing, wearing a sleeveless top and jea ns, holding a microphone .” Figure 16. MLLM occasionally includes descriptions of objects that are not present in the image but are related to the image con- text. 8.1. Analysis on failure cases W e observe that our method is not entirely free from the co-occurrence bias of MLLMs. As shown in Fig. 16 , while our method effecti vely enables the MLLM to distinguish between the target object and the occluder , it occasionally includes descriptions of co-occurring objects. For instance, in the ﬁgure, the MLLM mentions sliced fruits or a micro- phone, ev en though they are not present in the image. 9. Analysis on the Geometric Guidance Mod- ule In this section, we further analyze our approach to pre- dicting the full target object size using the MLLM. First, we present examples of the inpainting masks generated by our method. Second, we ev aluate the effecti veness of our Method COCO-A BSDS-A mIoU ↑ (%) mIoU ↑ (%) Original inpainting mask 48.00 52.02 Our inpainting mask 78.97 74.60 T able 6. Accuracy of our inpainting mask. method in predicting the true extent of the object. Finally , we sho w some failure cases of the geometric guidance mod- ule. 9.1. Examples of our inpainting mask Fig. 15 shows examples of inpainting masks generated by our method. As illustrated in the ﬁgure, our approach incor - porates three different scales of masks, each reﬂecting the characteristics of the target object. F or instance, small in- painting masks are generated when only the man’ s head is occluded, whereas larger inpainting masks are created when substantial portions of the car or the house are occluded. This demonstrates that our method produces reasonable size estimations and effecti vely pre vents the use of excessiv ely large inpainting masks by adjusting the mask size to match the size of the full target object. 9.2. Effectiveness of the Geometric Guidance Mod- ule T o ev aluate the capability of our method in estimating the target object’ s size, we assess the accuracy of our resized in- painting mask deriv ed from the predicted bounding box for 4 Tight: [430 , 241, 640, 422] Moder at e: [400 , 220 , 640 , 42 7] Coarse: [370, 200, 640, 427] Groud T r u th: [238, 16 7 , 639 , 417] Tight: [344, 37 6, 480, 502] Moder at e: [3 20, 360 , 480 , 52 0] Coarse: [300, 340, 480, 540] Groud T r u th: [299 , 375 , 479 , 497] (a) (b) Figure 17. Examples of failure cases. Our method tends to avoid ov erly aggressiv e estimation, even in the case of coarse estimation. (a) F ails to predict the size of a long fence. (b) Matches the ground truth size well but still f ails to predict a long fence. the full target object. W e compare the similarity between the inpainting mask adjusted using the ground truth bound- ing box of the full target object and the mask adjusted by our method by computing mIoU. As sho wn in T ab . 6 , our method le verages geometric guidance to generate inpaint- ing masks that match the target object, effecti vely prev ent- ing the use of ov erly large masks. 9.3. Analysis on failure cases Although our method produces reasonable estimations in most cases, as previously discussed, it sometimes av oids ex- cessiv ely aggressive estimation, even in the case of coarse estimation. As shown in Fig. 17 , although the fences could be long, our method predicts only the size of short fences. Nev ertheless, this limitation can be alle viated by incorpo- rating more scales in the multi-scale expansion, enabling further enlargement of the mask at the expense of efﬁcienc y . 10. Additional Qualitative Results of Amodal Completion In this section, we present additional qualitativ e results (Figs. 18 and 19 ) of our method in amodal completion. 5 Image Modal Image Ours Image Modal Image Ours Figure 18. Additional qualitativ e results of our method in amodal completion. 6 Image Modal Image Ours Image Modal Image Ours Figure 19. Additional qualitativ e results of our method in amodal completion. 7

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment