No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

No Need F or Real Anomaly: MLLM Empower ed Zer o-Shot V ideo Anomaly Detection Zunkai Dai 1 , K e Li 1 , † , Jiajia Liu 2 , Jie Y ang 1 , Y uan yuan Qiao 1 , † 1 Beijing Uni versity of Posts and T elecommunications 2 Northwestern Polytechnical Uni versity { daizk, like1990, janeyang, yyqiao } @bupt.edu.cn, liujiajia@nwpu.edu.cn Anomaly: Falling, Running, Jumping… T rain Scenario Anomaly : Riot, Explosion… Anomaly: Falling, Running… T raditional Model Novel Scenario Anomaly: Robbery, Shoot… Effective Ineffective T raditional Solution Close- Set: Cannot Detect Unseen Anomal y/Scenario Anomaly : Falling, Running, Jumping… VA D Dataset Anomaly : Parrot, Elephant, Loading, Dancing… Multi -model LLM Model Our Solution Anomaly Types in a Single Scenario Zer o - Shot VA D Anomaly Types: Fight / Jump / Run / Injured… Anomaly Types: Anomalous Vehicle Behavior Open -World ： Detect Any Anomaly in Divers Scenario Open -W orld Scenario Anomaly Exposure Dataset Anomaly: Fight, Jumping… V AD Evaluation Anomaly Types: Intentional Injury Anomaly Types: Explosion / Break / Shoot / Abuse… VAD Dataset “Intentional injury” ≈ “Butch intends to hurt Jerry” Deep Semantic Understanding Fig. 1. Motivation . Left: Existing V AD methods rely on training with anomaly data from single scenarios, resulting in poor generaliza- tion capability to novel anomaly types or unseen scenarios. Right: Our LA VID A model lev erages MLLM to understand deep anomaly semantics, enabling generalization to arbitrary anomaly types across div erse scenarios. The training data consists of pseudo anomaly data synthesized from external datasets, without incorporating an y V AD data. Abstract The collection and detection of video anomaly data has long been a challenging pr oblem due to its r ar e occurr ence and spatio-tempor al scar city . Existing video anomaly de- tection (V AD) methods under perform in open-world sce- narios. Ke y contributing factors include limited dataset di- versity , and inadequate under standing of conte xt-dependent anomalous semantics. T o addr ess these issues, i) we pr o- pose LA VID A, an end-to-end zer o-shot video anomaly de- tection framework. ii) LA VIDA employs an Anomaly Ex- posur e Sampler that transforms se gmented objects into pseudo-anomalies to enhance model adaptability to un- seen anomaly categories. It further integr ates a Multi- modal Lar ge Language Model (MLLM) to bolster seman- tic compr ehension capabilities. Additionally , iii) we de- sign a token compr ession appr oach based on r everse atten- tion to handle the spatio-temporal scar city of anomalous patterns and decr ease computational cost. The training † Corresponding author . pr ocess is conducted solely on pseudo anomalies without any V AD data. Evaluations acr oss four benchmark V AD datasets demonstrate that LA VID A achie ves SO T A perfor- mance in both fr ame-level and pixel-le vel anomaly detec- tion under the zer o-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA . 1. Introduction V ideo Anomaly Detection (V AD) aims to identify beha viors that de viate from normal patterns or represent une xpected ev ents in video sequences. Classical V AD methods assume static scenes, closed-set anomaly categories, and stationary data distributions, yet these assumptions fail in dynamic en- vironments where scenes ev olve, behaviors shift, and dis- tributions drift continuously . Recent works hav e reconcep- tualized V AD in open-world settings, where systems must detect unseen anomalies, operate without predeﬁned tax- onomies, and learn continuously . This capability is criti- cal for safety-critical applications, enabling world models to perceiv e and adapt to unexpected ev ents. Recent open-set V AD [ 1 , 8 , 53 ] and open-vocab ulary V AD [ 14 , 34 , 37 ] approaches hav e dev eloped promising ca- pabilities for detecting previously unseen anomaly types. Howe ver , single-scene training limits their generalization to unseen scenarios. Some methods [ 2 , 27 , 39 , 47 ] lev er- age multimodal large language models (MLLMs) to obtain anomaly scores, achieving training-free detection. Ho w- ev er, they heavily rely on frame-wise or clip-wise text out- puts generated by MLLMs, which signiﬁcantly limits their practical applicability . Improving the generalization performance of V AD mod- els faces three primary challenges: i) Limited diversity in available anomaly datasets : Existing V AD datasets con- tain limited scenarios and anomaly types, which restricts model learning capabilities and makes them inadequate for real open-world applications; ii) Conte xt-dependent seman- tic interpr etations of anomalies : Anomaly semantics v ary according to dif ferent scenarios, while current methods lack sufﬁcient semantic understanding, failing to comprehend unseen scenarios and nov el anomaly types, and struggle to adapt detection targets dynamically; and iii) Spatiotempo- ral sparsity of anomalies : Anomalies often occupy minimal temporal or spatial regions. The ab undance of redundant vi- sual information signiﬁcantly increases computational cost. Moreov er , detection models struggle to ef fectively le verage the coarse-grained (video-lev el) contextual semantic fea- tures generated by MLLMs, thereby overlooking localized anomalous patterns in spatiotemporal dimensions. T o address these challenges, we propose LA VIDA ( L LM- A ssisted VI deo Anomaly D etection A pproach), an end-to-end V AD framew ork that leverages MLLMs and re- quires no real V AD data during training. T o ov ercome the limited di versity of anomalies in existing datasets, we de- sign an Anomaly Exposure Sampler that transforms widely accessible semantic segmentation datasets that contain di- verse semantics into pseudo anomalies, thereby expand- ing V AD scenarios and anomaly types while eliminating the dependence of training on V AD data. T o enhance se- mantic understanding, we employ an MLLM-inte grated se- mantic feature extractor to capture clip-level semantic rep- resentations, utilizing MLLMs’ open-world understanding to signiﬁcantly improve anomaly semantic comprehension and resolve context-dependency challenges. T o enable the model to focus on local anomaly patterns, considering the spatiotemporal sparsity of anomalies, we apply a re verse- attention-based token compression method that substan- tially reduces irrele vant background visual information, and lev erage learnable query tokens that simultaneously access clip-lev el context and frame-lev el details. At the end, we ex- ecute comprehensiv e anomaly detection at both frame and pixel granularities. LA VID A demonstrates exceptional generalization capa- bilities, achieving state-of-the-art performance in zero-shot detection scenarios. After training on the Anomaly Expo- sure datasets (e xternal se gmentation datasets), ev aluations on four unseen datasets yield: 76.45% A UC on UBnormal, 85.28% A UC on ShanghaiT ech, 82.18% A UC on UCF- Crime (outperforming unsupervised methods), 90.62% AP on XD-V iolence (surpassing weakly-supervised methods), and 87.68% pix el-le vel A UC on UCSD Ped2 (current state- of-the-art pixel-le vel zero-shot performance). In summary , our contributions are as follo ws: • W e propose an end-to-end zero-shot V AD frame- work, LA VID A, which lev erages MLLMs to extract video anomaly semantic representations and enables frame/pixel-le vel open-world anomaly detection. • W e introduce an Anomaly Exposure Sampler: a training strategy that repurposes segmentation targets as pseudo- anomalies, enabling training without V AD data and im- proving adaptability to di verse scenarios. • W e design a token compression method for LLM-based V AD model, which mitigates background interference and reduces computational costs for LLMs. • Extensi ve experiments show that our method achieves state-of-the-art zero-shot performance, and achiev es com- petitiv e results w .r .t. unsupervised V AD methods at the frame le vel, and competitiv e zero-shot performance at the pixel le vel. 2. Related W ork 2.1. T raditional V AD Methods T raditional video anomaly detection can be cate gorized into unsupervised and weakly-supervised approaches. Unsuper- vised methods assume that only normal samples exist in the training set and learn normal patterns through one-class classiﬁcation (OCC) [ 25 , 26 , 31 , 33 , 44 ] or self-supervised tasks [ 5 , 7 , 16 , 24 , 40 , 50 ]. W eakly-supervised V AD (WS- V AD) detects anomalies using only video-lev el annotations without requiring precise temporal or spatial localization [ 15 , 28 , 30 , 32 , 45 ]. Recent adv ances lev erage pre-trained models and vision-language models to enhance detection performance [ 11 , 19 , 35 , 41 ]. Howe ver , unsupervised meth- ods struggle with unseen normal patterns and anomalous patterns similar to normal ones, and weakly-supervised methods can only recognize anomaly types present in the training set. 2.2. Open-W orld V AD Methods T o improv e model generalization capabilities for unknown anomaly types, researchers hav e proposed open-set V AD and open-vocab ulary V AD approaches. Open-set V AD was ﬁrst introduced by Acsintoae et al. [ 1 ] with a benchmark dataset and ev aluation framework. Subsequent approaches hav e explored evidential deep learning with normalizing q Cross Attention k v V i s i on B ac k b on e U S E R : F i n d t h e an om al y i n t h i s vi d e o / i m age . A n om al y t yp e s m ay c on t ai n f i gh t , abu s e , e xpl os i on … A S S I T A N T : S u r e , i t i s < S E G > . V i s u al En c od e r LoRA I t i s < S EG > C L I P T e xt E n c od e r M as k D e c od e r M L P q v k C r os s A t t e n t i on k v q M L P MLP v q k S e l f - at t e n t i on C r os s A t t e n t i on L e ar n ab l e q u e r i e s * N P r om p t Mul t i - S c a l e Se m a nt i c Pr oj e c t or F e a t ur e E nc oding A n om al y C at e gor y Trainin g Set A n o m a l y Ex po s ure D a ta s et A noma l y E xpos ur e S a m pl e r S e m a nt i c F e a t ur e Ext r a c t i on F r am e - L e ve l l ab e l s P i xe l - Le ve l l ab e l s P ar r ot C ar …… D og Mul t i - l e ve l Ma s k De c ode r I mage Φ V i d e o F r ame LLM Toke n Com pr e s s i on B ac k gr ou n d Tok e n s Tok e n Comp r e s s i on R e ve r s e At t e nt i on V i s u al T ok e n R e ve r s e - A t t n We i gh t z z  Fig. 2. Overview of LA VIDA Framework. LA VIDA is trained solely on a comprehensiv e Anomaly Exposure datasets, and consists of ﬁv e key components: a MLLM, a text encoder , a vision backbone, a SAM2 mask decoder, and a Multi-scale Semantic Projector . ﬂows [ 53 ] and lightweight pose-based normalizing ﬂows framew orks [ 8 ]. Open vocabulary V AD enhances anomaly categorization by le veraging vision-language models. W u et al. [ 34 ] ﬁrst introduce open-vocab ulary video anomaly detection (O VV AD) using CLIP . Li et al. [ 14 ] lev erages vi- sual and te xtual information with label relations to reduce detection ambiguity . Xu et al. [ 37 ] uses learnable prompts and graph attention networks. Nev ertheless, these methods remain restricted to particular scenarios and cannot adap- tiv ely adjust detection targets based on contextual changes, which pre vents these methods from achie ving truly open- world V AD capabilities. 2.3. LLM-based Video Anomaly Analysis Current applications of MLLMs in V ideo Anomaly De- tection (V AD) can be categorized into training-free V AD methods and V ideo Anomaly Understanding (V A U) ap- proaches. T raining-free V AD methods utilize MLLMs to analyze video clips or frames, extracting anomaly scores from the generated textual outputs. Zanella et al. [ 47 ] extract and reﬁne anomaly scores from frame-wise textual outputs. Y ang et al. [ 39 ] deri ve anomaly detection rules from training datasets for inference. Ahn et al. [ 2 ] em- ploy CLIP to guide MLLMs toward anomalous regions for more accurate scoring. Shao et al. [ 27 ] integrate dynamic graphs to mine e vent boundaries, enabling MLLMs to fo- cus on e vent intervals. Despite eliminating training require- ments, these methods rely on frame-wise or clip-wise tex- tual outputs, incurring high temporal costs and limiting pre- diction granularity to frame or clip lev el, thus prev enting spatial localization of anomalies. On the other hand, V A U methods focus on the semantic understanding capabilities of MLLMs to provide explanations for anomalies. Some studies [ 4 , 29 ] construct interacti ve instruction data to de- liv er video-lev el anomaly explanations. Y uan et al. [ 43 ] reﬁnes V A U precision to the clip le vel. Zhang et al. [ 49 ] combines detection and understanding by outputting ex- planations for high-probability anomalous regions. Xing et al. [ 36 ] deploys audio data to enhance the understand- ing of anomalies. Howe ver , these methods prioritize using MLLMs for explanation generation while o verlooking the intrinsic capability to detect unseen anomaly types. 3. Methods 3.1. Preliminary The training dataset is an pseudo anomaly dataset D E = { ( x i , y i , c i ) } N i =1 , where x i represents the input visual data encompassing both video and image modalities. For video samples, v i ∈ R T × C × H × W where T , C , H , W denote the number of frames, channels, height, and width, respec- tiv ely , while for image samples, I i ∈ R C × H × W . And c i = { c i, 0 , c i, 1 ..., c i,K − 1 } indicates the anomaly categories to be detected, and y i = ( y f i , y p i ) denotes the correspond- ing anomaly labels with frame-lev el label y f i ∈ { 0 , 1 } T and pixel-le vel label y p i ∈ { 0 , 1 } T × H × W . During ev alu- ation, the model is tested on unseen V AD datasets D test = { ( v t , y t , c t ) } M t =1 , where the anomaly categories c t and video scenarios are different from the training dataset. Our ob- jectiv e is to predict y t in D test under zero-shot conditions, where the test videos v t and test anomaly categories c t are not observed during training. 3.2. Overview The LA VID A frame work comprises ﬁ ve k ey components, as illustrated in Fig. 2 . First, an Anomaly Exposure Sampler reconstructs anomaly exposure dataset to form the training set. The input data then enters the Feature Encoding mod- ule, which encodes text, image, and video into feature vec- tors. Simultaneously , the Semantic Feature Extraction mod- ule encodes abnormal description prompts alongside vision data into a uniﬁed semantic feature via MLLM, with visual tokens being compressed by a token compression module. Thereafter , the Multi-Scale Semantic Projector fuses these semantic features with learnable query vectors and projects them into the mask decoder’ s latent space. Ultimately , a Multi-Lev el Mask Decoder decodes these latent space fea- tures to output frame-lev el and pixel-lev el anomaly scores. 3.3. Anomaly Exposure Sampler V isual semantic segmentation datasets provide rich scene div ersity and comprehensive semantic categories. Howe ver , these datasets cannot be directly applied to V AD tasks, since anomalies occur rarely in datasets. T o address this prob- lem, we propose a two-step transformation of the anomaly exposure dataset, as illustrated in Fig. 3 . W e deﬁne the training dataset as D E = { ( x i , y p i , s i ) } N i =1 , where s i rep- resents the text description of video clip v i , and y p i repre- sents pixel-le vel category labels. Our objectiv e is to con- struct ( c i , y f i ) for each sample, thereby transforming D E = { ( x i , y p i , s i ) } N i =1 into D E = { ( x i , y i , c i ) } N i =1 . For anomalous samples, only sparse anomaly ev ents ex- ist within the video. This means that c i contains fe w content-relev ant categories, with the majority being irrel- ev ant. T o construct c i for each sample in D E , we intro- duce irrelev ant categories from other samples within the same dataset, thereby requiring models to distinguish gen- uine anomaly categories from irrelev ant ones. This can be represented as follows: S irr i =  s j   j ∼ Unif  { 1 , ..., n } \ { i }  , | J | = K E − 1  (1) where S irr i represents the set of irrelev ant categories for the i -th sample, constructed by uniformly sampled from other samples in D E . K E is the total number of categories. In practice, K E is set as a random parameter to enable the MLLM to handle arbitrary numbers of anomaly types. T o model anomaly rarity , each sample is randomly la- beled as normal (probability 1 − p ) or anomaly (probability p ). For anomalous samples, the category set c i combines genuine and irrelev ant categories, with frame labels y f i set to positiv e. Normal samples contain only irrelevant cate- gories and are assigned negati ve labels. Such an operation is demonstrated as, ( c i , y f i ) = ( ( S irr i ∪ { s i } , 1 T ) with p ( S irr i , 0 T ) with 1 − p (2) Parrot Elephant Rabbit Dog Car Anomaly Exposure Dataset Sample Genuine Category ··· Irrelevant Category Anomaly Case Normal Case Parrot Rabbit Dog Car ··· Dog Rabbit Car ··· Q: Find the anomaly in this video/image. Anomaly types may contain Dog , Rabbit , Parrot , Car … A: Sure, it is . Q: Find the anomaly in this video/image. Anomaly types may contain Dog , Rabbit , Car … A: Sure, it is . MLLM Prompt Fig. 3. Anomaly Exposure Sampler: W e sample irrelev ant cate- gories from other samples to create anomaly categories, randomly designate samples as anomalous or normal based on probability . where ( c i , y f i ) denotes the output cate gory set and frame- lev el labels for the i -th sample, T represents the total num- ber of frames, and p controls the anomaly sampling proba- bility . 3.4. Visual T oken Compression Anomaly objects typically constitute only a small fraction of visual data, while backgrounds constitute the v ast major- ity . Excessiv e irrelev ant background tokens degrade MLLM reasoning and incur substantial computational costs. W e aim to deploy a training-free approach to compress back- ground tokens while retaining anomaly-rele vant features. For V AD tasks, directly identifying anomalous tokens is difﬁcult since the sparse spatial-temporal distribution of anomalous objects. Howe ver , background tokens are char- acterized by numerical predominance and high feature sim- ilarity , making them readily identiﬁable through density es- timation. After visual encoding, the token features are rep- resented as Z ∈ R L z × D z , where L z is the number of visual tokens and D z is the token dimensionality . W e compute the local density within the KNN neighborhood N k ( z i ) for each token z i as: ρ ( z i ) = k P z k ∈ N k ( z i ) ∥ z i − z k ∥ 2 (3) W e select the top- L r tokens with the highest density to form the background reference set Z b ∈ R L r × D z . T o iden- tify anomaly candidates, we employ a localized reverse at- tention mechanism [ 9 ]. Speciﬁcally , each token in Z is as- signed to its nearest neighbor in Z b based on the minimum Euclidean distance. For each background token Z b i , re verse attention is performed exclusi vely over its corresponding assigned tokens to highlight features most dissimilar to the background. This process is formulated as: Z ′ i = Softmax − Z b i Z T N i √ D z ! · Z N i (4) where N i = { j | arg min k ∥ Z j − Z b k ∥ 2 = i } denotes the set of indices of tokens in Z that are closest to the i -th background token, and Z ′ ∈ R L r × D z represents the ag- gregated anomalous features. This mechanism effecti vely compresses visual tokens into a compact L r -length repre- sentation Z ′ . 3.5. Anomaly Semantics Extraction Existing V AD methods are constrained by limited semantic comprehension capabilities, failing to understand anomalies in unseen scenarios. T o address this limitation, we lev er- age MLLMs following previous work LISA [ 12 ] to extract rich anomaly semantic features that enable rob ust detection across div erse scenarios. W e extend the MLLM’ s vocab ulary with a special token < S E G > to extract anomaly semantic features. Giv en a sample x i the corresponding anomaly category c i , we ﬁll c i into sev eral predeﬁned templates to construct the text prompt for the MLLM. For example: USER : Find the anomaly in this video. Anomaly types may contain { c i } . ASSIST ANT : Sure , it is < S E G > . ”. The < S E G > token aggre gates semantic information, and we extract its last-layer embedding as the anomaly semantic feature. Such an operation is demonstrated as: f sem = Φ M LLM ( x , c ) (5) where Φ M LLM is the MLLM, x is the input samples, c is the anomaly categories that are used to construct prompts, and f sem is the extracted anomaly semantic feature. 3.6. Feature Encoding For the input vision data x i and anomaly categories c i , we employ a vision backbone Φ v to extract visual features and a CLIP text encoder Φ t to e xtract te xtual features for anomaly categories: f v = Φ v ( x i ) , f c = Φ t ( c i ) (6) where f v ∈ R T × N p × D v represents vision feature. N p de- notes the number of patches within a single frame. f c ∈ R K × D t represents anomaly category feature. 3.7. Multi-Scale Semantic Projector While the MLLM effecti vely e xtracts semantic features for video anomalies, these representations remain at the video lev el without frame-speciﬁc granularity . T o ad- dress this limitation, we propose a Multi-Scale Seman- tic Projector that integrates video-lev el semantic features with frame-lev el ones, generating frame-speciﬁc features f proj ∈ R T × D m that are projected into the mask decoder to guide ﬁne-grained detection in each frame. T o extract frame-le vel local anomaly information from the video sequence, we emplo y cross-attention mechanisms between the anomaly category features and vision features, as demonstrated in the following formulation: f a = W o · CrossAttn ( W c f c , W v f v , W v f v ) (7) where f v ∈ R T × L × D v is vision features. W c ∈ R D c × D l , W v ∈ R D v × D l , and W o ∈ R D l × D a are learnable projec- tion matrices. D l and D a represent the intermediate layer feature dimension of the output MLP and the hidden layer features of the multi-scale semantic projector, respectively . f a ∈ R T × K × D a is the output frame-lev el semantic fea- tures, containing the anomaly target information for each frame. W e expand f sem along the temporal dimension and ap- ply a mapping matrix W LLM . Then we concatenate it with f a . The combined features are projected into the latent space of the mask decoder via a Q-Former-like projector . Drawing inspiration from SAM, we formulate the projec- tor as a two-w ay transformer architecture, as illustrated in Fig. 2 , to facilitate the mutual updating of both learnable queries and the extracted features f sem and f a : f proj = Φ proj ([ W LLM f sem , f a ]) (8) where Φ proj is the projector and f proj ∈ R T × D m repre- sents the projected feature, and D m is the latent dimension of Multi-Lev el Mask Decoder . 3.8. Multi-Level Mask Decoder Existing V AD models typically focus on frame-lev el anomaly scores, limiting their detection granularity . T o ad- dress this, our approach introduces a Multi-Le vel Mask De- coder initialized from SAM to enable both frame-lev el and pixel-le vel anomaly detection. W e feed f proj as the sparse prompt embedding of SAM2. After inte grating the visual features f v , the mask decoder produces pixel-le vel scores and object score logits. The object score logits indicate the conﬁdence of target object presence within the image or frame, which we lev erage as the frame-level anomaly score. This process can be formu- lated as follows: ˆ y f i , ˆ y p i = Φ d ( f proj , f v ) (9) where ˆ y f i represents the frame-lev el score, ˆ y p i denotes the pixel-le vel score, and Φ d is the mask decoder of SAM2. 3.9. Objective Function The objectiv e function comprises two components: L txt and L seg . L = λ txt L txt + λ seg L seg (10) where λ txt and λ seg are loss weight, L txt represents the text generation loss of the MLLM, and L seg denotes the Methods V enue T raining UBnormal ShanghaiT ech UCF-Crime XD-Violence A UC (%) A UC (%) A UC (%) AP (%) MemAE [ 6 ] ICCV’19 Unsupervised - 71.2 - - GODS [ 31 ] ICCV’19 Unsupervised - - 70.4 61.56 MSMA [ 18 ] ICLR’21 Unsupervised - 76.7 64.5 - GCL [ 46 ] CVPR’22 Unsupervised - 79.62 74.2 - FastAno [ 23 ] W ACV’22 Unsupervised - 72.2 - - FPDM [ 38 ] ICCV’23 Unsupervised 62.7 78.6 74.7 - MULDE [ 21 ] CVPR’24 Unsupervised 72.8 81.3 78.5 - AED-MAE [ 24 ] CVPR’24 Unsupervised 58.5 79.1 - - MA-PDM [ 51 ] AAAI’25 Unsupervised 63.4 79.2 - - CLIP-TSA [ 11 ] ICIP’23 W eakly-Supervised - - 87.58 82.19 TPWNG [ 41 ] CVPR’24 W eakly-Supervised - - 87.79 83.68 V adCLIP [ 35 ] AAAI’24 W eakly-Supervised - - 88.02 84.51 Holmes-V A U [ 49 ] CVPR’25 W eakly-Supervised - - 87.68 88.96 VERA [ 42 ] CVPR’25 W eakly-Supervised - - 86.55 56.27 PI-V AD [ 20 ] CVPR’25 W eakly-Supervised - - 90.33 85.37 Anomize [ 13 ] CVPR’25 W eakly-Supervised - - 84.49 69.31 AnomalyRuler [ 39 ] ECCV’24 Few-Shot 71.9 85.2 - - LA V AD [ 47 ] CVPR’24 Zero-Shot - - 80.82 62.01 AnyAnomaly [ 2 ] W ACV’26 Zero-Shot 74.5 79.7 80.7 - EventV AD [ 27 ] A CM’25 Zero-Shot - - 82.03 64.04 Ours - Zero-Shot 76.45 85.28 82.18 90.62 T ab. 1. Frame-level zero-shot performance compared with state-of-the-art methods. W e utilize A UC as the e valuation metric for UBnormal, ShanghaiT ech and UCF-Crime datasets, and AP for the XD-V iolence dataset. The best results are highlighted in bold. anomaly detection loss that encompasses both frame-lev el and pixel-le vel performance enhancement. T o facilitate op- timization, we adopt SAM2’ s training loss for L seg . 4. Experiment Our training dataset includes a di verse collection of seg- mentation datasets without any V AD datasets. Detailed in- formation regarding datasets, conﬁgurations, and additional results can be found in the supplementary material. 4.1. Qualitative Results 4.1.1. Frame-Lev el Zero-Shot Evaluation In T ab. 1 , we present a comprehensiv e comparison of our proposed method against other SOT A approaches under zero-shot conditions across the UBnormal, ShanghaiT ech, UCF-Crime, and XD-V iolence datasets. The compared methods encompass four categories: unsupervised meth- ods, weakly-supervised methods, few-shot methods, and zero-shot methods. The experimental results demonstrate that our method attains 76.45%, 85.28% and 82.18% on the UBnormal, ShanghaiT ech and UCF-Crime datasets, surpassing SO T A unsupervised, zero-shot and fe w-shot methods. On the XD- V iolence dataset, our approach achiev es 90.62%, outper- forming SOT A methods. These datasets vary in both scenar- ios and anomaly types. This superior performance demon- strates the effecti veness of our proposed method in handling both unseen scenarios and nov el anomaly categories. Our method does not surpass weakly-supervised ap- proaches on UCF-Crime, which we attribute to the limita- tions of existing MLLMs in comprehending low-resolution videos. In contrast, UBnormal, ShanghaiT ech, and XD- V iolence are high-resolution datasets, where small abnor- mal targets remain distinguishable. 4.1.2. Pixel-Lev el Zero-Shot Evaluation W e ev aluate the zero-shot pixel-le vel performance of our method against SOT A approaches on the UCSD Ped2 dataset, as presented in T ab. 2 . Our method achiev es a pixel- lev el A UC of 87.68%, which represents a substantial im- prov ement of 12.57% ov er the current SO T A method. This signiﬁcant enhancement demonstrates that our approach possesses strong zero-shot anomaly localization capability in the spatial dimension. 4.2. Quantitative Results Fig. 4 presents quantitative results for anomaly detection across six representati ve cases from different V AD datasets. For each case, the lower row displays frame-lev el anomaly scores over time with anomalous intervals highlighted in pink, and the upper row shows pixel-lev el detection results Running Explosion V andalism Cycling Fighting Road Accident Score Score Score Score Score Score Time Step Time Step Time Step Time Step Time Step Time Step Fig. 4. Qualitativ e Results for Anomaly Detection. F or each case, the ﬁrst row presents pixel-lev el detection results whitch are masked by green. The second row displays frame-le vel anomaly scores, with temporal intervals of anomalous e vents marked in pink. Method T raining A UC(%) AdaCLIP [ 3 ] Finetune 53.06 AnomalyCLIP [ 52 ] Finetune 54.25 DD AD [ 22 ] Supervised. 55.87 SimpleNet [ 17 ] Supervised 52.49 DRAEM [ 48 ] Supervised 69.58 T A O [ 10 ] Finetune 75.11 Proposed Zero-Shot 87.68 T ab. 2. Pixel-level perf ormance on UCSD P ed2. Anomaly : Car Sure, it is: Anomaly : Playing Football Sure, it is: Anomaly : Dog Sure, it is: Anomaly : Man in Blue Sure, it is: Anomaly : Stop Sign Sur e, it is: Anomaly : Fruit with different color Sure, it is: Fig. 5. Quantitati ve Visualizations for Open-W orld Scenarios. The left panel shows the original image, and the right panel high- lights detected anomalies with green masks. at corresponding time steps with anomaly targets mask ed in green. Higher scores in pink regions and lower scores else- where demonstrate effecti ve frame-lev el detection. Results show our model accurately identiﬁes anomalous frames in unseen scenarios and generates precise pixel-lev el scores that delineate anomaly boundaries. Fig. 5 presents open-world detection results. For each case, the target anomaly category is speciﬁed in the text. The left panel shows original images, while the right panel presents detection results with anomalies highlighted in green. Our method demonstrates robust performance and strong reasoning capabilities in identifying arbitrary anomaly types across div erse scenarios. 4.3. Ablation Studies 4.3.1. Analysis of the Anomaly Exposure Sampler max( K E ) ShanghaiT ech UCF-Crime XD-Violence A UC (%) A UC (%) AP (%) 10 73.34 67.11 55.44 20 80.13 80.02 91.20 30 85.02 82.06 90.30 40 76.41 77.93 75.32 T ab. 3. Effect of the number of anomaly categories introduced in the anomaly exposure dataset. T ab . 3 sho ws the effect of anomaly category count K E in the anomaly exposure dataset. T o enable MLLMs to com- prehend arbitrary anomaly types, we set K E as a random variable and e valuate the impact by controlling max( K E ) . Results sho w that performance is poor when max( K E ) = 10 , improves as the value increases, and reaches opti- mum around 30. Further increases beyond 30 cause per- formance degradation, which we attribute to excessi vely lengthy prompts that reduce the model’ s focus on individual anomaly types. 4.3.2. Analysis of T oken Compression Fig. 6 illustrates the token compression process. W e lever - age local density to identify tokens with the highest den- sity values, which correspond closely to the background regions. During the rev erse attention stage, tokens whose features are highly dissimilar to the background tokens are Original V ideo Local Density Reverse-Attn W eight Fig. 6. T oken Compression Process . The ﬁrst column shows the original video frames. The second column shows local density , and the third shows reverse attention weights. W armer colors in- dicate higher values. (a) Compression ratio vs. model performance. (b) Compression ratio vs. GPU memory usage. Fig. 7. Impression of compression ratio. W e normalize the frame size across all datasets to ensure a consistent number of total visual tokens. GPU memory usage is recorded during inference stage. aggregated via reverse attention weight. As sho wn in the third column of Fig. 6 , the reverse attention weights concen- trate on regions that e xhibit substantial dissimilarity from the background, which are typically anomaly-prone areas. Fig. 7a presents the variation of performance across datasets with respect to the token compression ratio. For compression ratios abov e 0.1, performance remains rel- ativ ely stable, demonstrating effecti ve background token reduction without compromising model capability . On high-resolution datasets (UBnormal, ShanghaiT ech, XD- V iolence), we observe slight improv ements, as background suppression enables better focus on small anomaly targets. When the compression ratio drops belo w 0.1, a marked per - formance de gradation occurs due to substantial information frame-level pixel-level Adapter ShanghaiT ech UCF-Crime XD-Violence Ped2 A UC (%) A UC (%) AP (%) AUC (%) MLP 81.99 79.90 86.09 77.09 Q-Former 82.63 75.54 83.93 71.85 Proposed 85.28 82.06 90.62 87.68 T ab. 4. Comparison among different adapters. number of queries ShanghaiT ech UCF-Crime XD-Violence A UC (%) A UC (%) AP (%) 24 58.85 66.63 50.14 32 77.41 77.96 88.31 48 80.64 80.06 90.30 64 77.96 67.11 89.15 T ab. 5. Effect of the number of learnable queries. loss caused by excessi ve compression. UCF-Crime’ s reso- lution is much lo wer than other datasets, making it the most susceptible to the visual compression. Fig. 7b sho ws the corresponding GPU memory utiliza- tion versus the compression ratio. As the number of in- put visual tokens decreases, GPU memory consumption ex- hibits a linear reduction. At a compression ratio of 0.2, GPU memory usage is reduced to 54.1% of the baseline, with no substantial performance loss (UBNormal: +2.44%, Shang- haiT ech: +2.24%, XD-V iolence: -0.32%, UCF-Crime: - 2.49%, A verage: +0.47%). 4.3.3. Analysis of the Multi-Scale Semantic Projector T o validate the effecti veness of our Multi-Scale Semantic Projector , we compared it against MLP and Q-Former at both frame-lev el and pixel-le vel. The e xperimental results are presented in T ab . 4 . The impro vements at frame-le vel demonstrate the capability to capture temporal anomaly cues, while the enhancements at pixel-lev el indicate the ability to detect spatially sparse anomalies. T ab . 5 presents the effect of learnable query count on zero-shot detection performance. W ith 24 queries, the model achiev es suboptimal results due to limited represen- tational capacity . Performance improv es as query count in- creases, reaching a peak before declining when queries be- come excessi ve, causing con ver gence difﬁculties. 5. Conclusion In this paper , we propose LA VID A, an end-to-end zero-shot V AD approach that lev erages MLLM and token compres- sion algorithm to extract semantic anomaly features and an anomaly exposure sampler to enable anomaly detection in open-world scenarios without training V AD data. Multi- scale semantic projector is employed to e xtract hierarchical cues for joint frame- and pixel-lev el prediction. Extensiv e experiments the effecti veness across multiple benchmarks. W e hope our w ork inspires further researches in dev eloping open-world video anomaly detection and understanding. Acknowledgement This work is supported in part by the National Natural Science Foundation of China (No. 62272057) and the Beijing Ke y Laboratory of Multimodal Data Intelligent Perception and Gov ernance. References [1] Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, T udor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ub- normal: New benchmark for supervised open-set video anomaly detection. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern reco gnition , pages 20143–20153, 2022. 2 [2] Sunghyun Ahn, Y oungwan Jo, Kijung Lee, Sein Kwon, In- pyo Hong, and Sanghyun Park. Anyanomaly: Zero-shot cus- tomizable video anomaly detection with lvlm. arXiv pr eprint arXiv:2503.04504 , 2025. 2 , 3 , 6 [3] Y unkang Cao, Jiangning Zhang, Luca Frittoli, Y uqi Cheng, W eiming Shen, and Giacomo Boracchi. Adaclip: Adapting clip with h ybrid learnable prompts for zero-shot anomaly de- tection. In Eur opean Confer ence on Computer V ision , pages 55–72. Springer , 2024. 7 [4] Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Ji- ayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiang- ming Liu, Hehe Fan, et al. Uncovering what why and how: A comprehensiv e benchmark for causation understanding of video anomaly . In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 18793– 18803, 2024. 3 [5] Xinyang Feng, Dongjin Song, Y uncong Chen, Zhengzhang Chen, Jingchao Ni, and Haifeng Chen. Conv olutional trans- former based dual discriminator generativ e adversarial net- works for video anomaly detection. In Proceedings of the 29th A CM international conference on multimedia , pages 5546–5554, 2021. 2 [6] Dong Gong, Lingqiao Liu, V uong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha V enkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection, 2019. 6 [7] Y i Hao, Jie Li, Nannan W ang, Xiaoyu W ang, and Xinbo Gao. Spatiotemporal consistency-enhanced network for video anomaly detection. P attern Recognition , 121:108232, 2022. 2 [8] Or Hirschorn and Shai A vidan. Normalizing ﬂows for human pose anomaly detection. In Pr oceedings of the IEEE/CVF International Conference on Computer V ision , pages 13545– 13554, 2023. 2 , 3 [9] Qin Huang, Chunyang Xia, Chihao W u, Siyang Li, Y e W ang, Y uhang Song, and C. C. Jay Kuo. Semantic segmentation with rev erse attention, 2017. 4 [10] Y uzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Y un- long Lin, Hengyu Liu, W uyang Li, Xinyu Liu, Jiechao Gao, Y ue Huang, et al. Track any anomalous object: A granu- lar video anomaly detection pipeline. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 8689–8699, 2025. 7 [11] Hyekang Ke vin Joo, Khoa V o, Kashu Y amazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In 2023 IEEE International Conference on Image Pr ocessing (ICIP) , pages 3230–3234. IEEE, 2023. 2 , 6 [12] Xin Lai, Zhuotao Tian, Y ukang Chen, Y anwei Li, Y uhui Y uan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv pr eprint arXiv:2308.00692 , 2023. 5 [13] Fei Li, W enxuan Liu, Jingjing Chen, Ruixu Zhang, Y uran W ang, Xian Zhong, and Zheng W ang. Anomize: Better open vocab ulary video anomaly detection. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P at- tern Recognition (CVPR) , pages 29203–29212, 2025. 6 [14] Fei Li, W enxuan Liu, Jingjing Chen, Ruixu Zhang, Y uran W ang, Xian Zhong, and Zheng W ang. Anomize: Better open vocab ulary video anomaly detection. In Proceedings of the Computer V ision and P attern Recognition Confer ence , pages 29203–29212, 2025. 2 , 3 [15] Shuo Li, F ang Liu, and Licheng Jiao. Self-training multi- sequence learning with transformer for weakly supervised video anomaly detection. In Pr oceedings of the AAAI Con- fer ence on Artiﬁcial Intelligence , pages 1395–1403, 2022. 2 [16] Zhian Liu, Y ongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framew ork via memory-augmented ﬂow reconstruction and ﬂow-guided frame prediction. In Pr oceedings of the IEEE/CVF interna- tional conference on computer vision , pages 13588–13597, 2021. 2 [17] Zhikang Liu, Y iming Zhou, Y uansheng Xu, and Zilei W ang. Simplenet: A simple network for image anomaly detection and localization. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern r ecognition , pages 20402–20411, 2023. 7 [18] Ahsan Mahmood, Junier Oliv a, and Martin Styner. Multi- scale score matching for out-of-distribution detection. arXiv pr eprint arXiv:2010.13132 , 2020. 6 [19] Snehashis Majhi, Giacomo D’Amicantonio, Antitza Dantchev a, Quan K ong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev , and Francois Bremond. Just dance with pi! a poly-modal inductor for weakly-supervised video anomaly detection. In Proceedings of the Computer V ision and P attern Recognition Conference (CVPR) , pages 24265–24274, 2025. 2 [20] Snehashis Majhi, Giacomo D’Amicantonio, Antitza Dantchev a, Quan K ong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev , and Franc ¸ ois Br ´ emond. Just dance with pi! a poly-modal inductor for weakly-supervised video anomaly detection. In Proceedings of the Com- puter V ision and P attern Recognition Confer ence , pages 24265–24274, 2025. 6 [21] Jakub Micorek, Horst Possegger , Dominik Narnhofer, Horst Bischof, and Mateusz K ozinski. Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection. In Pr oceedings of the IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition , pages 18868–18877, 2024. 6 [22] Arian Mousakhan, Thomas Brox, and Jawad T ayyub . Anomaly detection with conditioned denoising diffusion models. In DA GM German Conference on P attern Recog- nition , pages 181–195. Springer , 2024. 7 [23] Chae won Park, MyeongAh Cho, Minhyeok Lee, and Sangy- oun Lee. Fastano: Fast anomaly detection via spatio- temporal patch transformation. In Pr oceedings of the IEEE/CVF W inter Confer ence on Applications of Computer V ision , pages 2249–2259, 2022. 6 [24] Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled mask ed auto-encoders are efﬁcient video anomaly detectors. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern reco gnition , pages 15984–15995, 2024. 2 , 6 [25] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy , and Ehsan Adeli. Adversarially learned one-class classiﬁer for nov elty detection. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecogni- tion , pages 3379–3388, 2018. 2 [26] Bernhard Sch ¨ olkopf, Robert C W illiamson, Alex Smola, John Shawe-T aylor , and John Platt. Support vector method for novelty detection. Advances in neural information pro- cessing systems , 12, 1999. 2 [27] Y ihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Y uxuan Fan, Muyang Zhang, Ziyang Y an, Ao Ma, et al. Eventv ad: Training-free ev ent-aware video anomaly detection. arXiv preprint , 2025. 2 , 3 , 6 [28] W aqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecog- nition , pages 6479–6488, 2018. 2 [29] Jiaqi T ang, Hao Lu, Ruizheng W u, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying- cong Chen. Hawk: Learning to understand open-w orld video anomalies. Advances in Neural Information Pr ocessing Sys- tems , 37:139751–139785, 2024. 3 [30] Y u Tian, Guansong Pang, Y uanhong Chen, Rajvinder Singh, Johan W V erjans, and Gustavo Carneiro. W eakly-supervised video anomaly detection with rob ust temporal feature magni- tude learning. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 4975–4986, 2021. 2 [31] Jue W ang and Anoop Cherian. Gods: Generalized one-class discriminativ e subspaces for anomaly detection. In Pr oceed- ings of the IEEE/CVF International Confer ence on Com- puter V ision , pages 8201–8211, 2019. 2 , 6 [32] Peng W u and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE T ransac- tions on Image Pr ocessing , 30:3513–3527, 2021. 2 [33] Peng W u, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE tr ansactions on neur al networks and learning systems , 31(7):2609–2622, 2019. 2 [34] Peng W u, Xuerong Zhou, Guansong Pang, Y ujia Sun, Jing Liu, Peng W ang, and Y anning Zhang. Open-v ocabulary video anomaly detection. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 18297–18307, 2024. 2 , 3 [35] Peng W u, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Y an, Peng W ang, and Y anning Zhang. V adclip: Adapting vision-language models for weakly supervised video anomaly detection. In Pr oceedings of the AAAI Con- fer ence on Artiﬁcial Intelligence , pages 6074–6082, 2024. 2 , 6 [36] Zhenghao Xing, Hao Chen, Binzhu Xie, Jiaqi Xu, Ziyu Guo, Xuemiao Xu, Jianye Hao, Chi-W ing Fu, Xiaowei Hu, and Pheng-Ann Heng. Echotrafﬁc: Enhancing trafﬁc anomaly understanding with audio-visual insights. In Proceedings of the Computer V ision and P attern Recognition Confer ence (CVPR) , pages 19098–19108, 2025. 3 [37] Chenting Xu, K e Xu, Xinghao Jiang, and T anfeng Sun. Plo- vad: Prompting vision-language models for open vocab ulary video anomaly detection. IEEE T ransactions on Circuits and Systems for V ideo T echnology , 2025. 2 , 3 [38] Cheng Y an, Shiyu Zhang, Y ang Liu, Guansong Pang, and W enjun W ang. Feature prediction diffusion model for video anomaly detection. In Pr oceedings of the IEEE/CVF inter- national conference on computer vision , pages 5527–5537, 2023. 6 [39] Y uchen Y ang, Kwonjoon Lee, Behzad Dariush, Y inzhi Cao, and Shao-Y uan Lo. Follow the rules: Reasoning for video anomaly detection with large language models. In Eur opean Confer ence on Computer V ision , pages 304–322. Springer, 2024. 2 , 3 , 6 [40] Zhiwei Y ang, Jing Liu, Zhaoyang W u, Peng W u, and Xiaotao Liu. V ideo ev ent restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF con- fer ence on computer vision and pattern r ecognition , pages 14592–14601, 2023. 2 [41] Zhiwei Y ang, Jing Liu, and Peng W u. T ext prompt with nor - mality guidance for weakly supervised video anomaly detec- tion. In Pr oceedings of the IEEE/CVF conference on com- puter vision and pattern r ecognition , pages 18899–18908, 2024. 2 , 6 [42] Muchao Y e, W eiyang Liu, and Pan He. V era: Explainable video anomaly detection via verbalized learning of vision- language models, 2024. 6 [43] T ongtong Y uan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. T owards surveillance video-and-language understanding: New dataset baselines and challenges. In Pr oceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22052– 22061, 2024. 3 [44] Muhammad Zaigham Zaheer , Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee. Old is gold: Redeﬁning the adversarially learned one-class classiﬁer training paradigm. In Pr oceed- ings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 14183–14193, 2020. 2 [45] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalc y suppression for anoma- lous ev ent detection. In Eur opean Confer ence on Computer V ision , pages 358–376. Springer, 2020. 2 [46] M Zaigham Zaheer , Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Y u, and Seung-Ik Lee. Generativ e cooperativ e learning for unsupervised video anomaly detection. In Pr o- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 14744–14754, 2022. 6 [47] Luca Zanella, Willi Menapace, Massimiliano Mancini, Y im- ing W ang, and Elisa Ricci. Harnessing large language mod- els for training-free video anomaly detection. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P at- tern Recognition , pages 18527–18536, 2024. 2 , 3 , 6 [48] V itjan Za vrtanik, Matej Kristan, and Danijel Sko ˇ caj. Draem- a discriminatively trained reconstruction embedding for sur- face anomaly detection. In Proceedings of the IEEE/CVF international confer ence on computer vision , pages 8330– 8339, 2021. 7 [49] Huaxin Zhang, Xiaohao Xu, Xiang W ang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Y u, and Nong Sang. Holmes-vau: T owards long-term video anomaly understanding at any granularity . In Proceedings of the Com- puter V ision and P attern Recognition Conference (CVPR) , pages 13843–13853, 2025. 3 , 6 [50] Y uanhong Zhong, Xia Chen, Jinyang Jiang, and Fan Ren. A cascade reconstruction model with generalization ability ev aluation for anomaly detection in videos. P attern Recog- nition , 122:108336, 2022. 2 [51] Hang Zhou, Jiale Cai, Y uteng Y e, Y onghui Feng, Chenxing Gao, Junqing Y u, Zikai Song, and W ei Y ang. V ideo anomaly detection with motion and appearance guided patch diffusion model. In Pr oceedings of the AAAI Conference on Artiﬁcial Intelligence , pages 10761–10769, 2025. 6 [52] Qihang Zhou, Guansong Pang, Y u T ian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learn- ing for zero-shot anomaly detection. arXiv pr eprint arXiv:2310.18961 , 2023. 7 [53] Y uansheng Zhu, W entao Bao, and Qi Y u. T owards open set video anomaly detection. In Eur opean Conference on Com- puter V ision , pages 395–412. Springer, 2022. 2 , 3

No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment