Classifying long legal documents using short random chunks

December 31, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Classifying long legal documents using short random chunks
ArXiv ID: 2512.24997
Date: 2025-12-31
Authors: Luis Adrián Cabrera-Diego

📝 Abstract

Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomlyselected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

💡 Deep Analysis

📄 Full Content

Classifying long legal documents using short random chunks Luis Adrián Cabrera-Diego Jus Mundi / 30 Rue de Lisbonne, 75008 Paris, France a.cabrera@jusmundi.com Abstract Classifying legal documents is a challenge, be- sides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expen- sive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly- selected short chunks (max 128 tokens). Be- sides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust process- ing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files. 1 Introduction Legal AI is the use of artificial intelligence tech- nologies, to help legal professionals in their heavy and redundant tasks (Zhong et al., 2020). And, while Legal AI is not new (Dale, 2019), processing legal documents is challenging. The main challenge is that legal documents are diverse, not only in their length, but also in their vocabulary, structure, subjectivity and scope (Mitchell, 2014; Trautmann, 2023). And while the two latter characteristics can be (partially) mini- mized by training tools using specialized corpora (Chalkidis et al., 2020), the first characteristic, i.e. length, cannot be. For instance, the longer the doc- ument, the harder to keep the correct contexts that are actually relevant for a task (Wagh et al., 2021). As well, when using Transformed-based technolo- gies, such as BERT (Devlin et al., 2019), the mem- ory consumption will explode the longer the input (Vaswani et al., 2017). And while there are now Large Language Models (LLM), such as GPT1, that can process thousands of tokens, their use can be 1https://openai.com/ expensive, or can have risks (Karla Grossenbacher, 2023; Cassandre Coyer and Isha Marathe, 2024; Fields et al., 2024). Therefore, we present a classifier capable of pro- cessing long legal documents, and can be deployed within in-house CPU servers. To achieve this, we have created a document classifier using DeBERTA V3 (He et al., 2021) and a LSTM, that uses a col- lection of 48 randomly-selected short chunks (the maximum size of the chunks is 128 tokens). As well, we describe how we used Temporal2 to create durable workflows and to focus on the delivery of our classifier and not how to deal with the work- flows’ execution. The proposed model, has been trained and tested using a large multilingual collection of legal doc- uments of different lengths and that covers 18 classes. The classifier has a median weighted F- score of 0.898 and the median time for processing files using Temporal is 498 seconds per 100 files. 2 Related Work Most of the works related to the classification of long documents rely on the splitting of documents. For instance, Pappagari et al. (2019) segment a long document into smaller chunks of 200 tokens, feed them into a BERT (Devlin et al., 2019) model, and propagate them into either an LSTM or a trans- former layer. CogLXT (Ding et al., 2020) is a clas- sifier that uses only key sentences as input; these are obtained using a model trained as a judge. Park et al. (2022) presented two baselines using BERT where relevant sentences, determined by TextRank (Mihalcea and Tarau, 2004), or randomly selected ones, are concatenated to the first 512 tokens of a document. Others works have explored how to increase the input size of Transformer-based models. The best example is Longformer (Beltagy et al., 2020), a 2https://temporal.io/ arXiv:2512.24997v1 [cs.CL] 31 Dec 2025 model that is capable of processing up-to 4,096 tokens by using a windowed local-context self- attention and a global attention. Nonetheless, this kind of models tend to suffer from large memory consumption and long processing time (Park et al., 2022). On the legal domain, we can highlight the fol- lowing works. Chalkidis et al. (2019) classified documents by training legal Doc2Vec embeddings (Le and Mikolov, 2014) and feeding them into a BiGRU with a Label-Wise Attention Network (Mul- lenbach et al., 2018). Similarly, Wan et al. (2019) trained legal Doc2Vec embeddings, but they fed them into a BiLSTM, with chunk attention layer. LegalDB (Bambroo and Awasthi, 2021) and Law- former (Xiao et al., 2021) converted respectively DistillBERT (Sanh et al., 2020) and a Chinese RoBERTa model (Cui et al., 2021) into a Long- former model. Both models were pre-trained using legal corpora. Similarly, Mamakas et al. (2022) converted LegalBERT (Chalkidis et al., 2020) into a legal hierarchical BERT, and a legal Longformer model which can process up to 8,192 tokens. D2GCFL (Wang et al., 2022), is a legal docu- ment classifier that extracts relations and represent them into four graphs, which are fed into a graph atten

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Classifying long legal documents using short random chunks

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Do Large Language Models Know What They Are Capable Of?

Modeling Language as a Sequence of Thoughts

Start searching

No results found