Title: Classifying long legal documents using short random chunks
ArXiv ID: 2512.24997
Date: 2025-12-31
Authors: Luis Adrián Cabrera-Diego
📝 Abstract
Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomlyselected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.
💡 Deep Analysis
📄 Full Content
Classifying long legal documents using short random chunks
Luis Adrián Cabrera-Diego
Jus Mundi / 30 Rue de Lisbonne, 75008 Paris, France
a.cabrera@jusmundi.com
Abstract
Classifying legal documents is a challenge, be-
sides their specialized vocabulary, sometimes
they can be very long. This means that feeding
full documents to a Transformers-based models
for classification might be impossible, expen-
sive or slow. Thus, we present a legal document
classifier based on DeBERTa V3 and a LSTM,
that uses as input a collection of 48 randomly-
selected short chunks (max 128 tokens). Be-
sides, we present its deployment pipeline using
Temporal, a durable execution solution, which
allow us to have a reliable and robust process-
ing workflow. The best model had a weighted
F-score of 0.898, while the pipeline running
on CPU had a processing median time of 498
seconds per 100 files.
1
Introduction
Legal AI is the use of artificial intelligence tech-
nologies, to help legal professionals in their heavy
and redundant tasks (Zhong et al., 2020). And,
while Legal AI is not new (Dale, 2019), processing
legal documents is challenging.
The main challenge is that legal documents
are diverse, not only in their length, but also in
their vocabulary, structure, subjectivity and scope
(Mitchell, 2014; Trautmann, 2023). And while the
two latter characteristics can be (partially) mini-
mized by training tools using specialized corpora
(Chalkidis et al., 2020), the first characteristic, i.e.
length, cannot be. For instance, the longer the doc-
ument, the harder to keep the correct contexts that
are actually relevant for a task (Wagh et al., 2021).
As well, when using Transformed-based technolo-
gies, such as BERT (Devlin et al., 2019), the mem-
ory consumption will explode the longer the input
(Vaswani et al., 2017). And while there are now
Large Language Models (LLM), such as GPT1, that
can process thousands of tokens, their use can be
1https://openai.com/
expensive, or can have risks (Karla Grossenbacher,
2023; Cassandre Coyer and Isha Marathe, 2024;
Fields et al., 2024).
Therefore, we present a classifier capable of pro-
cessing long legal documents, and can be deployed
within in-house CPU servers. To achieve this, we
have created a document classifier using DeBERTA
V3 (He et al., 2021) and a LSTM, that uses a col-
lection of 48 randomly-selected short chunks (the
maximum size of the chunks is 128 tokens). As
well, we describe how we used Temporal2 to create
durable workflows and to focus on the delivery of
our classifier and not how to deal with the work-
flows’ execution.
The proposed model, has been trained and tested
using a large multilingual collection of legal doc-
uments of different lengths and that covers 18
classes. The classifier has a median weighted F-
score of 0.898 and the median time for processing
files using Temporal is 498 seconds per 100 files.
2
Related Work
Most of the works related to the classification of
long documents rely on the splitting of documents.
For instance, Pappagari et al. (2019) segment a
long document into smaller chunks of 200 tokens,
feed them into a BERT (Devlin et al., 2019) model,
and propagate them into either an LSTM or a trans-
former layer. CogLXT (Ding et al., 2020) is a clas-
sifier that uses only key sentences as input; these
are obtained using a model trained as a judge. Park
et al. (2022) presented two baselines using BERT
where relevant sentences, determined by TextRank
(Mihalcea and Tarau, 2004), or randomly selected
ones, are concatenated to the first 512 tokens of a
document.
Others works have explored how to increase the
input size of Transformer-based models. The best
example is Longformer (Beltagy et al., 2020), a
2https://temporal.io/
arXiv:2512.24997v1 [cs.CL] 31 Dec 2025
model that is capable of processing up-to 4,096
tokens by using a windowed local-context self-
attention and a global attention. Nonetheless, this
kind of models tend to suffer from large memory
consumption and long processing time (Park et al.,
2022).
On the legal domain, we can highlight the fol-
lowing works. Chalkidis et al. (2019) classified
documents by training legal Doc2Vec embeddings
(Le and Mikolov, 2014) and feeding them into a
BiGRU with a Label-Wise Attention Network (Mul-
lenbach et al., 2018). Similarly, Wan et al. (2019)
trained legal Doc2Vec embeddings, but they fed
them into a BiLSTM, with chunk attention layer.
LegalDB (Bambroo and Awasthi, 2021) and Law-
former (Xiao et al., 2021) converted respectively
DistillBERT (Sanh et al., 2020) and a Chinese
RoBERTa model (Cui et al., 2021) into a Long-
former model. Both models were pre-trained using
legal corpora. Similarly, Mamakas et al. (2022)
converted LegalBERT (Chalkidis et al., 2020) into
a legal hierarchical BERT, and a legal Longformer
model which can process up to 8,192 tokens.
D2GCFL (Wang et al., 2022), is a legal docu-
ment classifier that extracts relations and represent
them into four graphs, which are fed into a graph
atten