Skim-Aware Contrastive Learning for Efficient Document Representation

Reading time: 5 minute
...

📝 Original Info

  • Title: Skim-Aware Contrastive Learning for Efficient Document Representation
  • ArXiv ID: 2512.24373
  • Date: 2025-12-30
  • Authors: Waheed Ahmed Abro, Zied Bouraoui

📝 Abstract

Although transformer-based models have shown strong performance in word-and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new selfsupervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

💡 Deep Analysis

Figure 1

📄 Full Content

Skim-Aware Contrastive Learning for Efficient Document Representation Waheed Ahmed Abro CDEP - UR 2471, Univ Artois, France wahmed.abro@univ-artois.fr Zied Bouraoui CRIL - CNRS, Univ Artois, France zied.bouraoui@cril.fr Abstract Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse atten- tion mechanisms can handle longer inputs, but are resource-intensive and often fail to cap- ture full-document context. Hierarchical trans- former models offer better efficiency but do not clearly explain how they relate different sec- tions of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self- supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the doc- ument and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from un- related ones. This mimics how humans synthe- size information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency. 1 Introduction Since the introduction of Language Models (LMs), the focus in NLP has been on fine-tuning large pre-trained language models, especially for solv- ing sentence and paragraph-level tasks. However, accurately learning document embeddings contin- ues to be an important challenge for several appli- cations, such as document classification (Saggau et al., 2023), ranking (Ginzburg et al., 2021; Izac- ard et al., 2021), retrieval-augmented generation (RAG) systems that demand efficient document representation encoders (Zhang et al., 2024; Zhao et al., 2025) and legal and medical applications like judgment prediction (Chalkidis et al., 2019; Feng et al., 2022), legal information retrieval (Sansone and Sperlí, 2022), and biomedical document classi- fication (Johnson et al., 2016; Wang et al., 2023). Learning high-quality document representations is a challenging task due to the difficulty in develop- ing efficient encoders with reasonable complexity. Most document encoders use sentence encoders based on self-attention architectures such as BERT (Devlin et al., 2019). However, it is not feasible to have inputs that are too long as self-attention scales quadratically with the input length. To process long inputs efficiently, architectures such as Lin- former (Wang et al., 2020), Big Bird (Zaheer et al., 2020a), Longformer (Beltagy et al., 2020a) and Hierarchical Transformers (Chalkidis et al., 2022a) have been developed. Unlike quadratic scaling in traditional attention mechanisms, these architec- tures utilize sparse attention mechanisms or hierar- chical attention mechanisms that scale linearly. As such, they can process 4096 input tokens, which is enough to embed most types of documents, includ- ing legal and medical documents, among others. While methods based on sparse attention net- works offer a solution for complexity, the length of the document remains a problem for building faithful representations for downstream applica- tions such as legal and medical domains. First, fine-tuning these models for downstream tasks is computationally intensive. Second, capturing the meaning of the whole document remains too com- plex. In particular, it is unclear how or to what extent inner-dependencies between text fragments are considered. This is because longer documents contain more information than shorter documents, making it difficult to capture all the relevant infor- mation within a fixed-size representation. Addi- tionally, documents usually cover different parts, making the encoding process complex and may lead to collapsed representations. This is particu- larly true for legal and medical documents, as they contain specialized terminology and text segments that describe a series of interrelated facts. 1 arXiv:2512.24373v1 [cs.CL] 30 Dec 2025 When domain experts, such as legal or medi- cal professionals, read through documents, they skillfully skim the text, honing in on some text fragments that, when pieced together, provide an understanding of the content. Inspired by this in- tuitive process, our work focuses on developing document encoders capable of generating high- quality embeddings for long documents. These encoders mimic the expert ability to distill relevant text chunks, enabling them to excel in downstream tasks right out of the box, without the need for fine-tuning. We propose a novel framework for self-supervised contrastive learning that focuses on long legal and medical documents. Our approach features a self-supervised Chunk Prediction En- coder (CPE) designed to tackle the challenge of learning document representations w

📸 Image Gallery

CPE-v1.png CPE_long.png macro-F1.png tsne_CPE_embedding_scotus.png tsne_SimCSE_embedding_scotus.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut