Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi
Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
💡 Research Summary
The paper tackles the challenging problem of summarizing Indian court judgments in both English and Hindi, motivated by the fact that a large majority of the Indian population lacks proficiency in the complex English used in legal documents. The authors build upon their previously introduced MILDSum dataset, which contains 3,122 Indian judgments with reference summaries in both languages, and propose a comprehensive framework that injects legal domain knowledge into a variety of summarization models—both extractive and generative.
For extractive summarization, the authors enhance the SummaRuNNer architecture by integrating a domain‑specific pre‑trained encoder, InLegalBERT, which has been trained on a large corpus of Indian legal texts. This encoder captures legal terminology, citation patterns, and structural cues that are otherwise missed by generic encoders, leading to more accurate sentence selection in the high‑extractiveness setting of MILDSum (extractive fragment coverage 0.90, density 24.42).
For generative summarization, the study explores continual pre‑training on two legal corpora: InLegalBERT‑PT (27K–68K English judgments) and the Bail Corpus (17K–34K Hindi judgments). The authors experiment with two training strategies: (1) full‑parameter fine‑tuning, and (2) a memory‑efficient low‑rank adapter method called GaLore. Results demonstrate that GaLore achieves nearly identical performance while drastically reducing GPU memory consumption, making large‑scale legal pre‑training more accessible.
The models evaluated span three architectural families—Encoder‑only, Encoder‑Decoder, and Decoder‑only—each kept at comparable parameter counts to isolate the effect of domain knowledge injection. Evaluation metrics include standard ROUGE‑2 and ROUGE‑L, semantic similarity scores (InLegal‑BERTScore for English and multilingual BERTScore for Hindi), and factual consistency measured by SummaC, an NLI‑based metric that assesses whether each generated summary sentence logically follows from the source document.
Empirically, domain‑infused models outperform the prior state‑of‑the‑art baselines (SummaRuNNer for EN‑EN and CrossSum‑mT5 for EN‑HI) by substantial margins: ROUGE‑2 F1 improves by 20‑23% and ROUGE‑L F1 by 15‑19% across both language pairs. SummaC scores also rise, indicating reduced hallucination and better factual grounding. Human evaluation by legal experts corroborates these quantitative gains, with experts rating the injected models higher in accuracy, completeness, and appropriate legal phrasing. A direct comparison with GPT‑4 shows that, despite GPT‑4’s general language capabilities, it lags behind the specialized models on both domain relevance and cross‑lingual transfer.
Additional analyses examine the impact of pre‑training corpus size, revealing that even modestly sized legal corpora (tens of thousands of documents) yield meaningful performance boosts, suggesting that extensive data collection is not a strict prerequisite for domain adaptation. Cross‑lingual pre‑training experiments (English‑only, Hindi‑only, and mixed) further illuminate transfer effects, with mixed‑language pre‑training providing the best results for the EN‑HI task.
In summary, the paper makes four key contributions: (1) a novel method for injecting legal knowledge into extractive encoders, (2) a systematic approach to continual legal pre‑training for generative models, (3) validation of resource‑efficient training via GaLore, and (4) a thorough evaluation framework that includes legal‑specific semantic and factual metrics as well as expert human judgment. The findings demonstrate that domain knowledge injection markedly improves both the relevance and factual reliability of legal document summaries in multilingual settings, and the techniques are readily extensible to other specialized domains and languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment