LexPrecedentDB: A Multilingual Legal Precedent Database for Case Retrieval and Judicial Analytics

Yue Shen; Jiahao Wei; Lianming Xu

doi:10.63646/datamind.2023.010303

Open Access PDF

Published 2023-09-30

Yue Shen

School of Law and Intellectual Property, Guangdong University of Finance and Economics, Guangzhou 510320, China

Jiahao Wei

Department of Computer Science, Hebei University of Engineering, Handan 056038, China

Lianming Xu*

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
lmxu@njfu.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010303

Abstract

The accelerating adoption of artificial intelligence in legal practice has created an urgent demand for large-scale, structured, multilingual repositories of judicial precedent. Existing legal corpora either cover a single jurisdiction and language, lack formal schema design, omit critical citation and statute metadata, or are distributed under licences that prevent broad research use. This paper introduces LexPrecedentDB, an open multilingual legal precedent database covering 572,800 case opinions across eight jurisdictions and eight languages. LexPrecedentDB integrates three storage tiers—a relational store (PostgreSQL) for structured metadata and full texts, a citation graph (Neo4j) for forward and backward citation networks, and a vector index (FAISS/HNSW) for semantic case embeddings produced by a fine-tuned XLM-R model. The data pipeline performs OCR correction, deduplication, language identification, named-entity extraction, case-cause classification, and statute linking through a combination of rule-based aligners and transformer-based classifiers. The database exposes three interfaces: a REST API for case retrieval, a SPARQL endpoint for statute graph queries, and a Python SDK for analytical workflows. Benchmark experiments on the held-out evaluation sets show that the full LexPrecedentDB pipeline achieves MAP@10 = 0.682 and case-cause classification F1 = 0.651, outperforming BM25 and multilingual BERT baselines across all five tested languages. Statute linking accuracy reaches 89.4%, and expert-agreement rate in a blind validation study with eight practising lawyers is 82.7%. Ablation experiments confirm that each pipeline component contributes meaningfully to overall performance. The database is released under a CC-BY-NC 4.0 licence with a persistent DOI.

Keywords: legal precedent database; multilingual case retrieval; judicial analytics; cross-lingual information retrieval; statute linking; citation graph; XLM-R; legal NLP; open legal data

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Shen, Y., Wei, J., & Xu, L. (2023). LexPrecedentDB: A Multilingual Legal Precedent Database for Case Retrieval and Judicial Analytics. DATAMIND, 1(3), 19-33. https://doi.org/10.63646/datamind.2023.010303

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite