Main article

Yue Shen
School of Law and Intellectual Property, Guangdong University of Finance and Economics, Guangzhou 510320, China
Jiahao Wei
Department of Computer Science, Hebei University of Engineering, Handan 056038, China
Lianming Xu*
College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
lmxu@njfu.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010303

Abstract

The accelerating adoption of artificial intelligence in legal practice has created an urgent demand for large-scale, structured, multilingual repositories of judicial precedent. Existing legal corpora either cover a single jurisdiction and language, lack formal schema design, omit critical citation and statute metadata, or are distributed under licences that prevent broad research use. This paper introduces LexPrecedentDB, an open multilingual legal precedent database covering 572,800 case opinions across eight jurisdictions and eight languages. LexPrecedentDB integrates three storage tiers—a relational store (PostgreSQL) for structured metadata and full texts, a citation graph (Neo4j) for forward and backward citation networks, and a vector index (FAISS/HNSW) for semantic case embeddings produced by a fine-tuned XLM-R model. The data pipeline performs OCR correction, deduplication, language identification, named-entity extraction, case-cause classification, and statute linking through a combination of rule-based aligners and transformer-based classifiers. The database exposes three interfaces: a REST API for case retrieval, a SPARQL endpoint for statute graph queries, and a Python SDK for analytical workflows. Benchmark experiments on the held-out evaluation sets show that the full LexPrecedentDB pipeline achieves MAP@10 = 0.682 and case-cause classification F1 = 0.651, outperforming BM25 and multilingual BERT baselines across all five tested languages. Statute linking accuracy reaches 89.4%, and expert-agreement rate in a blind validation study with eight practising lawyers is 82.7%. Ablation experiments confirm that each pipeline component contributes meaningfully to overall performance. The database is released under a CC-BY-NC 4.0 licence with a persistent DOI.

Article details

How to Cite

Shen, Y., Wei, J., & Xu, L. (2023). LexPrecedentDB: A Multilingual Legal Precedent Database for Case Retrieval and Judicial Analytics. DATAMIND, 1(3), 19-33. https://doi.org/10.63646/datamind.2023.010303