Main article

Yuxiang Liang
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
Tianhao Qin
School of Software, Shandong Normal University, Jinan 250358, China
Beibei Hu
School of Information Management, Heilongjiang University, Harbin 150080, China
Zhenyu Hou*
School of Mathematics and Computer Science, Yan'an University, Yan'an 716000, China
zhenyu.hou@yau.edu.cn

DOI: https://doi.org/10.63646/datamind.2023.010203

Abstract

Research databases are growing in both number and heterogeneity. Even within a single discipline, analysts routinely confront relational stores, document collections, graph databases, vector indexes, and lakehouse-style tables, each carrying its own conventions for field naming, type encoding, key declaration, and provenance recording. The result is a recurring bottleneck: a disproportionate share of any data analysis project is spent re-discovering schema structure that, in principle, has already been recorded by someone else. This article presents MetaSchema, a schema discovery and harmonization toolkit aimed at this bottleneck. MetaSchema is organised as a four-stage pipeline — automatic schema profiling, large-language-model-assisted field annotation, cross-source entity and field matching, and reviewer-in-the-loop version control — that transforms a collection of heterogeneous databases into a unified, queryable schema graph with a field dictionary, cross-source mapping tables, and a reproducible query interface. We describe the design decisions that make the toolkit practically deployable, including its hybrid matching layer, its structured human-review protocol, and its semantic-version log. An empirical evaluation on a benchmark of twelve heterogeneous databases, totalling 2,418 tables and 27,640 fields, shows that MetaSchema achieves a field-type recovery accuracy of 86.4%, a cross-source field matching F1-score of 0.821, and a 67% reduction in median reviewer time per 100 fields compared with a careful manual baseline. The toolkit scales close to linearly up to 5,000 tables and integrates with relational, graph, vector, and lakehouse storage layers. MetaSchema is released as open-source software together with the benchmark, the evaluation scripts, and a reproducible query API designed to support automated analysis, model evaluation, and downstream decision tools.

Article details

How to Cite

Liang, Y., Qin, T., Hu, . B., & Hou, Z. (2023). MetaSchema: A Schema Discovery and Harmonization Toolkit for Heterogeneous Research Databases. DATAMIND, 1(2), 16-32. https://doi.org/10.63646/datamind.2023.010203