Data Lakehouse Architectures for AI Infrastructure: A Systematic Review of Storage, Metadata, and Governance

Daniel  R. Okafor; Mei-Ling  Tan; Sven  A. Halvorsen; Priya  Raghunathan

doi:10.63646/datamind.2023.010306

Open Access PDF

Received 2023-04-16

Accepted 2023-08-28

Published 2023-09-30

Daniel R. Okafor

Department of Computer Science, Institute for Data Systems Engineering, Trondheim 7034, Norway

Mei-Ling Tan*

School of Computing and Information Systems, National Data Infrastructure Laboratory, Singapore 117417, Singapore
mei-ling.tan@ndil-lab.edu.sg

Sven A. Halvorsen

Department of Computer Science, Institute for Data Systems Engineering, Trondheim 7034, Norway

Priya Raghunathan

Center for Computational Discovery, Pacific Institute of Technology, Berkeley, CA 94720, USA

DOI: https://doi.org/10.63646/datamind.2023.010306

Abstract

The data lakehouse has emerged as the dominant architectural pattern for serving both analytical SQL and machine-learning workloads from a single copy of data held on low-cost cloud object storage. Yet the engineering problem at its core is frequently underspecified: object stores expose a key–value interface with high per-object latency and weak listing consistency, which makes performant, transactional, and governable table storage difficult to achieve. This article presents a systematic review of lakehouse architectures organised around three engineering pillars—the storage layer (open columnar formats and physical optimisation), the metadata layer (transaction logs, manifests, catalogs, lineage, and discovery), and the governance layer (access control, policy, data quality, and auditability). Following a PRISMA-style protocol, 1,284 records were screened and 30 peer-reviewed studies were coded across thirteen dimensions. We formalise a layered reference architecture, a normalised metadata-catalog schema, and an optimistic-concurrency commit protocol that together make the design space explicit. To quantify the contribution of each metadata mechanism we develop a reproducible analytical cost model and evaluate four storage configurations on four AI-oriented workloads (full-scan training reads, point feature lookups, time-travel snapshots, and concurrent ingestion). The evaluation shows that a transaction log reduces query-planning time by roughly two orders of magnitude relative to object-store listing, that column statistics and data skipping reduce point-lookup latency by nearly four orders of magnitude, and that compaction trades a 6–8% storage overhead for further planning and locality gains. We report run-to-run variance and sensitivity to file count, and we synthesise the coded corpus into design guidelines for data engineers building AI data infrastructure and computational discovery systems. All code, the coded literature matrix, and the data dictionary are released as supplementary material.

Keywords: Data lakehouse; storage engines; metadata management; data governance; AI data infrastructure; data engineering; computational discovery

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

R. Okafor, D. ., Tan, M.-L. ., A. Halvorsen, S., & Raghunathan, P. . (2023). Data Lakehouse Architectures for AI Infrastructure: A Systematic Review of Storage, Metadata, and Governance. DATAMIND, 1(3), 66-84. https://doi.org/10.63646/datamind.2023.010306

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite