Data Lakehouse Architectures for AI Infrastructure: A Systematic Review of Storage, Metadata, and Governance
Main article
Abstract
The data lakehouse has emerged as the dominant architectural pattern for serving both analytical SQL and machine-learning workloads from a single copy of data held on low-cost cloud object storage. Yet the engineering problem at its core is frequently underspecified: object stores expose a key–value interface with high per-object latency and weak listing consistency, which makes performant, transactional, and governable table storage difficult to achieve. This article presents a systematic review of lakehouse architectures organised around three engineering pillars—the storage layer (open columnar formats and physical optimisation), the metadata layer (transaction logs, manifests, catalogs, lineage, and discovery), and the governance layer (access control, policy, data quality, and auditability). Following a PRISMA-style protocol, 1,284 records were screened and 30 peer-reviewed studies were coded across thirteen dimensions. We formalise a layered reference architecture, a normalised metadata-catalog schema, and an optimistic-concurrency commit protocol that together make the design space explicit. To quantify the contribution of each metadata mechanism we develop a reproducible analytical cost model and evaluate four storage configurations on four AI-oriented workloads (full-scan training reads, point feature lookups, time-travel snapshots, and concurrent ingestion). The evaluation shows that a transaction log reduces query-planning time by roughly two orders of magnitude relative to object-store listing, that column statistics and data skipping reduce point-lookup latency by nearly four orders of magnitude, and that compaction trades a 6–8% storage overhead for further planning and locality gains. We report run-to-run variance and sensitivity to file count, and we synthesise the coded corpus into design guidelines for data engineers building AI data infrastructure and computational discovery systems. All code, the coded literature matrix, and the data dictionary are released as supplementary material.
