RAGBase: A Hybrid Vector–Graph Database Architecture for Retrieval-Augmented Generation

Rafael Almeida Souza; Mariana Lopes Ferreira; Diego Carvalho Rocha; Camila Pereira Nunes

doi:10.63646/datamind.2023.010305

Open Access PDF

Published 2023-09-30

Rafael Almeida Souza

Department of Computer Science, Federal University of Lavras, Lavras 37200-900, Brazil

Mariana Lopes Ferreira*

Institute of Informatics, Federal University of Goiás, Goiânia 74690-900, Brazil
mariana.ferreira@inf.ufg.br

Diego Carvalho Rocha

Center of Technology, Federal University of Ceará, Fortaleza 60440-900, Brazil

Camila Pereira Nunes

Faculty of Computing, Federal University of Mato Grosso do Sul, Campo Grande 79070-900, Brazil

DOI: https://doi.org/10.63646/datamind.2023.010305

Abstract

Retrieval-augmented generation has rapidly become the de-facto pattern for grounding large language models on private and dynamic knowledge, yet the current backends used in practice oscillate between two extremes. Vector-only databases handle semantic similarity efficiently but struggle with multi-hop relational queries where the answer requires traversing several entities. Knowledge-graph-only backends handle relational queries elegantly but miss semantically paraphrased evidence that is not encoded as explicit relations. This article presents RAGBase, a hybrid vector-graph database architecture that treats the retrieval database itself as the principal research artifact and unifies both retrieval modes through a documented schema, a typed field dictionary, indexed evidence storage, a quality-control pipeline, and a reusable application programming interface. Six core entities (DOCUMENT, CHUNK, EMBEDDING, ENTITY, RELATION, EVIDENCE) are organized so that every retrieved fragment, regardless of whether it was reached by vector similarity or by graph traversal, traces back to a single canonical evidence record. A learned query router decides per-query whether to issue a vector recall, a graph traversal, a fused hybrid retrieval, or a BM25 fallback, and an evidence fusion module merges the resulting candidate set before passing it to the generator. We benchmark RAGBase on a working corpus of 24.7 million chunks drawn from Wikipedia, Wikidata, three biomedical knowledge bases, and a Brazilian Portuguese legal corpus, and we report runnable experiments on single-hop Natural Questions, multi-hop HotpotQA, end-to-end latency, build cost, and evidence accuracy. RAGBase improves single-hop exact match from 58.7 percent (the strongest baseline) to 64.2 percent, raises multi-hop exact match from 47.3 to 56.8 percent, sustains end-to-end p95 latency below 412 milliseconds at a 24.7 million chunk corpus size, halves the build cost relative to the strongest graph baseline at 96.8 US dollars per million chunks, and lifts evidence accuracy from 81.2 to 89.4 percent. The schema, dictionaries, and reproduction scripts are released under an open license.

Keywords: retrieval-augmented generation; vector database; knowledge graph; hybrid retrieval; large language models; evidence fusion; question answering; database schema

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Souza, R. A., Ferreira, M. L., Rocha, D. C., & Nunes, C. P. (2023). RAGBase: A Hybrid Vector–Graph Database Architecture for Retrieval-Augmented Generation. DATAMIND, 1(3), 51-65. https://doi.org/10.63646/datamind.2023.010305

Download Citation

Article sidebar

Main article

Abstract

Article details

How to Cite