RAGBase: A Hybrid Vector–Graph Database Architecture for Retrieval-Augmented Generation
Main article
Abstract
Retrieval-augmented generation has rapidly become the de-facto pattern for grounding large language models on private and dynamic knowledge, yet the current backends used in practice oscillate between two extremes. Vector-only databases handle semantic similarity efficiently but struggle with multi-hop relational queries where the answer requires traversing several entities. Knowledge-graph-only backends handle relational queries elegantly but miss semantically paraphrased evidence that is not encoded as explicit relations. This article presents RAGBase, a hybrid vector-graph database architecture that treats the retrieval database itself as the principal research artifact and unifies both retrieval modes through a documented schema, a typed field dictionary, indexed evidence storage, a quality-control pipeline, and a reusable application programming interface. Six core entities (DOCUMENT, CHUNK, EMBEDDING, ENTITY, RELATION, EVIDENCE) are organized so that every retrieved fragment, regardless of whether it was reached by vector similarity or by graph traversal, traces back to a single canonical evidence record. A learned query router decides per-query whether to issue a vector recall, a graph traversal, a fused hybrid retrieval, or a BM25 fallback, and an evidence fusion module merges the resulting candidate set before passing it to the generator. We benchmark RAGBase on a working corpus of 24.7 million chunks drawn from Wikipedia, Wikidata, three biomedical knowledge bases, and a Brazilian Portuguese legal corpus, and we report runnable experiments on single-hop Natural Questions, multi-hop HotpotQA, end-to-end latency, build cost, and evidence accuracy. RAGBase improves single-hop exact match from 58.7 percent (the strongest baseline) to 64.2 percent, raises multi-hop exact match from 47.3 to 56.8 percent, sustains end-to-end p95 latency below 412 milliseconds at a 24.7 million chunk corpus size, halves the build cost relative to the strongest graph baseline at 96.8 US dollars per million chunks, and lifts evidence accuracy from 81.2 to 89.4 percent. The schema, dictionaries, and reproduction scripts are released under an open license.
