Main article

Elena Vasquez
Department of Computer Science and Cybersecurity, University of the West of Scotland, Paisley PA1 2BE, UK
Tobias Reinhardt
Faculty of Information Technology, Reutlingen University, 72762 Reutlingen, Germany
Siobhán Murphy*
School of Computing, Engineering and Intelligent Systems, Ulster University, Derry BT48 7JL, UK
s.murphy@ulster.ac.uk

DOI: https://doi.org/10.63646/datamind.2024.020305

Abstract

The network intrusion detection research community has long contended with a fragmented landscape of publicly available traffic datasets, each characterised by distinct labelling conventions, capture methodologies, temporal scopes, and feature-engineering choices. This heterogeneity impedes reproducible model evaluation, cross-dataset generalisation, and the development of trustworthy AI-driven intrusion analytics. This paper introduces CyberTraceDB, a curated, schema-documented, multilabel network-attack trace database that unifies 857,300 labelled flow records sourced from five widely used benchmark corpora (NSL-KDD, CIC-IDS2017, UNSW-NB15, CIC-DDoS2019, and CTU-13). The database addresses three systematic deficiencies of its constituent sources: conflicting label taxonomies are resolved through a hierarchical label-harmonisation pipeline supported by a 9-class canonical attack taxonomy; missing values are imputed using temporal k-nearest-neighbour matching; and noise is suppressed through a multi-signal quality scoring mechanism. CyberTraceDB implements a four-tier storage architecture: a PostgreSQL relational store for structured session metadata, a TimescaleDB hypertable for time-partitioned flow statistics, a Neo4j property graph for attack chain and lateral-movement relationships, and a FAISS vector index of flow embeddings for similarity-based retrieval. Three application programming interfaces—a REST endpoint, a Python SDK, and a benchmark harness—serve the primary use cases of IDS model training, threat hunting, and forensic trace replay. Experimental evaluation on the held-out test partition demonstrates that a fine-tuned Transformer classifier trained on CyberTraceDB achieves macro F1 = 0.951 and false-positive rate 1.4%, compared with 0.912 and 2.2% on the CIC-IDS2017 baseline. Cross-dataset transfer experiments and Cohen’s κ label consistency analysis confirm the superior labelling quality and generalisation potential of CyberTraceDB over any single constituent corpus. The full database, construction pipeline, and benchmark code are released under CC BY 4.0 with a persistent DOI.

Article details

How to Cite

Vasquez, E., Reinhardt, T., & Murphy, S. (2024). CyberTraceDB: A Curated Network-Attack Trace Database for Database-Centered Intrusion Analytics. DATAMIND, 2(3), 59-72. https://doi.org/10.63646/datamind.2024.020305