Main article

Daniel R. Okafor
Institute for Data Systems and Learning, Norwich NR4 7TJ, United Kingdom
Sofia Mäkinen
Institute for Data Systems and Learning, Norwich NR4 7TJ, United Kingdom
Yiwen Tang*
Department of Computing, Imperial Data Science Lab, London SW7 2AZ, United Kingdom
yiwen.tang@idsl.ac.uk
Rahul Banerjee
School of Computer Science and Engineering, Bengaluru 560012, India

DOI: https://doi.org/10.63646/datamind.2024.020206

Abstract

Industrial Internet-of-Things deployments now instrument plants with thousands of networked sensors whose readings arrive at irregular rates, with gaps, and with strong cross-sensor dependence induced by physical topology. For analytics on such data, the dominant bottleneck is increasingly not modelling accuracy but the data-engineering pipeline that ingests, validates, aligns, imputes, stores, indexes, and serves the multivariate stream. Existing open time-series benchmarks were built to evaluate forecasting or classification models on clean, regularly sampled, and essentially flat collections of series; they neither preserve the sensor graph nor exercise the pipeline end-to-end, and they rarely ship the schema, data dictionary, and persistent identifiers needed for reproducible systems research. We introduce ISGraph, an open Industrial Sensor Graph dataset and benchmark designed specifically to evaluate time-series data-engineering pipelines. ISGraph couples a physically grounded generative model of plant sensor networks with explicit node and edge tables, controlled fault injection, and controlled missingness and irregular sampling. It is distributed with a normalised relational-plus-graph schema, a complete field-level data dictionary, a reference pipeline implementation, a query API, and a permanent repository deposit. We define a benchmark protocol over five pipeline tasks—schema validation, temporal alignment, graph-aware imputation, feature extraction, and windowed query serving—with control baselines, throughput and latency metrics, and reconstruction-error analysis. Using the reference pipeline we show that graph-aware imputation lowers reconstruction RMSE by 21–39% relative to forward-fill and to competitive flat baselines, that the advantage grows with node degree and with gap length, and that a columnar store augmented with a graph-neighbourhood index sustains the highest ingestion throughput while delivering the lowest windowed-query latency. ISGraph is intended as shared infrastructure for database, data-engineering, and computational-discovery research on industrial time series.

Article details

How to Cite

R. Okafor, D. ., Mäkinen, S., Tang, Y., & Banerjee, R. . (2024). Open Industrial Sensor Graph Dataset for Evaluating Time-Series Data Engineering Pipelines. DATAMIND, 2(2), 69-84. https://doi.org/10.63646/datamind.2024.020206