EduRiskDB: A Student Learning and Dropout-Risk Database for Explainable Educational Analytics
Main article
Abstract
Predicting student dropout before it occurs requires not just predictive models but purpose-built, ethically governed data infrastructure that integrates learning-platform clickstreams, assessment outcomes, assignment submission patterns, forum engagement, attendance records, and student support interactions. This paper introduces EduRiskDB, a relational-graph-vector hybrid database containing 148,392 student-term records drawn from four European higher-education institutions over six academic years (2017–2023). EduRiskDB is designed for reproducible experimentation in dropout-risk modelling and explainable educational analytics. The database schema, field dictionary, indexing strategy, data-quality controls, ethics-compliance pipeline, and open-access interfaces are described in detail. A benchmark experiment evaluates six dropout-prediction models on EduRiskDB, demonstrating that the augmented XGBoost configuration achieves an AUC-ROC of 0.903 with a mean early-warning lead time of 7.8 weeks, outperforming all baselines. SHAP attribution identifies cumulative click sequences, seven-day assignment lag, and attendance streaks as the three most predictive signals, with non-linear interactions confirmed by dependence plots. EduRiskDB is archived on Zenodo under a CC BY 4.0 licence and updated semi-annually.
