PharmaSignalDB: An Open Pharmacovigilance Knowledge Database for Adverse Event Mining
Main article
Abstract
Post-market drug safety surveillance remains one of the most data-intensive challenges in modern biomedical informatics. Spontaneous adverse event reporting systems generate millions of records annually; yet the absence of a curated, schema-documented, and openly accessible knowledge database impedes reproducible pharmacovigilance research and automated signal detection. This paper presents PharmaSignalDB, an open pharmacovigilance knowledge database designed for adverse event mining, disproportionality analysis, and drug-event knowledge graph construction. Built upon 11 years of processed FDA Adverse Event Reporting System (FAERS) quarterly data files from 2012 to 2023, PharmaSignalDB integrates 4,287,194 deduplicated individual case safety reports covering 8,341 drug substances and 19,472 unique preferred terms mapped to MedDRA version 26.1. The database schema comprises six normalized relational tables with complete field dictionaries, primary-key indexing, and foreign-key constraints. A multi-stage data pipeline handles deduplication using CASE_ID-based elimination and fuzzy matching, MedDRA terminology mapping at both preferred-term (PT) and system organ class (SOC) levels, and version-controlled incremental updates. Three validation experiments are reported: (1) ROR/PRR signal stability analysis across quarterly update cycles; (2) duplicate report rate quantification; and (3) serious adverse event identification accuracy benchmarked against a EudraVigilance reference set. PharmaSignalDB achieves a deduplication rate of 18.3%, an overall field completeness of 91.2%, and a positive predictive value of 0.847 for serious event classification. The database is released under Creative Commons Attribution 4.0 with standardized Python and SQL interfaces for reproducible research.
