Analysis and Prediction of Infectious Disease Outbreak Incidence and Related Mortality by Integrating Diverse Data Sources Using Statistical Modeling and Machine Learning Methods
Aleksandr Shishkin
Citations
Abstract
Accurate and timely forecasts of infectious disease incidence and related mortality are critical for effective public health responses. Traditional surveillance data, while invaluable, often suffer from reporting delays, necessitating the exploration of auxiliary data sources. This work leverages internet search data, molecular epidemiological information, and traditional surveillance data to improve outbreak predictions. The first study examines the COVID-19 pandemic burden in Ukraine using excess mortality analysis from 2020 to 2021. By comparing observed all-cause and cause-specific mortality with expected historical trends, the study quantifies the pandemic’s impact. Three distinct waves of excess mortality were identified, corresponding with peaks in lab-confirmed COVID- 19 deaths. Cause-specific analyses revealed significant excess mortality from pneumonia and circulatory system diseases, highlighting the broader health impacts of the pandemic beyond direct COVID-19 fatalities. The second study investigates the utility of Google search queries related to COVID-19 as supplementary data for forecasting incidence and mortality. Predictive keywords were identified through Granger causality tests and cross-correlation analyses. ARIMA, Prophet, and XGBoost models were then employed to compare baseline forecasts (using only traditional surveillance data) with enhanced models incorporating search query data. The inclusion of top-ranked keywords significantly improved predictive accuracy, with gains ranging from 50% to 90% in certain scenarios. The third study develops a novel approach for outbreak investigation and forecasting by integrating molecular data with internet search trends. Hepatitis C virus (HCV) sequence data from the Scott County outbreak were analyzed using Bayesian evolutionary models to estimate historical viral population sizes. These estimates were correlated with Google Trends data, and predictive models were constructed to assess the added value of search data in forecasting disease prevalence. The integration of molecular and internet-based data sources demonstrated potential improvements in predictive performance. Collectively, this dissertation underscores the importance of combining traditional epidemiological data with innovative auxiliary data sources and advanced modeling techniques. The findings contribute to the field of infectious disease epidemiology by offering improved methodologies for outbreak prediction and public health decision-making.
