Briefing Document: Leveraging AHP and Transfer Learning for Infectious Disease Outbreak Prediction
Dates: Received - 10 August 2024 | Accepted - 26 November 2024 | Published - 31 December 2024
Citation: Abdallah, R., Abdelgaber, S. & Sayed, H.A. Leveraging AHP and transfer learning in machine learning for improved prediction of infectious disease outbreaks. Sci Rep 14, 32163 (2024). https://doi.org/10.1038/s41598-024-81367-1
Keywords: Infectious diseases, AHP, Transfer learning, Risk factors, Machine learning, Dengue, Chikungunya, Zika, Outbreak prediction
Executive Summary:
This study investigates the application of advanced machine learning techniques, specifically an ensemble model integrating Random Forest, XGBoost, and Gradient Boosting, enhanced by Analytic Hierarchy Process (AHP) for feature selection and transfer learning, to predict outbreaks of Dengue, Chikungunya, and Zika. Utilizing a comprehensive dataset of climate and socioeconomic factors from Colombia (2007-2017), the researchers demonstrate that their integrated approach significantly improves the accuracy and reliability of outbreak predictions. The ensemble model achieved notable success, particularly in predicting Zika outbreaks with an accuracy of 96.80% and an AUC of 0.9197. The study highlights the importance of considering a broad range of risk factors, leveraging expert knowledge through AHP, and utilizing transfer learning to overcome data scarcity and improve model generalizability across related diseases.
Main Themes and Important Ideas/Facts:
- The Critical Need for Precise Outbreak Prediction:
- Infectious diseases pose a significant threat to global health and economic stability, underscoring the crucial need for accurate outbreak predictions to enable effective mitigation strategies.
- "Infectious diseases significantly impact both public health and economic stability, underscoring the critical need for precise outbreak predictions to effictively mitigate their impact."
- Early and precise prediction allows public health authorities to deploy timely control and prevention strategies, optimizing resource allocation and minimizing the impact of outbreaks.
- "Early and precise prediction of such outbreaks is essential for public health authorities to deploy effective control and prevention strategies, thereby mitigating impacts on public health and resources."
- Limitations of Traditional Prediction Methods and the Promise of Machine Learning:
- Traditional methods face challenges due to the complex dynamics of infectious diseases and limitations in available data.
- Machine learning (ML) has emerged as a powerful tool capable of analyzing diverse datasets and identifying complex relationships influencing disease dynamics.
- "Machine learning (ML) has emerged as a powerful tool, capable of analyzing various and diverse datasets to identify complex relationships among the various factors that influence disease dynamics."
- ML algorithms excel at identifying patterns and trends that may be missed by human analysis, enabling timely prediction based on continuous monitoring of risk factors.
- The Role of Analytic Hierarchy Process (AHP) for Feature Selection:
- AHP, a multi-criteria decision-making approach, is used to systematically identify and prioritize the most influential risk factors for infectious disease outbreaks.
- "The analytic hierarchy process (AHP), a multi-criteria decision making (MCDM) approach developed by Thomas L. Saaty, organizes factors into a hierarchical structure, proving invaluable for decision-makers navigating complex scenarios."
- AHP incorporates expert domain knowledge through pairwise comparisons to assign weights to various climatic, socioeconomic, and demographic factors.
- The study identifies "barriers to health services," "dependency rate," and "no health insurance" as the highest-ranked risk factors.
- "Based on the previous rankings of risk factors shown in Fig. 2, it is concluded that the highest-ranked factors; barriers to health services, dependency rate, and lack of health insurance play a critical role in disease outbreaks."
- Consistency checks (Consistency Index and Consistency Ratio) are employed to ensure the reliability of the weight assignments.
- Addressing Data Scarcity with Transfer Learning:
- Transfer learning is utilized to leverage knowledge from one domain (Dengue outbreak data) to improve prediction in related domains (Chikungunya and Zika).
- "Transfer learning, which utilizes knowledge from one domain to solve related issues in another, addresses this by transferring models, weights, or features from one disease context to another."
- The model is pre-trained on a comprehensive Dengue dataset, which shares similarities with Chikungunya and Zika in terms of transmission vectors and influencing factors.
- Fine-tuning is then performed using the specific data for Chikungunya and Zika.
- The Effectiveness of Ensemble Machine Learning Models:
- The study employs Random Forest, XGBoost, and Gradient Boosting algorithms, along with an ensemble technique that combines their predictions.
- The ensemble model demonstrated the highest overall predictive performance.
- For Zika, the ensemble model achieved the highest accuracy (96.80%) and AUC (0.9197).
- "The result reveals that the ensemble model is particularly effective, achieving the highest accuracy rate of 96.80% and an AUC of 0.9197 for predicting Zika outbreaks."
- For Chikungunya, the ensemble model achieved a balanced performance with an accuracy of 93.31%, precision of 57%, and recall of 63%, highlighting its reliability.
- "Notably, in the context of Chikungunya, this model achieves an optimal balance between precision and recall, with an accuracy of 93.31%, a precision of 57%, and a recall of 63%, highlighting its reliability for effective outbreak prediction."
- Methodology Overview:
- The proposed model consists of six layers: data source, preprocessing, feature engineering (including AHP for selection and 75th percentile method for outbreak threshold definition), data splitting (80% train, 20% test), modeling (Random Forest, XGBoost, Gradient Boosting, Ensemble with transfer learning), and evaluation.
- A predefined search technique across major online databases was used to identify common risk factors.
- The dataset comprised climate and socioeconomic data from NASA, DANE (Colombia), and SIVIGILIA (Colombia), totaling 1716 instances with 27 features for Dengue, Chikungunya, and Zika.
- Data preprocessing involved EDA, handling missing values (e.g., mean imputation for average temperature), and data transformation (Min-Max normalization and one-hot encoding).
- Model Evaluation Metrics:
- Model performance was evaluated using accuracy, precision, recall, F1-score, and Area Under the ROC Curve (AUC).
- Confusion matrices were used to visualize the prediction performance of each model.
- Comparison with Existing Research:
- The study distinguishes itself from prior research by implementing AHP for more precise feature selection, addressing the limitations of predefined feature selection libraries.
- The integration of transfer learning to overcome data scarcity is another key contribution.
- Limitations of the Study:
- Limited labeled datasets for Chikungunya and Zika.
- Lack of comprehensive clinical data at the patient lev...