Bee Health Early Warning System - Erdos Institute Data Science Bootcamp
Motivation and Objective
Honey bee colonies, which are essential for pollinating over $15 billion in U.S. crops annually, are collapsing at alarming rates. This poses a direct threat to national food security and economic stability. Currently, data on colony loss is a lagging indicator, meaning stakeholders can only react after a collapse has already occurred. In Phase 1 of our project, we addressed this by attempting to build a macro-level, “top-down” early warning system. The objective was to predict high or low colony loss at a U.S. state-by-state, quarterly level. However, this approach faced a significant challenge: averaging weather and pathogen data over vast geographic areas and entire quarters diluted the critical, localized patterns that truly stress a colony. Our models struggled to find a clear signal in this noisy data, which was reflected in poor F1 and ROC-AUC scores.
The challenges from Phase 1 directly motivated our pivot to Phase 2: a “bottom-up” approach to solve the problem at the scale where interventions actually happen - the individual hive. Our new objective is to create a model that predicts hive health (healthy/unhealthy) using highly localized weather data from the past one to two weeks. This approach is more robust for two key reasons: first, it aligns directly with the granular, local data that is available, and second, it allows us to use a standardized definition of “hive health” as established in recent research (Lower et al., 2024).
Stakeholders
This diagnostic model provides direct, tactical value to the people who can most immediately protect bee populations. Our primary stakeholders are Commercial Beekeepers, who can use this model to monitor their operations, enabling them to inspect and treat at-risk hives immediately rather than discovering a loss later. This tool would also be invaluable to Beekeeping Technology Companies (e.g., “Smart Hive” developers), as our model can serve as the software brain for their hardware sensors, turning raw data into an actionable warning system. Finally, the model would benefit Hobbyist Beekeepers by providing an accessible tool to help them manage hive health, as well as Pollination Service Providers who rely on a stable and healthy supply of bees to meet agricultural contracts.
Key Performance Indicators
To evaluate our classification models, we focused on three key performance indicators (KPIs): Accuracy, ROC-AUC, and the F1-score. We used these metrics to compare our more advanced models against each other and against a baseline model to quantify our improvements. While Accuracy provides a general measure of correctness, it can be misleading in a dataset where ‘healthy’ hives are less frequent but critical class to identify. Therefore, the ROC-AUC score was essential, as it measures a model’s fundamental ability to distinguish between the ‘healthy’ and ‘unhealthy’ classes. Most importantly, the F1-score was our primary metric for real-world performance, as it balances the critical trade-off between Precision (minimizing false alarms that waste a beekeeper’s time) and Recall (our main objective: successfully finding as many truly ‘unhealthy’ hives as possible).
Phase 1:
Data Sources and Processing
We created a comprehensive dataset for analyzing bee health by integrating three sources from 2015-2025. First, bee colony loss data (including specific causes like Varroa mites) from USDA NASS was aggregated by state and county. Second, this was merged with quarterly-averaged weather data (e.g., precipitation, temperature) from NOAA. Third, we added pathogen prevalence data, including Varroa counts and key virus levels, from USDA APHIS. This final, harmonized dataset aligns bee colony losses with corresponding weather and pathogen conditions by state, county, and quarter, providing a robust foundation for statistical analysis and predictive modeling.
Methodology / Modeling
Our Phase 1 objective was to predict future colony loss at the state level. To do this, we engineered a target variable by converting the raw quarterly loss percentage into a binary classification (High/Low Loss) based on the historical median loss for that quarter. This created a balanced target suitable for classification modeling. Our validation strategy was critical. Because the data was a time series, we used TimeSeriesSplit cross-validation. This ensured our models were always trained on past data (e.g., 2015-2020) to predict future outcomes (e.g., 2021), preventing data leakage. We experimented with several models, including LogisticRegression, XGBClassifier, and LGBMClassifier, to see if a signal could be found. We also conducted rigorous feature engineering, including lagging features and testing various preprocessing pipelines (imputation, scaling) to ensure a fair comparison.
Results and Conclusions
Our models conclusively demonstrated that predicting future state-level loss is not feasible with this data. Despite rigorous testing, our predictive models (forecasting Y(Q2) from X(Q1)) failed to find a signal, achieving an average ROC-AUC score of approximately 0.53—statistically indistinguishable from a random guess. Our critical analysis revealed four root causes for this failure:
- Rapid Signal Decay: Our investigation showed that while a weak contemporaneous signal exists (linking Varroa in Q1 to loss in Q1), this signal decays too rapidly. The one-quarter lag was too long, and the health status from one quarter had no predictive power on the next quarter.
- Extreme Feature Sparsity: The public pathogen data, which we hypothesized was a key predictor, was too sparse. Our analysis showed an average of only 4-8 samples per state, per quarter. This is not enough data to create a stable, representative feature, and our models correctly learned to treat it as statistical noise.
- Destructive Aggregation: As stated in our objective, averaging weather and pathogen data over an entire state (like Texas) and a full quarter “dilutes” the signal. Localized events, which are the true drivers of colony health, are lost in this macro-level view.
- Insufficient Data Depth: With only 40 quarters of data across ~50 states (roughly 2,000 total data points), the dataset was too shallow for a model to learn the weak, complex patterns hidden in the noise. This limited sample size made it impossible to train a generalizable model. This definitive failure was the primary motivation for our pivot to Phase 2. We concluded that a successful model must be built from the “bottom-up”, using granular, hive-level health data linked directly to local weather, which is precisely what Phase 2 accomplishes.
Phase 2:
Data Sources and Processing
Our model was built by combining two distinct datasets. First, we used Honeybee Colony Health Data from the “Predicting Honeybee Health” research article (Lower et al., 2024), which provided detailed hive inspection records (HCC_Inspections.csv) and apiary locations. Second, we used Daily Weather Summaries from the NOAA NCEI data portal for the corresponding apiary locations in North Carolina and Utah. To prepare the data for modeling, we converted the “Healthy” target variable to binary (1/0) and merged the two sources. Our primary feature engineering task was to prevent data leakage by creating time-series “lag” features (e.g., Prev_Health_Status) to capture the hive’s state from its previous inspection, as well as 7-day trailing weather features (e.g., Avg_tmax, Num_frost_days) to represent the environmental conditions leading up to the current inspection.
Methodology / Modeling
The nature of our data, which contains multiple inspections for the same hive over time, means that individual data points are not independent. A standard random split would cause data leakage, as the model could “memorize” a hive’s history from the training set. To simulate a real-world scenario, predicting health for a new, unseen hive, we used a GroupShuffleSplit on HiveID. This ensures all inspections for a single hive are confined to either the training or test set. We established a baseline with LogisticRegression and then experimented with ensemble models (RandomForest, LGBMClassifier, XGBClassifier). After hyperparameter tuning with RandomizedSearchCV, we selected the XGBClassifier as our final model. It yielded the strongest cross-validation ROC-AUC score (0.75), confirming it had the best ability to distinguish between ‘healthy’ and ‘unhealthy’ hives.
Results and Conclusions
Our final tuned XGBClassifier generalized well to the unseen holdout data, achieving a Test ROC-AUC of 0.78 and an Accuracy of 0.72. This strong predictive signal significantly outperformed our LogisticRegression baseline, which only achieved a Test ROC-AUC of 0.65. The model’s real-world value was confirmed by its Macro Average F1-score of 0.71, demonstrating a robust and balanced ability to identify both ‘healthy’ and ‘unhealthy’ hives. Feature importance analysis revealed that the model’s predictions were primarily driven by the hive’s recent history. The top three most important predictors were the Previous Queen Status (Prev_Queen_Status), whether it was the First Inspection (Is_First_Inspection), and the Previous Stressor Status (Prev_Stressors_Status). Key environmental factors, specifically the Number of Frost Days (Num_frost_days) and 7-day Average Temperature (Avg_tavg), also proved to be significant predictors of hive health.
Limitations and Future Work
While our Phase 2 model was successful (Test ROC-AUC: 0.78), its practical application is currently limited by several key factors. Our future work proposals directly address these limitations to build a more robust and scalable system.
Limitations
- Limited Geographic and Climatic Generalizability: Our model was trained exclusively on data from North Carolina and Utah. These two regions, while different, do not represent the full spectrum of climatic and geographic conditions beekeepers face (e.g., the high humidity of Florida, the extreme cold of the upper Midwest, or the intensive migratory agriculture of California). The model would likely perform poorly in these new environments.
- Limited Temporal Scope: The 2016-2019 data window is a narrow snapshot. The model has not been trained on major anomalous climate events, such as a widespread drought or an exceptionally harsh winter. Its ability to predict health outcomes during such extreme, non-linear events is untested and a significant risk.
Future Work
- Expanded Feature Engineering: Our model used a 7-day trailing average for weather. This should be the subject of deeper experimentation. We hypothesize that different lag periods (e.g., 3-day for acute stress, 21-day for forage impact) could unlock new predictive signals.
- Integrate Hive-Level Pathogen Data: Our Phase 1 project failed because state-level pathogen data was too sparse. The logical next step is to add hive-level pathogen screening (e.g., Varroa counts, DWV swabs) to our Phase 2 model. We predict these features would be immensely powerful, likely surpassing weather as a key secondary driver.
- Transition from Manual to Automated Features: The ultimate goal is to move from a diagnostic tool to a real-time one. Future work should focus on replacing lagged features (Prev_Brood_Status) with data from in-hive sensors (e.g., temperature, humidity, acoustics, CO2). This would create a fully automated “early warning system” and is the natural path to a commercial product.
- Scaling: This hive-level model acts as a precursor for solving the national-level problem. It proves that a “bottom-up" approach works. By deploying this model across more states, we can aggregate the predictions from thousands of individual hives. This would finally provide policymakers with a real, granular, and predictive tool to monitor national bee health, a goal our Phase 1 model could not achieve due to its “top-down” data limitations.
References
- Lower, E., Kollaparthi, S. P., Rogers, R., Hassler, E., & Cazier, J. (2024, Jul 30). Predicting Honeybee Health: The Healthy Colony Checklist,Hive Scale and Weather Data. Data & Analytics for Good, (2). https://data-for-good.pubpub.org/pub/rg3364dl