Home » COVID-19 hospitalizations forecasts using internet search data

COVID-19 hospitalizations forecasts using internet search data

by admin

We focused on hospital admission forecasts and state-level forecasts for 51 US states, including Washington, DC. Inputs consist of confirmed uptick cases, percentage of vaccinated population, confirmed new hospitalizations, and Google search query frequency. State-level and national data were obtained directly from the respective data sources outlined in this section.Our forecasting method is inspired by ARGO9which is also detailed in this section.

Data/code availability

All data used in this study are publicly available from the respective data sources outlined here. For the sake of completeness, the dataset and code analyzed during the current study are also stored in the Harvard Dataverse Repository, DOI: 10.7910/DVN/S7HOTD.All analyzes, including generation of all figures, were performed with R statistical software20version 4.1.1 (https://www.R-project.org/).

COVID-19 related data

We used confirmed COVID-19 increased cases reported from JHU CSSE datatwenty onepercentage of the fully vaccinated population from the Centers for Disease Control and Prevention (CDC)twenty two This is the average percentage of fully vaccinated population across all states and percentage of new hospitalizations from HHS confirmed with COVID-19.17The dataset was collected from July 15, 2020 to January 15, 2021.

Figure 3

Left: National weekly COVID-19 new hospitalizations (black), national weekly number of confirmed COVID-19 cases (blue), and percentage of national vaccinated population (red) scaled accordingly. Right: Weekly new hospitalizations due to COVID-19 (red) and the top three Google queries with the highest correlations to hospitalizations (blue), ‘duration of infection’ (blue), and ‘loss of sense of smell’ (red). , ‘loss of infection’ (red) taste’ (green).

Table 1 National-level comparison error metrics.
Table 2 State-level comparison error metrics.

Google search data

Google Trends provides estimated Google search frequency for a given query termtwenty threePulled online search data from Google Trendstwenty three Between July 15, 2020 and January 15, 2021. To get the chronological search frequency of the desired query, Google Trends needs to specify the geography and timeframe of the query. The frequency returned from Google Trends is obtained by sampling the raw her Google search frequency for all containing this querytwenty threeThe detailed data collection procedure and subsequent data preprocessing (introduced in the section below) are shown in the flow chart (Figure S1). In step 1 (green-highlighted box in Figure S1), we first started with 129 influenza (influenza)-related queries based on previous research to curate a pool of potentially predictive queries. I was.9,24,25We then changed the keywords “flu” and “influenza” to “coronavirus” and “COVID-19” respectively. I also added his COVID-19-specific search terms from the Google Trends coronavirus stories page.26Finally, for each query, we also included the top “related queries and topics” based on the Google Trends website.twenty threeUltimately, there were 256 COVID-19-related queries (Table S1). The next two sections detail the subsequent data cleaning and preprocessing, as shown in steps 2 and 3 of Figure S1.

Interquantile Range (IQR) Filter and Optimal Lag for Google Search Data

Raw Google Search Frequency from Google Trendstwenty three Observed to be unstable and sparse19Such instability and sparsity can adversely affect the predictive performance of outlier-sensitive linear regression models. To address such outliers in Google search data, we used the IQR filter19 Remove and correct outliers on a rolling window basis.Examine and remove search data that is more than 3 standard deviations from the average over the last 7 days19This is also shown in the first substep of Step 2 (orange highlighted box) in Figure S1.

Google search frequency trends are often days before hospitalization. This indicates that the search data may contain predictive information about hospital admissions. Figure 4 shows the lag behavior between Google search query frequency and nationwide hospitalizations.We found and applied the optimal lag to fully leverage the predictive information of domestic Google search terms19 Filter Google search frequency for national hospitalization trends. For each query, we fit a linear regression of new hospitalizations due to COVID-19 against the lagged Google search frequencies, considering a range of lags (4-35). The lag that minimizes the mean squared error is chosen as the optimal lag for that query. The data used to find the best lag is from 01-Aug-2020 to 31-Dec-2020, which is shown in his second substep of Step 2 (Figure S1 ).

Figure 4
Figure 4

Google search query “infectious period” and weekly new hospitalizations due to COVID-19 Illustration of the peak delay between the search frequency of the Google search query (infectious period in blue) and weekly new hospitalizations for COVID-19 at the national level (in red) . The Y axis is adjusted accordingly.

Select Google search term

Queries with a correlation coefficient greater than 0.5 with nationwide COVID-19 hospitalizations for the period 08/01/2020 to 12/31/2020 after applying the best lagged to 256 COVID-19-related terms further selected. Use a moving average to further smooth out the week-to-week variation of selected Google search queries. All three steps above are shown in step 3 of Figure S1 and serve as the final step in the overall data preprocessing procedure. Table 3 shows the 11 key terms selected and their optimal lags. Table S2 shows the correlation coefficients of 11 important Google search queries that were optimally delayed. Table 3 supports our intuition that when people get infected, they search for common queries first, such as “symptoms of covid-19.” This is because the optimal lag for this query is relatively large. After symptoms develop, people may start looking for specific symptoms, such as “loss of sense of smell”, for which the optimal lag is relatively small. Even the query terms that are used may have slightly different search patterns, so the optimal lag is also different.

Table 3 Optimal lags for selected key terms.

ARGO-inspired predictions

let me \({\hat{y}}_{t,r}\) Regional Daily Hospitalization r on that day t; \(X_{k,t}\) Google search data for words k on that day t; \(c_{t,r}\) Increased confirmed cases of JHU COVID-19 on the day t regional r; \(v_{t,r}\) Cumulative percentage of people vaccinated daily t regional r; \({\mathbb {I}}_{\{t, d\}}\) Day of the week indicator for day t (i.e. \({\mathbb {I}}_{\{t, 1\}}\) indicates the day t is Monday). Predict standing on T day l・Hospitalization on the day before the condition r, \({\hat{y}}_{T+l,r}\)used a penalized linear estimator as follows:

$$\begin{aligned} \begin{aligned} {\hat{y}}_{T+l,r} = {\hat{\mu}}_{y,r}+\sum ^{I}_ {i=0}{\hat{\alpha}}_{i,r}y_{Ti,r} + \sum _{j\in {J}}{\hat{\beta}}_{j,r }c_{T+lj,r}+ \sum _{m\in {M}_{r}}{\hat{\gamma}}_{m,r}y_{T,m}+ \sum _{ q\in {Q}}{\hat{\phi }}_{q,r}v_{T+lq,r}+ \sum ^{K}_{k=1}{\hat{\delta }} _{k,r}X_{k,T+l-{\hat{O}}_k} + \sum ^6_{d=1}{\hat{\tau}}_{d,r}{\mathbb {I}}_{\{T+l, d\}} \end{aligned} \end{aligned}$$


Where \(I=6\) Consider delayed daily admissions for 1 consecutive week. \(J=\max \left( \{7,28\},l\right)\)considering late confirmed cases. \({Mr}\) A set of geographically contiguous states in a state r; \(Q=\max\left(7,l\right)\)considering that the vaccination data are one week behind. \({\hat{O}}_k=\max \left(O_k,l\right)\) Adjusted optimal lag for term k; \(K=11\), considering 11 selected Google search terms.coefficient of l– Local day-ahead forecast r, \(\{\mu _{y,r},\varvec{\alpha }=(\alpha _{1,r},\ldots ,\alpha _{6,r}), \varvec{\beta}= (\beta _{1,r}, \ldots , \beta _{|J|,r}), \varvec{\gamma }=(\gamma _{1,r},\ldots ,\gamma _{| {{M}_{r}}|,r}), \varvec{\phi }=\phi _{max(7,l),r}, \varvec{\delta }=(\delta _{1, r},\ldots ,\delta _{11,r}), \varvec{\tau }=(\tau _{1,r}, \ldots , \tau _{6,r})\}\)calculated by

$$\begin{aligned} \begin{aligned} \underset{\mu _{y,r},\varvec{\alpha},\varvec{\beta},\varvec{\gamma},\varvec{\phi },\varvec{\delta },\varvec{\tau },\varvec{\lambda }}{\mathrm {argmin}} \sum _{t=TM-l+1}^{Tl}&\omega ^ {Tl-t+1}\Bigg ( y_{t+l,r}-\mu _{y,r} – \sum ^{6}_{i=0}{\alpha }_{i,r} y_{ti,r}-\sum_{j\in {J}}{\hat{\beta}}_{j,r}c_{t+lj,r}-\sum_{m\in {M }_{r}}{\hat{\gamma }}_{m,r}y_{t,m}\\ \;\;\;&- \sum _{q\in {Q}}{\hat {\phi }}_{q,r}v_{t+lq,r} -\sum ^{5}_{k=1}{\hat{\delta }}_{k,r}X_{k, t+l-{\hat{O}}_k} – \sum ^6_{d=1}{\hat{\tau}}_{d,r}{\mathbb {I}}_{\{t+ l , d\}}\Bigg )^2\\ \;\;\;&+ \lambda _\alpha \Vert \varvec{\alpha }\Vert _1+\lambda _\beta \Vert \varvec{\beta } \ Vert _1+\lambda _\gamma \Vert \varvec{\gamma }\Vert _1+ \lambda _\phi \Vert \varvec{\phi }\Vert _1+\lambda _\delta \Vert \varvec{\delta }\Vert _1 +\lambda _\tau \Vert \varvec{\tau}\Vert _1 \end{aligned} \end{aligned}$$


M = 56 This is the length of the training period. \(\omega = 0.8\) is an exponentially time-decaying weight that assigns higher weights to more recent observations.region \(\varvec{r}\) It is made up of the United States and 51 states, including Washington, DC. National level training, hospitalization in neighboring countries, \(y_{t,m}\)and their coefficients, \(\varvec{\gamma}\), is excluded. We used the L1 norm penalty to deal with the sparseness of the Google search data.For simplicity, the hyperparameter \(\varvec{\lambda}=(\lambda_{\alpha},\lambda_{\beta},\lambda_{\gamma},\lambda_{\phi},\lambda_{\delta}, \lambda _{\ tau})\) For L1-norm penalties, which were set equal and obtained by 10-fold cross-validation.

With the above formulation, every Monday from 01/04/2021 to 12/27/2021, we iteratively train the model to make national and state-level retrospective out-of-sample hospitalization predictions up to 14 days ahead. I was. The daily forecasts were then aggregated into his one week ahead and his two weeks ahead forecasts. for example, \({\hat{y}}_{T+1:T+7,r}=\sum ^7_{i=1}{\hat{y}}_{T+i,r}\) When \({\hat{y}}_{T+8:T+14,r}=\Total^{14}_{i=8}{\hat{y}}_{T+i,r}\ ) 1 week ahead forecast and 2 weeks ahead forecast for the day T. regional rRespectively.

Evaluation index

Root mean square error (RMSE) between admission estimates \({\hat{y}}_t\) and true value \(y_t\) During the period \(t=1,\ldots , T\) teeth \(\sqrt{\frac{1}{T}\sum _{t=1}^T \left( {\hat{y}}_t – y_t\right) ^2}\). mean absolute error (MAE) between estimates \({\hat{y}}_t\) and true value \(y_t\) During the period \(t=1,\ldots , T\) teeth \(\frac{1}{T}\sum _{t=1}^T \left| {\hat{y}}_t – y_t\right|\)Correlation is the Pearson correlation coefficient. \(\hat{\varvec{y}}=({\hat{y}}_1, \dots , {\hat{y}}_T)\) When \(\varvec{y}=(y_1,\dots, y_T)\). all quotes \({\hat{y}}_t\) and true value \(y_t\) were aggregated weekly before calculating RMSE, MAE, and Cor.

Ethical Acknowledgment and Consent to Participate

This study did not include human participants, data, or tissues. Conducted using only aggregated and anonymized data. No Institutional Review Board approval was required. All methods were performed in accordance with relevant guidelines and regulations.

You may also like

Leave a Comment