Home » COVID-19 hospitalizations forecasts using internet search data

# COVID-19 hospitalizations forecasts using internet search data

We focused on hospital admission forecasts and state-level forecasts for 51 US states, including Washington, DC. Inputs consist of confirmed uptick cases, percentage of vaccinated population, confirmed new hospitalizations, and Google search query frequency. State-level and national data were obtained directly from the respective data sources outlined in this section.Our forecasting method is inspired by ARGO9which is also detailed in this section.

### Data/code availability

All data used in this study are publicly available from the respective data sources outlined here. For the sake of completeness, the dataset and code analyzed during the current study are also stored in the Harvard Dataverse Repository, DOI: 10.7910/DVN/S7HOTD.All analyzes, including generation of all figures, were performed with R statistical software20version 4.1.1 (https://www.R-project.org/).

### COVID-19 related data

We used confirmed COVID-19 increased cases reported from JHU CSSE datatwenty onepercentage of the fully vaccinated population from the Centers for Disease Control and Prevention (CDC)twenty two This is the average percentage of fully vaccinated population across all states and percentage of new hospitalizations from HHS confirmed with COVID-19.17The dataset was collected from July 15, 2020 to January 15, 2021.

### Google search data

Google Trends provides estimated Google search frequency for a given query termtwenty threePulled online search data from Google Trendstwenty three Between July 15, 2020 and January 15, 2021. To get the chronological search frequency of the desired query, Google Trends needs to specify the geography and timeframe of the query. The frequency returned from Google Trends is obtained by sampling the raw her Google search frequency for all containing this querytwenty threeThe detailed data collection procedure and subsequent data preprocessing (introduced in the section below) are shown in the flow chart (Figure S1). In step 1 (green-highlighted box in Figure S1), we first started with 129 influenza (influenza)-related queries based on previous research to curate a pool of potentially predictive queries. I was.9,24,25We then changed the keywords “flu” and “influenza” to “coronavirus” and “COVID-19” respectively. I also added his COVID-19-specific search terms from the Google Trends coronavirus stories page.26Finally, for each query, we also included the top “related queries and topics” based on the Google Trends website.twenty threeUltimately, there were 256 COVID-19-related queries (Table S1). The next two sections detail the subsequent data cleaning and preprocessing, as shown in steps 2 and 3 of Figure S1.

#### Interquantile Range (IQR) Filter and Optimal Lag for Google Search Data

Raw Google Search Frequency from Google Trendstwenty three Observed to be unstable and sparse19Such instability and sparsity can adversely affect the predictive performance of outlier-sensitive linear regression models. To address such outliers in Google search data, we used the IQR filter19 Remove and correct outliers on a rolling window basis.Examine and remove search data that is more than 3 standard deviations from the average over the last 7 days19This is also shown in the first substep of Step 2 (orange highlighted box) in Figure S1.

Google search frequency trends are often days before hospitalization. This indicates that the search data may contain predictive information about hospital admissions. Figure 4 shows the lag behavior between Google search query frequency and nationwide hospitalizations.We found and applied the optimal lag to fully leverage the predictive information of domestic Google search terms19 Filter Google search frequency for national hospitalization trends. For each query, we fit a linear regression of new hospitalizations due to COVID-19 against the lagged Google search frequencies, considering a range of lags (4-35). The lag that minimizes the mean squared error is chosen as the optimal lag for that query. The data used to find the best lag is from 01-Aug-2020 to 31-Dec-2020, which is shown in his second substep of Step 2 (Figure S1 ).

#### Select Google search term

Queries with a correlation coefficient greater than 0.5 with nationwide COVID-19 hospitalizations for the period 08/01/2020 to 12/31/2020 after applying the best lagged to 256 COVID-19-related terms further selected. Use a moving average to further smooth out the week-to-week variation of selected Google search queries. All three steps above are shown in step 3 of Figure S1 and serve as the final step in the overall data preprocessing procedure. Table 3 shows the 11 key terms selected and their optimal lags. Table S2 shows the correlation coefficients of 11 important Google search queries that were optimally delayed. Table 3 supports our intuition that when people get infected, they search for common queries first, such as “symptoms of covid-19.” This is because the optimal lag for this query is relatively large. After symptoms develop, people may start looking for specific symptoms, such as “loss of sense of smell”, for which the optimal lag is relatively small. Even the query terms that are used may have slightly different search patterns, so the optimal lag is also different.

### ARGO-inspired predictions

let me $${\hat{y}}_{t,r}$$ Regional Daily Hospitalization r on that day t; $$X_{k,t}$$ Google search data for words k on that day t; $$c_{t,r}$$ Increased confirmed cases of JHU COVID-19 on the day t regional r; $$v_{t,r}$$ Cumulative percentage of people vaccinated daily t regional r; $${\mathbb {I}}_{\{t, d\}}$$ Day of the week indicator for day t (i.e. $${\mathbb {I}}_{\{t, 1\}}$$ indicates the day t is Monday). Predict standing on T day l・Hospitalization on the day before the condition r, $${\hat{y}}_{T+l,r}$$used a penalized linear estimator as follows:

\begin{aligned} \begin{aligned} {\hat{y}}_{T+l,r} = {\hat{\mu}}_{y,r}+\sum ^{I}_ {i=0}{\hat{\alpha}}_{i,r}y_{Ti,r} + \sum _{j\in {J}}{\hat{\beta}}_{j,r }c_{T+lj,r}+ \sum _{m\in {M}_{r}}{\hat{\gamma}}_{m,r}y_{T,m}+ \sum _{ q\in {Q}}{\hat{\phi }}_{q,r}v_{T+lq,r}+ \sum ^{K}_{k=1}{\hat{\delta }} _{k,r}X_{k,T+l-{\hat{O}}_k} + \sum ^6_{d=1}{\hat{\tau}}_{d,r}{\mathbb {I}}_{\{T+l, d\}} \end{aligned} \end{aligned}

(1)

Where $$I=6$$ Consider delayed daily admissions for 1 consecutive week. $$J=\max \left( \{7,28\},l\right)$$considering late confirmed cases. $${Mr}$$ A set of geographically contiguous states in a state r; $$Q=\max\left(7,l\right)$$considering that the vaccination data are one week behind. $${\hat{O}}_k=\max \left(O_k,l\right)$$ Adjusted optimal lag for term k; $$K=11$$, considering 11 selected Google search terms.coefficient of l– Local day-ahead forecast r, $$\{\mu _{y,r},\varvec{\alpha }=(\alpha _{1,r},\ldots ,\alpha _{6,r}), \varvec{\beta}= (\beta _{1,r}, \ldots , \beta _{|J|,r}), \varvec{\gamma }=(\gamma _{1,r},\ldots ,\gamma _{| {{M}_{r}}|,r}), \varvec{\phi }=\phi _{max(7,l),r}, \varvec{\delta }=(\delta _{1, r},\ldots ,\delta _{11,r}), \varvec{\tau }=(\tau _{1,r}, \ldots , \tau _{6,r})\}$$calculated by

\begin{aligned} \begin{aligned} \underset{\mu _{y,r},\varvec{\alpha},\varvec{\beta},\varvec{\gamma},\varvec{\phi },\varvec{\delta },\varvec{\tau },\varvec{\lambda }}{\mathrm {argmin}} \sum _{t=TM-l+1}^{Tl}&\omega ^ {Tl-t+1}\Bigg ( y_{t+l,r}-\mu _{y,r} – \sum ^{6}_{i=0}{\alpha }_{i,r} y_{ti,r}-\sum_{j\in {J}}{\hat{\beta}}_{j,r}c_{t+lj,r}-\sum_{m\in {M }_{r}}{\hat{\gamma }}_{m,r}y_{t,m}\\ \;\;\;&- \sum _{q\in {Q}}{\hat {\phi }}_{q,r}v_{t+lq,r} -\sum ^{5}_{k=1}{\hat{\delta }}_{k,r}X_{k, t+l-{\hat{O}}_k} – \sum ^6_{d=1}{\hat{\tau}}_{d,r}{\mathbb {I}}_{\{t+ l , d\}}\Bigg )^2\\ \;\;\;&+ \lambda _\alpha \Vert \varvec{\alpha }\Vert _1+\lambda _\beta \Vert \varvec{\beta } \ Vert _1+\lambda _\gamma \Vert \varvec{\gamma }\Vert _1+ \lambda _\phi \Vert \varvec{\phi }\Vert _1+\lambda _\delta \Vert \varvec{\delta }\Vert _1 +\lambda _\tau \Vert \varvec{\tau}\Vert _1 \end{aligned} \end{aligned}

(2)

M = 56 This is the length of the training period. $$\omega = 0.8$$ is an exponentially time-decaying weight that assigns higher weights to more recent observations.region $$\varvec{r}$$ It is made up of the United States and 51 states, including Washington, DC. National level training, hospitalization in neighboring countries, $$y_{t,m}$$and their coefficients, $$\varvec{\gamma}$$, is excluded. We used the L1 norm penalty to deal with the sparseness of the Google search data.For simplicity, the hyperparameter $$\varvec{\lambda}=(\lambda_{\alpha},\lambda_{\beta},\lambda_{\gamma},\lambda_{\phi},\lambda_{\delta}, \lambda _{\ tau})$$ For L1-norm penalties, which were set equal and obtained by 10-fold cross-validation.

### Ethical Acknowledgment and Consent to Participate

This study did not include human participants, data, or tissues. Conducted using only aggregated and anonymized data. No Institutional Review Board approval was required. All methods were performed in accordance with relevant guidelines and regulations.