Skip to contents

Overview

This introduction briefly outlines core functions used to preprocess observation data, build spatial-temporal models, and post-process model outputs. Its purpose is to demonstrate a standard workflow not to provide an in depth examination of functions or model building techniques.

Preliminaries

Load needed packages
A few of these packages are not available on CRAN and will need to be installed from other locations.

#comments and prompts
options(dplyr.summarise.inform = FALSE)
library(cli)


#wrangling
library(tidyverse)
library(lubridate)
library(arrow)
library(Hmisc)
library(yaml)

#spatial manipulation
library(sp)
library(sf)
library(spdep)
library(rgeos)
library(igraph)
library(maptools)
library(mapproj)

#census data
library(censusapi)

#forecast data
library(zoltr) #Not available on CRAN
#remotes::install_github("reichlab/zoltr")
library(covidHubUtils) #Not available on CRAN
#remotes::install_github("reichlab/covidHubUtils")

#inference
library(INLA) #Not available on CRAN
#install.packages("INLA",repos=c(getOption("repos"),
#INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

library(EpiEstim)
library(forecast)

CovidCAR currently on GitHub

library(CovidCAR)
#devtools::install_github("JMHumphreys/CovidCAR")

Setup Analysis

Specifiy Dates and Directories

The setup_analysis() function defines key date thresholds for model training and forecast horizon periods and should always be run before using any other functions in the CovidCAR package. The dates are written to a yaml file for use by other functions.

The function also allows for recording directory paths to (optionally) write outputs outside of the working directory or to pull previously cached observation data (cache as created with Covid19Forecast.v1).

my_ouputs <- "C:/Users/unp7/Desktop/Misc/CovidCAR_tests"
my_local <- "C:/Users/unp7/Desktop/GitHub/covid19Forecasts/local/cache"

setup_analysis(report_date = "2021-08-23", #report date, first forecast day
               training_period = 2*28, #days
               forecast_horizon = 28, #days
               output_dir = my_ouputs, #write outputs here
               local_cache_dir = my_local #cache
)
## → Your local cache will be available to get_covid19_obs()
## → Analysis outputs will be written to C:/Users/unp7/Desktop/Misc/CovidCAR_tests/2021-08-23-CovidCAR-run2023-05-25

Define Spatial Domain

The download_boundaries() function pulls US State and territorial boundaries (ESRI shapefiles) from sources in the public domain. Some basic projection is performed, the shapefile is converted to a SpatialPolygonsDataFrame, and data attributes for a location identifier (‘Region’) and name (‘State’) are appended to the object.

NOTE: The function includes an option to download county boundaries (unit=“county”) but there are some timeout issues that need to be resolved due to large file size.

States <- download_boundaries(unit = "state")
## → Downloading polygon files...
## Reading layer `us-state-boundaries' from data source 
##   `C:\Users\unp7\Desktop\Misc\CovidCAR_tests\2021-08-23-CovidCAR-run2023-05-25\polygons' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS:  WGS 84
class(States)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
head(States@data[,c("Region", "State")]) #appended attributes  
##   Region                    State
## 1      1              Puerto Rico
## 2      2 Northern Mariana Islands
## 3      3                 Arkansas
## 4      4            West Virginia
## 5      5             Rhode Island
## 6      6               Washington

Adjacency Graph

The get_neighbors() function is used to identify polygons (States and Territories in this example) that are located next to each other. Neighbor information is recorded in a matrix (dimensions: location*location) that is included with the CAR model. Estimates for any one location are then ‘conditional’ on the estimates for surrounding locations.

NOTE: Polygons representing locations such as Hawaii and Guam are isolated from other locations (termed ‘islands’) and can be problematic. One option in this situation is to force connections between locations; the function’s ‘connect’ option will force connections between islands and other locations based on proximity.

Example: Islands with “no links”

nb_islands = get_neighbors(States, connect=FALSE)
summary(nb_islands) #note that "7 regions with no links"
## Neighbour list object:
## Number of regions: 56 
## Number of nonzero links: 224 
## Percentage nonzero weights: 7.142857 
## Average number of links: 4 
## 7 regions with no links:
## 1 2 17 19 20 33 50
## Link number distribution:
## 
##  0  1  2  3  4  5  6  7  8 
##  7  1  4  9  9 10 12  2  2 
## 1 least connected region:
## 52 with 1 link
## 2 most connected regions:
## 12 36 with 8 links

Example: All locations linked

nb_coerced = get_neighbors(States, connect=TRUE)
summary(nb_coerced)
## Neighbour list object:
## Number of regions: 56 
## Number of nonzero links: 242 
## Percentage nonzero weights: 7.716837 
## Average number of links: 4.321429 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8 
##  5  3 10 12 10 11  3  2 
## 5 least connected regions:
## 17 19 20 50 52 with 1 link
## 2 most connected regions:
## 12 36 with 8 links

View mapped adjacency
The plot_neighbors() function overlays adjacency connections on mapped location boundaries.

plot_neighbors(States, nb_islands)
## Regions defined for each Polygons

plot_neighbors(States, nb_coerced)
## Regions defined for each Polygons

Convert to INLA Graph The nb2INLA() and inla.read.graph() functions are provided by the INLA package.

nb2INLA("J", nb_coerced)
J = inla.read.graph("J")

Retrieve Observation Data

The get_covid19_obs() function downloads hospital incidence data for a specified data range.

The source options for data to be retrieved:
  • The covidcast package
  • From a local cache as created by the covid19Forecasts package (i.e., refactored pipeline pkg)
  • From test data available from the package itself (sample from summer 2021)
MinDate = min(su_yaml$full_time_span)
MaxDate = max(su_yaml$full_time_span)

testData = get_covid19_obs(source = "covidcast", 
                           start_date = MinDate, end_date = MaxDate, 
                           write_copy = TRUE)
#testData = get_covid19_obs(source = "cache", start_date = MinDate, end_date = MaxDate)
#testData = get_covid19_obs(source = "test", start_date = MinDate, end_date = MaxDate)

dim(testData)
## [1] 4534    5
head(testData)
##         date value signal location location_name
## 1 2021-06-28     2   hosp       02        Alaska
## 2 2021-06-28    29   hosp       01       Alabama
## 3 2021-06-28    55   hosp       05      Arkansas
## 4 2021-06-28    75   hosp       04       Arizona
## 5 2021-06-28   201   hosp       06    California
## 6 2021-06-28    60   hosp       08      Colorado

Add Spatial Index
The append_region_index() function matches location names in the observation data to the Region index in the polygon boundaries object, which also corresponds with the adjaceny matrix. The Region index is added as a column to the observations as is a new trn_tst column that is coded with either a trn or tst nominal indicator to distinguish time periods used for model training (observed) and testing (not observed).

train_data = append_region_index(train_data = testData, polys = States)
which(is.na(train_data$Region))
## integer(0)

Forecast Template

The create_forecast_template() function ensures that each location-time combination in the analysis is represented in the data ingested by the model. In the case of model runs using only historic observations, this function basically returns the original input but with some column names adjusted. This because the full date range was already represented. However, in the case of future dates where observations are not yet available, this function will add a row for each day through the forecast horizon coding the observed incidence value as NA as a placeholder.

train_data = create_forecast_template(train_data)

Additional Covariates

Demo models in this example are fairly simple but in many cases users will want to add additional predictors, signals, or covariates (independent variables). This section of the script demonstrates how to (1) pull and add demographic variables from the American Community Survey (ACS) and how to (2) add Rt estimates generated from the EpiEstim package.

Demographic Data

The getPovertyPop() function provides a wrapper function for the getCensus package for loading American Community Survey (ACS) data from the U.S. Census Bureau. In this example, an API key (‘secret_api’) is used to pull the percent of each state’s total population in poverty (SAEPOVRTALL_PT) and the number of individuals over the age of 55yrs (given in the vars_pop option).

PovPop_data = getPovertyPop(key = get_api("censusapi"), #function reads 'secrets.yaml' for specified name
                            vars_pov = c("SAEPOVRTALL_PT"), 
                            vars_pop = c('AGEGROUP','POP'), 
                            filt_age = c(12,18))
train_data = left_join(train_data, PovPop_data, by = "location")

Rt Estimation

The Rt_projection() function combines the estimate_R() function from the EpiEstim package with simple timeseries models to forecast Rt estimated over the model training period across the forecast horizon (28 days in the future). Both the ‘raw’ Rt estimate (‘Rt_raw’) for the observation period only and the forecast values (‘Rt’) are added to the dataframe.

Forecast models include:
  • simple ARIMA model using the forecast package (method=“arima”)
  • an order-2 random walk with noise and trend using the INLA package (method=“dlm)

NOTE: This is an experimental function and the “dlm” method is used as an example. Models later in this demo will use the Rt_raw value to forecast concurrently with incidence estimation.

Rt_df = Rt_projection(train_data, mean_si = 5.7, std_si = 2, 
                      forecast_horizon = 28, method = "dlm")

Rt Estimates
Checking Rt_projection() results

Rt_df[1:10,] #check values
##          date value signal location location_name       day trn_tst Region
## 1  2021-06-28     2   hosp       02        Alaska    Monday   train     20
## 2  2021-06-29     5   hosp       02        Alaska   Tuesday   train     20
## 3  2021-06-30     2   hosp       02        Alaska Wednesday   train     20
## 4  2021-07-01     6   hosp       02        Alaska  Thursday   train     20
## 5  2021-07-02     3   hosp       02        Alaska    Friday   train     20
## 6  2021-07-03     7   hosp       02        Alaska  Saturday   train     20
## 7  2021-07-04     3   hosp       02        Alaska    Sunday   train     20
## 8  2021-07-05     2   hosp       02        Alaska    Monday   train     20
## 9  2021-07-06     5   hosp       02        Alaska   Tuesday   train     20
## 10 2021-07-07     4   hosp       02        Alaska Wednesday   train     20
##    SAEPOVRTALL_PT age_pop   Rt_raw       Rt
## 1             9.6  184927       NA       NA
## 2             9.6  184927       NA       NA
## 3             9.6  184927       NA       NA
## 4             9.6  184927       NA       NA
## 5             9.6  184927       NA       NA
## 6             9.6  184927       NA       NA
## 7             9.6  184927       NA       NA
## 8             9.6  184927 3.295300 3.272158
## 9             9.6  184927 2.304267 2.334909
## 10            9.6  184927 1.858476 1.848646
Rt_df %>% 
  filter(trn_tst == "test") %>% 
  slice(1:10) #check values (forecast period)
##          date value signal location location_name       day trn_tst Region
## 1  2021-08-24    15   hosp       02        Alaska   Tuesday    test     20
## 2  2021-08-25    33   hosp       02        Alaska Wednesday    test     20
## 3  2021-08-26    18   hosp       02        Alaska  Thursday    test     20
## 4  2021-08-27    21   hosp       02        Alaska    Friday    test     20
## 5  2021-08-28    14   hosp       02        Alaska  Saturday    test     20
## 6  2021-08-29    19   hosp       02        Alaska    Sunday    test     20
## 7  2021-08-30    32   hosp       02        Alaska    Monday    test     20
## 8  2021-08-31    22   hosp       02        Alaska   Tuesday    test     20
## 9  2021-09-01    22   hosp       02        Alaska Wednesday    test     20
## 10 2021-09-02    23   hosp       02        Alaska  Thursday    test     20
##    SAEPOVRTALL_PT age_pop Rt_raw       Rt
## 1             9.6  184927     NA 1.018212
## 2             9.6  184927     NA 1.051603
## 3             9.6  184927     NA 1.084992
## 4             9.6  184927     NA 1.118382
## 5             9.6  184927     NA 1.151771
## 6             9.6  184927     NA 1.185161
## 7             9.6  184927     NA 1.218551
## 8             9.6  184927     NA 1.251941
## 9             9.6  184927     NA 1.285331
## 10            9.6  184927     NA 1.318721

Organize Data

Model parameters, inputs, and covariates will vary from model-to-model and user-to-user but ultimately all need to be combined in a single object that can be ingested by the the inference software, INLA in this case. The code below modifies the train_data dataframe to perform any desired scaling and to add several spatial and temporal indices.

Ordered integers are used as indices to define timesteps (days, weeks, etc), locations (‘Region’), and space*time combinations (e.g., ‘ID.Region.Wk’ in the next chunk). The modeling approach is hierarchical, so some indices may be used in multiple levels. But, each index name must be unique therefore some indices are copied and given slighly different names.

Once data is organized, it is reformatted as a list object called a ‘datastack’ that is passed to the inference software. Although a daraframe might be more intuitive, a list object is used so that model inputs can be of different lengths.

Clean Dataframe

The time_index() function is used to recode a date vector to the desired timestep duration (2-day steps, 1 week steps, etc).

train_data <- as.data.frame(Rt_df) %>%
  mutate(
    s_pop = log(age_pop), #log scale
    s_pov = as.numeric(scale(SAEPOVRTALL_PT)), #some NAs present.  
    doy = as.integer(as.factor(date)),
    doy.1 = doy,
    Region.Wk = paste0("ID", Region, "W", doy), #unique Region*doy combinations ('space-time interaction')
    ID.Region.Wk = as.integer(as.factor(Region.Wk)), #convert factor levels to integer
    week = week(date),
    int_week.1 = as.integer(as.factor(week)),
    int_week.2 = int_week.1,
    int_week.3 = int_week.1,
    threeday_indx = time_index(date, seq(min(date), max(date), by = "3 days")),
    threeday_indx.1 = as.integer(as.factor(threeday_indx)),
    fourday_indx = time_index(date, seq(min(date), max(date), by = "4 days")),
    fourday_indx.1 = as.integer(as.factor(fourday_indx)),
    fiveday_indx = time_index(date, seq(min(date), max(date), by = "5 days")),
    fiveday_indx.1 = as.integer(as.factor(fiveday_indx)),
    eightday_indx = time_index(date, seq(min(date), max(date), by = "8 days")),
    eightday_indx.1 = as.integer(as.factor(eightday_indx)),
    biweek_indx = time_index(date, seq(min(date), max(date), by = "14 days")),
    biweek_indx.1 = as.integer(as.factor(biweek_indx)),
    Region.1 = Region, Region.2 = Region, Region.3 = Region, 
                Region.4 = Region, Region.5 = Region
  ) %>%
  select(c(-biweek_indx, threeday_indx, fourday_indx, fiveday_indx, eightday_indx))

head(train_data)
##         date value signal location location_name       day trn_tst Region
## 1 2021-06-28     2   hosp       02        Alaska    Monday   train     20
## 2 2021-06-29     5   hosp       02        Alaska   Tuesday   train     20
## 3 2021-06-30     2   hosp       02        Alaska Wednesday   train     20
## 4 2021-07-01     6   hosp       02        Alaska  Thursday   train     20
## 5 2021-07-02     3   hosp       02        Alaska    Friday   train     20
## 6 2021-07-03     7   hosp       02        Alaska  Saturday   train     20
##   SAEPOVRTALL_PT age_pop Rt_raw Rt    s_pop      s_pov doy doy.1 Region.Wk
## 1            9.6  184927     NA NA 12.12772 -0.8064822   1     1    ID20W1
## 2            9.6  184927     NA NA 12.12772 -0.8064822   2     2    ID20W2
## 3            9.6  184927     NA NA 12.12772 -0.8064822   3     3    ID20W3
## 4            9.6  184927     NA NA 12.12772 -0.8064822   4     4    ID20W4
## 5            9.6  184927     NA NA 12.12772 -0.8064822   5     5    ID20W5
## 6            9.6  184927     NA NA 12.12772 -0.8064822   6     6    ID20W6
##   ID.Region.Wk week int_week.1 int_week.2 int_week.3 threeday_indx
## 1          795   26          1          1          1    2021-06-28
## 2          806   26          1          1          1    2021-06-28
## 3          817   26          1          1          1    2021-07-01
## 4          828   26          1          1          1    2021-07-01
## 5          839   27          2          2          2    2021-07-01
## 6          850   27          2          2          2    2021-07-04
##   threeday_indx.1 fourday_indx fourday_indx.1 fiveday_indx fiveday_indx.1
## 1               1   2021-06-28              1   2021-06-28              1
## 2               1   2021-06-28              1   2021-06-28              1
## 3               2   2021-06-28              1   2021-06-28              1
## 4               2   2021-07-02              2   2021-07-03              2
## 5               2   2021-07-02              2   2021-07-03              2
## 6               3   2021-07-02              2   2021-07-03              2
##   eightday_indx eightday_indx.1 biweek_indx.1 Region.1 Region.2 Region.3
## 1    2021-06-28               1             1       20       20       20
## 2    2021-06-28               1             1       20       20       20
## 3    2021-06-28               1             1       20       20       20
## 4    2021-06-28               1             1       20       20       20
## 5    2021-06-28               1             1       20       20       20
## 6    2021-07-06               2             1       20       20       20
##   Region.4 Region.5
## 1       20       20
## 2       20       20
## 3       20       20
## 4       20       20
## 5       20       20
## 6       20       20

Response Variable

The response variable may differ between models. In this case, copying hospital incidence (counts) to a new column, standardizing the distribution, and ensuring that observations for the forecast horizon are coded as NA. Retaining the scaling object and rewriting as function obs_scale() to transform model outputs back to the observation scale later.

Again, response variables are specific to individual model setup so could be scaled differently or not at all to be fit with a different likelihood (Poisson, NegBinomial, etc). Keeping it simple here.

train_data$resp = ifelse(train_data$trn_tst == "train", train_data$value, NA) #Set obs value to NA for forecasts periods

resp_scale_obj = scale(train_data$resp, scale=T, center=T) #scaled object
obs_scale = function(r)r*attr(resp_scale_obj,'scaled:scale') + attr(resp_scale_obj, 'scaled:center') #transform back to observation scale

train_data$nrm_resp = as.numeric(resp_scale_obj)

Format as a Datastack

In this example, data could remain as a dataframe and passed to INLA directly, but as a matter of practice, it is better to format as a list object (datastack).

nrm.lst = list(list(intercept1 = rep(1, dim(train_data)[1])), #custom intercept
          list(pov_pct = train_data[,"s_pov"],                #desired covariates and indices below
               pop = train_data[,"s_pop"],                    #formatting as nrm.lst = list()
               Rt_raw = train_data[,"Rt_raw"],
               Rt_raw.1 = train_data[,"Rt_raw"],
               Rt = train_data[,"Rt"],
               Rt.1 = train_data[,"Rt"],
               doy = train_data[,"doy"],
               doy.1 = train_data[,"doy.1"],
               doy.2 = train_data[,"doy.1"],
               int_week.1 = train_data[,"int_week.1"],
               int_week.2 = train_data[,"int_week.2"],
               int_week.3 = train_data[,"int_week.3"],
               threeday_indx.1 = train_data[,"threeday_indx.1"],
               fourday_indx.1 = train_data[,"fourday_indx.1"],
               fiveday_indx.1 = train_data[,"fiveday_indx.1"],
               eightday_indx.1 = train_data[,"eightday_indx.1"],
               biwek_indx.1 = train_data[,"biweek_indx.1"],
               Region.1 = train_data[,"Region.1"],
               Region.2 = train_data[,"Region.2"],
               Region.3 = train_data[,"Region.3"],
               Region.4 = train_data[,"Region.4"],
               Region.5 = train_data[,"Region.5"],
               Region_Wk = train_data[,"ID.Region.Wk"],
               dow = train_data[,"day"]))

nrm.stk = inla.stack(data = list(Y = train_data$nrm_resp), #Y is response variable
                                      A = list(1,1),       #option to include matrices, not used in this case
                                effects = nrm.lst,         #list object from above
                                    tag = "nrm")           #arbitrary name to index searches later 
                                                           #(multiple datastack used in complex models)

Model Priors and Formulae

Specifying all priors and formulas for desired models using R-INLA syntax. A deep-dive would be needed to describe in detail, which would be too time consuming for this workflow focused demonstration.

A few notes:

  • Priors
    • pc refers to “Penalizing Complexity”
    • prec is “precision”
    • cor is “correlation”
    • non-PC priors are coded using the distribution name (e.g., norm = “normal”)
  • Formulas
    • Y is the response variable in the datastack
    • f() designates a function, level, or submodel in the hierarchy
    • hyper= refers to the prior for that f()

Set Priors

#bym prior
bym_hyper <- list(phi = list(prior = "pc", 
                      param = c(0.5, 2/3), 
                      initial = 3), 
               prec = list(prior = "pc.prec", 
                       param = c(1, 0.01), 
                       initial = 1.5))  
#Normal prior
norm.prior <- list(theta=list(prior = "normal", 
                              param=c(0, 1)))


#iid prior
pc_prec_iid <- list(theta = list(prior="pc.prec", 
                                 param=c(0.5, 0.01)))

#ar1 prior
pc_cor_ar1 <- list(theta = list(prior = 'pccor1', 
                                param = c(0.5, 0.9)))

#rw2 prior
pc_rw2 <- list(prec=list(prior="pc.prec", 
                         param=c(0.5,0.01)))

#bundle priors to archive run
priors.list <- list()
priors.list[["bym_hyper"]] <- bym_hyper
priors.list[["norm.prior"]] <- norm.prior
priors.list[["pc_prec_iid"]] <- pc_prec_iid
priors.list[["pc_cor_ar1"]] <- pc_cor_ar1
priors.list[["pc_rw2"]] <- pc_rw2

Specify Formulas

Formula 1: Random Walk plus noise for each location (i.e., state)

Frm.1 = Y ~ -1 +     #remove default intercept
  intercept1 +       #custom intercept
  f(doy.1,           #order by time index (daily)
    constr=TRUE,     #enforced zero mean
    model="rw1",     #order-1 random walk with noise
    scale.model = TRUE, #additional internal scaling
    group = Region.1, #run rw1 model for groups based on location 
    control.group=list(model="iid"), #groups are treated independently
    hyper=pc_rw2)  #prior for rw2

Formula 2: Random Walk plus noise and trend for each location

Frm.2 = Y ~ -1 +     
  intercept1 +       
  f(doy.1,           
    constr=TRUE,     
    model="rw1",    
    scale.model = TRUE, 
    group = Region.1, 
    control.group=list(model="iid"), 
    hyper=pc_rw2) + 
  f(doy.2, model="linear", mean.linear = 0, prec.linear = 0.001) #add linear trend to rw1

Formula 3: Common spatial effect for timesteps but each location has separate autoregression

Frm.3 = Y ~ -1 +    
  intercept1 +       
  f(Region.1,        #location index
    model="bym2",    #spatial effect, Besag-York-Mollie model (the 2 indicates scaling) 
    graph=J,         #Adjacency graph to identify neighbors
    constr=TRUE,     #enforced zero mean
    hyper=bym_hyper) + #BYM prior
  f(doy.1,             #order by time index (daily)
    model="ar1",       #apply order-1 autoregressive
    constr=TRUE,
    group = Region.1,  #run ar1 model for groups based on location
    control.group=list(model="iid"), #groups are treated independently
    hyper=pc_cor_ar1) 

Formula 4: Separate spatial effect for each timestep (related by ar1) and each location has its own autoregressive term.

Frm.4 = Y ~ -1 +     
  intercept1 +       
  f(Region.1,        
    model="bym2",   
    graph=J,         
    constr=TRUE,     
    group = doy,     #time index, daily (create separate realizations of spatial covariate for each day)
    control.group=list(model="ar1"), #groups are related via an order-1 autoregressive
    hyper=bym_hyper) + #prior for BYM
  f(doy.1,             
    model="ar1",       
    constr=TRUE,
    group = Region.1,  
    control.group=list(model="iid"), 
    hyper=pc_cor_ar1) 

Formula 5: As Formula 4 but with space-time interaction to capture location and time specific variation outside of modeled trends.

Frm.5 = Y ~ -1 +    
  intercept1 +       
  f(Region.1,        
    model="bym2",    
    graph=J,         
    constr=TRUE,    
    group = doy,     
    control.group=list(model="ar1"),
    hyper=bym_hyper) +
  f(doy.1,             
    model="ar1",       
    constr=TRUE,
    group = Region.1,  
    control.group=list(model="iid"), 
    hyper=pc_cor_ar1) +
  f(Region_Wk,   #Index for all location*time combinations (space-time interaction)
    model="iid", #each location and time combination considered independently
    constr=TRUE,
    hyper=pc_prec_iid) 

Formula 6: As Formula 5 but adding covariate for variation due to day of week (e.g. Monday, Tuesday,…Sunday).

Frm.6 = Y ~ -1 +    
  intercept1 +       
  f(Region.1,        
    model="bym2",    
    graph=J,         
    constr=TRUE,     
    group = doy,     
    hyper=bym_hyper, 
    control.group=list(model="ar1")) +
  f(doy.1,             
    model="ar1",       
    constr=TRUE,
    group = Region.1,  
    control.group=list(model="iid"), 
    hyper=pc_cor_ar1) +
  f(dow,           #discrete variable indicating day of week, e.g. Monday, Tuesday,...Sunday
    constr=TRUE,
    model="iid",   #days of week may vary independently
    group = Region.2, #variation attributed to days of week may differ by location
    control.group=list(model="iid"), 
    hyper=pc_prec_iid) +
  f(Region_Wk,   
    model="iid", 
    constr=TRUE,
    hyper=pc_prec_iid) 

Formula 7: Including Rt estimates as an experimental covariate. Forecast Rt trend estimated from the observation period (training period) to the future (28 days) using an autoregressive model.

Frm.7 = Y ~ -1 +     
  intercept1 +       
  pov_pct + pop +    
  f(Region.1,        
    model="bym2",    
    graph=J,        
    constr=TRUE,     
    group = doy,     
    hyper=bym_hyper, 
    control.group=list(model="ar1")) + 
  f(doy.1, Rt_raw,  #order by time index (daily) but weight each timestep by corresponding Rt_raw estimate
    model="ar1",    #apply order-1 autoregressive to Rt weighted time index above
    constr=TRUE,
    group = Region.2, 
    control.group=list(model="iid"),
    hyper=pc_cor_ar1) +
  f(dow,           
    constr=TRUE,
    model="iid",
    group = Region.3,
    control.group=list(model="iid"),
    hyper=pc_prec_iid) +
  f(Region_Wk,   
    model="iid", 
    constr=TRUE,
    hyper=pc_prec_iid) 

Formula 8: As with Formula 8 but adding a random walk at a more coarse time scale (3 day steps) to reduce forecast decay.

Frm.8 = Y ~ -1 +     
  intercept1 +       
  pov_pct + pop +    #linear covariates for poverty and population over 55yrs
  f(Region.1,        
    model="bym2",    
    graph=J,        
    constr=TRUE,     
    group = doy,    
    hyper=bym_hyper, 
    control.group=list(model="ar1")) + 
  f(threeday_indx.1, #time index, 3days
    constr=TRUE,
    model="rw2",     #order-2 random walk with noise
    scale.model = TRUE,
    group = Region.2,
    control.group=list(model="iid"), 
    hyper=pc_rw2) + 
  f(doy.1, Rt_raw,  
    model="ar1",    
    constr=TRUE,
    group = Region.3, 
    control.group=list(model="iid"),
    hyper=pc_cor_ar1) +
  f(dow,           
    constr=TRUE,
    model="iid",
    group = Region.4,
    control.group=list(model="iid"),
    hyper=pc_prec_iid) +
  f(Region_Wk,   
    model="iid", 
    constr=TRUE,
    hyper=pc_prec_iid) 

Organize Formulas

formulas.list <- list()
formulas.list[["base_rw1"]] <- Frm.1
formulas.list[["rw1_trend"]] <- Frm.2
formulas.list[["base_car"]] <- Frm.3
formulas.list[["car_time"]] <- Frm.4
formulas.list[["car_sti"]] <- Frm.5
formulas.list[["car_wdays"]] <- Frm.6
formulas.list[["car_rt"]] <- Frm.7
formulas.list[["car_full"]] <- Frm.8

Run Models

The run_model_list() function runs a series of models as specified in the formulas.list using the input data organized as a datastack (nrm.srk). The function runs each model sequentially and writes the executed models (models_out), formulas (formulas.list), priors (prior.list), datastack, and original dataframe (train_data) to an .RData in the analysis directory. The models_out object will also be available in the environment for further processing.

There are many customization options for inference but have opted to keep run_model_list() fairly simple for ease of use and maximum efficiency.

Addional run_model_list() options:
  • likelihood
    • If one likelihood is provided it will be applied to all models
    • A vector of likelihoods can be provided with order based on formulas.list
    • e.g., myFamilies <- c(“gaussian”, “binomial”, “zeroinflatednbinomial”, …)
  • config
    • indicates if latencies (GMRF) should be retained for sampling
    • config=TRUE can be time intensive and dramatically slow model runs
    • CovidCAR has a post_sampling() function to facilitate sampling
  • verbose prints INLA algorithm process to screen during model runs
  • archive indicates to save model inputs and results to the analysis directory
    • model outputs will be written to a run_archive subdirectory

Run All Models

#formulas.list = formulas.list[c(1:4)] #short list for demo, fast run models

models_out = run_model_list(formulas.list=formulas.list,
                            dataStack=nrm.stk,
                            likelihood = "gaussian",
                            config=FALSE, verbose = FALSE, archive=TRUE)

Extract and Format Forecasts

The extract_forecasts() function pulls forecasts from models and saves them to a forecasts folder in analysis directory. The function returns a forecast_paths list object to the environment with file path names.

Forecasts are formatted to the specifications required for submission to the covid19-forecast-hub.

extract_forecasts(mod_out=models_out,
                  dataStack=nrm.stk, train_data=train_data,
                  team = "CFA")
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m base_rw1 
[38;5;249m[1.8s]
[39m                                
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m rw1_trend 
[38;5;249m[2.4s]
[39m                               
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m base_car 
[38;5;249m[2.3s]
[39m                                
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m car_time 
[38;5;249m[2.2s]
[39m                                
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m car_sti 
[38;5;249m[2.4s]
[39m                                 
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m car_wdays 
[38;5;249m[2.1s]
[39m                               
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m car_rt 
[38;5;249m[2.8s]
[39m                                  
## 

[36mℹ
[39m Writing model forecasts to analysis directory: 


[32m✔
[39m car_full 
[38;5;249m[2.5s]
[39m                                
#check returned object
names(forecast_paths)
## [1] "base_rw1"  "rw1_trend" "base_car"  "car_time"  "car_sti"   "car_wdays"
## [7] "car_rt"    "car_full"
head(read.csv(forecast_paths[["rw1_trend"]])) #formatted for submission
##   forecast_date location               target target_end_date     type quantile
## 1    2021-08-23       02 0 day ahead inc hosp      2021-08-23 quantile     0.01
## 2    2021-08-23       02 1 day ahead inc hosp      2021-08-24 quantile     0.01
## 3    2021-08-23       02 2 day ahead inc hosp      2021-08-25 quantile     0.01
## 4    2021-08-23       02 3 day ahead inc hosp      2021-08-26 quantile     0.01
## 5    2021-08-23       02 4 day ahead inc hosp      2021-08-27 quantile     0.01
## 6    2021-08-23       02 5 day ahead inc hosp      2021-08-28 quantile     0.01
##      value
## 1 18.04242
## 2  0.00000
## 3  0.00000
## 4 15.69182
## 5 47.97946
## 6 83.44678

View forecasts extract_forecasts() also returns a list (‘plot_paths’) of paths to plotting data from the extraction process. This data can be accessed from the plot_location() function, which provides a quick diagnostic plot of forecast for a specific location.

Note of Caution: If the ‘loc=’ option is left as NULL, all locations will be plotted to PDF file and saved in the ‘Reports’ folder of the analysis directory. This may be time consuming!

OK_plot = plot_location(plot_path = plot_paths, model = "car_full", loc = "Oklahoma")
OK_plot

AR_plot = plot_location(plot_path = plot_paths, model = "car_full", loc = "Arkansas")
AR_plot

#plot_location(plot_path = plot_paths, model = "car_full") #plots and saves all locations to a pdf file

Model Scoring

WIS Scores

The score_WIS() function calculates the WIS score for forecasts by model. Optional arguments can be included to indicate if files should be read from a directory (ingest = “path”), a dataframe in the environment (ingest = “dataframe”),or from a list object with individual file paths (ingest = “list”) as returned by extract_forecasts().

The ‘missing’ option can be used to specify how missing observation data should be handled; ‘remove’ from data or fill with ‘zero’.

my_truth <- train_data %>% #Caution: my_truth may be different than your truth :)
  select(date, location, value)

my_scores <- score_WIS(forecast_data = forecast_paths, truth=my_truth, 
                       ingest = "list", missing = "remove") 
## → A total of 5336 predictions weren't evalauted due lack of truth data
head(my_scores)
##      model       date location_name forecast_date      WIS
## 1 base_car 2021-08-23            01    2021-08-23 1.349751
## 2 base_car 2021-08-23            02    2021-08-23 1.105233
## 3 base_car 2021-08-23            04    2021-08-23 1.200149
## 4 base_car 2021-08-23            05    2021-08-23 1.460301
## 5 base_car 2021-08-23            06    2021-08-23 1.401860
## 6 base_car 2021-08-23            08    2021-08-23 1.142202
#overall
wis_rank <- my_scores %>%
  group_by(model) %>%
  summarise(mean_wis = mean(WIS)) %>%
  arrange(mean_wis) %>%
  mutate(wisRank = row_number())

wis_rank #mean absolute values
## # A tibble: 8 × 3
##   model     mean_wis wisRank
##   <chr>        <dbl>   <int>
## 1 car_wdays     33.5       1
## 2 car_rt        34.1       2
## 3 base_car      34.6       3
## 4 car_time      34.6       4
## 5 car_sti       34.9       5
## 6 car_full      49.3       6
## 7 base_rw1     281.        7
## 8 rw1_trend    307.        8

Diagnostic score plots
The plot_WIS_lines() function has options to make quick plots of WIS scores returned by score_WIS().

unique(my_scores$model)
## [1] "base_car"  "base_rw1"  "car_full"  "car_rt"    "car_sti"   "car_time" 
## [7] "car_wdays" "rw1_trend"
#lines showing absolute WIS
plot_WIS_lines(my_scores, by = "date", range = "abs")

#lines showing scaled WIS
plot_WIS_lines(my_scores, by = "date", range = "scaled", 
               scale_model = "base_rw1")

#optional 'limit' that recodes: (WIS >= limit) -> limit
plot_WIS_lines(my_scores, by = "date", range = "scaled", 
               scale_model = "base_rw1", limit = 2)

#tile option
plot_WIS_lines(my_scores, by = "tile", range = "scaled", 
               scale_model = "base_rw1", limit = 2)
## Warning: Removed 864 rows containing missing values (`geom_tile()`).

Mean Absolute Error (MAE)

The score_MAE() function works comparably to score_WIS() but is a simpler measure of model performance as it is based on only the point estimates from forecasts.

my_mae <- score_MAE(forecast_data = forecast_paths, truth=my_truth, ingest = "list", missing = "remove") 
## → A total of 232 forecasts weren't evaluatd due lack of truth data
my_mae
##       model   MAE    MAPE maeRank
## 1 car_wdays  41.4 0.01247       1
## 2    car_rt  41.8 0.01262       2
## 3  base_car  41.9 0.01262       3
## 4  car_time  41.9 0.01263       4
## 5   car_sti  42.1 0.01271       5
## 6  car_full  62.8 0.01893       6
## 7  base_rw1 325.6 0.09818       7
## 8 rw1_trend 353.3 0.10652       8

Ensemble

The propose_weights() function assists with ensemble building by weighting models using a given performance metric. The function scales the raw model comparison metric and then builds the ensemble by multiplying each forecasts by its model-specific weight and summing across all included models. The resulting ensemble forecast is then standardized for Covid19-hub submission and written to the analysis directory (./forecasts). The function returns the estimated weights to the environment.

For example, the WIS and MAE scores could be used to weight individual models in an ensemble.

First, compare WIS and MAE scores:

mod_rank <- left_join(my_mae, wis_rank, by="model") #combine with overall WIS

mod_rank #note the scores rank models differently  
##       model   MAE    MAPE maeRank  mean_wis wisRank
## 1 car_wdays  41.4 0.01247       1  33.54621       1
## 2    car_rt  41.8 0.01262       2  34.10622       2
## 3  base_car  41.9 0.01262       3  34.57760       3
## 4  car_time  41.9 0.01263       4  34.64150       4
## 5   car_sti  42.1 0.01271       5  34.90884       5
## 6  car_full  62.8 0.01893       6  49.25360       6
## 7  base_rw1 325.6 0.09818       7 280.86031       7
## 8 rw1_trend 353.3 0.10652       8 307.10326       8

The propose_weights() function can the be applied to generate weights and write the resulting ensemble.

my_wis_weights <- propose_weights(forecast_data = forecast_paths, #standardized model forecasts
                             ingest = "list",                     #read model locations as list
                             rank_df = mod_rank,                  #use the data from WIS and MAE scoring         
                             rankCol = "mean_wis",                #weight models based on this column
                             team = "CFA",                        #team name (need be in file name per Covid19hub)
                             mod_name = "wis_ensemble")           #label for the new ensemble forecast
## → Writing ensemble 'wis_ensemble' to analysis directory
my_wis_weights #The weights column reports the actual weights calculated for each model
##       model   MAE    MAPE maeRank  mean_wis wisRank mean_wis_weights
## 1 car_wdays  41.4 0.01247       1  33.54621       1       0.13693337
## 2    car_rt  41.8 0.01262       2  34.10622       2       0.13683448
## 3  base_car  41.9 0.01262       3  34.57760       3       0.13675124
## 4  car_time  41.9 0.01263       4  34.64150       4       0.13673996
## 5   car_sti  42.1 0.01271       5  34.90884       5       0.13669275
## 6  car_full  62.8 0.01893       6  49.25360       6       0.13415968
## 7  base_rw1 325.6 0.09818       7 280.86031       7       0.09326132
## 8 rw1_trend 353.3 0.10652       8 307.10326       8       0.08862720
#another example,this time using MAE and including a 'drop' option
my_mae_weights <- propose_weights(forecast_data = forecast_paths, 
                             ingest = "list",
                             rank_df = mod_rank, 
                             rankCol = "MAE",
                             drop = 1,  #number of lowest ranked models to drop/exclude from ensemble
                             team = "CFA",
                             mod_name = "mae_ensemble")
## → Dropped models: rw1_trend
## → Writing ensemble 'mae_ensemble' to analysis directory
my_mae_weights
##       model   MAE    MAPE maeRank  mean_wis wisRank MAE_weights
## 1 car_wdays  41.4 0.01247       1  33.54621       1  0.15511855
## 2    car_rt  41.8 0.01262       2  34.10622       2  0.15500697
## 3  base_car  41.9 0.01262       3  34.57760       3  0.15497908
## 4  car_time  41.9 0.01263       4  34.64150       4  0.15497908
## 5   car_sti  42.1 0.01271       5  34.90884       5  0.15492329
## 6  car_full  62.8 0.01893       6  49.25360       6  0.14914923
## 7  base_rw1 325.6 0.09818       7 280.86031       7  0.07584379
#yet another example, not providing a rankCol -> function assumes equal weighting 
equal_mae_weights <- propose_weights(forecast_data = forecast_paths, 
                             ingest = "list",
                             rank_df = mod_rank, 
                             #rankCol = NULL,
                             drop = 1,  #issues warning, dropping models without a ranking criteria
                             team = "CFA",
                             mod_name = "equal_mae_ensemble")
##  No rankCol but dropping models? Models will be dropped from end of list!
## ! No rankCol provided, ensemble assumes equal weighting
## → Dropped models: rw1_trend
## → Writing ensemble 'equal_mae_ensemble' to analysis directory
equal_mae_weights
##       model   MAE    MAPE maeRank  mean_wis wisRank rankCol rankCol_weights
## 1 car_wdays  41.4 0.01247       1  33.54621       1       1       0.1428571
## 2    car_rt  41.8 0.01262       2  34.10622       2       1       0.1428571
## 3  base_car  41.9 0.01262       3  34.57760       3       1       0.1428571
## 4  car_time  41.9 0.01263       4  34.64150       4       1       0.1428571
## 5   car_sti  42.1 0.01271       5  34.90884       5       1       0.1428571
## 6  car_full  62.8 0.01893       6  49.25360       6       1       0.1428571
## 7  base_rw1 325.6 0.09818       7 280.86031       7       1       0.1428571

Ensemble Re-scoring

Now that new ensembles have been added to the ‘forecasts’ directory, comparison scores can be recalculated.

myDir <- file.path(su_yaml$out_dir_name, "forecasts")
new_mae <- score_MAE(forecast_data = myDir, truth=my_truth, ingest = "path", missing = "remove") 
## → A total of 319 forecasts weren't evaluatd due lack of truth data
new_mae
##                 model   MAE    MAPE maeRank
## 1           car_wdays  41.4 0.01247       1
## 2              car_rt  41.8 0.01262       2
## 3            base_car  41.9 0.01262       3
## 4            car_time  41.9 0.01263       4
## 5             car_sti  42.1 0.01271       5
## 6        mae_ensemble  49.2 0.01485       6
## 7  equal_mae_ensemble  60.4 0.01820       7
## 8            car_full  62.8 0.01893       8
## 9        wis_ensemble  70.6 0.02130       9
## 10           base_rw1 325.6 0.09818      10
## 11          rw1_trend 353.3 0.10652      11

Historic Forecasts

The get_hub_forecasts() function retrieves forecasts previously submitted to the covid19-forecast-hub.

Similar to get_covid19_obs(), there are options to load from a local parquet cache (source=“cache”) as indexed with the Covid19Forecasts package(private repo) or to load “test” data included with the package. There is also the an option to use the covidHubUtils package to download data directly from covid19-forecast-hub.

The queried results can also be filtered to specific models using the ‘models=’ option. If not set, the ‘model=’ options defaults to forecasts from the COVIDhub-trained_ensemble, COVIDhub-ensemble, and COVIDhub-baseline models.

By default, get_hub_forecasts() returns forecasts for the forecast period specified during initial setup using setup_analysis().

hist_forecasts <- get_hub_forecasts(source = "covidHubUtils",
                                    models = c("COVIDhub-trained_ensemble", "COVIDhub-baseline", "COVIDhub-ensemble"),
                                    write_copy = TRUE)
## → Loading location crosswalk
## Rows: 57 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "|"
## chr (4): STATE, STUSAB, STATE_NAME, STATENS
## 
##  Use `spec()` to retrieve the full column specification for this data.
##  Specify the column types or set `show_col_types = FALSE` to quiet this message.
## → Fetching COVID-19 forecasts using covidHubUtils
## 
## get_token(): POST: https://zoltardata.com/api-token-auth/
## 
## get_resource(): GET: https://zoltardata.com/api/projects/
## 
## get_resource(): GET: https://zoltardata.com/api/project/44/models/
## 
## get_resource(): GET: https://zoltardata.com/api/project/44/timezeros/
## 
## → Writing forecast data to analysis directory
dim(hist_forecasts)
## [1] 103032      7
head(hist_forecasts)
##                       model forecast_date location target_end_date     type
## 1 COVIDhub-trained_ensemble    2021-08-23       01      2021-08-24    point
## 2 COVIDhub-trained_ensemble    2021-08-23       01      2021-08-24 quantile
## 3 COVIDhub-trained_ensemble    2021-08-23       01      2021-08-24 quantile
## 4 COVIDhub-trained_ensemble    2021-08-23       01      2021-08-24 quantile
## 5 COVIDhub-trained_ensemble    2021-08-23       01      2021-08-24 quantile
## 6 COVIDhub-trained_ensemble    2021-08-23       01      2021-08-24 quantile
##   quantile value
## 1       NA   495
## 2    0.010   369
## 3    0.025   384
## 4    0.050   395
## 5    0.100   404
## 6    0.150   412
#because data is formatted to same standard, functions can read
hub_rank <- score_WIS(forecast_data = hist_forecasts, truth=my_truth, 
                      ingest = "dataframe", missing = "remove")

hub_rank %>%
  group_by(model) %>%
  summarise(mean_wis = mean(WIS))
## # A tibble: 3 × 2
##   model                     mean_wis
##   <chr>                        <dbl>
## 1 COVIDhub-baseline             45.6
## 2 COVIDhub-ensemble             50.6
## 3 COVIDhub-trained_ensemble     68.2

Combine with CovidCAR models
The plot_forecasts_compare() function combines plot_WIS_lines() and score_WIS() to make a model WIS comparison line plot. The ‘hub_forecasts’ option facilitates direct use of imported historical forecast data from the Covid19-forecast-hub.

my_plot <- plot_forecasts_compare(forecast_data = myDir, truth=my_truth, ingest = "path",
                                  hub_forecasts = hist_forecasts, #historic forecasts
                                  scale_model = "COVIDhub-baseline",
                                  limit = 4,
                                  missing = "remove",
                                  write_copy = TRUE)
## → A total of 7337 predictions weren't evalauted due lack of truth data
## → Writing comparison scores to analysis directory
class(my_plot)
## [1] "gg"     "ggplot"
my_plot