CovidCAR Package Overview

CovidCAR is intended to facilitate Covid19 model building, ensembling, and evalutaion
R
statistical inference
Published

May 18, 2023

Overview

This introduction briefly outlines core functions used to preprocess observation data, build spatial-temporal models, and post-process model outputs. Its purpose is to demonstrate a standard workflow not to provide an in depth examination of functions or model building techniques.

Resources

Preliminaries

Load needed packages
A few of these packages are not available on CRAN and will need to be installed from other locations.

Hide code
#comments and prompts
options(dplyr.summarise.inform = FALSE)
library(cli)


#wrangling
library(tidyverse)
library(lubridate)
library(arrow)
library(Hmisc)
library(yaml)

#spatial manipulation
library(sp)
library(sf)
library(spdep)
library(rgeos)
library(igraph)
library(maptools)
library(mapproj)

#census data
library(censusapi)

#forecast data
library(zoltr) #Not available on CRAN
#remotes::install_github("reichlab/zoltr")
library(covidHubUtils) #Not available on CRAN
#remotes::install_github("reichlab/covidHubUtils")

#inference
library(INLA) #Not available on CRAN
#install.packages("INLA",repos=c(getOption("repos"),
#INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

library(EpiEstim)
library(forecast)

CovidCAR currently on GitHub

Hide code
library(CovidCAR)
#devtools::install_github("JMHumphreys/CovidCAR")

Setup Analysis

Specifiy Dates and Directories

The setup_analysis() function defines key date thresholds for model training and forecast horizon periods and should always be run before using any other functions in the CovidCAR package. The dates are written to a yaml file for use by other functions.

The function also allows for recording directory paths to (optionally) write outputs outside of the working directory or to pull previously cached observation data (cache as created with Covid19Forecast.v1).

Hide code
my_ouputs <- "C:/Users/unp7/Desktop/Misc/CovidCAR_test"
my_local <- "C:/Users/unp7/Desktop/GitHub/covid19Forecasts/local/cache"

setup_analysis(report_date = "2021-08-23", #report date, first forecast day
               training_period = 2*28, #days
               forecast_horizon = 28, #days
               output_dir = my_ouputs, #write outputs here
               local_cache_dir = my_local #cache
)
→ Your local cache will be available to get_covid19_obs()
→ Analysis outputs will be written to C:/Users/unp7/Desktop/Misc/CovidCAR_test/2021-08-23-CovidCAR-run2023-05-23

Define Spatial Domain

The download_boundaries() function pulls US State and territorial boundaries (ESRI shapefiles) from sources in the public domain. Some basic projection is performed, the shapefile is converted to a SpatialPolygonsDataFrame, and data attributes for a location identifier (‘Region’) and name (‘State’) are appended to the object.

NOTE: The function includes an option to download county boundaries (unit=“county”) but there are some timeout issues that need to be resolved due to large file size.

Hide code
States <- download_boundaries(unit = "state")
→ Downloading polygon files...
Reading layer `us-state-boundaries' from data source 
  `C:\Users\unp7\Desktop\Misc\CovidCAR_test\2021-08-23-CovidCAR-run2023-05-23\polygons' 
  using driver `ESRI Shapefile'
Simple feature collection with 56 features and 20 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
Geodetic CRS:  WGS 84
Hide code
class(States)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
Hide code
head(States@data[,c("Region", "State")]) #appended attributes  
Region State
1 Virgin Islands
2 Wisconsin
3 Vermont
4 New Jersey
5 Colorado
6 South Carolina

Adjacency Graph

The get_neighbors() function is used to identify polygons (States and Territories in this example) that are located next to each other. Neighbor information is recorded in a matrix (dimensions: location*location) that is included with the CAR model. Estimates for any one location are then ‘conditional’ on the estimates for surrounding locations.

NOTE: Polygons representing locations such as Hawaii and Guam are isolated from other locations (termed ‘islands’) and can be problematic. One option in this situation is to force connections between locations; the function’s ‘connect’ option will force connections between islands and other locations based on proximity.

Example: Islands with “no links”

Hide code
nb_islands = get_neighbors(States, connect=FALSE)
summary(nb_islands) #note that "7 regions with no links"
Neighbour list object:
Number of regions: 56 
Number of nonzero links: 224 
Percentage nonzero weights: 7.142857 
Average number of links: 4 
7 regions with no links:
1 24 32 33 38 42 54
Link number distribution:

 0  1  2  3  4  5  6  7  8 
 7  1  4  9  9 10 12  2  2 
1 least connected region:
13 with 1 link
2 most connected regions:
49 56 with 8 links

Example: All locations linked

Hide code
nb_coerced = get_neighbors(States, connect=TRUE)
summary(nb_coerced)
Neighbour list object:
Number of regions: 56 
Number of nonzero links: 242 
Percentage nonzero weights: 7.716837 
Average number of links: 4.321429 
Link number distribution:

 1  2  3  4  5  6  7  8 
 5  3 10 12 10 11  3  2 
5 least connected regions:
1 13 38 42 54 with 1 link
2 most connected regions:
49 56 with 8 links

View mapped adjacency
The plot_neighbors() function overlays adjacency connections on mapped location boundaries.

Hide code
plot_neighbors(States, nb_islands)
Regions defined for each Polygons

Hide code
plot_neighbors(States, nb_coerced)
Regions defined for each Polygons

Convert to INLA Graph The nb2INLA() and inla.read.graph() functions are provided by the INLA package.

Hide code
nb2INLA("J", nb_coerced)
J = inla.read.graph("J")

Retrieve Observation Data

The get_covid19_obs() function downloads hospital incidence data for a specified data range.

The source options for data to be retrieved:

  • The covidcast package
  • From a local cache as created by the covid19Forecasts package (i.e., refactored pipeline pkg)
  • From test data available from the package itself (sample from summer 2021)
Hide code
MinDate = min(su_yaml$full_time_span)
MaxDate = max(su_yaml$full_time_span)

testData = get_covid19_obs(source = "covidcast", 
                           start_date = MinDate, end_date = MaxDate, 
                           write_copy = TRUE)
#testData = get_covid19_obs(source = "cache", start_date = MinDate, end_date = MaxDate)
#testData = get_covid19_obs(source = "test", start_date = MinDate, end_date = MaxDate)

dim(testData)
[1] 4534    5
Hide code
head(testData)
date value signal location location_name
2021-06-28 2 hosp 02 Alaska
2021-06-28 29 hosp 01 Alabama
2021-06-28 55 hosp 05 Arkansas
2021-06-28 75 hosp 04 Arizona
2021-06-28 201 hosp 06 California
2021-06-28 60 hosp 08 Colorado

Add Spatial Index
The append_region_index() function matches location names in the observation data to the Region index in the polygon boundaries object, which also corresponds with the adjaceny matrix. The Region index is added as a column to the observations as is a new trn_tst column that is coded with either a trn or tst nominal indicator to distinguish time periods used for model training (observed) and testing (not observed).

Hide code
train_data = append_region_index(train_data = testData, polys = States)
which(is.na(train_data$Region))
integer(0)

Forecast Template

The create_forecast_template() function ensures that each location-time combination in the analysis is represented in the data ingested by the model. In the case of model runs using only historic observations, this function basically returns the original input but with some column names adjusted. This because the full date range was already represented. However, in the case of future dates where observations are not yet available, this function will add a row for each day through the forecast horizon coding the observed incidence value as NA as a placeholder.

Hide code
train_data = create_forecast_template(train_data)

Additional Covariates

Demo models in this example are fairly simple but in many cases users will want to add additional predictors, signals, or covariates (independent variables). This section of the script demonstrates how to (1) pull and add demographic variables from the American Community Survey (ACS) and how to (2) add Rt estimates generated from the EpiEstim package.

Demographic Data

The getPovertyPop() function provides a wrapper function for the getCensus package for loading American Community Survey (ACS) data from the U.S. Census Bureau. In this example, an API key (‘secret_api’) is used to pull the percent of each state’s total population in poverty (SAEPOVRTALL_PT) and the number of individuals over the age of 55yrs (given in the vars_pop option).

Hide code
PovPop_data = getPovertyPop(key = get_api("censusapi"), #function reads 'secrets.yaml' for specified name
                            vars_pov = c("SAEPOVRTALL_PT"), 
                            vars_pop = c('AGEGROUP','POP'), 
                            filt_age = c(12,18))
Hide code
train_data = left_join(train_data, PovPop_data, by = "location")

Rt Estimation

The Rt_projection() function combines the estimate_R() function from the EpiEstim package with simple timeseries models to forecast Rt estimated over the model training period across the forecast horizon (28 days in the future). Both the ‘raw’ Rt estimate (‘Rt_raw’) for the observation period only and the forecast values (‘Rt’) are added to the dataframe.

Forecast models include:

  • simple ARIMA model using the forecast package (method=“arima”)
  • an order-2 random walk with noise and trend using the INLA package (method=“dlm)

NOTE: This is an experimental function and the “dlm” method is used as an example. Models later in this demo will use the Rt_raw value to forecast concurrently with incidence estimation.

Hide code
Rt_df = Rt_projection(train_data, mean_si = 5.7, std_si = 2, 
                      forecast_horizon = 28, method = "dlm")

Rt Estimates
Checking Rt_projection() results

Hide code
Rt_df[1:10,] #check values
date value signal location location_name day trn_tst Region SAEPOVRTALL_PT age_pop Rt_raw Rt
2021-06-28 2 hosp 02 Alaska Monday train 38 9.6 184927 NA NA
2021-06-29 5 hosp 02 Alaska Tuesday train 38 9.6 184927 NA NA
2021-06-30 2 hosp 02 Alaska Wednesday train 38 9.6 184927 NA NA
2021-07-01 6 hosp 02 Alaska Thursday train 38 9.6 184927 NA NA
2021-07-02 3 hosp 02 Alaska Friday train 38 9.6 184927 NA NA
2021-07-03 7 hosp 02 Alaska Saturday train 38 9.6 184927 NA NA
2021-07-04 3 hosp 02 Alaska Sunday train 38 9.6 184927 NA NA
2021-07-05 2 hosp 02 Alaska Monday train 38 9.6 184927 3.295300 3.272205
2021-07-06 5 hosp 02 Alaska Tuesday train 38 9.6 184927 2.304267 2.334882
2021-07-07 4 hosp 02 Alaska Wednesday train 38 9.6 184927 1.858476 1.848652
Hide code
Rt_df %>% 
  filter(trn_tst == "test") %>% 
  slice(1:10) #check values (forecast period)
date value signal location location_name day trn_tst Region SAEPOVRTALL_PT age_pop Rt_raw Rt
2021-08-24 15 hosp 02 Alaska Tuesday test 38 9.6 184927 NA 1.018219
2021-08-25 33 hosp 02 Alaska Wednesday test 38 9.6 184927 NA 1.051613
2021-08-26 18 hosp 02 Alaska Thursday test 38 9.6 184927 NA 1.085007
2021-08-27 21 hosp 02 Alaska Friday test 38 9.6 184927 NA 1.118401
2021-08-28 14 hosp 02 Alaska Saturday test 38 9.6 184927 NA 1.151794
2021-08-29 19 hosp 02 Alaska Sunday test 38 9.6 184927 NA 1.185188
2021-08-30 32 hosp 02 Alaska Monday test 38 9.6 184927 NA 1.218582
2021-08-31 22 hosp 02 Alaska Tuesday test 38 9.6 184927 NA 1.251977
2021-09-01 22 hosp 02 Alaska Wednesday test 38 9.6 184927 NA 1.285371
2021-09-02 23 hosp 02 Alaska Thursday test 38 9.6 184927 NA 1.318765

Organize Data

Model parameters, inputs, and covariates will vary from model-to-model and user-to-user but ultimately all need to be combined in a single object that can be ingested by the the inference software, INLA in this case. The code below modifies the train_data dataframe to perform any desired scaling and to add several spatial and temporal indices.

Ordered integers are used as indices to define timesteps (days, weeks, etc), locations (‘Region’), and space*time combinations (e.g., ‘ID.Region.Wk’ in the next chunk). The modeling approach is hierarchical, so some indices may be used in multiple levels. But, each index name must be unique therefore some indices are copied and given slighly different names.

Once data is organized, it is reformatted as a list object called a ‘datastack’ that is passed to the inference software. Although a daraframe might be more intuitive, a list object is used so that model inputs can be of different lengths.

Clean Dataframe

The time_index() function is used to recode a date vector to the desired timestep duration (2-day steps, 1 week steps, etc).

Hide code
train_data <- as.data.frame(Rt_df) %>%
  mutate(
    s_pop = log(age_pop), #log scale
    s_pov = as.numeric(scale(SAEPOVRTALL_PT)), #some NAs present.  
    doy = as.integer(as.factor(date)),
    doy.1 = doy,
    Region.Wk = paste0("ID", Region, "W", doy), #unique Region*doy combinations ('space-time interaction')
    ID.Region.Wk = as.integer(as.factor(Region.Wk)), #convert factor levels to integer
    week = week(date),
    int_week.1 = as.integer(as.factor(week)),
    int_week.2 = int_week.1,
    int_week.3 = int_week.1,
    threeday_indx = time_index(date, seq(min(date), max(date), by = "3 days")),
    threeday_indx.1 = as.integer(as.factor(threeday_indx)),
    fourday_indx = time_index(date, seq(min(date), max(date), by = "4 days")),
    fourday_indx.1 = as.integer(as.factor(fourday_indx)),
    fiveday_indx = time_index(date, seq(min(date), max(date), by = "5 days")),
    fiveday_indx.1 = as.integer(as.factor(fiveday_indx)),
    eightday_indx = time_index(date, seq(min(date), max(date), by = "8 days")),
    eightday_indx.1 = as.integer(as.factor(eightday_indx)),
    biweek_indx = time_index(date, seq(min(date), max(date), by = "14 days")),
    biweek_indx.1 = as.integer(as.factor(biweek_indx)),
    Region.1 = Region, Region.2 = Region, Region.3 = Region, 
                Region.4 = Region, Region.5 = Region
  ) %>%
  select(c(-biweek_indx, threeday_indx, fourday_indx, fiveday_indx, eightday_indx))

head(train_data)
date value signal location location_name day trn_tst Region SAEPOVRTALL_PT age_pop Rt_raw Rt s_pop s_pov doy doy.1 Region.Wk ID.Region.Wk week int_week.1 int_week.2 int_week.3 threeday_indx threeday_indx.1 fourday_indx fourday_indx.1 fiveday_indx fiveday_indx.1 eightday_indx eightday_indx.1 biweek_indx.1 Region.1 Region.2 Region.3 Region.4 Region.5
2021-06-28 2 hosp 02 Alaska Monday train 38 9.6 184927 NA NA 12.12772 -0.8064822 1 1 ID38W1 2466 26 1 1 1 2021-06-28 1 2021-06-28 1 2021-06-28 1 2021-06-28 1 1 38 38 38 38 38
2021-06-29 5 hosp 02 Alaska Tuesday train 38 9.6 184927 NA NA 12.12772 -0.8064822 2 2 ID38W2 2477 26 1 1 1 2021-06-28 1 2021-06-28 1 2021-06-28 1 2021-06-28 1 1 38 38 38 38 38
2021-06-30 2 hosp 02 Alaska Wednesday train 38 9.6 184927 NA NA 12.12772 -0.8064822 3 3 ID38W3 2488 26 1 1 1 2021-07-01 2 2021-06-28 1 2021-06-28 1 2021-06-28 1 1 38 38 38 38 38
2021-07-01 6 hosp 02 Alaska Thursday train 38 9.6 184927 NA NA 12.12772 -0.8064822 4 4 ID38W4 2499 26 1 1 1 2021-07-01 2 2021-07-02 2 2021-07-03 2 2021-06-28 1 1 38 38 38 38 38
2021-07-02 3 hosp 02 Alaska Friday train 38 9.6 184927 NA NA 12.12772 -0.8064822 5 5 ID38W5 2510 27 2 2 2 2021-07-01 2 2021-07-02 2 2021-07-03 2 2021-06-28 1 1 38 38 38 38 38
2021-07-03 7 hosp 02 Alaska Saturday train 38 9.6 184927 NA NA 12.12772 -0.8064822 6 6 ID38W6 2521 27 2 2 2 2021-07-04 3 2021-07-02 2 2021-07-03 2 2021-07-06 2 1 38 38 38 38 38

Response Variable

The response variable may differ between models. In this case, copying hospital incidence (counts) to a new column, standardizing the distribution, and ensuring that observations for the forecast horizon are coded as NA. Retaining the scaling object and rewriting as function obs_scale() to transform model outputs back to the observation scale later.

Again, response variables are specific to individual model setup so could be scaled differently or not at all to be fit with a different likelihood (Poisson, NegBinomial, etc). Keeping it simple here.

Hide code
train_data$resp = ifelse(train_data$trn_tst == "train", train_data$value, NA) #Set obs value to NA for forecasts periods

resp_scale_obj = scale(train_data$resp, scale=T, center=T) #scaled object
obs_scale = function(r)r*attr(resp_scale_obj,'scaled:scale') + attr(resp_scale_obj, 'scaled:center') #transform back to observation scale

train_data$nrm_resp = as.numeric(resp_scale_obj)

Format as a Datastack

In this example, data could remain as a dataframe and passed to INLA directly, but as a matter of practice, it is better to format as a list object (datastack).

Hide code
nrm.lst = list(list(intercept1 = rep(1, dim(train_data)[1])), #custom intercept
          list(pov_pct = train_data[,"s_pov"],                #desired covariates and indices below
               pop = train_data[,"s_pop"],                    #formatting as nrm.lst = list()
               Rt_raw = train_data[,"Rt_raw"],
               Rt_raw.1 = train_data[,"Rt_raw"],
               Rt = train_data[,"Rt"],
               Rt.1 = train_data[,"Rt"],
               doy = train_data[,"doy"],
               doy.1 = train_data[,"doy.1"],
               doy.2 = train_data[,"doy.1"],
               int_week.1 = train_data[,"int_week.1"],
               int_week.2 = train_data[,"int_week.2"],
               int_week.3 = train_data[,"int_week.3"],
               threeday_indx.1 = train_data[,"threeday_indx.1"],
               fourday_indx.1 = train_data[,"fourday_indx.1"],
               fiveday_indx.1 = train_data[,"fiveday_indx.1"],
               eightday_indx.1 = train_data[,"eightday_indx.1"],
               biwek_indx.1 = train_data[,"biweek_indx.1"],
               Region.1 = train_data[,"Region.1"],
               Region.2 = train_data[,"Region.2"],
               Region.3 = train_data[,"Region.3"],
               Region.4 = train_data[,"Region.4"],
               Region.5 = train_data[,"Region.5"],
               Region_Wk = train_data[,"ID.Region.Wk"],
               dow = train_data[,"day"]))

nrm.stk = inla.stack(data = list(Y = train_data$nrm_resp), #Y is response variable
                                      A = list(1,1),       #option to include matrices, not used in this case
                                effects = nrm.lst,         #list object from above
                                    tag = "nrm")           #arbitrary name to index searches later 
                                                           #(multiple datastack used in complex models)

Model Priors and Formulae

Specifying all priors and formulas for desired models using R-INLA syntax. A deep-dive would be needed to describe in detail, which would be too time consuming for this workflow focused demonstration.

A few notes:

  • Priors
    • pc refers to “Penalizing Complexity”
    • prec is “precision”
    • cor is “correlation”
    • non-PC priors are coded using the distribution name (e.g., norm = “normal”)
  • Formulas
    • Y is the response variable in the datastack
    • f() designates a function, level, or submodel in the hierarchy
    • hyper= refers to the prior for that f()

Set Priors

Hide code
#bym prior
bym_hyper <- list(phi = list(prior = "pc", 
                      param = c(0.5, 2/3), 
                      initial = 3), 
               prec = list(prior = "pc.prec", 
                       param = c(1, 0.01), 
                       initial = 1.5))  
#Normal prior
norm.prior <- list(theta=list(prior = "normal", 
                              param=c(0, 1)))


#iid prior
pc_prec_iid <- list(theta = list(prior="pc.prec", 
                                 param=c(0.5, 0.01)))

#ar1 prior
pc_cor_ar1 <- list(theta = list(prior = 'pccor1', 
                                param = c(0.5, 0.9)))

#rw2 prior
pc_rw2 <- list(prec=list(prior="pc.prec", 
                         param=c(0.5,0.01)))

#bundle priors to archive run
priors.list <- list()
priors.list[["bym_hyper"]] <- bym_hyper
priors.list[["norm.prior"]] <- norm.prior
priors.list[["pc_prec_iid"]] <- pc_prec_iid
priors.list[["pc_cor_ar1"]] <- pc_cor_ar1
priors.list[["pc_rw2"]] <- pc_rw2

Specify Formulas

Formula 1: Random Walk plus noise for each location (i.e., state)

Hide code
Frm.1 = Y ~ -1 +     #remove default intercept
  intercept1 +       #custom intercept
  f(doy.1,           #order by time index (daily)
    constr=TRUE,     #enforced zero mean
    model="rw1",     #order-1 random walk with noise
    scale.model = TRUE, #additional internal scaling
    group = Region.1, #run rw1 model for groups based on location 
    control.group=list(model="iid"), #groups are treated independently
    hyper=pc_rw2)  #prior for rw2

Formula 2: Random Walk plus noise and trend for each location

Hide code
Frm.2 = Y ~ -1 +     
  intercept1 +       
  f(doy.1,           
    constr=TRUE,     
    model="rw1",    
    scale.model = TRUE, 
    group = Region.1, 
    control.group=list(model="iid"), 
    hyper=pc_rw2) + 
  f(doy.2, model="linear", mean.linear = 0, prec.linear = 0.001) #add linear trend to rw1

Formula 3: Common spatial effect for timesteps but each location has separate autoregression

Hide code
Frm.3 = Y ~ -1 +    
  intercept1 +       
  f(Region.1,        #location index
    model="bym2",    #spatial effect, Besag-York-Mollie model (the 2 indicates scaling) 
    graph=J,         #Adjacency graph to identify neighbors
    constr=TRUE,     #enforced zero mean
    hyper=bym_hyper) + #BYM prior
  f(doy.1,             #order by time index (daily)
    model="ar1",       #apply order-1 autoregressive
    constr=TRUE,
    group = Region.1,  #run ar1 model for groups based on location
    control.group=list(model="iid"), #groups are treated independently
    hyper=pc_cor_ar1) 

Formula 4: Separate spatial effect for each timestep (related by ar1) and each location has its own autoregressive term.

Hide code
Frm.4 = Y ~ -1 +     
  intercept1 +       
  f(Region.1,        
    model="bym2",   
    graph=J,         
    constr=TRUE,     
    group = doy,     #time index, daily (create separate realizations of spatial covariate for each day)
    control.group=list(model="ar1"), #groups are related via an order-1 autoregressive
    hyper=bym_hyper) + #prior for BYM
  f(doy.1,             
    model="ar1",       
    constr=TRUE,
    group = Region.1,  
    control.group=list(model="iid"), 
    hyper=pc_cor_ar1) 

Formula 5: As Formula 4 but with space-time interaction to capture location and time specific variation outside of modeled trends.

Hide code
Frm.5 = Y ~ -1 +    
  intercept1 +       
  f(Region.1,        
    model="bym2",    
    graph=J,         
    constr=TRUE,    
    group = doy,     
    control.group=list(model="ar1"),
    hyper=bym_hyper) +
  f(doy.1,             
    model="ar1",       
    constr=TRUE,
    group = Region.1,  
    control.group=list(model="iid"), 
    hyper=pc_cor_ar1) +
  f(Region_Wk,   #Index for all location*time combinations (space-time interaction)
    model="iid", #each location and time combination considered independently
    constr=TRUE,
    hyper=pc_prec_iid) 

Formula 6: As Formula 5 but adding covariate for variation due to day of week (e.g. Monday, Tuesday,…Sunday).

Hide code
Frm.6 = Y ~ -1 +    
  intercept1 +       
  f(Region.1,        
    model="bym2",    
    graph=J,         
    constr=TRUE,     
    group = doy,     
    hyper=bym_hyper, 
    control.group=list(model="ar1")) +
  f(doy.1,             
    model="ar1",       
    constr=TRUE,
    group = Region.1,  
    control.group=list(model="iid"), 
    hyper=pc_cor_ar1) +
  f(dow,           #discrete variable indicating day of week, e.g. Monday, Tuesday,...Sunday
    constr=TRUE,
    model="iid",   #days of week may vary independently
    group = Region.2, #variation attributed to days of week may differ by location
    control.group=list(model="iid"), 
    hyper=pc_prec_iid) +
  f(Region_Wk,   
    model="iid", 
    constr=TRUE,
    hyper=pc_prec_iid) 

Formula 7: Including Rt estimates as an experimental covariate. Forecast Rt trend estimated from the observation period (training period) to the future (28 days) using an autoregressive model.

Hide code
Frm.7 = Y ~ -1 +     
  intercept1 +       
  pov_pct + pop +    
  f(Region.1,        
    model="bym2",    
    graph=J,        
    constr=TRUE,     
    group = doy,     
    hyper=bym_hyper, 
    control.group=list(model="ar1")) + 
  f(doy.1, Rt_raw,  #order by time index (daily) but weight each timestep by corresponding Rt_raw estimate
    model="ar1",    #apply order-1 autoregressive to Rt weighted time index above
    constr=TRUE,
    group = Region.2, 
    control.group=list(model="iid"),
    hyper=pc_cor_ar1) +
  f(dow,           
    constr=TRUE,
    model="iid",
    group = Region.3,
    control.group=list(model="iid"),
    hyper=pc_prec_iid) +
  f(Region_Wk,   
    model="iid", 
    constr=TRUE,
    hyper=pc_prec_iid) 

Formula 8: As with Formula 8 but adding a random walk at a more coarse time scale (3 day steps) to reduce forecast decay.

Hide code
Frm.8 = Y ~ -1 +     
  intercept1 +       
  pov_pct + pop +    #linear covariates for poverty and population over 55yrs
  f(Region.1,        
    model="bym2",    
    graph=J,        
    constr=TRUE,     
    group = doy,    
    hyper=bym_hyper, 
    control.group=list(model="ar1")) + 
  f(threeday_indx.1, #time index, 3days
    constr=TRUE,
    model="rw2",     #order-2 random walk with noise
    scale.model = TRUE,
    group = Region.2,
    control.group=list(model="iid"), 
    hyper=pc_rw2) + 
  f(doy.1, Rt_raw,  
    model="ar1",    
    constr=TRUE,
    group = Region.3, 
    control.group=list(model="iid"),
    hyper=pc_cor_ar1) +
  f(dow,           
    constr=TRUE,
    model="iid",
    group = Region.4,
    control.group=list(model="iid"),
    hyper=pc_prec_iid) +
  f(Region_Wk,   
    model="iid", 
    constr=TRUE,
    hyper=pc_prec_iid) 

Organize Formulas

Hide code
formulas.list <- list()
formulas.list[["base_rw1"]] <- Frm.1
formulas.list[["rw1_trend"]] <- Frm.2
formulas.list[["base_car"]] <- Frm.3
formulas.list[["car_time"]] <- Frm.4
formulas.list[["car_sti"]] <- Frm.5
formulas.list[["car_wdays"]] <- Frm.6
formulas.list[["car_rt"]] <- Frm.7
formulas.list[["car_full"]] <- Frm.8

Run Models

The run_model_list() function runs a series of models as specified in the formulas.list using the input data organized as a datastack (nrm.srk). The function runs each model sequentially and writes the executed models (models_out), formulas (formulas.list), priors (prior.list), datastack, and original dataframe (train_data) to an .RData in the analysis directory. The models_out object will also be available in the environment for further processing.

There are many customization options for inference but have opted to keep run_model_list() fairly simple for ease of use and maximum efficiency.

Addional run_model_list() options:

  • likelihood
    • If one likelihood is provided it will be applied to all models
    • A vector of likelihoods can be provided with order based on formulas.list
    • e.g., myFamilies <- c(“gaussian”, “binomial”, “zeroinflatednbinomial”, …)
  • config
    • indicates if latencies (GMRF) should be retained for sampling
    • config=TRUE can be time intensive and dramatically slow model runs
    • CovidCAR has a post_sampling() function to facilitate sampling
  • verbose prints INLA algorithm process to screen during model runs
  • archive indicates to save model inputs and results to the analysis directory
    • model outputs will be written to a run_archive subdirectory

Run All Models

Hide code
formulas.list = formulas.list[c(1:4)] #short list for demo, fast run models

models_out = run_model_list(formulas.list=formulas.list,
                            dataStack=nrm.stk,
                            likelihood = "gaussian",
                            config=FALSE, verbose = FALSE, archive=TRUE)

Extract and Format Forecasts

The extract_forecasts() function pulls forecasts from models and saves them to a forecasts folder in analysis directory. The function returns a forecast_paths list object to the environment with file path names.

Forecasts are formatted to the specifications required for submission to the covid19-forecast-hub.

Hide code
#models_out[["full_mod"]] <- full_model
extract_forecasts(mod_out=models_out,
                  dataStack=nrm.stk, train_data=train_data,
                  team = "CFA")
ℹ Writing model forecasts to analysis directory: 
✔ base_rw1 [871ms]
ℹ Writing model forecasts to analysis directory: 
✔ rw1_trend [964ms]
ℹ Writing model forecasts to analysis directory: 
✔ base_car [900ms]
ℹ Writing model forecasts to analysis directory: 
✔ car_time [855ms]
Hide code
#check returned object
names(forecast_paths)
[1] "base_rw1"  "rw1_trend" "base_car"  "car_time" 
Hide code
head(read.csv(forecast_paths[["rw1_trend"]])) #formatted for submission
forecast_date location target target_end_date type quantile value
2021-08-23 02 0 day ahead inc hosp 2021-08-23 quantile 0.01 18.04364
2021-08-23 02 1 day ahead inc hosp 2021-08-24 quantile 0.01 0.00000
2021-08-23 02 2 day ahead inc hosp 2021-08-25 quantile 0.01 0.00000
2021-08-23 02 3 day ahead inc hosp 2021-08-26 quantile 0.01 15.77363
2021-08-23 02 4 day ahead inc hosp 2021-08-27 quantile 0.01 48.06863
2021-08-23 02 5 day ahead inc hosp 2021-08-28 quantile 0.01 83.54071

View forecasts extract_forecasts() also returns a list (‘plot_paths’) of paths to plotting data from the extraction process. This data can be accessed from the plot_location() function, which provides a quick diagnostic plot of forecast for a specific location.

Note of Caution: If the ‘loc=’ option is left as NULL, all locations will be plotted to PDF file and saved in the ‘Reports’ folder of the analysis directory. This may be time consuming!

Hide code
OK_plot = plot_location(plot_path = plot_paths, model = "base_car", loc = "Oklahoma")
OK_plot

Hide code
AR_plot = plot_location(plot_path = plot_paths, model = "base_car", loc = "Arkansas")
AR_plot

Hide code
#plot_location(plot_path = plot_paths, model = "full_mod") #plots and saves all locations to a pdf file

Model Scoring

WIS Scores

The score_WIS() function calculates the WIS score for forecasts by model. Optional arguments can be included to indicate if files should be read from a directory (ingest = “path”), a dataframe in the environment (ingest = “dataframe”),or from a list object with individual file paths (ingest = “list”) as returned by extract_forecasts().

The ‘missing’ option can be used to specify how missing observation data should be handled; ‘remove’ from data or fill with ‘zero’.

Hide code
my_truth <- train_data %>% #Caution: my_truth may be different than your truth :)
  select(date, location, value)

my_scores <- score_WIS(forecast_data = forecast_paths, truth=my_truth, 
                       ingest = "list", missing = "remove") 
→ A total of 2668 predictions weren't evalauted due lack of truth data
Hide code
head(my_scores)
model date location_name forecast_date WIS
base_car 2021-08-23 01 2021-08-23 0.5471048
base_car 2021-08-23 02 2021-08-23 0.5186350
base_car 2021-08-23 04 2021-08-23 0.5279331
base_car 2021-08-23 05 2021-08-23 0.5578004
base_car 2021-08-23 06 2021-08-23 0.5507474
base_car 2021-08-23 08 2021-08-23 0.5235968
Hide code
#overall
wis_rank <- my_scores %>%
  group_by(model) %>%
  summarise(mean_wis = mean(WIS)) %>%
  arrange(mean_wis) %>%
  mutate(wisRank = row_number())

wis_rank #mean absolute values
model mean_wis wisRank
base_car 34.41192 1
car_time 34.46840 2
base_rw1 280.88099 3
rw1_trend 307.12383 4

Diagnostic score plots
The plot_WIS_lines() function has options to make quick plots of WIS scores returned by score_WIS().

Hide code
unique(my_scores$model)
[1] "base_car"  "base_rw1"  "car_time"  "rw1_trend"
Hide code
#lines showing absolute WIS
plot_WIS_lines(my_scores, by = "date", range = "abs")

Hide code
#lines showing scaled WIS
plot_WIS_lines(my_scores, by = "date", range = "scaled", 
               scale_model = "base_rw1")

Hide code
#optional 'limit' that recodes: (WIS >= limit) -> limit
plot_WIS_lines(my_scores, by = "date", range = "scaled", 
               scale_model = "base_rw1", limit = 2)

Hide code
#tile option
plot_WIS_lines(my_scores, by = "tile", range = "scaled", 
               scale_model = "base_rw1", limit = 2)
Warning: Removed 432 rows containing missing values (`geom_tile()`).

Mean Absolute Error (MAE)

The score_MAE() function works comparably to score_WIS() but is a simpler measure of model performance as it is based on only the point estimates from forecasts.

Hide code
my_mae <- score_MAE(forecast_data = forecast_paths, truth=my_truth, ingest = "list", missing = "remove") 
→ A total of 116 forecasts weren't evaluatd due lack of truth data
Hide code
my_mae
model MAE MAPE maeRank
base_car 41.2 0.01243 1
car_time 42.0 0.01266 2
base_rw1 325.6 0.09818 3
rw1_trend 353.3 0.10652 4

Ensemble

The propose_weights() function assists with ensemble building by weighting models using a given performance metric. The function scales the raw model comparison metric and then builds the ensemble by multiplying each forecasts by its model-specific weight and summing across all included models. The resulting ensemble forecast is then standardized for Covid19-hub submission and written to the analysis directory (./forecasts). The function returns the estimated weights to the environment.

For example, the WIS and MAE scores could be used to weight individual models in an ensemble.

First, compare WIS and MAE scores:

Hide code
mod_rank <- left_join(my_mae, wis_rank, by="model") #combine with overall WIS

mod_rank #note the scores rank models differently  
model MAE MAPE maeRank mean_wis wisRank
base_car 41.2 0.01243 1 34.41192 1
car_time 42.0 0.01266 2 34.46840 2
base_rw1 325.6 0.09818 3 280.88099 3
rw1_trend 353.3 0.10652 4 307.12383 4

The propose_weights() function can the be applied to generate weights and write the resulting ensemble.

Hide code
my_wis_weights <- propose_weights(forecast_data = forecast_paths, #standardized model forecasts
                             ingest = "list",                     #read model locations as list
                             rank_df = mod_rank,                  #use the data from WIS and MAE scoring         
                             rankCol = "mean_wis",                #weight models based on this column
                             team = "CFA",                        #team name (need be in file name per Covid19hub)
                             mod_name = "wis_ensemble")           #label for the new ensemble forecast
→ Writing ensemble 'wis_ensemble' to analysis directory
Hide code
my_wis_weights #The weights column reports the actual weights calculated for each model
model MAE MAPE maeRank mean_wis wisRank mean_wis_weights
base_car 41.2 0.01243 1 34.41192 1 0.3158712
car_time 42.0 0.01266 2 34.46840 2 0.3158425
base_rw1 325.6 0.09818 3 280.88099 3 0.1908016
rw1_trend 353.3 0.10652 4 307.12383 4 0.1774848
Hide code
#another example,this time using MAE and including a 'drop' option
my_mae_weights <- propose_weights(forecast_data = forecast_paths, 
                             ingest = "list",
                             rank_df = mod_rank, 
                             rankCol = "MAE",
                             drop = 1,  #number of lowest ranked models to drop/exclude from ensemble
                             team = "CFA",
                             mod_name = "mae_ensemble")
→ Dropped models: rw1_trend
→ Writing ensemble 'mae_ensemble' to analysis directory
Hide code
my_mae_weights
model MAE MAPE maeRank mean_wis wisRank MAE_weights
base_car 41.2 0.01243 1 34.41192 1 0.4496086
car_time 42.0 0.01266 2 34.46840 2 0.4486301
base_rw1 325.6 0.09818 3 280.88099 3 0.1017613
Hide code
#yet another example, not providing a rankCol -> function assumes equal weighting 
equal_mae_weights <- propose_weights(forecast_data = forecast_paths, 
                             ingest = "list",
                             rank_df = mod_rank, 
                             #rankCol = NULL,
                             drop = 1,  #issues warning, dropping models without a ranking criteria
                             team = "CFA",
                             mod_name = "equal_mae_ensemble")
✖ No rankCol but dropping models? Models will be dropped from end of list!
! No rankCol provided, ensemble assumes equal weighting
→ Dropped models: rw1_trend
→ Writing ensemble 'equal_mae_ensemble' to analysis directory
Hide code
equal_mae_weights
model MAE MAPE maeRank mean_wis wisRank rankCol rankCol_weights
base_car 41.2 0.01243 1 34.41192 1 1 0.3333333
car_time 42.0 0.01266 2 34.46840 2 1 0.3333333
base_rw1 325.6 0.09818 3 280.88099 3 1 0.3333333

Ensemble Re-scoring

Now that new ensembles have been added to the ‘forecasts’ directory, comparison scores can be recalculated.

Hide code
myDir <- file.path(su_yaml$out_dir_name, "forecasts")
new_mae <- score_MAE(forecast_data = myDir, truth=my_truth, ingest = "path", missing = "remove") 
→ A total of 203 forecasts weren't evaluatd due lack of truth data
Hide code
new_mae
model MAE MAPE maeRank
base_car 41.2 0.01243 1
car_time 42.0 0.01266 2
mae_ensemble 50.8 0.01532 3
equal_mae_ensemble 105.5 0.03180 4
wis_ensemble 120.6 0.03638 5
base_rw1 325.6 0.09818 6
rw1_trend 353.3 0.10652 7

Historic Forecasts

The get_hub_forecasts() function retrieves forecasts previously submitted to the covid19-forecast-hub.

Similar to get_covid19_obs(), there are options to load from a local parquet cache (source=“cache”) as indexed with the Covid19Forecasts package(private repo) or to load “test” data included with the package. There is also the an option to use the covidHubUtils package to download data directly from covid19-forecast-hub.

The queried results can also be filtered to specific models using the ‘models=’ option. If not set, the ‘model=’ options defaults to forecasts from the COVIDhub-trained_ensemble, COVIDhub-ensemble, and COVIDhub-baseline models.

By default, get_hub_forecasts() returns forecasts for the forecast period specified during initial setup using setup_analysis().

Hide code
hist_forecasts <- get_hub_forecasts(source = "covidHubUtils",
                                    models = c("COVIDhub-trained_ensemble", "COVIDhub-baseline", "COVIDhub-ensemble"),
                                    write_copy = TRUE)
→ Loading location crosswalk
Rows: 57 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (4): STATE, STUSAB, STATE_NAME, STATENS

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
→ Fetching COVID-19 forecasts using covidHubUtils

get_token(): POST: https://zoltardata.com/api-token-auth/

get_resource(): GET: https://zoltardata.com/api/projects/

get_resource(): GET: https://zoltardata.com/api/project/44/models/

get_resource(): GET: https://zoltardata.com/api/project/44/timezeros/

→ Writing forecast data to analysis directory
Hide code
dim(hist_forecasts)
[1] 103032      7
Hide code
head(hist_forecasts)
model forecast_date location target_end_date type quantile value
COVIDhub-trained_ensemble 2021-08-23 01 2021-08-24 point NA 495
COVIDhub-trained_ensemble 2021-08-23 01 2021-08-24 quantile 0.010 369
COVIDhub-trained_ensemble 2021-08-23 01 2021-08-24 quantile 0.025 384
COVIDhub-trained_ensemble 2021-08-23 01 2021-08-24 quantile 0.050 395
COVIDhub-trained_ensemble 2021-08-23 01 2021-08-24 quantile 0.100 404
COVIDhub-trained_ensemble 2021-08-23 01 2021-08-24 quantile 0.150 412
Hide code
#because data is formatted to same standard, functions can read
hub_rank <- score_WIS(forecast_data = hist_forecasts, truth=my_truth, 
                      ingest = "dataframe", missing = "remove")

hub_rank %>%
  group_by(model) %>%
  summarise(mean_wis = mean(WIS))
model mean_wis
COVIDhub-baseline 45.61759
COVIDhub-ensemble 50.61499
COVIDhub-trained_ensemble 68.17765

Combine with CovidCAR models
The plot_forecasts_compare() function combines plot_WIS_lines() and score_WIS() to make a model WIS comparison line plot. The ‘hub_forecasts’ option facilitates direct use of imported historical forecast data from the Covid19-forecast-hub.

Hide code
my_plot <- plot_forecasts_compare(forecast_data = myDir, truth=my_truth, ingest = "path",
                                  hub_forecasts = hist_forecasts,
                                  scale_model = "COVIDhub-baseline",
                                  limit = 4,
                                  missing = "remove",
                                  write_copy = TRUE)
→ A total of 4669 predictions weren't evalauted due lack of truth data
→ Writing comparison scores to analysis directory
Hide code
class(my_plot)
[1] "gg"     "ggplot"
Hide code
my_plot

Citation

BibTeX citation:
@online{2023,
  author = {},
  title = {CovidCAR {Package} {Overview}},
  date = {2023-05-18},
  url = {https://jmhumphreys.github.io//projects/ccintro/ccintro.html},
  langid = {en}
}
For attribution, please cite this work as:
“CovidCAR Package Overview.” 2023. May 18, 2023. https://jmhumphreys.github.io//projects/ccintro/ccintro.html.