Modeling methods for environmental mixtures with survival time outcomes

In this document, we illustrate how to apply various modeling methods to analyze the effect of an environmental mixture on a survival outcome, also know as a time-to-event outcome. Additional information on environmental mixtures and the methods demonstrated here can be found in Mayer et al. 2023.

Overview of mixtures analysis

Humans are constantly exposed to various environmental factors. Understanding how these exposures interact to affect health outcomes, rather than studying each one separately, can offer deeper insights. However, modeling these mixtures presents challenges:

Complex relationships: Exposures are typically measured continuously and often have nonlinear, interactive effects on outcomes. For example, an exposure might have no effect at low levels, but become toxic at high levels, or only exhibit toxicity in combination with another exposure.
High correlation among exposures: Components of a mixture are often correlated due to shared sources (e.g., multiple pollutants in water), leading to issues like multicollinearity.
High dimensionality: The number of exposures in a mixture can be vast, making it computationally intensive to model large exposure sets and reducing statistical power to detect effects.

These challenges are further complicated when the outcome is survival time, a form of analysis for which many modeling methods are not well suited. In the following, we explore available modeling approaches for mixture analysis in the context of survival-time outcomes.

Data simulation

To illustrate how to apply the different methods to mixtures data, we begin by simulating some data. The data generation process can be found in data generation.

library(tidyverse) 
library(corrplot)
library(rsample) #for bootstrapping
library(survival)

setwd("/Volumes/Extreme SSD/survival-mixture-analysis")
sim_survival <- readRDS("./sim_data/sim_survival.Rds")

sim_surv_df <- cbind(lst(t_obs = sim_survival$t_obs),
                     lst(event = sim_survival$event),
                     as.data.frame(sim_survival$A))
head(sim_surv_df)

##        t_obs event         A1         A2         A3         A4         A5
## 1 0.07446581     0  0.4135089 -0.5944527 -0.8096157 -0.5541695 -1.0115614
## 2 0.66082410     1 -1.4181004 -1.4298398 -0.2309133 -0.3124463 -0.2397462
## 3 4.35809486     0  0.8432820  0.2765033  0.5813349  1.3007974  1.2237475
## 4 2.44517878     1  0.5248145  0.5858638  0.2235701  1.4008136  0.3704268
## 5 0.42520927     1  1.6828272  1.4439944  1.5123399  0.7281761  2.3005869
## 6 0.03121801     1 -0.8499007 -1.0331179  0.3965195 -0.2793215 -1.0853986
##           A6         A7          A8         A9         A10
## 1  0.3314428  0.1792858  0.61754085 -0.5871062 -0.33002096
## 2 -0.1843424  1.0303817 -0.27295057  0.8662922  1.38327822
## 3  0.7839420 -0.1241911  0.31629865  0.9622082 -0.33539220
## 4  0.8433271  1.9342665 -0.06862079  0.7841573 -0.90592414
## 5  2.4458749  1.7142845  1.47959355  1.6374398  1.03932034
## 6 -0.8947021 -0.6172477  1.07830102 -0.7533066  0.03489254

# Assess correlations
corrplot(cor(sim_survival$A),
         method = "number",
         type = "lower",
         number.digits = 1)

The correlation plot shows us that the mixture is made up of 10 components with correlations ranging from low (0.1) to high (0.9). Let’s explore the true relationships of the exposures with the outcome:

The data is generated such that 5 mixture components (\(A1\), \(A4\), \(A5\), \(A6\) and \(A7\)) have an effect on the outcome with nonlinear and nonadditive effects, while the other 5 components have no effect.

Exposures in the mixtures are also generated to have interactions with one another on the outcome, as shown below:

We see that the slope of the exposure-response curve of exposure \(A1\) varies at different quantiles of exposure \(A4\), indicating that the effect of \(A1\) on the outcome varies depending on the concentrations of exposure \(A4\).

Let’s get into how we can go about modeling these continuous and correlated exposures with complex relationships on the survival outcome!

Quantities of interest

In order to quantify the association of a mixture on a survival time outcome, we need to define how we would like to measure said association. There are many different quantities one can use to quantify a mixture effect, here we will show how to estimate three.

Firstly, we estimate the exposure-response curve for an individual exposure. We show the estimated survival probability over a range of concentrations of an individual exposure given a specific time. We adjust for all other exposures in our model and hold them at their median.

Next, we focus on point-estimates and consider a quantity that better describes the relationship of the complete mixture with the outcome rather than an individual exposure. To do so, we estimate two quantities: the hazard ratio (HR) and the survival probability difference (SPD). The hazard function is the instantaneous probability of experiencing an event at time \(t\) given that a person has not experienced an event before time \(t\). The survival function is the probability of surviving beyond time \(t\). We consider the HR and SPD comparing the hazards and the survival probabilities for an interquartile range (IQR) change in each metal, or when each individual exposure is set to their \(75^{th}\) percentile compared to their \(25^{th}\) percentile. The HR represents the excess hazard associated with being exposed to higher concentrations of all metals compared to lower concentrations. Similarly, the SPD indicates a decrease in the probability of the event not occurring when exposed to higher levels of metals.

When estimating the SPD or when estimating the HR using modeling methods that don’t assume proportional hazards, these quantities may vary depending on time, \(T = t\). Thus, a time, \(t_spec\), must be specified. The choice of \(t_{spec}\) should be motivated by contextual relevance. For demonstrative purposes, we considered the \(80^{th}\) percentile of the observed total follow-up time, calculated using observations’ time to either event or censoring.

We show how to estimate these three quantities with different modeling approaches below.

Cox proportional hazards (PH) based approaches

Cox PH and its extensions are commonly used for survival outcomes. The simultaneous effect of several variables on the survival outcome at a specified time, \(t\), is estimated by assuming the hazard function (\(\lambda\)) has the form

\[\lambda(t; \mathbf{a}) = \lambda_0 (t) exp \{f(\mathbf{a})\}\]

where \(\lambda_0(t)\) is the baseline hazard function, or the hazard at time \(t\) when all metals are equal to 0. While \(\lambda_0 (t)\) can vary over time, the hazard of the event given a set, \(\mathbf{a}_i\) is a constant multiple of the hazard for another set, \(\mathbf{a}_{i^*}\) where \(i\neq i^*\). This is referred to as the proportional hazards assumption, where the ratio of the hazards for any two exposure profiles is constant over time and thus the hazard ratio is independent of \(t\).

Traditional Cox proportional hazards model (Cox PH)

For the Cox PH model, the model assumes a linear relationship between the log-hazard of an event and the exposures, such that the hazard function has the form:

\[\begin{align} \tag{Eq. 2} \lambda(t;\mathbf{a}) = \lambda_0 (t) exp\{\Sigma_{j=1}^J \beta_j a_j \} \end{align}\]

We estimate this model in R and use this model to estimate our quantities of interest (HR and SPD):

# create cox model
cox_mod <- coxph(Surv(t_obs, event) ~ ., data = sim_surv_df)

# Estimate HR (independent of time)

### Create vectors of needed quantiles of all exposures
mix_25 <- apply(sim_survival$A, 2, quantile, 0.25)
mix_25

##         A1         A2         A3         A4         A5         A6         A7 
## -0.5496136 -0.5865948 -0.4792235 -0.5381036 -0.6849604 -0.5863576 -0.6045356 
##         A8         A9        A10 
## -0.6685006 -0.5181123 -0.5003354

mix_75 <- apply(sim_survival$A, 2, quantile, 0.75)
mix_75

##        A1        A2        A3        A4        A5        A6        A7        A8 
## 0.5499913 0.5828373 0.4845958 0.5318476 0.6965129 0.6082973 0.6367786 0.7944992 
##        A9       A10 
## 0.6252047 0.5073894

### Predict HR
cox_hr_mix <- predict(cox_mod, as.data.frame(t(mix_75)), "risk")/
  predict(cox_mod, as.data.frame(t(mix_25)), "risk")

# Estimate SPD at 80th quantile of time

### Get survival probabilities over time given exposure quantiles
cox_surv_25 <- survfit(cox_mod, newdata = mix_25)
cox_surv_75 <- survfit(cox_mod, newdata = mix_75)

### Predict SPD
cox_spd_mix <- cox_surv_75$surv[length(cox_surv_75$surv)*0.8] -
  cox_surv_25$surv[length(cox_surv_25$surv)*0.8]

#### Bootstrap to estimate confidence interval

# Create function to create ratio of predicted hazards using models from boots 
# when mixture values are at q1 vs q2
cox.ratio <- function(boots, q1, q2){
  bstar = NULL 
  n = dim(boots)[1]; 
  
  for (mod in 1:n) {
    quant1 = predict(boots[[2]][[mod]], q1, "risk")
    quant2 = predict(boots[[2]][[mod]], q2, "risk")
    ratio = data_frame(ratio = (quant2/quant1))
    bstar = rbind(bstar, ratio)
  } # Next draw
  return(bstar)
}

# Create function to create difference of predicted survival probs 
# using models from boots 
# when mixture values are at q1 vs q2 at quantile of time t
cox.diff<- function(boots, q1, q2, t){
  bstar = NULL 
  n = dim(boots)[1]; 
  
  for (mod in 1:n) {
    survfit1 = survfit(boots[[2]][[mod]], newdata = q1)
    survfit2 = survfit(boots[[2]][[mod]], newdata = q2)
    
    quant1 = survfit1$surv[length(survfit1$surv)*t]
    quant2 = survfit2$surv[length(survfit2$surv)*t]
    
    diff = data_frame(diff = (quant2 - quant1))
    bstar = rbind(bstar, diff)
  } # Next draw
  return(bstar)
}

#Create function to run Cox model on new dataset
cox.model.fnct <- function(data) return(coxph(Surv(t_obs, event) ~ ., data = data))

## Bootstrap data and create model for each bootstrap dataset
set.seed(2020)
cox.boot <- sim_surv_df %>%
  bootstraps(times = 1000) %>% 
  rowwise() %>% 
  mutate(data_sample=(list(analysis(splits)))) %>%
  select(id, data_sample) %>%
  ungroup() %>% 
  mutate(models = map(data_sample, cox.model.fnct)) %>%
  select(-data_sample)

## Estimate HR for each bootstrap model
cox.hr.boot <- cox.ratio(cox.boot, 
                         as.data.frame(t(mix_25)), 
                         as.data.frame(t(mix_75)))

## Form HR CI using 2.5th and 97.5th quantiles of bootstrap estimates
cox.hr.ll <- quantile(cox.hr.boot$ratio, 0.025)
cox.hr.ul <- quantile(cox.hr.boot$ratio, 0.975)


## Estimate SPD for each bootstrap model at 80th quantile of time
cox.spd.boot <- cox.diff(cox.boot, 
                         as.data.frame(t(mix_25)), 
                         as.data.frame(t(mix_75)),
                         0.8)

## Form SPD CI using 2.5th and 97.5th quantiles of bootstrap estimates
cox.spd.ll <- quantile(cox.spd.boot$diff, 0.025)
cox.spd.ul <- quantile(cox.spd.boot$diff, 0.975)

Let’s explore how Cox PH models an individual exposure (\(A7\)) through its exposure-response function:

## Get true survival curve from simulated date
setwd("/Volumes/Extreme SSD/survival-mixture-analysis")
true_response.curve_A7 <- readRDS("sim_data/A7_outcome_true.Rds")

## Get grid of values of individual exposure, all other exposures at median
### median for all exposures
med_vals <- apply(sim_survival$A, 2, median)
med_vals_rep <- t(matrix(rep(med_vals, 100), ncol = 100))

## quantils of A7 while other exposures constant at median
A7_seq <- seq(min(sim_surv_df$A7), max(sim_surv_df$A7), length = 100)
quantsA7 <- cbind(med_vals_rep[,1:6], 
                  A7_seq, 
                  med_vals_rep[,8:10])
colnames(quantsA7) <- names(med_vals)
head(quantsA7)

##              A1         A2           A3         A4          A5          A6
## [1,] 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367 -0.01109992
## [2,] 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367 -0.01109992
## [3,] 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367 -0.01109992
## [4,] 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367 -0.01109992
## [5,] 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367 -0.01109992
## [6,] 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367 -0.01109992
##             A7         A8         A9        A10
## [1,] -2.617565 0.09136669 0.03405408 0.02159977
## [2,] -2.562197 0.09136669 0.03405408 0.02159977
## [3,] -2.506829 0.09136669 0.03405408 0.02159977
## [4,] -2.451460 0.09136669 0.03405408 0.02159977
## [5,] -2.396092 0.09136669 0.03405408 0.02159977
## [6,] -2.340724 0.09136669 0.03405408 0.02159977

## Estimate exposure-response curve for A7
### Predict survival probability across grid of time and values of A7
Cox_pred <- survfit(cox_mod, newdata = as.data.frame(quantsA7))

### Use surv. probs. at 80th percentile of time over grid of A7 values
surv_fits <- bind_rows(bind_cols(Model = "Truth", 
                                 A7 = quantsA7[,"A7"], 
                                 surv = true_response.curve_A7),
          bind_cols(Model = "Cox PH", 
                    A7 = quantsA7[,"A7"], 
                    surv = Cox_pred$surv[0.8*length(Cox_pred$time),]))


## Plot curves
surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3", "Cox PH" = "red")) +
  ylab("Survival Probability") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw() +
  theme(legend.title=element_blank())

The estimated exposure-response curve using the Cox PH model is far from true curve. It may be that the modeling assumptions embedded in the Cox PH model may be too restrictive, next we show ways we can try relaxing some of these assumptions.

Cox Proportional Hazards Model with Penalized Splines (Cox PH-ps)

One way to relax the Cox PH model is to include penalized splines on the exposures such that the hazard function has the form:

\[\lambda(t;\mathbf{a}) = \lambda_0 (t) exp\{\Sigma_{j=1}^J f_j(a_j) + \Sigma_{j \neq j'}f_{j, j'}(a_j, a_{j'})\}\]

where \(f_j\) and \(f_{j, j'}\) are smooth functions estimated via penalized splines through tensor product smoothers. The same assumptions hold as with the traditional Cox PH model, but the effect of an exposure (and their interaction with other exposures) on the survival outcome is modeled more flexibly compared to the linear relationship assumed in the Cox PH model.

Cox PH-ps is less computationally and statistically efficient due to being less parsimonious than the traditional Cox PH model and it may not converge if too many penalized splines are included, thus one may want to use previous findings to select which mixture component to model more flexibly. Below we show how to estimate this model with smoothers on each mixture component, as well as on the interaction between components \(A1\) and \(A4\):

library(mgcv)

cox_ps_model <- gam(t_obs ~ ti(A1) + ti(A2) + ti(A3) + 
                     ti(A4) + ti(A5) + ti(A6) + 
                     ti(A7) + ti(A8) + ti(A9) + ti(A10) + 
                      ti(A1, A4), 
                   family = cox.ph(),
                   data = sim_surv_df,
                   weight = event)

We use this model to estimate our quantities of interest (HR and SPD):

# Estimate HR
coxps_hr_mix <- exp(predict(cox_ps_model, as.data.frame(t(mix_75)), type = "link"))/
  exp(predict(cox_ps_model, as.data.frame(t(mix_25)), type = "link"))


## Estimate SPD at 80th quantile of time
mix_25_t80 <- bind_cols(t_obs = quantile(sim_surv_df$t_obs, 0.8), t(mix_25))
mix_25_t80

## # A tibble: 1 × 11
##   t_obs     A1     A2     A3     A4     A5     A6     A7     A8     A9    A10
##   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  4.86 -0.550 -0.587 -0.479 -0.538 -0.685 -0.586 -0.605 -0.669 -0.518 -0.500

mix_75_t80 <- bind_cols(t_obs = quantile(sim_surv_df$t_obs, 0.8), t(mix_75))
mix_75_t80

## # A tibble: 1 × 11
##   t_obs    A1    A2    A3    A4    A5    A6    A7    A8    A9   A10
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  4.86 0.550 0.583 0.485 0.532 0.697 0.608 0.637 0.794 0.625 0.507

coxps_spd_mix <- predict(cox_ps_model,
                         newdata = mix_75_t80,
                         type = "response") -
  predict(cox_ps_model,
          newdata = mix_25_t80,
          type = "response") 


#### Bootstrap to estimate confidence interval

## Create needed functions for bootstrapping
coxps.ratio <- function(boots, q1, q2){
  bstar = NULL 
  n = dim(boots)[1]; 
  
  for (mod in 1:n) {
    quant1 =  exp(predict(boots[[2]][[mod]], q1, type = "link"))
    quant2 =  exp(predict(boots[[2]][[mod]], q2, type = "link"))

    ratio = data_frame(ratio = (quant2/quant1))
    bstar = rbind(bstar, ratio)
  } # Next draw
  return(bstar)
}

coxps.diff <- function(boots, q1, q2){
  bstar = NULL 
  n = dim(boots)[1]; 
  for (mod in 1:n) {
    
    quant1 = predict(boots[[2]][[mod]], 
                     newdata = q1,
                     type = "response")
    quant2 = predict(boots[[2]][[mod]], 
                     newdata = q2,
                     type = "response")
    
    diff = data_frame(diff = (quant2-quant1))
    bstar = rbind(bstar, diff)
  } # Next draw
  return(bstar)
}

coxps.model.fnct <- function(dat){
  mod <- gam(t_obs ~ ti(A1, k=4) + ti(A2, k=4) + 
               ti(A3, k=4) + ti(A4, k=4) + 
               ti(A5, k=4) + ti(A6, k=4) + 
               ti(A7, k=4) + ti(A8, k=4) + 
               ti(A9, k=4) + ti(A10, k=4) + 
               ti(A1, A4, k = 4), 
             weights = event,
             family = 'cox.ph',
             data = dat)
  return(mod)
}

## Bootstrap 100 cox ps models
set.seed(2020)
coxps.boot = sim_surv_df %>%
  bootstraps(times = 100) %>%
  rowwise() %>%
  mutate(data_sample=(list(analysis(splits)))) %>%
  select(id, data_sample) %>%
  ungroup() %>%
  mutate(models = map(data_sample, coxps.model.fnct)) %>%
  select(-data_sample)


## Create HR CI using bootstrap models
coxps.hr.boot <- coxps.ratio(coxps.boot,
                             as.data.frame(t(mix_25)),
                             as.data.frame(t(mix_75)))

coxps.hr.ll <- quantile(coxps.hr.boot$ratio, 0.025)
coxps.hr.ul <- quantile(coxps.hr.boot$ratio, 0.975)

## Create SPD CI using bootstrap models
coxps.spd.boot <- coxps.diff(coxps.boot,
                             mix_25_t80,
                             mix_75_t80)

coxps.spd.ll <- quantile(coxps.spd.boot$diff, 0.025)
coxps.spd.ul <- quantile(coxps.spd.boot$diff, 0.975)

Let’s see how including penalized splines changes the estimated individual exposure-response function:

## Estimate exposure-response curve for A7
cox_spline_pred <- predict(cox_ps_model, 
                           newdata = bind_cols(t_obs = quantile(sim_surv_df$t_obs, 0.8), 
                                               quantsA7),
                           type = "response")

surv_fits <- bind_rows(surv_fits,
                       bind_cols(Model = "Cox PH w Splines", 
                                 A7 = quantsA7[,"A7"], 
                                 surv = cox_spline_pred))

surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH", "Cox PH w Splines"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3", "Cox PH w Splines" = "red")) +
  ylab("Y") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw()

In this case, allowing our exposures to be modeled more flexibly did not change the estimated curve drastically. We’ll explore more models!

Cox elastic net (Cox EN)

EN is a popular approach for high-dimensional, correlated data. For the survival analysis extension, the form of the estimated hazard function is the same as that of the traditional Cox PH model. However, the coefficient estimation method constrains the coefficients such that they are smaller than those estimated using the traditional Cox PH model. This helps prevent the model from overfitting to the data and performs a version of variable selection by shrinking the coefficients of less relevant exposures toward zero, thus effectively selecting a smaller subset of exposures. EN is particularly useful for a high dimensional exposure sets where some components might be correlated or redundant.

Below we show how to estimate this model. There are two hyperparameters in this model, we cross validate to select the optimal values for these.

library(glmnet)

# Perform 10-fold cross validation 
# Divide all observations into 10 folds
foldid <- sample(1:10, 
                 size = length(sim_surv_df$t_obs), 
                 replace = TRUE)

# Run model, cross validate over values for alpha and lambda
cv.fit <- cv.glmnet(x = sim_survival$A,
                    y = Surv(sim_survival$t_obs, event = sim_survival$event), 
                    family = "cox",
                    maxit = 5000, 
                    nfolds = 10,
                    foldid = foldid) #alpha = 1 by default
  
#Perform CV over grid of alpha values
#select combo alpha/lambda combo with best performance
for (i in seq(0,0.8, by =0.2)) {
  cv.temp  <- cv.glmnet(x = sim_survival$A,
                        y = Surv(sim_survival$t_obs, event = sim_survival$event),
                        family = "cox",
                        maxit = 5000,
                        nfolds = 10,
                        foldid = foldid,
                        alpha = i)
  if(min(cv.temp$cvm) < min(cv.fit$cvm)) cv.fit <- cv.temp
}

#returns cox model with EN coefficients
selectedBeta <- array(t(as.matrix(coef(cv.fit, 
                                       s = cv.fit$lambda.min))))
coxnet_mod <- coxph(Surv(t_obs, event = event) ~ .,
                      data = sim_surv_df, 
                      init = selectedBeta, 
                      iter = 0)

The model returned is a Cox PH model with coefficient estimated using the EN constraint. Thus we can now estimate our quantities of interest in the same way we did for the Cox PH model:

# Estimate HR
coxnet_hr_mix <- predict(coxnet_mod, as.data.frame(t(mix_75)), "risk")/
  predict(coxnet_mod, as.data.frame(t(mix_25)), "risk")

## Estimate SPD at 80th quantile of time
coxnet_surv_25 <- survfit(coxnet_mod, newdata = mix_25)
coxnet_surv_75 <- survfit(coxnet_mod, newdata = mix_75)

coxnet_spd_mix <- coxnet_surv_75$surv[length(coxnet_surv_75$surv)*0.8] -
  coxnet_surv_25$surv[length(coxnet_surv_25$surv)*0.8]

## Bootstrap confidence bands
#Create function to run Cox EN model on new dataset
coxnet.model.fnct <- function(data){
  
  foldid <- sample(1:10, 
                   size = nrow(data), 
                   replace = TRUE)
  
  cv.fit <- cv.glmnet(x = as.matrix(select(data, A1:A10)),
                      y = Surv(data$t_obs, event = data$event),
                      family = "cox",
                      maxit = 5000, 
                      nfolds = 10,
                      foldid = foldid)

  for (i in seq(0,0.8, by =0.2)) {
    cv.temp  <- cv.glmnet(x = as.matrix(select(data, A1:A10)),
                          y = Surv(data$t_obs, event = data$event), 
                          family = "cox",
                          maxit = 5000,
                          nfolds = 10,
                          foldid = foldid,
                          alpha = i)
    if(min(cv.temp$cvm) < min(cv.fit$cvm)) cv.fit <- cv.temp
  }
  
  selectedBeta <- array(t(as.matrix(coef(cv.fit, 
                                         s = cv.fit$lambda.min))))
  glmnet.model <- coxph(Surv(t_obs, event = event) ~ .,
                        data = data, 
                        init = selectedBeta, 
                        iter = 0)
  return(glmnet.model)
}

## Bootstrap Cox EN models
set.seed(2020)
coxnet.boot <- sim_surv_df %>%
  bootstraps(times = 10) %>% 
  rowwise() %>% 
  mutate(data_sample=(list(analysis(splits)))) %>%
  select(id, data_sample) %>%
  ungroup() %>% 
  mutate(models = map(data_sample, coxnet.model.fnct)) %>%
  select(-data_sample)


# Use Cox model's HR function to get cox net HR
coxnet.hr.boot <- cox.ratio(coxnet.boot, 
                            as.data.frame(t(mix_25)), 
                            as.data.frame(t(mix_75)))

coxnet.hr.ll <- quantile(coxnet.hr.boot$ratio, 0.025)
coxnet.hr.ul <- quantile(coxnet.hr.boot$ratio, 0.975)


# Use Cox model's SPD function to get cox net SPD
coxnet.spd.boot <- cox.diff(coxnet.boot, 
                            as.data.frame(t(mix_25)), 
                            as.data.frame(t(mix_75)),
                            t = 0.8)

coxnet.spd.ll <- quantile(coxnet.spd.boot$diff, 0.025)
coxnet.spd.ul <- quantile(coxnet.spd.boot$diff, 0.975)

We estimate the exposure response curve for exposure component \(A7\) using the Cox EN model:

## Estimate exposure-response curve for A7
coxnet_pred <- survfit(coxnet_mod, newdata = as.data.frame(quantsA7))

surv_fits <- bind_rows(surv_fits, 
                       bind_cols(Model = "Cox Elastic Net", 
                                 A7 = quantsA7[,"A7"], 
                                 surv = coxnet_pred$surv[0.8*length(coxnet_pred$time),])) 

surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH", "Cox PH w Splines",
                                           "Cox Elastic Net"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3",
                                "Cox PH" = "blue",
                                "Cox Elastic Net" = "red")) +
  ylab("Survival") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw()

We see that the slope for the estimated curve using Cox EN is a little flatter than that of the traditional Cox PH model. This is a results of the coefficient being shrunk towards 0.

Discrete time survival analysis based approaches

Many machine learning approaches are yet to be extended to survival data and/or are not readily available in R. In order to employ such methods and relax the proportional hazards assumption, we applied a discrete-time survival analysis approach (Fahrmeir et al. 2005).

To apply this approach, we consider each subject’s corresponding triple \((t_i,\delta_i,\mathbf{x}_i)\) where, for subject \(i\):

\(t_i\) is the observed time to event or censoring
\(\delta_i\) is the indicator of whether the event occurred or was censored
\(\mathbf{a}_i\) is the set of observed metals

Time, \(t\), is discretized into \(R\) bins of event/censoring times delineated by its corresponding \(1/R,2/R,…,R/R\) quantiles. An augmented dataset is created such that subject \(i\) has multiple corresponding observations for each distinct discretized time up to the bin pertaining to their observed \(t_i\). A binary variable, \(\mathbf{Y}_i\), is created to indicate event status per subject-discretized time period.

For example, below we perform this data augmentation on the simulated dataset, using \(R = 5\):

library(BART)

#select number of bins wanted for time variable
R <- 5

#use function in BART package to create augmented data frame
pre <- surv.pre.bart(times = sim_survival$t_obs,
                     delta = sim_survival$event,
                     x.train = sim_survival$A,
                     K = R)

sim_survival_long <- bind_cols(event = pre[["y.train"]], pre[["tx.train"]])

#discretized time values (binning limits)
pre$times

## [1] 0.6783287 1.5375476 2.8487990 4.8633189 9.8891516

#original dataset, one row per observation
head(sim_surv_df, 4)

##        t_obs event         A1         A2         A3         A4         A5
## 1 0.07446581     0  0.4135089 -0.5944527 -0.8096157 -0.5541695 -1.0115614
## 2 0.66082410     1 -1.4181004 -1.4298398 -0.2309133 -0.3124463 -0.2397462
## 3 4.35809486     0  0.8432820  0.2765033  0.5813349  1.3007974  1.2237475
## 4 2.44517878     1  0.5248145  0.5858638  0.2235701  1.4008136  0.3704268
##           A6         A7          A8         A9        A10
## 1  0.3314428  0.1792858  0.61754085 -0.5871062 -0.3300210
## 2 -0.1843424  1.0303817 -0.27295057  0.8662922  1.3832782
## 3  0.7839420 -0.1241911  0.31629865  0.9622082 -0.3353922
## 4  0.8433271  1.9342665 -0.06862079  0.7841573 -0.9059241

#augmented dataset, one row per observation-time
head(sim_survival_long, 9)

## # A tibble: 9 × 12
##   event     t     A1     A2     A3     A4     A5     A6     A7      A8     A9
##   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
## 1     0 0.678  0.414 -0.594 -0.810 -0.554 -1.01   0.331  0.179  0.618  -0.587
## 2     1 0.678 -1.42  -1.43  -0.231 -0.312 -0.240 -0.184  1.03  -0.273   0.866
## 3     0 0.678  0.843  0.277  0.581  1.30   1.22   0.784 -0.124  0.316   0.962
## 4     0 1.54   0.843  0.277  0.581  1.30   1.22   0.784 -0.124  0.316   0.962
## 5     0 2.85   0.843  0.277  0.581  1.30   1.22   0.784 -0.124  0.316   0.962
## 6     0 4.86   0.843  0.277  0.581  1.30   1.22   0.784 -0.124  0.316   0.962
## 7     0 0.678  0.525  0.586  0.224  1.40   0.370  0.843  1.93  -0.0686  0.784
## 8     0 1.54   0.525  0.586  0.224  1.40   0.370  0.843  1.93  -0.0686  0.784
## 9     1 2.85   0.525  0.586  0.224  1.40   0.370  0.843  1.93  -0.0686  0.784
## # ℹ 1 more variable: A10 <dbl>

We see the first two observation from the original dataset only have one row in the new augmented dataset, this is because their \(t_{obs}\) occurs before the first time quantile. The first observation has \(event = 0\) since they were censored while the second observation has \(event = 1\) since the event occured during the first time bin. For the third observation from the original dataset, there are 4 corresponding rows in the augmented dataset where their exposure values are repeated but the \(t\) variable varies. Since this observation was censored, they have a \(event = 0\) for all time points. Alternatively, the fourth observation now has three corresponding rows where \(event = 1\) at the third time bin since the event was observed for this observation during this time bin.

We can now use \(\mathbf{Y}\) (labeled “event” in the dataset) as the outcome of interest and apply flexible methods readily available for binary outcomes to the augmented dataset with \(t\) included as a covariate. Generally, we model

\[g(\mu_i) = f(\mathbf{a}_i,t)\]

where \(g(.)\) is a monotonic link function, \(\mu_i=E(Y|\mathbf{a}_i,,t)\) is the expected probability of experiencing the event, and \(f(\mathbf{a}_i,t)\) is a function of the exposures and time estimated through an algorithm. There are many modeling algorithms that can be used to estimate this function, we demonstrate a few.

To get the quantities we need at event/censoring times \(t_{(k)}\), \(k = 1, ..., r\), we can use the functions shown in Sparapani et al., 2016.

\[\lambda(t_{(k)}|\mathbf{a}) = \frac{p(t_{(k)}, \mathbf{a})}{(t_{(k)} - t_{(k - 1)})}\]

\[S(t_{(k)}|\mathbf{a}) = \Pi_{l = 1}^{k}(1 - p(t_{(l)}, \mathbf{a}))\]

where \(p(t, \mathbf{a})\) is the estimated probability that the even occurs at time \(t\) and exposure concentrations \(\mathbf{a}\).

Multivariate Adaptive Regression Splines (MARS)

The MARS algorithm creates a piecewise linear model convenient for modeling nonlinear and interaction relationships of the metals while still maintaining some interpretability. Nonlinear relationships can be captured by binning the range of values for each exposure into smaller sections, split by values referred to as knots, and creating linear regression models for each section of the exposure’s concentrations, such that the relationship of a metal with the outcome differs over different intervals of the exposure concentrations. The relationship between the exposures and time with the outcome are modeled by summing over multiple piecewise linear functions.

The algorithm optimizes the choice of knots and the resulting choice of piecewise linear functions may not include all exposures, thus performing automated variable selection. Products of piecewise linear functions can also be included to model interactions, the degree of products we want to account for can be cross-validated.

library(caret)

#Need to cross validate for number of terms retained in the final model
# and maximal degree of interactions 
mars_hypergrid <- expand.grid(
  degree = 1:2, 
  nprune = seq(2, 30, length.out = 10) %>% floor() 
  )

head(mars_hypergrid)

##   degree nprune
## 1      1      2
## 2      2      2
## 3      1      5
## 4      2      5
## 5      1      8
## 6      2      8

#change outcome var to make compatible with caret
sim_survival_long$event <- ifelse(sim_survival_long$event == 1, 
                                  "event", "none")

#Run MARS mode
set.seed(2020)

t1 <- Sys.time()
model_mars <- train(
  as.factor(event) ~ .,
  data = sim_survival_long,
  method = "earth",
  glm = list(family=binomial),
  trControl = trainControl(method = "cv", number = 5,
                           summaryFunction = twoClassSummary,
                           classProbs = TRUE, # IMPORTANT!
                           verboseIter = TRUE),
  tuneGrid = mars_hypergrid)

## + Fold1: degree=1, nprune=30 
## - Fold1: degree=1, nprune=30 
## + Fold1: degree=2, nprune=30 
## - Fold1: degree=2, nprune=30 
## + Fold2: degree=1, nprune=30 
## - Fold2: degree=1, nprune=30 
## + Fold2: degree=2, nprune=30 
## - Fold2: degree=2, nprune=30 
## + Fold3: degree=1, nprune=30 
## - Fold3: degree=1, nprune=30 
## + Fold3: degree=2, nprune=30 
## - Fold3: degree=2, nprune=30 
## + Fold4: degree=1, nprune=30 
## - Fold4: degree=1, nprune=30 
## + Fold4: degree=2, nprune=30 
## - Fold4: degree=2, nprune=30 
## + Fold5: degree=1, nprune=30 
## - Fold5: degree=1, nprune=30 
## + Fold5: degree=2, nprune=30 
## - Fold5: degree=2, nprune=30 
## Aggregating results
## Selecting tuning parameters
## Fitting nprune = 5, degree = 2 on full training set

mars_time = Sys.time() - t1

Now we’ll use this model to estimate our quantities of interest:

# Estimate HR

## Set up quantile df into long format with time
mix_25t <- bind_cols(t = pre$times, t(mix_25))
mix_75t <- bind_cols(t = pre$times, t(mix_75))

mix_75t

## # A tibble: 5 × 11
##       t    A1    A2    A3    A4    A5    A6    A7    A8    A9   A10
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.678 0.550 0.583 0.485 0.532 0.697 0.608 0.637 0.794 0.625 0.507
## 2 1.54  0.550 0.583 0.485 0.532 0.697 0.608 0.637 0.794 0.625 0.507
## 3 2.85  0.550 0.583 0.485 0.532 0.697 0.608 0.637 0.794 0.625 0.507
## 4 4.86  0.550 0.583 0.485 0.532 0.697 0.608 0.637 0.794 0.625 0.507
## 5 9.89  0.550 0.583 0.485 0.532 0.697 0.608 0.637 0.794 0.625 0.507

## Estimate effect at time bin 4 out of 5, since 4/5 = 0.8 quantile of t
## Ignore denominator from equation because will cancel out in ratio
t = 4
mars_hr_mix <- predict(model_mars, mix_75t, type = "prob")$event[t]/
  predict(model_mars, mix_25t, type = "prob")$event[t]

## Estimate SPD at 80th quantile of time

### Get probabilities at different time points
p.noevent <- predict(model_mars, 
                     rbind(mix_25t, mix_75t), 
                     type = "prob")$none

#### Calculate survival probability: 
##### product of probabilities up to time bin of interest

surv.mix <-NULL
for (j in 1:2) {
  surv.mix[j] <- cumprod(p.noevent[(1:R)+(j-1)*R])[t]
}

##### Returns survival prob. for 4th bin of time 
##### when all exposures are at 25th  and 75th percentiles of exposures
surv.mix

## [1] 0.2686210 0.4862488

mars_spd_mix <- surv.mix[2] - surv.mix[1]

## Bootstrap confidence bands

# function to estimate MARS model
mars_mod_funct <- function(data){
  
  pre <- surv.pre.bart(times = data$t_obs,
                       delta = data$event,
                       x.train = select(data, A1:A10),
                       K = R)
  
  dat_long <- bind_cols(event = pre[["y.train"]], pre[["tx.train"]])
  dat_long$event <- ifelse(dat_long$event == 1,  "event", "none")
  
  train <-  
    capture.output(mars_mod <- 
                     train(as.factor(event) ~ .,
                           data = dat_long,
                           method = "earth",
                           glm = list(family=binomial),
                           trControl = trainControl(method = "cv", 
                                                    number = 10,
                                                    summaryFunction = 
                                                      twoClassSummary,
                                                    classProbs = TRUE,
                                                    verboseIter = TRUE),
                           tuneGrid = mars_hypergrid,
                           trace = FALSE))
  return(mars_mod)
}

# Function to estimate HR for bootstrap models using Caret package
caret.ratio <- function(boots, q1, q2){
  bstar = NULL 
  n = dim(boots)[1]; 
  
  for (mod in 1:n) {
    
    quant1 = predict(boots[[2]][[mod]], q1, type = "prob")$event
    quant2 = predict(boots[[2]][[mod]], q2, type = "prob")$event
    ratio = data_frame(ratio = (quant2/quant1))
    bstar = rbind(bstar, ratio)
  } # Next draw
  return(bstar)
}

# Function to estimate SPD for bootstrap models using Caret package
# t refers to time bin to calculate values at
caret.diff <- function(boots, q1.long, q2.long, t){
  bstar = NULL 
  n = dim(boots)[1]; 
  
  for (mod in 1:n) {
    p.noevent <- predict(boots[[2]][[mod]], 
                         q1.long, 
                         type = "prob")$none
    surv.1 <- cumprod(p.noevent)[t]
    
    p.noevent <- predict(boots[[2]][[mod]], 
                         q2.long, 
                         type = "prob")$none
    surv.2 <- cumprod(p.noevent)[t]
  
    diff <- data_frame(diff = (surv.2 - surv.1))
    bstar = rbind(bstar, diff)
  } # Next draw
  return(bstar)
}

## bootstrap MARS model
set.seed(2020)
mars.boot = sim_surv_df %>%
  bootstraps(times = 100) %>%
  rowwise() %>%
  mutate(data_sample=(list(analysis(splits)))) %>%
  select(id, data_sample) %>%
  ungroup() %>%
  mutate(models = map(data_sample, mars_mod_funct)) %>%
  select(-data_sample)

## Estimate HR for each bootstrap model
mars_hr_boot <- caret.ratio(mars.boot,
                            mix_25_t80,
                            mix_75_t80)
mars_hr_ll <- quantile(mars_hr_boot$ratio, 0.025)
mars_hr_ul <- quantile(mars_hr_boot$ratio, 0.975)


## Estimate SPD for each bootstrap model
mars_diff_boot <- caret.diff(mars.boot,
                                mix_25t,
                                mix_75t,
                                4)
mars_diff_ll <- quantile(mars_diff_boot$diff, 0.025)
mars_diff_ul <- quantile(mars_diff_boot$diff, 0.975)

Next we estimate the exposure-response curve for exposure \(A7\) using our MARS model:

## Exposure response curve with MARS
## Create augmented dataset for individual exposure A7 to predict values over
A7_long <- surv.pre.bart(x.train = sim_survival$A, 
                         times = sim_survival$t_obs, 
                         delta = sim_survival$event, 
                         K = R, 
                         x.test = quantsA7)

head(A7_long$tx.test, 10)

##               t         A1         A2           A3         A4          A5
##  [1,] 0.6783287 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [2,] 1.5375476 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [3,] 2.8487990 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [4,] 4.8633189 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [5,] 9.8891516 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [6,] 0.6783287 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [7,] 1.5375476 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [8,] 2.8487990 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##  [9,] 4.8633189 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
## [10,] 9.8891516 0.02495632 0.02578666 -0.001082423 0.01331099 -0.02523367
##                A6        A7         A8         A9        A10
##  [1,] -0.01109992 -2.617565 0.09136669 0.03405408 0.02159977
##  [2,] -0.01109992 -2.617565 0.09136669 0.03405408 0.02159977
##  [3,] -0.01109992 -2.617565 0.09136669 0.03405408 0.02159977
##  [4,] -0.01109992 -2.617565 0.09136669 0.03405408 0.02159977
##  [5,] -0.01109992 -2.617565 0.09136669 0.03405408 0.02159977
##  [6,] -0.01109992 -2.562197 0.09136669 0.03405408 0.02159977
##  [7,] -0.01109992 -2.562197 0.09136669 0.03405408 0.02159977
##  [8,] -0.01109992 -2.562197 0.09136669 0.03405408 0.02159977
##  [9,] -0.01109992 -2.562197 0.09136669 0.03405408 0.02159977
## [10,] -0.01109992 -2.562197 0.09136669 0.03405408 0.02159977

## Estimate survival probability across values of A7
p.noevent <- predict(model_mars, 
                     A7_long$tx.test, type = "prob")[,'none']
mars_pred <-NULL
for (j in 1:nrow(quantsA7)) {mars_pred[j] <- cumprod(p.noevent[(1:R)+(j-1)*R])[t]}

surv_fits <- bind_rows(surv_fits,
                       bind_cols(Model = "MARS", 
                                 A7 = quantsA7[,"A7"], 
                                 surv = mars_pred)) 

surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH", "Cox PH w Splines", 
                                           "Cox Elastic Net", "MARS"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3", "MARS" = "red")) +
  ylab("Survival") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw()

From the exposure-response curve we see that 1 is a knot value for \(A7\), where the linear model is different for concentrations of \(A7\) on \((-2.6, 1)\) compared to \((1, 2.9)\). This allows the model to better approximate the true curve.

Gaussian Process Regression (GPR)

GPR is a nonparametric method that estimates the outcome for a new set of exposure concentrations by weighting the outcome from all observations based on how numerically close their concentrations are. Each subject’s weight increases as the distance between their concentrations decreases. Conceptually, subjects with similar exposure profiles should have a similar outcome, thus receive higher weights. The estimated outcome can then be thought of as a weighted average of the observed outcomes. Since a functional form is not specified or assumed, nonlinear exposure-response functions and nonlinear and non-additive interactions among all mixture components can be captured.

\(\rho\) is a tuning parameter used to control flexibility, as seen in the smoothness of the exposure-response curve. Given the time intensiveness to run GPR, we used the median of \(|\chi - \chi'|^2\) for the \(\rho\) value rather than cross-validating. While this might not provide the optimal \(\rho\), it has been shown empirically that the optimal value lies between the 0.1 and 0.9 quantiles of this statistic (Caputo, 2002). We estimate this model using the 50th percentile below:

#get potential rho parameter values, 
rho_val <- kernlab::sigest(as.factor(event) ~ ., data = sim_survival_long)
rho_val

##        90%        50%        10% 
## 0.02437901 0.05434667 0.13918908

gp_model <-  train(as.factor(event) ~ .,
                   data = sim_survival_long,
                   method="gaussprRadial", 
                   trControl=trainControl(method="none"),
                   tuneGrid = expand.grid(.sigma = 
                                            expand.grid(.sigma = rho_val[2])
                                          )
                   )

Since we used the Caret package to estimate the GPR model, we can estimate the quantities of interest in the same way we did for the MARS model:

# Estimate HR at 80th percentile of time
gpr_hr_mix <- predict(gp_model, mix_75t, "prob")$event[t]/
  predict(gp_model, mix_25t, "prob")$event[t]

# Estimate SPD at 80th percentile of time
p.noevent <- predict(gp_model,
                     rbind(mix_25t, mix_75t), 
                     type = "prob")$none
gpr.surv.mix <-NULL
for (j in 1:2) {
  gpr.surv.mix[j] <- cumprod(p.noevent[(1:R)+(j-1)*R])[t]
}

gpr_spd_mix <- gpr.surv.mix[2] - gpr.surv.mix[1]
  
## Bootstrap confidence bands
# function to estimate GPR model
gpr_mod_funct <- function(data){
  
  pre <- surv.pre.bart(times = data$t_obs,
                       delta = data$event,
                       x.train = select(data, A1:A10),
                       K = 5)
  
  dat_long <- bind_cols(event = pre[["y.train"]], pre[["tx.train"]])
  dat_long$event <- ifelse(dat_long$event == 1,  "event", "none")
  
  rho <- kernlab::sigest(as.factor(event) ~ ., data = dat_long)
  gp_mod <-  train(as.factor(event) ~ .,
                   data = dat_long,
                   method="gaussprRadial", 
                   trControl=trainControl(method="none"),
                   tuneGrid = expand.grid(.sigma = rho[2])
                   )
  return(gp_mod)
}

### Bootstrap GPR model

# set.seed(2020)
# gpr.boot = sim_surv_df %>%
#   bootstraps(times = 100) %>%
#   rowwise() %>%
#   mutate(data_sample=(list(analysis(splits)))) %>%
#   select(id, data_sample) %>%
#   ungroup() %>%
#   mutate(models = map(data_sample, gpr_mod_funct)) %>%
#   select(-data_sample)
# saveRDS(gpr.boot, "gpr.boot.Rds")

# Ran previously due to timely process, load in

setwd("/Volumes/Extreme SSD/survival-mixture-analysis")
gpr.boot <- readRDS("gpr.boot.Rds")

### Get CI for HR 
gpr_hr_boot <- caret.ratio(gpr.boot,
                            mix_25_t80,
                            mix_75_t80)
gpr_hr_ll <- quantile(gpr_hr_boot$ratio, 0.025)
gpr_hr_ul <- quantile(gpr_hr_boot$ratio, 0.975)

### Get CI for SPD 
gpr_diff_boot <- caret.diff(gpr.boot,
                            mix_25t,
                            mix_75t,
                            4)

gpr_diff_ll <- quantile(gpr_diff_boot$diff, 0.025)
gpr_diff_ul <- quantile(gpr_diff_boot$diff, 0.975)

We can also estimate the exposure-response curve:

#Probably of no event at each time point for each A7 value
gpr.p.noevent <- predict(gp_model, 
                         A7_long$tx.test, type = "prob")[,'none']

#Survival probability
gpr_pred <-NULL
for (j in 1:nrow(quantsA7)) {gpr_pred[j] <- cumprod(gpr.p.noevent[(1:R)+(j-1)*R])[t]}

surv_fits <- bind_rows(surv_fits,
                       bind_cols(Model = "GPR", 
                                 A7 = quantsA7[,"A7"], 
                                 surv = gpr_pred)) 

surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH", "Cox PH w Splines", 
                                           "Cox Elastic Net", "MARS", "GPR"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3", "GPR" = "red")) +
  ylab("Survival") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw()

The exposure-response curve estimated via GPR does appear to be more flexible compared to the other models shown in grey. It may be too flexible, cross validating the \(\rho\) parameter might help.

Bayesian Additive Regression Tree (BART)

BART is a nonparametric Bayesian regression method which approximates the response curve using a sum of trees approach. Interactions and nonlinearities are naturally incorporated into the tree structure. A Bayesian approach is used to fit the trees, with a prior imposed to keep individual tree effects small. By imposing a Bayesian framework, credible intervals are also easily produced, easing one’s ability to measure uncertainty.

# fit BART model
bart_fit <- surv.bart(x.train = sim_survival$A,
                      times = sim_survival$t_obs,
                      delta = sim_survival$event,
                      K = R,
                 #     seed = 2020,
                      ntree = 50,
                      nskip = 1000,
                      ndpost = 1000,
                      keepevery = 10,
                      printevery = 5000)

## *****Calling gbart: type=2
## *****Data:
## data:n,p,np: 3000, 11, 0
## y1,yn: 0.000000, 0.000000
## x1,x[n*p]: 0.678329, 0.700396
## *****Number of Trees: 50
## *****Number of Cut Points: 4 ... 100
## *****burn,nd,thin: 1000,10000,10
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.212132,3,1,-0.855996
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,10,0
## *****printevery: 5000
## 
## MCMC
## done 0 (out of 11000)
## done 5000 (out of 11000)
## done 10000 (out of 11000)
## time: 27s
## trcnt,tecnt: 1000,0

## Get posterior distributions for predicted values
pred_bart_mix <- predict(bart_fit, 
                         as.data.frame(rbind(mix_25t, mix_75t)))

## *****In main of C++ for bart prediction
## tc (threadcount): 1
## number of bart draws: 1000
## number of trees in bart sum: 50
## number of x columns: 11
## from x,np,p: 11, 10
## ***using serial code

# Estimate HR for IQR of mixture
## Median of posterior of HR
bart_hr_mix <- quantile(pred_bart_mix$prob.test[,R+t]/
                          pred_bart_mix$prob.test[,t],
                        0.5)

## 2.5th and 97.5th percentile of posterior of HR
bart.hr.ll <- quantile(pred_bart_mix$prob.test[,R+t]/
                          pred_bart_mix$prob.test[,t],
                        0.025)
bart.hr.ul <- quantile(pred_bart_mix$prob.test[,R+t]/
                          pred_bart_mix$prob.test[,t],
                        0.975)


# Estimate SPD for IQR of mixture
bart_spd_mix <- pred_bart_mix$surv.test.mean[R+t] - 
  pred_bart_mix$surv.test.mean[t]

bart.spd.ll <- quantile(pred_bart_mix$surv.test[,R+t] - 
                          pred_bart_mix$surv.test[,t],
                       0.025)
bart.spd.ul <- quantile(pred_bart_mix$surv.test[,R+t] - 
                          pred_bart_mix$surv.test[,t],
                       0.975)

We estimate the exposure-response curve for \(A7\) using the BART model:

# estimate exposure-response curve for A7

## Posterior predicted prob. over range of A7 concentrations
bart_A7<- predict(bart_fit, A7_long$tx.test)

## *****In main of C++ for bart prediction
## tc (threadcount): 1
## number of bart draws: 1000
## number of trees in bart sum: 50
## number of x columns: 11
## from x,np,p: 11, 500
## ***using serial code

bart_A7_pred <- bart_A7$surv.test.mean[seq(t, length(bart_A7$surv.test.mean), by = R)]

surv_fits <- bind_rows(surv_fits,
                        bind_cols(Model = "BART", 
                                  A7 = quantsA7[,"A7"], 
                                  surv = bart_A7_pred)) 

surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH", "Cox PH w Splines", 
                                           "Cox Elastic Net", "MARS", "GPR", "BART"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3", "BART" = "red")) +
  ylab("Survival") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw()

Bayesian Kernel Machine Regression (BKMR)

BKMR is a Bayesian extension of GPR developed for mixtures analyses. A slab-and-spike prior is applied to estimate the contribution of each exposure such that their is a high probability of assigning no contribution to individual exposures. This is a way to incorporate variable selection into the model. Additionally, the probability of the contribution not being zero serves as a way to assess variable performance. While BKMR is popular for mixture analyses, it has not been directly extended to the survival time outcome setting. Hence, we apply the discrete time survival analysis approach. One drawback to this modeling method is that it is computationally intensive, thus we will use an approximate method using knots to speed up the model fitting process.

library(bkmr)

# Create field of 50 knots to speed up model
# as number of knots increases, faster fitting of model
set.seed(2020)
knots50 <- fields::cover.design(select(sim_survival_long, -event), nd = 50)$design

# Run BKMR model
t1 <- Sys.time()
bkmr_surv_fit <- kmbayes(y = ifelse(sim_survival_long$event == "event", 1, 0),
                         Z = select(sim_survival_long, -event),
                         iter = 1000,
                          family = "binomial",
                          verbose = FALSE,
                          varsel = TRUE,
                          ztest = c(2:11), #does not perform variable selection on t
                          est.h = TRUE,
                          knots = knots50)

## Fitting probit regression model

## Iteration: 100 (10% completed; 1.11425 mins elapsed)

## Iteration: 200 (20% completed; 2.21968 mins elapsed)

## Iteration: 300 (30% completed; 3.30385 mins elapsed)

## Iteration: 400 (40% completed; 4.38277 mins elapsed)

## Iteration: 500 (50% completed; 5.46884 mins elapsed)

## Iteration: 600 (60% completed; 6.55741 mins elapsed)

## Iteration: 700 (70% completed; 7.64948 mins elapsed)

## Iteration: 800 (80% completed; 8.72485 mins elapsed)

## Iteration: 900 (90% completed; 9.81736 mins elapsed)

## Iteration: 1000 (100% completed; 10.90345 mins elapsed)

bkmr_t <- Sys.time() - t1
print(bkmr_t)

## Time difference of 10.91029 mins

# estimate HR for IQR of mixture
bkmr_mix_prob <- SamplePred(bkmr_surv_fit, 
                            Znew = as.data.frame(rbind(mix_25t, mix_75t)),
                            Xnew = cbind(0),
                            type = "response")
bkmr_hr_mix <- quantile(bkmr_mix_prob[ ,R+t]/bkmr_mix_prob[ ,t], 0.5)
bkmr.hr.ll <- quantile(bkmr_mix_prob[ ,R+t]/bkmr_mix_prob[ ,t], 0.025)
bkmr.hr.ul <- quantile(bkmr_mix_prob[ ,R+t]/bkmr_mix_prob[ ,t], 0.975)

# estimate SPD for IQR of mixture
bkmr_mix_surv <- cbind()
for(j in 1:2) bkmr_mix_surv <- cbind(bkmr_mix_surv, t(apply(1 - bkmr_mix_prob[, (1:R)+(j-1)*R], 1, cumprod))[,t])


bkmr_spd_mix <- mean(bkmr_mix_surv[, 2] - bkmr_mix_surv[,1])
bkmr.spd.ll <- quantile(bkmr_mix_surv[, 2] - bkmr_mix_surv[,1], 0.025)  
bkmr.spd.ul <- quantile(bkmr_mix_surv[, 2] - bkmr_mix_surv[,1], 0.975)  

# estimate exposure-response curve for A7
p.noevent.bkmr <- SamplePred(fit = bkmr_surv_fit, 
                             Znew =  A7_long$tx.test, 
                             Xnew = cbind(0),
                             type = "response")

p.noevent.bkmr <- 1 - apply(p.noevent.bkmr, 2, mean)
BKMR_pred <-NULL
for (j in 1:nrow(quantsA7)) {BKMR_pred[j] <- cumprod(p.noevent.bkmr[(1:R)+(j-1)*R])[t]}


surv_fits <- bind_rows(surv_fits,
          bind_cols(Model = "BKMR", A7 = quantsA7[,"A7"], surv = BKMR_pred)) 

surv_fits %>% 
   mutate(Model = factor(Model, levels = c("Truth", "Cox PH", "Cox PH w Splines", 
                                           "Cox Elastic Net", "MARS", "GPR",
                                           "BART", "BKMR"))) %>% 
  ggplot(aes(x = A7, y = surv, color = Model)) + 
  geom_line() + 
  scale_color_manual(values = c("Truth" = "green3", "BKMR" = "red")) +
  ylab("Survival") + 
  xlab("A7") + 
  ggtitle("Exposure Response Curves") +
  theme_bw()

We see the exposure-response curve estimated via BKMR is flexible yet more closely mimics the true curve.

Models’ Performances

Throughout we visualized how well the estimated exposure-response curves replicated the true curve, but we have not explored how the point estimated line up with the true values. Let’s see how the models did.

setwd("/Volumes/Extreme SSD/survival-mixture-analysis")
HR_true <- readRDS("HR_true.Rds")
SPD_true <- readRDS("SPD_true.Rds")


true_estimands <- data.frame(estimand = c("HR", "SPD"), 
                             Z = c(HR_true, SPD_true))

bind_rows(bind_cols(Model = "Cox", 
                    estimand = "HR",
                    value = cox_hr_mix, 
                    ll = cox.hr.ll,
                    ul = cox.hr.ul),
          bind_cols(Model = "Cox", 
                    estimand = "SPD",
                    value = cox_spd_mix,
                    ul = cox.spd.ul,
                    ll = cox.spd.ll),
          bind_cols(Model = "Cox PH-ps", 
                    estimand = "HR",
                    value = coxps_hr_mix, 
                    ll = coxps.hr.ll,
                    ul = coxps.hr.ul),
          bind_cols(Model = "Cox PH-ps", 
                    estimand = "SPD",
                    value = coxps_spd_mix,
                    ul = coxps.spd.ul,
                    ll = coxps.spd.ll),
          bind_cols(Model = "Cox EN",
                    estimand = "HR",
                    value = coxnet_hr_mix,  
                    ll = coxnet.hr.ll,
                    ul = coxnet.hr.ul),
          bind_cols(Model = "Cox EN", 
                    estimand = "SPD",
                    value = coxnet_spd_mix,
                    ul = coxnet.spd.ul,
                    ll = coxnet.spd.ll),
          bind_cols(Model = "MARS", 
                    estimand = "HR",
                    value = mars_hr_mix, 
                    ll = mars_hr_ll,
                    ul = mars_hr_ul),
          bind_cols(Model = "MARS", 
                    estimand = "SPD",
                    value = mars_spd_mix,
                    ul = mars_diff_ul,
                    ll = mars_diff_ll),
          bind_cols(Model = "GPR",
                    estimand = "HR",
                    value = gpr_hr_mix, 
                    ll = gpr_hr_ll,
                    ul = gpr_hr_ul),
          bind_cols(Model = "GPR", 
                    estimand = "SPD",
                    value = gpr_spd_mix,
                    ul = gpr_diff_ul,
                    ll = gpr_diff_ll),
          bind_cols(Model = "BART", 
                    estimand = "HR",
                    value = bart_hr_mix, 
                    ll = bart.hr.ll,
                    ul = bart.hr.ul),          
          bind_cols(Model = "BART", 
                    estimand = "SPD",
                    value = bart_spd_mix,
                    ul = bart.spd.ul,
                    ll = bart.spd.ll),
          bind_cols(Model = "BKMR",
                    estimand = "HR",
                    value = bkmr_hr_mix, 
                    ll = bkmr.hr.ll,
                    ul = bkmr.hr.ul),          
          bind_cols(Model = "BKMR", 
                    estimand = "SPD",
                    value = bkmr_spd_mix,
                    ul = bkmr.spd.ul,
                    ll = bkmr.spd.ll)) %>% 
  ggplot(aes(x = Model, y = value, ymin = ll, ymax = ul)) +
  geom_pointrange() +
  facet_grid(~estimand) +
  geom_hline(data = true_estimands, aes(yintercept = Z), linetype = "dashed") + 
  coord_flip() +
  theme_bw()

The true HR and SPD values are dictated by the vertical dashed lines. We see that all methods include the true values in their confidence bands with varying degrees of closeness in the point estimates to the true values. As expected, the more flexible modeling methods tend to have wider confidence bands due to more uncertainty. However, many more closely mimicked the true exposure-response curve compared to more restrictive models. Thus, the choice in modeling method may be dependent on what the ultimate goal of your analysis is!

References

Mayer, M. N., Domingo-Relloso, A., Kioumourtzoglou, M. A., Navas-Acien, A., Coull, B., & Valeri, L. (2023). Comparison of methods for analyzing environmental mixtures effects on survival outcomes and application to a population-based cohort study. arXiv preprint arXiv:2311.01484.

Fahrmeir Ludwig. Discrete survival-time models Wiley StatsRef: Statistics Reference Online. 2014.

Sparapani, R. A., Logan, B. R., McCulloch, R. E., & Laud, P. W. (2016). Nonparametric survival analysis using Bayesian additive regression trees (BART). Statistics in medicine, 35(16), 2741-2753.

Caputo, B., Sim, K., Furesjo, F., & Smola, A. (2002, December). Appearance-based object recognition using SVMs: which kernel should I use?. In Proc of NIPS workshop on Statistical methods for computational experiments in visual processing and computer vision, Whistler (Vol. 2002).