--- title: "Simulating a simulation study" author: "Alessandro Gasparini" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulating a simulation study} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} options(width = 150) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.height = 6, fig.width = 6, out.width = "75%" ) ``` # Introduction In this vignette, we show how the simulated data included as an example dataset in `simsum` has been generated. # Motivation Say we want to run a simulation study in which we want to compare the sensitivity of parametric and semiparametric survival models on relative risk estimates. # Data generating mechanisms We simulate an hypothetical trial with a binary treatment. We fix the log-treatment effect to $-0.50$, and we generate a treatment indicator variable for each simulated individual via a $Binom(1, 0.5)$ random variable. We simulate two different sample sizes (50 and 250 individuals) and we assume two different baseline hazard functions: exponential with scale parameter $\lambda = 0.5$, and Weibull with scale parameter $\lambda = 0.5$ and shape parameter $\gamma = 1.5$. Finally, we apply administrative censoring at time $t = 5$. ```{r baseline-hazards} exp_basehaz <- function(t, lambda = 0.5) lambda * 1 * t^0 exp_weibull <- function(t, lambda = 0.5, gamma = 1.5) lambda * gamma * t^(gamma - 1) curve(exp_basehaz, from = 0, to = 5, lty = 1, ylim = c(0, 2), ylab = expression(h[0](t)), xlab = "Follow-up time t") curve(exp_weibull, from = 0, to = 5, lty = 2, add = TRUE) legend(x = "topleft", lty = 1:2, legend = c("Exponential baseline hazard", "Weibull baseline hazard"), bty = "n") ``` The survival times are estimated using the approach of Bender _et al_. (2005), based on drawing from a $U(0, 1)$ random variable and applying the following transformations: 1. for an exponential baseline hazard, the survival time $t$ is simulated as: $$t = -\frac{log(U)}{\lambda \exp(\beta ^ T X)}$$ 2. for a Weibull baseline hazard, the survival time $t$ is simulated as: $$t = \left(-\frac{log(U)}{\lambda \exp(\beta ^ T X)}\right) ^ {1 / \gamma}$$ The R function to simulate a dataset for our simulation study is defined as follows: ```{r dgfun} simulate_data <- function(dataset, n, baseline, params = list(), coveff = -0.50) { # Simulate treatment indicator variable x <- rbinom(n = n, size = 1, prob = 0.5) # Draw from a U(0,1) random variable u <- runif(n) # Simulate survival times depending on the baseline hazard if (baseline == "Exponential") { t <- -log(u) / (params$lambda * exp(x * coveff)) } else { t <- (-log(u) / (params$lambda * exp(x * coveff)))^(1 / params$gamma) } # Winsorising tiny values for t (smaller than one day on a yearly-scale, e.g. 1 / 365.242), and adding a tiny amount of white noise not to have too many concurrent values t <- ifelse(t < 1 / 365.242, 1 / 365.242, t) t[t == 1 / 365.242] <- t[t == 1 / 365.242] + rnorm(length(t[t == 1 / 365.242]), mean = 0, sd = 1e-4) # ...and make sure that the resulting value is positive t <- abs(t) # Make event indicator variable applying administrative censoring at t = 5 d <- as.numeric(t < 5) t <- pmin(t, 5) # Return a data.frame data.frame(dataset = dataset, x = x, t = t, d = d, n = n, baseline = baseline, stringsAsFactors = FALSE) } ``` # Methods We compare the Cox model (Cox, 1972) with a fully parametric survival model assuming an exponential baseline hazard and a flexible parametric model with 2 degrees of freedom for modelling the baseline hazard (Royston and Parmar, 2002). The Cox model can be fit via the `coxph` function from the `survival` package, the exponential model can be fit via the `phreg` function from the `eha` package, and the Royston-Parmar model can be fixed via the `stpm2` function from the `rstpm2` package. # Performance measures Say we are interested in the following performance measures: * Bias in the estimated log-treatment effect, and corresponding $95\%$ Monte Carlo confidence intervals * Coverage of confidence intervals for the log-treatment effect, defined as the proportion of simulated data sets for which the true log-treatment effect of $-0.50$ lies within the $95\%$ confidence intervals obtained from the model # Sample size We are primarily interested in bias, and assume that the variance of the estimated log-treatment effect is $0.1$. The Monte Carlo standard error for the bias is: $$\text{MCSE} = \sqrt{\frac{\text{variance}}{\# \text{simulations}}}$$ Aiming for a Monte Carlo standard error of 0.01 on the estimated bias, we would require $1,000$ replications. The Monte Carlo standard error for coverage is: $$\text{MCSE} = \sqrt{\frac{\text{coverage} \times (1 - \text{coverage})}{\# \text{simulations}}}$$ This Monte Carlo standard error is maximised for a coverage = $0.5$. In that setting, the Monte Carlo standard error with $1,000$ replications would be $0.01581139$, which is deemed to be acceptable. Therefore, we should run $1,000$ replications of this simulation study. However, for simplicity, we will run $100$ replications only to speed up the process. # Running the simulation study ## Generate data We generate $100$ datasets for each data-generating mechanism. First, we set a random seed for reproducibility: ```{r set-seed} set.seed(755353002) ``` Then, we simulate the data: ```{r generate-data} reps <- 1:100 data <- list() data[["n = 50, baseline = Exp"]] <- lapply( X = reps, FUN = simulate_data, n = 50, baseline = "Exponential", params = list(lambda = 0.5) ) data[["n = 250, baseline = Exp"]] <- lapply( X = reps, FUN = simulate_data, n = 250, baseline = "Exponential", params = list(lambda = 0.5) ) data[["n = 50, baseline = Wei"]] <- lapply( X = reps, FUN = simulate_data, n = 50, baseline = "Weibull", params = list(lambda = 0.5, gamma = 1.5) ) data[["n = 250, baseline = Wei"]] <- lapply( X = reps, FUN = simulate_data, n = 250, baseline = "Weibull", params = list(lambda = 0.5, gamma = 1.5) ) ``` ## Run models We define a function to fit the models of interest: ```{r fitting-function} library(survival) library(rstpm2) library(eha) fit_models <- function(data, model) { # Fit model if (model == "Cox") { fit <- survival::coxph(Surv(t, d) ~ x, data = data) } else if (model == "RP(2)") { fit <- rstpm2::stpm2(Surv(t, d) ~ x, data = data, df = 2) } else { fit <- eha::phreg(Surv(t, d) ~ x, data = data, dist = "weibull", shape = 1) } # Return relevant coefficients data.frame( dataset = unique(data$dataset), n = unique(data$n), baseline = unique(data$baseline), theta = coef(fit)["x"], se = sqrt(ifelse(model == "Exp", fit$var["x", "x"], vcov(fit)["x", "x"])), model = model, stringsAsFactors = FALSE, row.names = NULL ) } ``` We now run the models for each simulated dataset: ```{r run-models} results <- list() results[["n = 50, baseline = Exp, model = Cox"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 50, baseline = Exp"]], FUN = fit_models, model = "Cox" ) ) results[["n = 250, baseline = Exp, model = Cox"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 250, baseline = Exp"]], FUN = fit_models, model = "Cox" ) ) results[["n = 50, baseline = Wei, model = Cox"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 50, baseline = Wei"]], FUN = fit_models, model = "Cox" ) ) results[["n = 250, baseline = Wei, model = Cox"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 250, baseline = Wei"]], FUN = fit_models, model = "Cox" ) ) results[["n = 50, baseline = Exp, model = Exp"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 50, baseline = Exp"]], FUN = fit_models, model = "Exp" ) ) results[["n = 250, baseline = Exp, model = Exp"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 250, baseline = Exp"]], FUN = fit_models, model = "Exp" ) ) results[["n = 50, baseline = Wei, model = Exp"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 50, baseline = Wei"]], FUN = fit_models, model = "Exp" ) ) results[["n = 250, baseline = Wei, model = Exp"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 250, baseline = Wei"]], FUN = fit_models, model = "Exp" ) ) results[["n = 50, baseline = Exp, model = RP(2)"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 50, baseline = Exp"]], FUN = fit_models, model = "RP(2)" ) ) results[["n = 250, baseline = Exp, model = RP(2)"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 250, baseline = Exp"]], FUN = fit_models, model = "RP(2)" ) ) results[["n = 50, baseline = Wei, model = RP(2)"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 50, baseline = Wei"]], FUN = fit_models, model = "RP(2)" ) ) results[["n = 250, baseline = Wei, model = RP(2)"]] <- do.call( rbind.data.frame, lapply( X = data[["n = 250, baseline = Wei"]], FUN = fit_models, model = "RP(2)" ) ) ``` ## Aggregating results ```{r aggregate-results} relhaz <- do.call( rbind.data.frame, results ) row.names(relhaz) <- NULL ``` We save the final results, that will be included as an example in the R package `rsimsum`. ```{r saving, eval = FALSE} library(usethis) usethis::use_data(relhaz, overwrite = TRUE) ``` ## Summarising results Finally, we obtain summary statistics by calling the `simsum` function: ```{r compute-summaries} library(rsimsum) s <- rsimsum::simsum(data = relhaz, estvarname = "theta", se = "se", true = -0.50, methodvar = "model", ref = "Cox", by = c("n", "baseline")) s ``` ```{r print-summaries} summary(s) ``` # Conclusions With this vignette we showed how to simulate survival data and run a small, simple simulation study. # References * Cox D.R. _Regression models and life-tables_. Journal of the Royal Statistical Society, Series B (Methodological), 1972, 34(2):187-220 * Royston P. and Parmar M.K. _Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects_. Statistics in Medicine, 2002, 21(15):2175-2197 * Bender R., Augustin T., and Blettner M. _Generating survival times to simulate Cox proportional hazards models_. Statistics in Medicine, 2005, 24(11):1713-1723