--- title: "Virtual Twins Examples" author: "Francois Vieille" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{full-example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Contents 1. Introduction 2. Quick preview 3. Sepsis dataset 4. Create object virtual twins 5. Step 1 : compute probabilities 6. Step 2 : Estimate a decision tree 7. Subgroups and results 8. Questions # Introduction The goal of this vignette is to show most of all possibilies with *aVT* (for *aVirtualTwins* meaning *a*daptation of *Virtual Twins* method) package. *VT* method (Jared Foster and al, 2011) has been created to find subgroup of patients with enhanced treatment effect, if it exists. Theorically, this method can be used for binary and continous outcome. This package only deals with binary outcome in a two arms clinical trial. *VT* method is based on random forests and regression/classification trees. I decided to use a simulated dataset called *sepsis* in order to show how *aVT* package can be used. Type `?sepsis` to know more about this dataset. Anyway, the true subgroup is `PRAPACHE <= 26 & AGE <= 49.80`. **NOTE:** This true subgroup is defined with the *lower* event rate (`survival = 1`) in treatement arm. Therefore in following examples we'll search the subgroup with the *highest* event rate, and we know it is `PRAPACHE > 26 & AGE > 49.80`. ----------- # Quick preview ## Dataset Data used in *VT* are modelized by $\left\{Y, T, X_1, \ldots, X_{p-2}\right\}$. $p$ is the number of variables. * $Y$ is a binary outcome. In R, $Y$ is a `factor`. Second level of this factor will be the desirable event. ($Y=1$) * $T$ is treatment variable, $T=1$ means _active treatement_, $T=0$ means _control treatment_. In R, $T$ is numeric. * $X_i$ is covariables, $X_i$ can be categorical, continous, binary. **NOTE:** if you run *VT* with interactions, categorical covariables must be transformed into binary variables. Type `?formatRCTDataset` for details. Related functions/classes in aVirtualTwins package : `VT.object()`, `vt.data()`, `formatRCTDataset`. ## Method *VT* is a two steps method but with many possibilities let $\hat{P_{1i}} = P(Y_i = 1|T_i = 1, X_i)$ let $\hat{P_{0i}} = P(Y_i = 1|T_i = 0, X_i)$ let $X = \left\{X_1, \ldots, X_{p-2}\right\}$ ### First Step * Grow a random forest with data $\left\{Y, T, X \right\}$. * Grow a random forest with interaction treatement / covariable, i.e. $\left\{Y, T, X, XI(T_i=0), XI(T_i=1)\right\}$ * Grow two random forests, one for each treatement: + The first with data $\left\{Y, X \right\}$ where $T_i = 0$ + The second with data $\left\{Y, X \right\}$ where $T_i = 1$ * Build your own model From one of these methods you can estimate $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Related functions/classes in aVirtualTwins package : `VT.difft()`, `vt.forest()`. ### Second Step Define $Z_i = \hat{P_{1i}} - \hat{P_{0i}}$ * Use regression tree to explain $Z$ by covariables $X$. Then subjects with predicted $Z_i$ greater than some threshold $c$ are considered to define a subgroup. * Use classification tree on new variable $Z^{*}$ defined by $Z^{*}_i=1$ if $Z_i > c$ and $Z^{*}_i=0$ otherwise. The idea is to identify which covariable from $X$ described variation of $Z$. Related function in aVirtualTwins package : `vt.tree()`. ----------- # Sepsis dataset See __Introduction__. *Sepsis* dataset is a simulated clinical trial with two groups treatment about sepsis desease. See details. This dataset is taken from [SIDES method](http://biopharmnet.com/subgroup-analysis-software/) *Sepsis* contains simulated data on 470 subjects with a binary outcome survival, that stores survival status for patient after 28 days of treatment, value of 1 for subjects who died after 28 days and 0 otherwise. There are 11 covariates, listed below, all of which are numerical variables. Note that contrary to the original dataset used in SIDES, missing values have been imputed by random forest `randomForest::rfImpute()`. See file *data-raw/sepsis.R* for more details. True subgroup is `PRAPACHE <= 26 & AGE <= 49.80`. __NOTE:__ This subgroup is defined with the *lower* event rate (survival = 1) in treatement arm. 470 patients and 13 variables: * `survival` : binary outcome * `THERAPY` : 1 for active treatment, 0 for control treatment * `TIMFIRST` : Time from first sepsis-organ fail to start drug * `AGE` : Patient age in years * `BLLPLAT` : Baseline local platelets * `blSOFA` : Sum of baselin sofa (cardiovascular, hematology, hepaticrenal, and respiration scores) * `BLLCREAT` : Base creatinine * `ORGANNUM` : Number of baseline organ failures * `PRAPACHE` : Pre-infusion apache-ii score * `BLGCS` : Base GLASGOW coma scale score * `BLIL6` : Baseline serum IL-6 concentration * `BLADL` : Baseline activity of daily living score * `BLLBILI` : Baseline local bilirubin __Source:__ http://biopharmnet.com/subgroup-analysis-software/ ----------- # Create object VirtualTwins In order to begin the two steps of *VT* method, aVirtualTwins package needs to be initialized with `vt.data()` function. type `?vt.data` for more details. __NOTE:__ if running VT with interactions between $T$ and $X$, set `interactions = TRUE`. Code of `vt.data()` : ```{r, collapse=T, eval = F} vt.data <- function(dataset, outcome.field, treatment.field, interactions = TRUE, ...){ data <- formatRCTDataset(dataset, outcome.field, treatment.field, interactions = TRUE) VT.object(data = data, ...) } ``` __Example with Sepsis__ ```{r, collapse=T} # load library VT library(aVirtualTwins) # load data sepsis data(sepsis) # initialize VT.object vt.o <- vt.data(sepsis, "survival", "THERAPY", TRUE) ``` 1 will be the favorable outcome because 1 is the second level of `"survival"` column. It means that $P(Y=1)$ is the probability of interest. Anyway, it's still possible to compute $P(Y=0)$. __Quick example__ *Sepsis* does not have any categorical variable, following example show how `vt.data` deals with categorical values depending on `interactions` parameter ```{r, collapse=T} # Creation of categorical variable cat.x <- rep(1:5, (nrow(sepsis))/5) cat.x <- as.factor(cat.x) sepsis.tmp <- cbind(sepsis, cat.x) vt.o.tmp <- vt.data(sepsis.tmp, "survival", "THERAPY", TRUE) ``` Dummies variables are created for each category of `cat.x` variable. And `cat.x` is removed from dataset. ```{r, collapse=T, echo = FALSE} rm(vt.o.tmp, cat.x, sepsis.tmp) ``` ----------- # Step 1 : compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$ As described earlier, step 1 can be done via differents ways ## Simple Random Forest Following example used *sepsis* data created in previous part. To perform simple random forest on `VT.object`, `randomForest`, `caret` and `party` package can be used. Class `vt.forest("one", ...)` is used. It takes in arguments : * `forest.type` : you have to set it to `"one"` * `vt.data` : return of `vt.data()` function * `model` : a random forest model * `interactions` : logical, `TRUE` is default value * `...` : options to `randomForest()` function __with `randomForest`__ ```{r, collapse=T} # use randomForest::randomForest() library(randomForest, verbose = F) # Reproducibility set.seed(123) # Fit rf model # default params # set interactions to TRUE if using interaction between T and X model.rf <- randomForest(x = vt.o$getX(interactions = T), y = vt.o$getY(), ntree = 500) # initialize VT.forest.one vt.f.rf <- vt.forest("one", vt.data = vt.o, model = model.rf, interactions = T) ### or you can use randomForest inside vt.forest() vt.f.rf <- vt.forest("one", vt.data = vt.o, interactions = T, ntree = 500) ``` __with `party`__ `cforest()` can be usefull however computing time is really long. I think there is an issue when giving *cforest object* in Reference Class parameter. Need to fix it. ```{r, collapse=T} # # use randomForest::randomForest() # library(party, verbose = F) # # Reproducibility # set.seed(123) # # Fit cforest model # # default params # # set interactions to TRUE if using interaction between T and X # model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T)) # # initialize VT.forest.one # vt.f.cf <- vt.forest("one", vt.data = vt.o, model = model.cf) ``` __with `caret`__ Using `caret` can be usefull to deal with parallel computing for example. __NOTE:__ For `caret` levels of outcome can't be 0, so i'll change levels name into "n"/"y" ```{r, collapse=T} # Copy new object vt.o.tr <- vt.o$copy() # Change levels tmp <- ifelse(vt.o.tr$data$survival == 1, "y", "n") vt.o.tr$data$survival <- as.factor(tmp) rm(tmp) # Check new data to be sure formatRCTDataset(vt.o.tr$data, "survival", "THERAPY") # use caret::train() library(caret, verbose = F) # Reproducibility set.seed(123) # fit train model fitControl <- trainControl(classProbs = T, method = "none") model.tr <- train(x = vt.o.tr$getX(interactions = T), y = vt.o.tr$getY(), method = "rf", tuneGrid = data.frame(mtry = 5), trControl = fitControl) # initialize VT.forest.one vt.f.tr <- vt.forest("one", vt.o.tr, model = model.tr) ``` ## Double Random Forest To perform double random forest on `VT.object`, same packages as simple random forest can be used. Function `vt.forest("double", ...)` is used. It takes in arguments : * `forest.type` : You have to set is to `"double"` * `vt.data` : return of `vt.data()` function * `model_trt1` : a random forest model for $T=1$ (this argument has to be specified) * `model_trt0` : a random forest model for $T=0$ (this argument has to be specified) __NOTE:__ use `trt` parameter in `VT.object::getX()` or `VT.object::getY()` methods to obtain part of data depending on treatment. See following example. __with `randomForest`__ ```{r, collapse=T} # grow RF for T = 1 model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1), y = vt.o$getY(trt = 1)) # grow RF for T = 0 model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0), y = vt.o$getY(trt = 0)) # initialize VT.forest.double() vt.doublef.rf <- vt.forest("double", vt.data = vt.o, model_trt1 = model.rf.trt1, model_trt0 = model.rf.trt0) ### Or you can use randomForest() inside vt.doublef.rf <- vt.forest("double", vt.data = vt.o, ntree = 200) ``` Follow the same structure for `caret` or `cforest` models. ## K Fold Random Forest This idea is taken from *method 3* of Jared Foster paper : > A modification of [previous methods] is to obtain $\hat{P_{1i}}$ and $\hat{P_{0i}}$ via cross-validation. In this méthod the specific data for subject $i$ is not used to obtain $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Using k-fold cross-validation, we apply random forest regression approach to $\frac{k-1}{k}$ of the data and use the resulting predictor to obtain estimates of $P_{1i}$ and $P_{0i}$ for the remaining $\frac{1}{k}$ of the observations. This is repeated $k$ times. To use this approach, use `vt.forest("fold", ...)`. This class takes in argument : * `forest.type` : it has to be set to `"fold"` * `vt.data` : return of `vt.data()` function * `fold` : number of fold (e.g. $5$) * `ratio` : Control of sampsize balance. `ratio` of $2$ means that there 2 times le highest level compared to the other. "Highest" means the level with larger observations. It's in test. * `interactions` : Logical. If `TRUE`, interactions between covariables and treatments are used. `FALSE` otherwise. * `...` : `randomForest()` function options __NOTE:__ This function use only `randomForest` package. ```{r, collapse=T, cache=F} # initialize k-fold RF # you can use randomForest options model.fold <- vt.forest("fold", vt.data = vt.o, fold = 5, ratio = 1, interactions = T, ntree = 200) ``` ## Build Your Own Model Random Forests are not the only models you can use to compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Any prediction model can be used, as logitic regression, boosting ... Anyway, aVirtualTwins package can be used. To do so, you can use `VT.difft()` class. It is important to note this the parent class of all "forests" classes. It takes in argument : * `vt.object` : return of `vt.data()` function * `twin1` : estimate of $P(Y_{i} = 1 | T = T_{i})$ : meaning response probability under the correct treatment. * `twin2` : estimate of $P(Y_{i} = 1 | T = 1-T_{i})$ : meaning response probability under the other treatment. * `method` : _absolute_ (default), _relative_ or _logit_. See `?VT.difft` for details. ```{r, collapse=T} # you get twin1 and twin2 by your own method # here, i'll use random number between 0 and 1 : twin1_random <- runif(470) twin2_random <- runif(470) # then you can initialize VT.difft class : model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute") # compute difference of twins : model.difft$computeDifft() # See results head(model.difft$difft) # Graph : # hist(model.difft$difft) ``` __NOTE: Also, you can clone repository, write your own child class of `VT.difft()` AND submit it !__ ------------ # Step 2 : Estimate a Regression or Classification Tree As described in the method, we define $Z_i = \hat{P_{1i}} - \hat{P_{0i}}$. It's the difference in term of response of the active treatments compared to the control treatment. The idea is to try to explain this difference by few covariables. ## Classification We define a new variable $Z^{*}$, $Z^{*}_i=1$ if $Z_i > c$ and $Z^{*}_i=0$ otherwise. Classification tree's goal is to explain the value $Z^*=1$. $c$ is a threshold given by the user. It's the threshold for which the difference is "interesting". One idea is to use quantiles of the *difft* distribution. To compute a classifiction tree, `vt.tree("class", ...)` is used. Internally, `rpart::rpart()` is computed. It takes in argument: * `tree.type` : it has to be set to `"class"` * `vt.difft` : `VT.difft` object (return of `vt.forest()` function) * `sens` : `c(">", "<")`. `sens` corresponds to the way $Z^{*}$ is defined. * `">"` (default) : $Z^{*}$, $Z^{*}_i=1$ if $Z_i > c$ and $Z^{*}_i=0$ otherwise. * `"<"` : $Z^{*}$, $Z^{*}_i=1$ if $Z_i < c$ and $Z^{*}_i=0$ otherwise. * `threshold` : corresponds to $c$, it can be a vector. $seq(.5, .8, .1)$ by default. * `screening` : `NULL` is default value. If `TRUE` only covariables in `varimp` `vt.data` 's field is used. See `?VT.tree` for details. ```{r, collapse=T} # initialize classification tree tr.class <- vt.tree("class", vt.difft = vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, seq(.5, .8, .1)), maxdepth = 3, cp = 0, maxcompete = 2) # tr.class is a list if threshold is a vectoor class(tr.class) # acce trees with treeXX class(tr.class$tree1) ``` ## Regression Use regression tree to explain $Z$ by covariables $X$. Then some leafs have predicted $Z_i$ greater than the threshold $c$ (if $sens$ is ">"), and it defines which covariables explain $Z$. The function to use is `vt.tree("reg", ...)`. It takes same parameters than classification mehod. ```{r, collapse=T} # initialize regression tree tr.reg <- vt.tree("reg", vt.difft = vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, seq(.5, .8, .1))) # tr.class is a list if threshold is a vectoor class(tr.reg) # access trees with treeXX class(tr.reg$tree1) ``` ----------- # Subgroups and results Once trees have been computed, you surely want to see what are the subgroups. This package provides a wrapper function of intern methods of `VT.tree` class : `vt.subgroups()`. This function takes in argument : * `vt.tree` : object or list of class `VT.tree`. Return of the `vt.tree()` function. * `only.leaf` : logical. Set `TRUE` (default) to visualize only terminal nodes. * `only.fav` : logical. Set `TRUE` (default) to visualize only class 1 nodes. ($\hat{A}$) * `tables` : logical. Set `FALSE` (default) to prevent tables of incidences from being printed. * `verbose` : logical. Set `FALSE` (default) to prevent detailed stuffs from being printed. * `compete` : logical. Set `TRUE` to print competitors rules thanks to competitors split. `FALSE` is default value. If `vt.tree` is a list, unique subgroups are printed. ```{r, collapse=T} # use tr.class computed previously vt.sbgrps <- vt.subgroups(tr.class) # print tables with knitr package library(knitr) knitr::kable(vt.sbgrps) ``` You can plot one tree with package `rpart.plot` ```{r, collapse=T, echo=F, fig.align='center', fig.height=4, fig.width=6} library(rpart.plot) rpart.plot(tr.class$tree2$tree, type = 1, extra = 1) ``` If you want to see competitors split : ```{r, collapse=T} tr.class$tree2$createCompetitors() head(tr.class$tree2$competitors) ``` If you want to print incidence of a subgroup : ```{r, collapse=T} vt.o$getIncidences("PRAPACHE >= 26 & AGE >= 52") # or # tr.class$tree2$getIncidences("PRAPACHE >= 26 & AGE >= 52") ``` If you want to get infos about the tree ```{r, collapse=T} tr.class$tree2$getInfos() # access Ahat # tr.class$tree2$Ahat ``` You can re-run rpart computation: ```{r, collapse=T} tr.class$tree2$run(maxdepth = 2) ``` Type `?VT.tree` for details. ----------- # Questions This vignette is a bit messy right now, therefore feel free to ask anything to the repository issue reports : 1. https://github.com/prise6/aVirtualTwins 2. read documentation with `?aVirtualTwins`