diff --git a/inst/doc/full-example.R b/inst/doc/full-example.R index 6260383..332df9a 100644 --- a/inst/doc/full-example.R +++ b/inst/doc/full-example.R @@ -34,6 +34,8 @@ model.rf <- randomForest(x = vt.o$getX(interactions = T), y = vt.o$getY()) # initialize VT.forest.one vt.f.rf <- VT.forest.one(vt.o, model.rf) +# Then, use run() to compute probabilities +vt.f.rf$run() ## ------------------------------------------------------------------------ # # use randomForest::randomForest() @@ -46,6 +48,8 @@ vt.f.rf <- VT.forest.one(vt.o, model.rf) # model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T)) # # initialize VT.forest.one # vt.f.cf <- VT.forest.one(vt.o, model.cf) +# # Then, use run() to compute probabilities +# vt.f.cf$run() ## ------------------------------------------------------------------------ # Copy new object @@ -69,4 +73,52 @@ model.tr <- train(x = vt.o.tr$getX(interactions = T), trControl = fitControl) # initialize VT.forest.one vt.f.tr <- VT.forest.one(vt.o.tr, model.tr) +# Then, use run() to compute probabilities +vt.f.tr$run() + +## ------------------------------------------------------------------------ +# grow RF for T = 1 +model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1), + y = vt.o$getY(trt = 1)) +# grow RF for T = 0 +model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0), + y = vt.o$getY(trt = 0)) +# initialize VT.forest.double() +vt.doublef.rf <- VT.forest.double(vt.o, model.rf.trt1, model.rf.trt0) +# Then, use run() to compute probabilities +vt.doublef.rf$run() + +## ---- cache=TRUE--------------------------------------------------------- +# initialize k-fold RF +model.fold <- VirtualTwins:::VT.forest.fold(vt.o, fold = 5, ratio = 1, interactions = T) +# grow RF with randomForest package options +# set do.trace option to see the 5 folds +model.fold$run(ntree = 500, do.trace = 500) + +## ------------------------------------------------------------------------ +# you get twin1 and twin2 by your own method +# here, i'll use random number between 0 and 1 : +twin1_random <- runif(470) +twin2_random <- runif(470) + +# then you can initialize VT.difft class : +model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute") +# compute difference of twins : +model.difft$computeDifft() +# See results +head(model.difft$difft) +# Graph : +# hist(model.difft$difft) + +## ------------------------------------------------------------------------ +# initialize classification tree +tr.class <- VT.tree.class(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7)) +# compute tree with rpart option +tr.class$run(cp = 0, maxdepth = 3, maxcompete = 1) + +## ------------------------------------------------------------------------ +# initialize regression tree +tr.reg <- VT.tree.reg(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7)) +# compute tree with rpart option +tr.reg$run(cp = 0, maxdepth = 3, maxcompete = 1) diff --git a/inst/doc/full-example.Rmd b/inst/doc/full-example.Rmd index 266785d..d1b6028 100644 --- a/inst/doc/full-example.Rmd +++ b/inst/doc/full-example.Rmd @@ -5,22 +5,25 @@ date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Vignette Title} + %\VignetteIndexEntry{full-example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction -The goal of this vignette is to show most of all possibilies with *VT* (for *VirtualTwins*) package. +The goal of this vignette is to show most of all possibilies with *aVT* (for *aVirtualTwins* meaning *a*daptation of *Virtual Twins* method) package. *VT* method (Jared Foster and al, 2011) has been created to find subgroup of patients with enhanced treatment effect, if it exists. Theorically, this method can be used for binary and continous outcome. This package only deals with binary outcome in a two arms clinical trial. *VT* method is based on random forests and regression/classification trees. -I decided to use a simulated dataset called *sepsis* in order to show how *VT* package can be used. Type `?sepsis` to know more about this dataset. Anyway, the true subgroup is `PRAPACHE <= 26 & AGE <= 49.80`. +I decided to use a simulated dataset called *sepsis* in order to show how *aVT* package can be used. Type `?sepsis` to know more about this dataset. Anyway, the true subgroup is `PRAPACHE <= 26 & AGE <= 49.80`. -**NOTE:** This true subgroup is defined with the *lower* event rate (`survival = 1`) in treatement arm. Therefore in following examples we'll search the subgroup with the *highest* event rate, and we know it is `PRAPACHE > 26 & AGE > 49.80`. +**NOTE:** This true subgroup is defined with the *lower* event rate (`survival = 1`) in treatement arm. Therefore in following examples we'll search the subgroup with the *highest* event rate, and we know it is +`PRAPACHE > 26 & AGE > 49.80`. + +----------- # Quick preview @@ -36,7 +39,7 @@ Data used in *VT* are modelized by $\left\{Y, T, X_1, \ldots, X_{p-2}\right\}$. Type `?formatRCTDataset` for details. -Related functions/classes in VirtualTwins package : `VT.object()`, `vt.data()`, `formatRCTDataset`. +Related functions/classes in aVirtualTwins package : `VT.object()`, `vt.data()`, `formatRCTDataset`. ## Method @@ -49,11 +52,14 @@ let $X = \left\{X_1, \ldots, X_{p-2}\right\}$ ### First Step * Grow a random forest with data $\left\{Y, T, X \right\}$. * Grow a random forest with interaction treatement / covariable, i.e. $\left\{Y, T, X, XI(T_i=0), XI(T_i=1)\right\}$ -* Grow two random forests, one for each treatement. +* Grow two random forests, one for each treatement: + + The first with data $\left\{Y, X \right\}$ where $T_i = 0$ + + The second with data $\left\{Y, X \right\}$ where $T_i = 1$ +* Build your own model From one of these methods you can estimate $\hat{P_{1i}}$ and $\hat{P_{0i}}$. -Related functions/classes in VirtualTwins package : `VT.difft()`, `VT.forest()`, `VT.forest.one()`, `VT.forest.double()`, `VT.forest.fold()`. +Related functions/classes in aVirtualTwins package : `VT.difft()`, `VT.forest()`, `VT.forest.one()`, `VT.forest.double()`, `VT.forest.fold()`. ### Second Step @@ -64,17 +70,50 @@ Define $Z_i = \hat{P_{1i}} - \hat{P_{0i}}$ The idea is to identify which covariable from $X$ described variation of $Z$. -Related functions/classes in VirtualTwins package : `VT.tree()`, `VT.tree.class()`, `VT.tree.reg()`. +Related functions/classes in aVirtualTwins package : `VT.tree()`, `VT.tree.class()`, `VT.tree.reg()`. + + +----------- + # Sepsis dataset See __Introduction__. -# Examples +*Sepsis* dataset is a simulated clinical trial with two groups treatment about sepsis desease. See details. +This dataset is taken from [SIDES method](http://biopharmnet.com/wiki/Software_for_subgroup_identification_and_analysis) -## Create object VirtualTwins +*Sepsis* contains simulated data on 470 subjects with a binary outcome survival, that stores survival status for patient after 28 days of treatment, value of 1 for subjects who died after 28 days and 0 otherwise. There are 11 covariates, listed below, all of which are numerical variables. + +Note that contrary to the original dataset used in SIDES, missing values have been imputed by random forest `randomForest::rfImpute()`. See file *data-raw/sepsis.R* for more details. -In order to begin the two steps of *VT* method, VirtualTwins package need to be initialized with `vt.data()` function. +True subgroup is `PRAPACHE <= 26 & AGE <= 49.80`. __NOTE:__ This subgroup is defined with the *lower* event rate (survival = 1) in treatement arm. + +470 patients and 13 variables: + +* `survival` : binary outcome +* `THERAPY` : 1 for active treatment, 0 for control treatment +* `TIMFIRST` : Time from first sepsis-organ fail to start drug +* `AGE` : Patient age in years +* `BLLPLAT` : Baseline local platelets +* `blSOFA` : Sum of baselin sofa (cardiovascular, hematology, hepaticrenal, and respiration scores) +* `BLLCREAT` : Base creatinine +* `ORGANNUM` : Number of baseline organ failures +* `PRAPACHE` : Pre-infusion apache-ii score +* `BLGCS` : Base GLASGOW coma scale score +* `BLIL6` : Baseline serum IL-6 concentration +* `BLADL` : Baseline activity of daily living score +* `BLLBILI` : Baseline local bilirubin + +__Source:__ http://biopharmnet.com/wiki/Software_for_subgroup_identification_and_analysis + + +----------- + + +# Create object VirtualTwins + +In order to begin the two steps of *VT* method, aVirtualTwins package needs to be initialized with `vt.data()` function. type `?vt.data` for more details. __NOTE:__ if running VT with interactions between $T$ and $X$, set `interactions = TRUE`. @@ -116,16 +155,25 @@ rm(vt.o.tmp, cat.x, sepsis.tmp) ``` -## Step 1 : compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$ +----------- + + +# Step 1 : compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$ As described earlier, step 1 can be done via differents ways -### Simple Random Forest +## Simple Random Forest Following example used *sepsis* data created in previous part. To perform simple random forest on `VT.object`, `randomForest`, `caret` and `party` package can be used. +Class `VT.forest.one` is used. It takes in arguments : + +* `vt.object` : return of `vt.data()` function +* `model` : A random forest model +* `interactions` : logical, `TRUE` is default value + __with `randomForest`__ ```{r} # use randomForest::randomForest() @@ -139,6 +187,8 @@ model.rf <- randomForest(x = vt.o$getX(interactions = T), y = vt.o$getY()) # initialize VT.forest.one vt.f.rf <- VT.forest.one(vt.o, model.rf) +# Then, use run() to compute probabilities +vt.f.rf$run() ``` __with `party`__ @@ -156,6 +206,8 @@ __with `party`__ # model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T)) # # initialize VT.forest.one # vt.f.cf <- VT.forest.one(vt.o, model.cf) +# # Then, use run() to compute probabilities +# vt.f.cf$run() ``` __with `caret`__ @@ -186,13 +238,139 @@ model.tr <- train(x = vt.o.tr$getX(interactions = T), trControl = fitControl) # initialize VT.forest.one vt.f.tr <- VT.forest.one(vt.o.tr, model.tr) +# Then, use run() to compute probabilities +vt.f.tr$run() ``` -### Double Random Forest +## Double Random Forest To perform double random forest on `VT.object`, same packages as simple random forest can be used. +Class `VT.forest.double` is used. It takes in arguments : + +* `vt.object` : return of `vt.data()` function +* `model_trt1` : a random forest model for $T=1$ +* `model_trt0` : a random forest model for $T=0$ + + +__NOTE:__ use `trt` parameter in `VT.object::getX()` or `VT.object::getY()` methods to obtain part of data depending on treatment. See following example. + +__with `randomForest`__ +```{r} +# grow RF for T = 1 +model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1), + y = vt.o$getY(trt = 1)) +# grow RF for T = 0 +model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0), + y = vt.o$getY(trt = 0)) +# initialize VT.forest.double() +vt.doublef.rf <- VT.forest.double(vt.o, model.rf.trt1, model.rf.trt0) +# Then, use run() to compute probabilities +vt.doublef.rf$run() +``` + +Follow the same structure for `caret` or `cforest` models. + +## K Fold Random Forest + +This idea is taken from *method 3* of Jared Foster paper : + +> A modification of [previous methods] is to obtain $\hat{P_{1i}}$ and $\hat{P_{0i}}$ via cross-validation. In this méthod the specific data for subject $i$ is not used to obtain $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Using k-fold cross-validation, we apply random forest regression approach to $\frac{k-1}{k}$ of the data and use the resulting predictor to obtain estimates of $P_{1i}$ and $P_{0i}$ for the remaining $\frac{1}{k}$ of the observations. This is repeated $k$ times. + +To use this approach, type `VirtualTwins:::VT.forest.fold()`. This class takes in argument : + +* `vt.object` : return of `vt.data()` function +* `fold` : number of fold (e.g. $5$) +* `ratio` : Control of sampsize balance. `ratio` of $2$ means that there 2 times le highest level compared to the other. "Highest" means the level with larger observations. It's in test. +* `interactions` : Logical. If `TRUE`, interactions between covariables and treatments are used. `FALSE` otherwise. + +__NOTE:__ This function use only `randomForest` package. + +```{r, cache=TRUE} +# initialize k-fold RF +model.fold <- VirtualTwins:::VT.forest.fold(vt.o, fold = 5, ratio = 1, interactions = T) +# grow RF with randomForest package options +# set do.trace option to see the 5 folds +model.fold$run(ntree = 500, do.trace = 500) +``` + + +## Build Your Own Model + +Random Forests are not the only models you can use to compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Any prediction model can be used, as logitic regression, boosting ... + +Anyway, aVirtualTwins package can be used. To do so, you can use `VT.difft()` class. It is important to note this the parent class of all "forests" classes. It takes in argument : + +* `vt.object` : return of `vt.data()` function +* `twin1` : estimate of $P(Y_{i} = 1 | T = T_{i})$ : meaning response probability under the correct treatment. +* `twin1` : estimate of $P(Y_{i} = 1 | T = 1-T_{i})$ : meaning response probability under the other treatment. +* `method` : _absolute_ (default), _relative_ or _logit_. See `?VT.difft()` for details. + +```{r} +# you get twin1 and twin2 by your own method +# here, i'll use random number between 0 and 1 : +twin1_random <- runif(470) +twin2_random <- runif(470) + +# then you can initialize VT.difft class : +model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute") +# compute difference of twins : +model.difft$computeDifft() +# See results +head(model.difft$difft) +# Graph : +# hist(model.difft$difft) +``` + +__NOTE: Also, you can clone repository, write your own child class of `VT.difft()` AND submit it !__ + + +------------ + +# Step 2 : Estimate a Regression or Classification Tree + +As described in the method, we define $Z_i = \hat{P_{1i}} - \hat{P_{0i}}$. It's the difference in term of response of the active treatments compared to the control treatment. The idea is to try to explain this difference by few covariables. + +## Classification + +We define a new variable $Z^{*}$, $Z^{*}_i=1$ if $Z_i > c$ and $Z^{*}_i=0$ otherwise. Classification tree's goal is to explain the value $Z^*=1$. +$c$ is a threshold given by the user. It's the threshold for which the difference is "interesting". One idea is to use quantiles of the *difft* distribution. + +To compute a classifiction tree, `VT.tree.class()` is used. Internally, `rpart::rpart()` is computed. It takes in argument: + +* `vt.difft` : `VT.difft` object +* `threshold` : $0.05$ by default, corresponds to $c$ +* `sens` : `c(">", "<")`. `sens` corresponds to the way $Z^{*}$ is defined. + * `">"` (default) : $Z^{*}$, $Z^{*}_i=1$ if $Z_i > c$ and $Z^{*}_i=0$ otherwise. + * `"<"` : $Z^{*}$, $Z^{*}_i=1$ if $Z_i < c$ and $Z^{*}_i=0$ otherwise. +* `screening` : `NULL` is default value. If `TRUE` only covariables in `varimp` `vt.object` 's field is used. + +See `?VT.tree` for details. + +```{r} +# initialize classification tree +tr.class <- VT.tree.class(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7)) +# compute tree with rpart option +tr.class$run(cp = 0, maxdepth = 3, maxcompete = 1) +``` + + +## Regression + +Use regression tree to explain $Z$ by covariables $X$. Then some leafs have predicted $Z_i$ greater than the threshold $c$ (if $sens$ is ">"), and it defines which covariables explain $Z$. + +The class to use is `VT.tree.reg()`. It takes same parameters than classification mehod. + +```{r} +# initialize regression tree +tr.reg <- VT.tree.reg(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7)) +# compute tree with rpart option +tr.reg$run(cp = 0, maxdepth = 3, maxcompete = 1) +``` + + +## Subgroups and results diff --git a/inst/doc/full-example.html b/inst/doc/full-example.html index 4c40410..5240281 100644 --- a/inst/doc/full-example.html +++ b/inst/doc/full-example.html @@ -10,7 +10,7 @@ - + Virtual Twins Examples @@ -54,17 +54,18 @@ code > span.er { color: #ff0000; font-weight: bold; }

Introduction

-

The goal of this vignette is to show most of all possibilies with VT (for VirtualTwins) package.

+

The goal of this vignette is to show most of all possibilies with aVT (for aVirtualTwins meaning adaptation of Virtual Twins method) package.

VT method (Jared Foster and al, 2011) has been created to find subgroup of patients with enhanced treatment effect, if it exists. Theorically, this method can be used for binary and continous outcome. This package only deals with binary outcome in a two arms clinical trial.

VT method is based on random forests and regression/classification trees.

-

I decided to use a simulated dataset called sepsis in order to show how VT package can be used. Type ?sepsis to know more about this dataset. Anyway, the true subgroup is PRAPACHE <= 26 & AGE <= 49.80.

+

I decided to use a simulated dataset called sepsis in order to show how aVT package can be used. Type ?sepsis to know more about this dataset. Anyway, the true subgroup is PRAPACHE <= 26 & AGE <= 49.80.

NOTE: This true subgroup is defined with the lower event rate (survival = 1) in treatement arm. Therefore in following examples we’ll search the subgroup with the highest event rate, and we know it is PRAPACHE > 26 & AGE > 49.80.

+

Quick preview

@@ -78,7 +79,7 @@ code > span.er { color: #ff0000; font-weight: bold; }

NOTE: if you run VT with interactions, categorical covariables must be transformed into binary variables.

Type ?formatRCTDataset for details.

-

Related functions/classes in VirtualTwins package : VT.object(), vt.data(), formatRCTDataset.

+

Related functions/classes in aVirtualTwins package : VT.object(), vt.data(), formatRCTDataset.

Method

@@ -89,10 +90,15 @@ code > span.er { color: #ff0000; font-weight: bold; }

From one of these methods you can estimate \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\).

-

Related functions/classes in VirtualTwins package : VT.difft(), VT.forest(), VT.forest.one(), VT.forest.double(), VT.forest.fold().

+

Related functions/classes in aVirtualTwins package : VT.difft(), VT.forest(), VT.forest.one(), VT.forest.double(), VT.forest.fold().

Second Step

@@ -102,19 +108,40 @@ code > span.er { color: #ff0000; font-weight: bold; }
  • Use classification tree on new variable \(Z^{*}\) defined by \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise.
  • The idea is to identify which covariable from \(X\) described variation of \(Z\).

    -

    Related functions/classes in VirtualTwins package : VT.tree(), VT.tree.class(), VT.tree.reg().

    +

    Related functions/classes in aVirtualTwins package : VT.tree(), VT.tree.class(), VT.tree.reg().

    +

    Sepsis dataset

    See Introduction.

    +

    Sepsis dataset is a simulated clinical trial with two groups treatment about sepsis desease. See details. This dataset is taken from SIDES method

    +

    Sepsis contains simulated data on 470 subjects with a binary outcome survival, that stores survival status for patient after 28 days of treatment, value of 1 for subjects who died after 28 days and 0 otherwise. There are 11 covariates, listed below, all of which are numerical variables.

    +

    Note that contrary to the original dataset used in SIDES, missing values have been imputed by random forest randomForest::rfImpute(). See file data-raw/sepsis.R for more details.

    +

    True subgroup is PRAPACHE <= 26 & AGE <= 49.80. NOTE: This subgroup is defined with the lower event rate (survival = 1) in treatement arm.

    +

    470 patients and 13 variables:

    + +

    Source: http://biopharmnet.com/wiki/Software_for_subgroup_identification_and_analysis

    +
    -
    -

    Examples

    -
    -

    Create object VirtualTwins

    -

    In order to begin the two steps of VT method, VirtualTwins package need to be initialized with vt.data() function. type ?vt.data for more details.

    +
    +

    Create object VirtualTwins

    +

    In order to begin the two steps of VT method, aVirtualTwins package needs to be initialized with vt.data() function. type ?vt.data for more details.

    NOTE: if running VT with interactions between \(T\) and \(X\), set interactions = TRUE.

    Code of vt.data() :

    vt.data <- function(dataset, outcome.field, treatment.field, interactions = TRUE, ...){
    @@ -145,18 +172,27 @@ vt.o.tmp <- vt.data(sepsis.tm
     ## Dummy variable cat.x_4 created 
     ## Dummy variable cat.x_5 created

    Dummies variables are created for each category of cat.x variable. And cat.x is removed from dataset.

    +
    -
    -

    Step 1 : compute \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\)

    +
    +

    Step 1 : compute \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\)

    As described earlier, step 1 can be done via differents ways

    -
    -

    Simple Random Forest

    +
    +

    Simple Random Forest

    Following example used sepsis data created in previous part.

    To perform simple random forest on VT.object, randomForest, caret and party package can be used.

    +

    Class VT.forest.one is used. It takes in arguments :

    +
      +
    • vt.object : return of vt.data() function
    • +
    • model : A random forest model
    • +
    • interactions : logical, TRUE is default value
    • +

    with randomForest

    # use randomForest::randomForest()
    -library(randomForest, verbose = F)
    -# Reproducibility
    +library(randomForest, verbose = F)
    +
    ## randomForest 4.6-10
    +## Type rfNews() to see new features/changes/bug fixes.
    +
    # Reproducibility
     set.seed(123)
     # Fit rf model 
     # default params
    @@ -164,7 +200,9 @@ vt.o.tmp <- vt.data(sepsis.tm
     model.rf <- randomForest(x = vt.o$getX(interactions = T),
                              y = vt.o$getY())
     # initialize VT.forest.one
    -vt.f.rf <- VT.forest.one(vt.o, model.rf)
    +vt.f.rf <- VT.forest.one(vt.o, model.rf) +# Then, use run() to compute probabilities +vt.f.rf$run()

    with party

    cforest() can be usefull however computing time is really long. I think there is an issue when giving cforest object in Reference Class parameter. Need to fix it.

    # # use randomForest::randomForest()
    @@ -176,7 +214,9 @@ vt.f.rf <- VT.forest.one(vt.o
     # # set interactions to TRUE if using interaction between T and X
     # model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T))
     # # initialize VT.forest.one
    -# vt.f.cf <- VT.forest.one(vt.o, model.cf)
    +# vt.f.cf <- VT.forest.one(vt.o, model.cf) +# # Then, use run() to compute probabilities +# vt.f.cf$run()

    with caret

    Using caret can be usefull to deal with parallel computing for example.

    NOTE: For caret levels of outcome can’t be 0, so i’ll change levels name into “n”/“y”

    @@ -190,8 +230,10 @@ vt.o.tr$data$survival <- as.factorformatRCTDataset(vt.o.tr$data, "survival", "THERAPY")
    ## "y" will be the favorable outcome
    # use caret::train()
    -library(caret, verbose = F)
    -# Reproducibility
    +library(caret, verbose = F)
    +
    ## Loading required package: lattice
    +## Loading required package: ggplot2
    +
    # Reproducibility
     set.seed(123)
     # fit train model
     fitControl <- trainControl(classProbs = T, method = "none")
    @@ -201,12 +243,125 @@ model.tr <- train(tuneGrid = data.frame(mtry = 5),
                       trControl = fitControl)
     # initialize VT.forest.one
    -vt.f.tr <- VT.forest.one(vt.o.tr, model.tr)
    +vt.f.tr <- VT.forest.one(vt.o.tr, model.tr) +# Then, use run() to compute probabilities +vt.f.tr$run()
    -
    -

    Double Random Forest

    +
    +

    Double Random Forest

    To perform double random forest on VT.object, same packages as simple random forest can be used.

    +

    Class VT.forest.double is used. It takes in arguments :

    +
      +
    • vt.object : return of vt.data() function
    • +
    • model_trt1 : a random forest model for \(T=1\)
    • +
    • model_trt0 : a random forest model for \(T=0\)
    • +
    +

    NOTE: use trt parameter in VT.object::getX() or VT.object::getY() methods to obtain part of data depending on treatment. See following example.

    +

    with randomForest

    +
    # grow RF for T = 1
    +model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1),
    +                              y = vt.o$getY(trt = 1))
    +# grow RF for T = 0
    +model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0),
    +                              y = vt.o$getY(trt = 0))
    +# initialize VT.forest.double()
    +vt.doublef.rf <- VT.forest.double(vt.o, model.rf.trt1, model.rf.trt0)
    +# Then, use run() to compute probabilities
    +vt.doublef.rf$run()
    +

    Follow the same structure for caret or cforest models.

    +
    +

    K Fold Random Forest

    +

    This idea is taken from method 3 of Jared Foster paper :

    +
    +

    A modification of [previous methods] is to obtain \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\) via cross-validation. In this méthod the specific data for subject \(i\) is not used to obtain \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\). Using k-fold cross-validation, we apply random forest regression approach to \(\frac{k-1}{k}\) of the data and use the resulting predictor to obtain estimates of \(P_{1i}\) and \(P_{0i}\) for the remaining \(\frac{1}{k}\) of the observations. This is repeated \(k\) times.

    +
    +

    To use this approach, type VirtualTwins:::VT.forest.fold(). This class takes in argument :

    +
      +
    • vt.object : return of vt.data() function
    • +
    • fold : number of fold (e.g. \(5\))
    • +
    • ratio : Control of sampsize balance. ratio of \(2\) means that there 2 times le highest level compared to the other. “Highest” means the level with larger observations. It’s in test.
    • +
    • interactions : Logical. If TRUE, interactions between covariables and treatments are used. FALSE otherwise.
    • +
    +

    NOTE: This function use only randomForest package.

    +
    # initialize k-fold RF
    +model.fold <- VirtualTwins:::VT.forest.fold(vt.o, fold = 5, ratio = 1, interactions = T)
    +# grow RF with randomForest package options
    +# set do.trace option to see the 5 folds
    +model.fold$run(ntree = 500, do.trace = 500)
    +
    ## ntree      OOB      1      2
    +##   500:  32.38% 16.46% 57.72%
    +## ntree      OOB      1      2
    +##   500:  33.25% 18.50% 55.26%
    +## ntree      OOB      1      2
    +##   500:  30.03% 16.52% 50.34%
    +## ntree      OOB      1      2
    +##   500:  29.26% 13.56% 55.71%
    +## ntree      OOB      1      2
    +##   500:  30.60% 13.79% 59.70%
    +
    +
    +

    Build Your Own Model

    +

    Random Forests are not the only models you can use to compute \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\). Any prediction model can be used, as logitic regression, boosting …

    +

    Anyway, aVirtualTwins package can be used. To do so, you can use VT.difft() class. It is important to note this the parent class of all “forests” classes. It takes in argument :

    +
      +
    • vt.object : return of vt.data() function
    • +
    • twin1 : estimate of \(P(Y_{i} = 1 | T = T_{i})\) : meaning response probability under the correct treatment.
    • +
    • twin1 : estimate of \(P(Y_{i} = 1 | T = 1-T_{i})\) : meaning response probability under the other treatment.
    • +
    • method : absolute (default), relative or logit. See ?VT.difft() for details.
    • +
    +
    # you get twin1 and twin2 by your own method
    +# here, i'll use random number between 0 and 1 :
    +twin1_random <- runif(470)
    +twin2_random <- runif(470)
    +
    +# then you can initialize VT.difft class : 
    +model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute")
    +# compute difference of twins : 
    +model.difft$computeDifft()
    +# See results
    +head(model.difft$difft)
    +
    ## [1]  0.2507360 -0.3905591  0.3102285  0.3986582  0.3747078 -0.2508240
    +
    # Graph :
    +# hist(model.difft$difft)
    +

    NOTE: Also, you can clone repository, write your own child class of VT.difft() AND submit it !

    +
    +
    +
    +
    +

    Step 2 : Estimate a Regression or Classification Tree

    +

    As described in the method, we define \(Z_i = \hat{P_{1i}} - \hat{P_{0i}}\). It’s the difference in term of response of the active treatments compared to the control treatment. The idea is to try to explain this difference by few covariables.

    +
    +

    Classification

    +

    We define a new variable \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise. Classification tree’s goal is to explain the value \(Z^*=1\). \(c\) is a threshold given by the user. It’s the threshold for which the difference is “interesting”. One idea is to use quantiles of the difft distribution.

    +

    To compute a classifiction tree, VT.tree.class() is used. Internally, rpart::rpart() is computed. It takes in argument:

    +
      +
    • vt.difft : VT.difft object
    • +
    • threshold : \(0.05\) by default, corresponds to \(c\)
    • +
    • sens : c(">", "<"). sens corresponds to the way \(Z^{*}\) is defined. +
        +
      • ">" (default) : \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise.
      • +
      • "<" : \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i < c\) and \(Z^{*}_i=0\) otherwise.
      • +
    • +
    • screening : NULL is default value. If TRUE only covariables in varimp vt.object ’s field is used.
    • +
    +

    See ?VT.tree for details.

    +
    # initialize classification tree
    +tr.class <- VT.tree.class(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7))
    +# compute tree with rpart option
    +tr.class$run(cp = 0, maxdepth = 3, maxcompete = 1)
    +
    +
    +

    Regression

    +

    Use regression tree to explain \(Z\) by covariables \(X\). Then some leafs have predicted \(Z_i\) greater than the threshold \(c\) (if \(sens\) is “>”), and it defines which covariables explain \(Z\).

    +

    The class to use is VT.tree.reg(). It takes same parameters than classification mehod.

    +
    # initialize regression tree
    +tr.reg <- VT.tree.reg(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7))
    +# compute tree with rpart option
    +tr.reg$run(cp = 0, maxdepth = 3, maxcompete = 1)
    +
    +
    +

    Subgroups and results