Quick preview

Dataset

Data used in VT are modelized by \(\left\{Y, T, X_1, \ldots, X_{p-2}\right\}\). \(p\) is the number of variables.

\(Y\) is a binary outcome. In R, \(Y\) is a factor. Second level of this factor will be the desirable event. (\(Y=1\))
\(T\) is treatment variable, \(T=1\) means active treatement, \(T=0\) means control treatment. In R, \(T\) is numeric.
\(X_i\) is covariables, \(X_i\) can be categorical, continous, binary.

NOTE: if you run VT with interactions, categorical covariables must be transformed into binary variables.

Type ?formatRCTDataset for details.

Related functions/classes in aVirtualTwins package : VT.object(), vt.data(), formatRCTDataset.

Method

VT is a two steps method but with many possibilities

let \(\hat{P_{1i}} = P(Y_i = 1|T_i = 1, X_i)\)
let \(\hat{P_{0i}} = P(Y_i = 1|T_i = 0, X_i)\)
let \(X = \left\{X_1, \ldots, X_{p-2}\right\}\)

First Step

Grow a random forest with data \(\left\{Y, T, X \right\}\).
Grow a random forest with interaction treatement / covariable, i.e. \(\left\{Y, T, X, XI(T_i=0), XI(T_i=1)\right\}\)
Grow two random forests, one for each treatement:
- The first with data \(\left\{Y, X \right\}\) where \(T_i = 0\)
- The second with data \(\left\{Y, X \right\}\) where \(T_i = 1\)
Build your own model

From one of these methods you can estimate \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\).

Related functions/classes in aVirtualTwins package : VT.difft(), VT.forest(), VT.forest.one(), VT.forest.double(), VT.forest.fold().

Second Step

Define \(Z_i = \hat{P_{1i}} - \hat{P_{0i}}\)

Use regression tree to explain \(Z\) by covariables \(X\). Then subjects with predicted \(Z_i\) greater than some threshold \(c\) are considered to define a subgroup.
Use classification tree on new variable \(Z^{*}\) defined by \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise.

The idea is to identify which covariable from \(X\) described variation of \(Z\).

Related functions/classes in aVirtualTwins package : VT.tree(), VT.tree.class(), VT.tree.reg().

Sepsis dataset

See Introduction.

Sepsis dataset is a simulated clinical trial with two groups treatment about sepsis desease. See details. This dataset is taken from SIDES method

Sepsis contains simulated data on 470 subjects with a binary outcome survival, that stores survival status for patient after 28 days of treatment, value of 1 for subjects who died after 28 days and 0 otherwise. There are 11 covariates, listed below, all of which are numerical variables.

Note that contrary to the original dataset used in SIDES, missing values have been imputed by random forest randomForest::rfImpute(). See file data-raw/sepsis.R for more details.

True subgroup is PRAPACHE <= 26 & AGE <= 49.80. NOTE: This subgroup is defined with the lower event rate (survival = 1) in treatement arm.

470 patients and 13 variables:

survival : binary outcome
THERAPY : 1 for active treatment, 0 for control treatment
TIMFIRST : Time from first sepsis-organ fail to start drug
AGE : Patient age in years
BLLPLAT : Baseline local platelets
blSOFA : Sum of baselin sofa (cardiovascular, hematology, hepaticrenal, and respiration scores)
BLLCREAT : Base creatinine
ORGANNUM : Number of baseline organ failures
PRAPACHE : Pre-infusion apache-ii score
BLGCS : Base GLASGOW coma scale score
BLIL6 : Baseline serum IL-6 concentration
BLADL : Baseline activity of daily living score
BLLBILI : Baseline local bilirubin

Source: http://biopharmnet.com/wiki/Software_for_subgroup_identification_and_analysis

Create object VirtualTwins

In order to begin the two steps of VT method, aVirtualTwins package needs to be initialized with vt.data() function. type ?vt.data for more details.

NOTE: if running VT with interactions between \(T\) and \(X\), set interactions = TRUE.

Code of vt.data() :

vt.data <- function(dataset, outcome.field, treatment.field, interactions = TRUE, ...){
  data <- formatRCTDataset(dataset, outcome.field, treatment.field, interactions = TRUE)
  VT.object(data = data, ...)
}

Example with Sepsis

# load library VT
library(aVirtualTwins)
# load data sepsis
data(sepsis)
# initialize VT.object
vt.o <- vt.data(sepsis, "survival", "THERAPY", TRUE)

## "1" will be the favorable outcome

1 will be the favorable outcome because 1 is the second level of "survival" column. It means that \(P(Y=1)\) is the probability of interest. Anyway, it’s still possible to compute \(P(Y=0)\).

Quick example

Sepsis does not have any categorical variable, following example show how vt.data deals with categorical values depending on interactions parameter

# Creation of categorical variable
cat.x <- rep(1:5, (nrow(sepsis))/5)
cat.x <- as.factor(cat.x)
sepsis.tmp <- cbind(sepsis, cat.x)
vt.o.tmp <- vt.data(sepsis.tmp, "survival", "THERAPY", TRUE)

## "1" will be the favorable outcome 
## Creation of dummy variables for cat.x 
## Dummy variable cat.x_1 created 
## Dummy variable cat.x_2 created 
## Dummy variable cat.x_3 created 
## Dummy variable cat.x_4 created 
## Dummy variable cat.x_5 created

Dummies variables are created for each category of cat.x variable. And cat.x is removed from dataset.

Step 1 : compute \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\)

As described earlier, step 1 can be done via differents ways

Simple Random Forest

Following example used sepsis data created in previous part.

To perform simple random forest on VT.object, randomForest, caret and party package can be used.

Class VT.forest.one is used. It takes in arguments :

vt.object : return of vt.data() function
model : A random forest model
interactions : logical, TRUE is default value

with randomForest

# use randomForest::randomForest()
library(randomForest, verbose = F)
# Reproducibility
set.seed(123)
# Fit rf model 
# default params
# set interactions to TRUE if using interaction between T and X
model.rf <- randomForest(x = vt.o$getX(interactions = T),
                         y = vt.o$getY())
# initialize VT.forest.one
vt.f.rf <- VT.forest.one(vt.o, model.rf)
# Then, use run() to compute probabilities
vt.f.rf$run()

with party

cforest() can be usefull however computing time is really long. I think there is an issue when giving cforest object in Reference Class parameter. Need to fix it.

# # use randomForest::randomForest()
# library(party, verbose = F)
# # Reproducibility
# set.seed(123)
# # Fit cforest model 
# # default params
# # set interactions to TRUE if using interaction between T and X
# model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T))
# # initialize VT.forest.one
# vt.f.cf <- VT.forest.one(vt.o, model.cf)
# # Then, use run() to compute probabilities
# vt.f.cf$run()

with caret

Using caret can be usefull to deal with parallel computing for example.

NOTE: For caret levels of outcome can’t be 0, so i’ll change levels name into “n”/“y”

# Copy new object
vt.o.tr <- vt.o$copy()
# Change levels
tmp <- ifelse(vt.o.tr$data$survival == 1, "y", "n")
vt.o.tr$data$survival <- as.factor(tmp)
rm(tmp)
# Check new data to be sure
formatRCTDataset(vt.o.tr$data, "survival", "THERAPY")

## "y" will be the favorable outcome

# use caret::train()
library(caret, verbose = F)
# Reproducibility
set.seed(123)
# fit train model
fitControl <- trainControl(classProbs = T, method = "none")
model.tr <- train(x = vt.o.tr$getX(interactions = T),
                  y = vt.o.tr$getY(),
                  method = "rf",
                  tuneGrid = data.frame(mtry = 5),
                  trControl = fitControl)
# initialize VT.forest.one
vt.f.tr <- VT.forest.one(vt.o.tr, model.tr)
# Then, use run() to compute probabilities
vt.f.tr$run()

Double Random Forest

To perform double random forest on VT.object, same packages as simple random forest can be used.

Class VT.forest.double is used. It takes in arguments :

vt.object : return of vt.data() function
model_trt1 : a random forest model for \(T=1\)
model_trt0 : a random forest model for \(T=0\)

NOTE: use trt parameter in VT.object::getX() or VT.object::getY() methods to obtain part of data depending on treatment. See following example.

with randomForest

# grow RF for T = 1
model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1),
                              y = vt.o$getY(trt = 1))
# grow RF for T = 0
model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0),
                              y = vt.o$getY(trt = 0))
# initialize VT.forest.double()
vt.doublef.rf <- VT.forest.double(vt.o, model.rf.trt1, model.rf.trt0)
# Then, use run() to compute probabilities
vt.doublef.rf$run()

Follow the same structure for caret or cforest models.

K Fold Random Forest

This idea is taken from method 3 of Jared Foster paper :

A modification of [previous methods] is to obtain \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\) via cross-validation. In this méthod the specific data for subject \(i\) is not used to obtain \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\). Using k-fold cross-validation, we apply random forest regression approach to \(\frac{k-1}{k}\) of the data and use the resulting predictor to obtain estimates of \(P_{1i}\) and \(P_{0i}\) for the remaining \(\frac{1}{k}\) of the observations. This is repeated \(k\) times.

To use this approach, type aVirtualTwins:::VT.forest.fold(). This class takes in argument :

vt.object : return of vt.data() function
fold : number of fold (e.g. \(5\))
ratio : Control of sampsize balance. ratio of \(2\) means that there 2 times le highest level compared to the other. “Highest” means the level with larger observations. It’s in test.
interactions : Logical. If TRUE, interactions between covariables and treatments are used. FALSE otherwise.

NOTE: This function use only randomForest package.

# initialize k-fold RF
model.fold <- aVirtualTwins:::VT.forest.fold(vt.o, fold = 5, ratio = 1, interactions = T)
# grow RF with randomForest package options
# set do.trace option to see the 5 folds
model.fold$run(ntree = 500, do.trace = 500)

## ntree      OOB      1      2
##   500:  32.38% 16.46% 57.72%
## ntree      OOB      1      2
##   500:  33.25% 18.50% 55.26%
## ntree      OOB      1      2
##   500:  30.03% 16.52% 50.34%
## ntree      OOB      1      2
##   500:  29.26% 13.56% 55.71%
## ntree      OOB      1      2
##   500:  30.60% 13.79% 59.70%

Build Your Own Model

Random Forests are not the only models you can use to compute \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\). Any prediction model can be used, as logitic regression, boosting …

Anyway, aVirtualTwins package can be used. To do so, you can use VT.difft() class. It is important to note this the parent class of all “forests” classes. It takes in argument :

vt.object : return of vt.data() function
twin1 : estimate of \(P(Y_{i} = 1 | T = T_{i})\) : meaning response probability under the correct treatment.
twin1 : estimate of \(P(Y_{i} = 1 | T = 1-T_{i})\) : meaning response probability under the other treatment.
method : absolute (default), relative or logit. See ?VT.difft() for details.

# you get twin1 and twin2 by your own method
# here, i'll use random number between 0 and 1 :
twin1_random <- runif(470)
twin2_random <- runif(470)

# then you can initialize VT.difft class : 
model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute")
# compute difference of twins : 
model.difft$computeDifft()
# See results
head(model.difft$difft)

## [1]  0.2507360 -0.3905591  0.3102285  0.3986582  0.3747078 -0.2508240

# Graph :
# hist(model.difft$difft)

NOTE: Also, you can clone repository, write your own child class of VT.difft() AND submit it !

Step 2 : Estimate a Regression or Classification Tree

As described in the method, we define \(Z_i = \hat{P_{1i}} - \hat{P_{0i}}\). It’s the difference in term of response of the active treatments compared to the control treatment. The idea is to try to explain this difference by few covariables.

Classification

We define a new variable \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise. Classification tree’s goal is to explain the value \(Z^*=1\). \(c\) is a threshold given by the user. It’s the threshold for which the difference is “interesting”. One idea is to use quantiles of the difft distribution.

To compute a classifiction tree, VT.tree.class() is used. Internally, rpart::rpart() is computed. It takes in argument:

vt.difft : VT.difft object
threshold : \(0.05\) by default, corresponds to \(c\)
sens : c(">", "<"). sens corresponds to the way \(Z^{*}\) is defined.
- ">" (default) : \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise.
- "<" : \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i < c\) and \(Z^{*}_i=0\) otherwise.
screening : NULL is default value. If TRUE only covariables in varimp vt.object ’s field is used.

See ?VT.tree for details.

# initialize classification tree
tr.class <- VT.tree.class(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7))
# compute tree with rpart option
tr.class$run(cp = 0, maxdepth = 3, maxcompete = 1)

Regression

Use regression tree to explain \(Z\) by covariables \(X\). Then some leafs have predicted \(Z_i\) greater than the threshold \(c\) (if \(sens\) is “>”), and it defines which covariables explain \(Z\).

The class to use is VT.tree.reg(). It takes same parameters than classification mehod.

# initialize regression tree
tr.reg <- VT.tree.reg(vt.f.rf, sens = ">", threshold = quantile(vt.f.rf$difft, 0.7))
# compute tree with rpart option
tr.reg$run(cp = 0, maxdepth = 3, maxcompete = 1)

Virtual Twins Examples

Francois Vieille

2015-07-25

Introduction