Skip to content
François Vieille edited this page Oct 9, 2016 · 1 revision

Step 1 : compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$

As described earlier, step 1 can be done via differents ways

Simple Random Forest

Following example used sepsis data created in previous part.

To perform simple random forest on VT.object, randomForest, caret and party package can be used.

Class vt.forest("one", ...) is used. It takes in arguments :

  • forest.type : you have to set it to "one"
  • vt.data : return of vt.data() function
  • model : a random forest model
  • interactions : logical, TRUE is default value
  • ... : options to randomForest() function

with randomForest

# use randomForest::randomForest()
library(randomForest, verbose = F)
# Reproducibility
set.seed(123)
# Fit rf model 
# default params
# set interactions to TRUE if using interaction between T and X
model.rf <- randomForest(x = vt.o$getX(interactions = T),
                         y = vt.o$getY(),
                         ntree = 500)
# initialize VT.forest.one
vt.f.rf <- vt.forest("one", vt.data = vt.o, model = model.rf, interactions = T)
### or you can use randomForest inside vt.forest()
vt.f.rf <- vt.forest("one", vt.data = vt.o, interactions = T, ntree = 500)

with party

cforest() can be usefull however computing time is really long. I think there is an issue when giving cforest object in Reference Class parameter. Need to fix it.

# # use randomForest::randomForest()
# library(party, verbose = F)
# # Reproducibility
# set.seed(123)
# # Fit cforest model 
# # default params
# # set interactions to TRUE if using interaction between T and X
# model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T))
# # initialize VT.forest.one
# vt.f.cf <- vt.forest("one", vt.data = vt.o, model = model.cf)

with caret

Using caret can be usefull to deal with parallel computing for example.

NOTE: For caret levels of outcome can't be 0, so i'll change levels name into "n"/"y"

# Copy new object
vt.o.tr <- vt.o$copy()
# Change levels
tmp <- ifelse(vt.o.tr$data$survival == 1, "y", "n")
vt.o.tr$data$survival <- as.factor(tmp)
rm(tmp)
# Check new data to be sure
formatRCTDataset(vt.o.tr$data, "survival", "THERAPY")
## "y" will be the favorable outcome
# use caret::train()
library(caret, verbose = F)
# Reproducibility
set.seed(123)
# fit train model
fitControl <- trainControl(classProbs = T, method = "none")
model.tr <- train(x = vt.o.tr$getX(interactions = T),
                  y = vt.o.tr$getY(),
                  method = "rf",
                  tuneGrid = data.frame(mtry = 5),
                  trControl = fitControl)
# initialize VT.forest.one
vt.f.tr <- vt.forest("one", vt.o.tr, model = model.tr)

Double Random Forest

To perform double random forest on VT.object, same packages as simple random forest can be used.

Function vt.forest("double", ...) is used. It takes in arguments :

  • forest.type : You have to set is to "double"
  • vt.data : return of vt.data() function
  • model_trt1 : a random forest model for $T=1$ (this argument has to be specified)
  • model_trt0 : a random forest model for $T=0$ (this argument has to be specified)

NOTE: use trt parameter in VT.object::getX() or VT.object::getY() methods to obtain part of data depending on treatment. See following example.

with randomForest

# grow RF for T = 1
model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1),
                              y = vt.o$getY(trt = 1))
# grow RF for T = 0
model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0),
                              y = vt.o$getY(trt = 0))
# initialize VT.forest.double()
vt.doublef.rf <- vt.forest("double",
                           vt.data = vt.o, 
                           model_trt1 = model.rf.trt1, 
                           model_trt0 = model.rf.trt0)
### Or you can use randomForest() inside
vt.doublef.rf <- vt.forest("double",
                           vt.data = vt.o,
                           ntree = 200)

Follow the same structure for caret or cforest models.

K Fold Random Forest

This idea is taken from method 3 of Jared Foster paper :

A modification of [previous methods] is to obtain $\hat{P_{1i}}$ and $\hat{P_{0i}}$ via cross-validation. In this méthod the specific data for subject $i$ is not used to obtain $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Using k-fold cross-validation, we apply random forest regression approach to $\frac{k-1}{k}$ of the data and use the resulting predictor to obtain estimates of $P_{1i}$ and $P_{0i}$ for the remaining $\frac{1}{k}$ of the observations. This is repeated $k$ times.

To use this approach, use vt.forest("fold", ...). This class takes in argument :

  • forest.type : it has to be set to "fold"
  • vt.data : return of vt.data() function
  • fold : number of fold (e.g. $5$)
  • ratio : Control of sampsize balance. ratio of $2$ means that there 2 times le highest level compared to the other. "Highest" means the level with larger observations. It's in test.
  • interactions : Logical. If TRUE, interactions between covariables and treatments are used. FALSE otherwise.
  • ... : randomForest() function options

NOTE: This function use only randomForest package.

# initialize k-fold RF
# you can use randomForest options
model.fold <- vt.forest("fold", vt.data = vt.o, fold = 5, ratio = 1, interactions = T, ntree = 200)

Build Your Own Model

Random Forests are not the only models you can use to compute $\hat{P_{1i}}$ and $\hat{P_{0i}}$. Any prediction model can be used, as logitic regression, boosting ...

Anyway, aVirtualTwins package can be used. To do so, you can use VT.difft() class. It is important to note this the parent class of all "forests" classes. It takes in argument :

  • vt.object : return of vt.data() function
  • twin1 : estimate of $P(Y_{i} = 1 | T = T_{i})$ : meaning response probability under the correct treatment.
  • twin2 : estimate of $P(Y_{i} = 1 | T = 1-T_{i})$ : meaning response probability under the other treatment.
  • method : absolute (default), relative or logit. See ?VT.difft for details.
# you get twin1 and twin2 by your own method
# here, i'll use random number between 0 and 1 :
twin1_random <- runif(470)
twin2_random <- runif(470)

# then you can initialize VT.difft class : 
model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute")
# compute difference of twins : 
model.difft$computeDifft()
# See results
head(model.difft$difft)
## [1] -0.03599908 -0.44271883 -0.25458624 -0.64201822  0.29347148 -0.02843780
# Graph :
# hist(model.difft$difft)

NOTE: Also, you can clone repository, write your own child class of VT.difft() AND submit it !