- Bootstrapping
 - Decision Trees
 - Random Forest
 - Application of Random Forest
 
Kaushik Roy Chowdhury
Is there an alternative?
Given a response variable and predictors, a set of rules governing the behavior of response variables depending on the split of predictors.
Can be applied to both regression and classification problems.
CART - Classification and Regression Trees
Divide predictor space into distinct non-overlapping regions.
For regression, make same prediction (= mean of response values) for all observations falling into the region.
For classification, make prediction based of majority votes of observation class.
 
 
 
 
However, aggregating trees along with bootstrapping can lead to significant improvement in prediction accuracy.
Take bootstrapped samples from training data >> Randomly sample \(m\) predictors >> fit decision trees >> Average out results of these tress.
Rationale: If there exists a strong predictor, along with moderately strong predictors, trees become correlated. Averaging out results does not lead to reduction in variance.
The datasets left out of the \(b^{th}\) bootstrapped samples: "Out-of-Bag observations"
Predictions for \(i^{th}\) observation for which it was OOB. Average of predicted responses (for regression) or majority vote (for classification).
OOB Error also acts as a proxy for test set validation.
Trade-off loss of interpratibility with high-performance.
library(MASS); data(fgl); summary(fgl)
       RI                Na              Mg              Al       
 Min.   :-6.8500   Min.   :10.73   Min.   :0.000   Min.   :0.290  
 1st Qu.:-1.4775   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
 Median :-0.3200   Median :13.30   Median :3.480   Median :1.360  
 Mean   : 0.3654   Mean   :13.41   Mean   :2.685   Mean   :1.445  
 3rd Qu.: 1.1575   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
 Max.   :15.9300   Max.   :17.38   Max.   :4.490   Max.   :3.500  
       Si              K                Ca               Ba       
 Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
 1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
 Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
 Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
 3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
 Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
       Fe             type   
 Min.   :0.00000   WinF :70  
 1st Qu.:0.00000   WinNF:76  
 Median :0.00000   Veh  :17  
 Mean   :0.05701   Con  :13  
 3rd Qu.:0.10000   Tabl : 9  
 Max.   :0.51000   Head :29  
  library(randomForest)
mod <- randomForest(type~., data=fgl, mtry=3, importance=TRUE)
print(mod)
Call:
 randomForest(formula = type ~ ., data = fgl, mtry = 3, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3
        OOB estimate of  error rate: 19.63%
Confusion matrix:
      WinF WinNF Veh Con Tabl Head class.error
WinF    62     6   2   0    0    0   0.1142857
WinNF   10    62   1   1    1    1   0.1842105
Veh      7     3   7   0    0    0   0.5882353
Con      0     3   0   9    0    1   0.3076923
Tabl     0     2   0   0    7    0   0.2222222
Head     1     3   0   0    0   25   0.1379310
  varImpPlot(mod)