Decision Tree 1 - Regression and Classification Trees

Yiyang Dong

2020-11-14 888 words 5 minutes

Contents

0. Tree-based methods

Involving stratifying / segmenting the predictor space into a number of simple regions

Use the mean/mode of the training data in the region as prediction for test data

1. Regression Decision Tree

1.1 Motivation

Making Prediction via Stratification of the Feature Space：

Divide the predictor space – that is, the set of possible response $Y$ for ${X_1,X_2,…,X_p}$ – into $J$ distinct and non-overlapping regions, ${R_1,R_2,…,R_J}$
For every test data which will fall into the region $R_j$, we make the same prediction, which is simply the mean of the response $Y$ for the training observations in $R_j$

1.2 Tree Splitting

The goal is to find leaves $R_1,…,R_J$ that minimizes the RSS, given by: $$\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2$$

where $\hat{y}_{R_j}$ is the mean response for the training observations within the $j,^{th}$ leaf

Problem: it is computationally infeasible to consider every possible partition of the feature space into $J$ leaves

Solution: Recursive Binary Splitting is a top-down, greedy approach:

top-down because it begins at the top of the tree (all in one region) and then successively splits the predictor space;
greedy because the best split is made at each step, rather than ~~looking ahead globally and picking a split will lead to a better tree in some future step~~ (which is impossible)

First, for any feature $p$ and cutpoint $s$ , we define two regions: $$R_1(p,s)={X|X_j< s}$$ $$and$$ $$R_2(p,s)={X|X_j\geq s}$$ to get the best $p$ and $s$ that minimize: $$\sum_{i:x_i\in R_1(p,s)}(y_i-\hat{y}_{R_1})^2+\sum_{i:x_i\in R_2(p,s)}(y_i-\hat{y}_{R_2})^2$$

Next, repeat the process, looking for the best $p$ and $s$ to continue splitting

until a stopping criterion is reached:
e.g. no region contains more than 5 observations.

If the number of features $p$ is not too large, this process can be done quickly

predict $Y$ in test data using the mean of the train data in the region $R_j$ to which the test data belongs

1.3 Tree Pruning

Problem：Complex Tree will lead overfit (each leaf has one data)

Solution: A smaller tree with fewer splits (fewer $R_j\ $) may lead to lower variance and better interpretation at the cost of a little bias

Method 1 - Threshold
Splitting only as the decrease in the RSS exceeds some (high) threshold

Problem: too short-sighted since a seemingly worthless split early on may lead to a better split with large reduction in RSS

Method 2 - Pruning
Grow a very large tree $T_0\ $, then prune it back in order to obtain a subtree

Goal: select a subtree that leads to the lowest test error rate

Rather than ~~CVing every possible subtree~~, we consider a sequence of trees indexed by non-negative tuning parameter $\alpha$

Cost Complexity / Weakest Link Pruning

For each value of $\alpha$ there corresponds a subtree $T\subset T_0$ to minimize: $$\sum_{j=1}^{|T|}\sum_{i: \ x_i\in R_j}(y_i-\hat{y}_{Rj})^2 + \alpha|T|$$

$|T|$ : number of leaves of the tree $T$

The tuning parameter $\alpha$ controls a trade-off between the subtree’s complexity and its fit to the training data.

$\alpha= 0 => T = T_0$, just measures the error
As $\alpha,$ increases, there is penalty for the subtree with many leaves
so branches get pruned from the tree in a nested and predictable fashion,
then obtaining the whole sequence of subtrees (as a function of $\alpha$ ) is easy

$\alpha$ is similar to $\lambda$ of the lasso, which is a controller of the complexity of a linear model

also can be selected via CV and obtain the subtree corresponding to $\alpha$

Example on Baseball Hitters Data

Perform 6-fold CV to estimate the CV MSE of the trees as a function of $\alpha$

CV error is minimum at $|T|=3$ based on the best $\alpha$

2. Classification Trees

For a classification tree, we predict the test data belongs to the most commonly occurring class of train data in the region to which it belongs

RSS cannot be a criterion for classification tree
need two criterions to evaluate the quality of a particular split

Gini Index:
A measure of total variance across the $K$ classes: $$G=\sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk})$$

$\hat{p}_{mk}$ represents the proportion of train data in the $m^{th}$ region that are from the $k^{th}$ class
G small if all $\hat{p}_{mk}$ are close 1 or 0
Node Purity: Smaller if a node contains larger amount of observations from a single class

Cross-Entropy: $$D=-\sum_{k=1}^K\hat{p}_{mk}\log (\hat{p}_{mk})$$

Like the Gini Index, the Entropy is smaller if $m^{th}$ node is pure
both are sensitive to node purity

Heart Disease Example:

The splits may yield two same predicted value, there are reasons to keep them:

because it leads to increased node purity
improves the Gini Index and the Entropy

3. Trees vs Linear Models

If there is a highly non-linear complex relathinship between the features and the response, CARTs may outperform classical approaches

However, there may still be linear relationship

4. Pros & Cons of Trees

Advantages:

Easier to interpret than Linear Regression
More closely mirror human decision making
Trees can be displayed graphically
Easily handle qualitative predictor ~~without the dummy variables~~

Limitations:

Trees can be very non-robust:

a small change in the data can cause a large change in the final estimated tree

Trees do not have the same level of predictive accuracy as other regression and classification methods

TOGO：By aggregating many decision trees, bagging, random forests and boosting will improve accuracy, at the expense of some loss in interpretation

5. Reference

An Introduction to Statistical Learning, with applications in R (Springer, 2013)