Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

IDS 702: Module 8.2

Classification and regression trees

Dr. Olanrewaju Michael Akande

1 / 12

Tree-based methods

  • The regression approaches we have covered so far in this course are all parametric.
2 / 12

Tree-based methods

  • The regression approaches we have covered so far in this course are all parametric.

  • Parametric means that we need to assume an underlying probability distribution to explain the randomness.

2 / 12

Tree-based methods

  • The regression approaches we have covered so far in this course are all parametric.

  • Parametric means that we need to assume an underlying probability distribution to explain the randomness.

  • For example, for linear regression,

    yi=β0+β1xi1+ϵi;  ϵiiidN(0,σ2),

    we assume a normal distribution.

2 / 12

Tree-based methods

  • The regression approaches we have covered so far in this course are all parametric.

  • Parametric means that we need to assume an underlying probability distribution to explain the randomness.

  • For example, for linear regression,

    yi=β0+β1xi1+ϵi;  ϵiiidN(0,σ2),

    we assume a normal distribution.

  • For logistic regression,

    yi|xiBernoulli(πi);   log(πi1πi)=β0+β1xi,

    we assume a Bernoulli distribution.

2 / 12

Tree-based methods

  • All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.
3 / 12

Tree-based methods

  • All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.

  • We may not want to run the risk of mis-specifying those.

3 / 12

Tree-based methods

  • All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.

  • We may not want to run the risk of mis-specifying those.

  • As an alternative one can turn to nonparametric models that optimize certain criteria rather than specify models.

    • Classification and regression trees (CART)

    • Random forests

    • Boosting

    • Other machine learning methods

3 / 12

Tree-based methods

  • All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.

  • We may not want to run the risk of mis-specifying those.

  • As an alternative one can turn to nonparametric models that optimize certain criteria rather than specify models.

    • Classification and regression trees (CART)

    • Random forests

    • Boosting

    • Other machine learning methods

  • Over the next few modules, we will briefly discuss a few of those methods.

3 / 12

CART

  • Goal: predict outcome variable from several predictors.
4 / 12

CART

  • Goal: predict outcome variable from several predictors.

  • Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).

4 / 12

CART

  • Goal: predict outcome variable from several predictors.

  • Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).

  • Let Y represent the outcome and X represent the predictors.

4 / 12

CART

  • Goal: predict outcome variable from several predictors.

  • Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).

  • Let Y represent the outcome and X represent the predictors.

  • CART recursively partitions the predictor space in a way that can be effectively represented by a tree structure, with leaves corresponding to the subsets of units.

4 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
5 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.

  • Partitions from recursive binary splits of X.

5 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.

  • Partitions from recursive binary splits of X.

  • Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).

5 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.

  • Partitions from recursive binary splits of X.

  • Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).

  • Various ways to prune tree based on cross validation.

5 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.

  • Partitions from recursive binary splits of X.

  • Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).

  • Various ways to prune tree based on cross validation.

  • Making predictions:

5 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.

  • Partitions from recursive binary splits of X.

  • Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).

  • Various ways to prune tree based on cross validation.

  • Making predictions:

    • For any new X, trace down tree until you reach the appropriate leaf.
5 / 12

CART for categorical outcomes

  • Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.

  • Partitions from recursive binary splits of X.

  • Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).

  • Various ways to prune tree based on cross validation.

  • Making predictions:

    • For any new X, trace down tree until you reach the appropriate leaf.

    • Use value of Y that occurs most frequently in leaf as the prediction.

5 / 12

CART

6 / 12

CART for categorical outcomes

  • To illustrate, Figure 1 displays a fictional regression tree for
    • an outcome variable.
    • two predictors, gender (male or female) and race/ethnicity (African-American, Caucasian, or Hispanic).
7 / 12

CART for categorical outcomes

  • To illustrate, Figure 1 displays a fictional regression tree for

    • an outcome variable.
    • two predictors, gender (male or female) and race/ethnicity (African-American, Caucasian, or Hispanic).
  • To approximate the conditional distribution of Y for a particular gender and race/ethnicity combination, one uses the values in the corresponding leaf.

7 / 12

CART for categorical outcomes

  • To illustrate, Figure 1 displays a fictional regression tree for

    • an outcome variable.
    • two predictors, gender (male or female) and race/ethnicity (African-American, Caucasian, or Hispanic).
  • To approximate the conditional distribution of Y for a particular gender and race/ethnicity combination, one uses the values in the corresponding leaf.

  • For example, to predict a Y value for for female Caucasians, one uses the Y value that occurs most frequently in leaf L3.

7 / 12

CART for continuous outcomes

  • Same idea as for categorical outcomes: grow tree by recursive partitions on X.
8 / 12

CART for continuous outcomes

  • Same idea as for categorical outcomes: grow tree by recursive partitions on X.

  • Use the variance of the Y values as a splitting criterion: choose the split that makes the sum of the variances of the Y values in the leaves as small as possible.

8 / 12

CART for continuous outcomes

  • Same idea as for categorical outcomes: grow tree by recursive partitions on X.

  • Use the variance of the Y values as a splitting criterion: choose the split that makes the sum of the variances of the Y values in the leaves as small as possible.

  • When making predictions for new X, use the average value of Y in the leaf for that X.

8 / 12

Model diagnostics

  • Can look at residuals, but...
9 / 12

Model diagnostics

  • Can look at residuals, but...

    • No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.
9 / 12

Model diagnostics

  • Can look at residuals, but...

    • No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.

    • Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?

9 / 12

Model diagnostics

  • Can look at residuals, but...

    • No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.

    • Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?

    • Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.

9 / 12

Model diagnostics

  • Can look at residuals, but...

    • No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.

    • Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?

    • Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.

  • Transforming the X values is irrelevant for trees (as long as transformation is monotonic, like logs)

9 / 12

Model diagnostics

  • Can look at residuals, but...

    • No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.

    • Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?

    • Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.

  • Transforming the X values is irrelevant for trees (as long as transformation is monotonic, like logs)

  • Can still do model validation, that is, compute and compare RMSEs, AUC, accuracy, and so on.

9 / 12

CART vs. parametric regression: benefits

  • No parametric assumptions.
10 / 12

CART vs. parametric regression: benefits

  • No parametric assumptions.

  • Automatic model selection.

10 / 12

CART vs. parametric regression: benefits

  • No parametric assumptions.

  • Automatic model selection.

  • Multi-collinearity not problematic.

10 / 12

CART vs. parametric regression: benefits

  • No parametric assumptions.

  • Automatic model selection.

  • Multi-collinearity not problematic.

  • Useful exploratory tool to find important interactions.

10 / 12

CART vs. parametric regression: benefits

  • No parametric assumptions.

  • Automatic model selection.

  • Multi-collinearity not problematic.

  • Useful exploratory tool to find important interactions.

  • In R, use tree or rpart.

10 / 12

CART vs. parametric regression: limitations

  • Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.
11 / 12

CART vs. parametric regression: limitations

  • Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.

  • Bins continuous predictors, so fine grained relationships lost.

11 / 12

CART vs. parametric regression: limitations

  • Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.

  • Bins continuous predictors, so fine grained relationships lost.

  • Finds one tree, making it hard to interpret chance error for that tree.

11 / 12

CART vs. parametric regression: limitations

  • Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.

  • Bins continuous predictors, so fine grained relationships lost.

  • Finds one tree, making it hard to interpret chance error for that tree.

  • No obvious ways to assess variable importance.

11 / 12

CART vs. parametric regression: limitations

  • Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.

  • Bins continuous predictors, so fine grained relationships lost.

  • Finds one tree, making it hard to interpret chance error for that tree.

  • No obvious ways to assess variable importance.

  • Harder to interpret effects of individual predictors.

11 / 12

CART vs. parametric regression: limitations

  • Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.

  • Bins continuous predictors, so fine grained relationships lost.

  • Finds one tree, making it hard to interpret chance error for that tree.

  • No obvious ways to assess variable importance.

  • Harder to interpret effects of individual predictors.

Also, One big tree is limiting, but, we need different datasets or variables to grow more than one tree...

11 / 12

What's next?

Move on to the readings for the next module!

12 / 12

Tree-based methods

  • The regression approaches we have covered so far in this course are all parametric.
2 / 12
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow