The regression approaches we have covered so far in this course are all parametric.
Parametric means that we need to assume an underlying probability distribution to explain the randomness.
The regression approaches we have covered so far in this course are all parametric.
Parametric means that we need to assume an underlying probability distribution to explain the randomness.
For example, for linear regression,
yi=β0+β1xi1+ϵi; ϵiiid∼N(0,σ2),
we assume a normal distribution.
The regression approaches we have covered so far in this course are all parametric.
Parametric means that we need to assume an underlying probability distribution to explain the randomness.
For example, for linear regression,
yi=β0+β1xi1+ϵi; ϵiiid∼N(0,σ2),
we assume a normal distribution.
For logistic regression,
yi|xi∼Bernoulli(πi); log(πi1−πi)=β0+β1xi,
we assume a Bernoulli distribution.
All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.
We may not want to run the risk of mis-specifying those.
All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.
We may not want to run the risk of mis-specifying those.
As an alternative one can turn to nonparametric models that optimize certain criteria rather than specify models.
Classification and regression trees (CART)
Random forests
Boosting
Other machine learning methods
All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.
We may not want to run the risk of mis-specifying those.
As an alternative one can turn to nonparametric models that optimize certain criteria rather than specify models.
Classification and regression trees (CART)
Random forests
Boosting
Other machine learning methods
Over the next few modules, we will briefly discuss a few of those methods.
Goal: predict outcome variable from several predictors.
Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).
Goal: predict outcome variable from several predictors.
Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).
Let Y represent the outcome and X represent the predictors.
Goal: predict outcome variable from several predictors.
Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).
Let Y represent the outcome and X represent the predictors.
CART recursively partitions the predictor space in a way that can be effectively represented by a tree structure, with leaves corresponding to the subsets of units.
Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
Partitions from recursive binary splits of X.
Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
Partitions from recursive binary splits of X.
Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).
Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
Partitions from recursive binary splits of X.
Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).
Various ways to prune tree based on cross validation.
Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
Partitions from recursive binary splits of X.
Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).
Various ways to prune tree based on cross validation.
Making predictions:
Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
Partitions from recursive binary splits of X.
Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).
Various ways to prune tree based on cross validation.
Making predictions:
Partition X space so that subsets of individuals formed by partitions have relatively homogeneous Y.
Partitions from recursive binary splits of X.
Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).
Various ways to prune tree based on cross validation.
Making predictions:
For any new X, trace down tree until you reach the appropriate leaf.
Use value of Y that occurs most frequently in leaf as the prediction.
To illustrate, Figure 1 displays a fictional regression tree for
To approximate the conditional distribution of Y for a particular gender and race/ethnicity combination, one uses the values in the corresponding leaf.
To illustrate, Figure 1 displays a fictional regression tree for
To approximate the conditional distribution of Y for a particular gender and race/ethnicity combination, one uses the values in the corresponding leaf.
For example, to predict a Y value for for female Caucasians, one uses the Y value that occurs most frequently in leaf L3.
Same idea as for categorical outcomes: grow tree by recursive partitions on X.
Use the variance of the Y values as a splitting criterion: choose the split that makes the sum of the variances of the Y values in the leaves as small as possible.
Same idea as for categorical outcomes: grow tree by recursive partitions on X.
Use the variance of the Y values as a splitting criterion: choose the split that makes the sum of the variances of the Y values in the leaves as small as possible.
When making predictions for new X, use the average value of Y in the leaf for that X.
Can look at residuals, but...
Can look at residuals, but...
No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.
Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?
Can look at residuals, but...
No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.
Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?
Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.
Can look at residuals, but...
No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.
Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?
Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.
Transforming the X values is irrelevant for trees (as long as transformation is monotonic, like logs)
Can look at residuals, but...
No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.
Big residuals identify X values for which the predictions are not close to the actual Y values. But...what should we do with them?
Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.
Transforming the X values is irrelevant for trees (as long as transformation is monotonic, like logs)
Can still do model validation, that is, compute and compare RMSEs, AUC, accuracy, and so on.
No parametric assumptions.
Automatic model selection.
No parametric assumptions.
Automatic model selection.
Multi-collinearity not problematic.
No parametric assumptions.
Automatic model selection.
Multi-collinearity not problematic.
Useful exploratory tool to find important interactions.
No parametric assumptions.
Automatic model selection.
Multi-collinearity not problematic.
Useful exploratory tool to find important interactions.
In R, use tree
or rpart
.
Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.
Bins continuous predictors, so fine grained relationships lost.
Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.
Bins continuous predictors, so fine grained relationships lost.
Finds one tree, making it hard to interpret chance error for that tree.
Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.
Bins continuous predictors, so fine grained relationships lost.
Finds one tree, making it hard to interpret chance error for that tree.
No obvious ways to assess variable importance.
Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.
Bins continuous predictors, so fine grained relationships lost.
Finds one tree, making it hard to interpret chance error for that tree.
No obvious ways to assess variable importance.
Harder to interpret effects of individual predictors.
Regression predictions forced to range of observed Y values. May or may not be a limitation depending on the context.
Bins continuous predictors, so fine grained relationships lost.
Finds one tree, making it hard to interpret chance error for that tree.
No obvious ways to assess variable importance.
Harder to interpret effects of individual predictors.
Also, One big tree is limiting, but, we need different datasets or variables to grow more than one tree...
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |