IDS 702: Module 2.7

class: center, middle, inverse, title-slide

# IDS 702: Module 2.7
## Aggregated outcomes; Probit regression
### Dr. Olanrewaju Michael Akande

---

## Aggregated binary outcomes

- In the datasets we have seen so far under logistic regression, we observe the binary outcomes for each observation, that is, each `$y_i \in \{0,1\}$`.

- This is not always the case. Sometimes, we get an aggregated version, with the outcome summed up by combinations of other variables.

- For example, for individual-level data, suppose we had
  
  <table style="text-align:center"><tr><td colspan="26" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">response</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0</td><td>1</td><td>1</td><td>0</td><td>1</td></tr>
  <tr><td style="text-align:left">predictor</td><td>3</td><td>3</td><td>2</td><td>1</td><td>2</td><td>3</td><td>2</td><td>2</td><td>2</td><td>2</td><td>3</td><td>1</td><td>3</td><td>1</td><td>1</td><td>2</td><td>2</td><td>2</td><td>2</td><td>1</td><td>3</td><td>3</td><td>3</td><td>1</td><td>3</td></tr>
  <tr><td colspan="26" style="border-bottom: 1px solid black"></td></tr></table>
  where .hlight[predictor] is a factor with 3 levels: 1,2,3.
--

- The aggregated version of the same data could look like
  
  <table style="text-align:center"><tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">predictor</td><td>n</td><td>successes</td></tr>
  <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">1</td><td>31</td><td>17</td></tr>
  <tr><td style="text-align:left">2</td><td>35</td><td>16</td></tr>
  <tr><td style="text-align:left">3</td><td>34</td><td>14</td></tr>
  <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr></table>

---
## Aggregated binary outcomes

- Recall that if `$Y \sim \textrm{Bin}(n,p)$` (that is, `$Y$` is a random variable that follows a binomial distribution with parameters `$n$` and `$p$`), then `$Y$` follows a `$\textrm{Bernoulli}(p)$` distribution  when `$n = 1$`.

- Alternatively, we also have that if `$Z_1, \ldots, Z_n \sim \textrm{Bernoulli}(p)$`, then `$Y = \sum_i^n Z_i \sim \textrm{Bin}(n,p)$`.

- That is, the sum of `$n$` "iid" `$\textrm{Bernoulli}(p)$` random variables gives a random variable with the `$\textrm{Bin}(n,p)$` distribution.

- The logistic regression model can be used either for Bernoulli data (as we have done so far) or for data summarized as binomial counts (that is, aggregated counts).

- In the aggregated form, the model is
.block[
.small[
`$$y_i | x_i \sim \textrm{Bin}(n_i,\pi_i); \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip},$$`
]
]

---
## Bernoulli versus binomial outcomes

Normally, for individual-level data, we would have
.small[

```
##   response predictor
## 1        0         3
## 2        0         3
## 3        1         2
## 4        1         1
## 5        1         2
## 6        0         3
```
]

.small[

```r
M1 <- glm(response~predictor,data=Data,family=binomial)
summary(M1)
```

```
## 
## Call:
## glm(formula = response ~ predictor, family = binomial, data = Data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.261  -1.105  -1.030   1.251   1.332  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.1942     0.3609   0.538    0.591
## predictor2   -0.3660     0.4954  -0.739    0.460
## predictor3   -0.5508     0.5017  -1.098    0.272
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.27  on 99  degrees of freedom
## Residual deviance: 137.02  on 97  degrees of freedom
## AIC: 143.02
## 
## Number of Fisher Scoring iterations: 4
```
]

---
## Bernoulli versus binomial outcomes

But we could also do the following with the aggregate level data instead
.small[

```r
M2 <- glm(cbind(successes,n-successes)~predictor,data=Data_agg,family=binomial)
summary(M2)
```

```
## 
## Call:
## glm(formula = cbind(successes, n - successes) ~ predictor, family = binomial, 
##     data = Data_agg)
## 
## Deviance Residuals: 
## [1]  0  0  0
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.1942     0.3609   0.538    0.591
## predictor2   -0.3660     0.4954  -0.739    0.460
## predictor3   -0.5508     0.5017  -1.098    0.272
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.2524e+00  on 2  degrees of freedom
## Residual deviance: 1.3323e-14  on 0  degrees of freedom
## AIC: 17.868
## 
## Number of Fisher Scoring iterations: 2
```
]

Same results overall! Deviance and AIC are different because of the different likelihood functions.

Note that some glm functions use .hlight[n] in the formular instead of .hlight[n-successes].

---
class: center, middle

# Probit regression

---
## Probit regression

- Recall the "Bernoulli" .hlight[logistic regression model]:
.block[
.small[
`$$y_i | x_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip},$$`
]
]

for `$i=1,\ldots,n$`.

- Here the link function is the .hlight[logit function], which ensures that the probabilities lie between 0 and 1.

- We can also use the .hlight[probit function] `$\Phi^{-1}$`, which is the quantile function associated with the standard normal distribution `$N(0,1)$`, as the link.

---
## Probit regression

- That is, suppose `$H$` follows a standard normal distribution, that is, `$H \sim N(0,1)$`.

- Then `$\Phi$` is the CDF, that is, `$\Pr[H \leq h] = \Phi(h)$`.

- Formally, the .hlight[probit regression model] can be written as
.block[
.small[
`$$y_i | x_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \Phi^{-1}\left(\pi_i\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}.$$`
]
]

- It is then easy to see that
.block[
.small[
$$
`\begin{split}
\Pr[y_i = 1 | x_i] = \pi_i & = \Phi\left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}\right)\\
\\
& = \Pr[H \leq \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}].
\end{split}`
$$
]
]

---
## Latent variable representation

- It turns out that we can rewrite the .hlight[probit regression model] as
.block[
.small[
$$
`\begin{split}
y_i & = \mathbb{1}[z_i > 0];\\
z_i & = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i; \ \ \ \epsilon_i  \sim N(0,1) \\
\end{split}`
$$
]
]

where `$y_i = \mathbb{1}[z_i > 0]$` means `$y_i = 1$` if `$z_i > 0$` and `$y_i = 0$` if `$z_i < 0$`.

--
  
- To see that the two representations are equivalent, note that
.block[
.small[
$$
`\begin{split}
\Pr[y_i = 1 | x_i] & = \Pr[z_i > 0] \\
 & = \Pr[\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i > 0] \\
 & = \Pr[\epsilon_i  > -(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip})] \\
 & = \Pr[\epsilon_i  < (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip})] \ \ \ [\textrm{since} \ \ \ \epsilon_i  \sim N(0,1)] \\
 & = \Phi\left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}\right) = \pi_i
\end{split}`
$$
]
]

- Clearly, we do not observe `$Z = (z_1, z_2, \ldots, z_n)$` and it is thus referred to as an .hlight[auxiliary variable].

---
## Probit vs logit functions?

- The plots below compares the inverse logit function `$\pi_i  = \dfrac{e^{x}}{1 + e^{x}}$` and the CDF function (inverse probit) `$\pi_i = \Phi(x)$`.

- Notice that they are similar, but the CDF of the standard normal distribution has fatter tails (the inverse logit has thinner tails).

---
## Probit or logistic regression?

- In practice, the decision to use one or the other is often based on preference: the overall conclusions from both are usually quite similar.

- The results based on logistic regression (using odds and odds ratio) can be more interpretable than those based on Probit regression.

- In some applications, interpreting the  `$z_i$`'s may be meaningful but that is not always the case.

- For example, suppose `$y_i$` is a binary variable for whether or not person `$i$` chooses to buy the new iPhone, then `$z_i$` can be thought of as person `$i$`'s "utility" in a way.

- Works in this example, but does not always work across different domains.

- In .hlight[R], use the `glm` command but set the option `family="binomial(link=probit)` instead of `family="binomial(link=logit)`.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!