Context

We were examining the predictive capacity of academic and non-academic predictors with respect to the probability of admission to selective higher education in 2022.

Figure 1

Flowchart Admission Process 2022

Empty Models

As we have seen in class, an empty model can help us determine what is being predicted by the model (i.e., probability of 0 or 1). In this example, we will use three models: logistic, probit, and general linear. This can help us understand that the logistic model and the probit model have different scales but lead to the same conclusions. In addition, the predictions of the general linear model may be outside the limits of the observed variable.

# First, we fit an empty or null model to check what the model is fitting
empty_mod_log <- glm(AdmittedFull ~ 1, family = binomial(link = 'logit'), data = dat_mods)
empty_mod_prob <- glm(AdmittedFull ~ 1, family = binomial(link = 'probit'), data = dat_mods)
empty_mod_lm <- lm(AdmittedFull ~ 1, data = dat_mods)

These are the results of the empty models:

## Logistic Regression
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    1.263      0.009  146.42        0
## Probit Regression
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.771      0.005  154.11        0
## General Linear Regression
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     0.78      0.001 525.858        0

The predicted intercept (general mean) in each model seems different due to the differences between scales. However, when we apply the inverse link function (to go from the model scale to the data scale), we can see the similarities between models:

\[\begin{equation} p = \frac{1}{1 + e^{-\text{logit}}} \end{equation}\]

pred_empty_log <- round(plogis(coef(empty_mod_log)),4)

From our model empty_mod_log, the intercept is equal to 0.7795.

\[\begin{equation} p = \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}} dt \end{equation}\]

pred_empty_prob <- round(pnorm(coef(empty_mod_prob)),4)

From our model empty_mod_prob, the intercept is equal to 0.7795.

\[\begin{equation} g(\mu) = \mu \end{equation}\]

From our model empty_mod_lm, the intercept is equal to 0.7795.

Adding Math Scores

We are going to add Math (range between 150 and 350) as a predictor to see how the conclusions between the logistic model and the probit model are similar. In addition, we will see how the general linear model leads to predictions outside the limits of the result.

Using Math Centered at 550

# Centering Math
dat_mods$MATH_550 = (dat_mods$MATH/10) - 55

# Simple linear models
simple_mod_log <- glm(AdmittedFull ~ 1 + MATH_550, family = binomial(link = 'logit'), data = dat_mods)
simple_mod_prob <- glm(AdmittedFull ~ 1 + MATH_550, family = binomial(link = 'probit'), data = dat_mods)
simple_mod_lm <- lm(AdmittedFull ~ 1 + MATH_550, data = dat_mods)

These are our model results:

## Logistic Regression
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    1.245      0.009 140.930        0
## MATH_550       0.055      0.001  53.377        0
## Probit Regression
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.754      0.005 148.104        0
## MATH_550       0.032      0.001  54.629        0
## General Linear Regression
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.766      0.001 519.500        0
## MATH_550       0.009      0.000  55.567        0

Again, we can go from model scale to data scale via inverse-link functions:

  • Inverse logit function
pred_simple_log <- round(plogis(coef(simple_mod_log)[1]),4)

From our model simple_mod_log, the intercept is equal to 0.7765.

  • Inverse of the cumulative normal distribution function
pred_simple_prob <- round(pnorm(coef(simple_mod_prob)[1]),4)

From our model simple_mod_prob, the intercept is equal to 0.7746.

  • “Inverse” identity link function

From our model simple_mod_lm, the intercept is equal to 0.7659.

The results of our models after applying the inverse function are very similar, which is good. However, to see the problem of the general linear model, we just need to adjust the reference of our prediction a little.

Predicting Probability of Admission at Math Score = 850

library(prediction)
prob_logit <- prediction(model = simple_mod_log, type = "response", at = list(MATH_550 = 30))
prob_logit <- round(mean(prob_logit$fitted),4)

At math score equals 850, the simple_mod_log predicts \(Adm\) = 0.9483.

prob_probit <- prediction(model = simple_mod_prob, type = "response", at = list(MATH_550 = 30))
prob_probit <- round(mean(prob_probit$fitted),4)

At math score equals 850, the simple_mod_prob predicts \(Adm\) = 0.9579.

prob_lm <- prediction(model = simple_mod_lm, type = "response", at = list(MATH_550 = 30))
prob_lm <- round(mean(prob_lm$fitted),4)

At math score equals 850, the simple_mod_lm predicts \(Adm\) = 1.0366.

For a student with a maths score of 850, the probability of admission predicted by the general model is 1.0366, whereas the logistic and probit models maintain predicted values below 1.