We were examining the predictive capacity of academic and non-academic predictors with respect to the probability of admission to selective higher education in 2022.
Figure 1
Flowchart Admission Process 2022
As we have seen in class, an empty model can help us determine what is being predicted by the model (i.e., probability of 0 or 1). In this example, we will use three models: logistic, probit, and general linear. This can help us understand that the logistic model and the probit model have different scales but lead to the same conclusions. In addition, the predictions of the general linear model may be outside the limits of the observed variable.
# First, we fit an empty or null model to check what the model is fitting
empty_mod_log <- glm(AdmittedFull ~ 1, family = binomial(link = 'logit'), data = dat_mods)
empty_mod_prob <- glm(AdmittedFull ~ 1, family = binomial(link = 'probit'), data = dat_mods)
empty_mod_lm <- lm(AdmittedFull ~ 1, data = dat_mods)
These are the results of the empty models:
## Logistic Regression
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.263 0.009 146.42 0
## Probit Regression
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.771 0.005 154.11 0
## General Linear Regression
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.78 0.001 525.858 0
The predicted intercept (general mean) in each model seems different due to the differences between scales. However, when we apply the inverse link function (to go from the model scale to the data scale), we can see the similarities between models:
\[\begin{equation} p = \frac{1}{1 + e^{-\text{logit}}} \end{equation}\]
pred_empty_log <- round(plogis(coef(empty_mod_log)),4)
From our model empty_mod_log
, the intercept is equal to
0.7795.
\[\begin{equation} p = \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}} dt \end{equation}\]
pred_empty_prob <- round(pnorm(coef(empty_mod_prob)),4)
From our model empty_mod_prob
, the intercept is equal to
0.7795.
\[\begin{equation} g(\mu) = \mu \end{equation}\]
From our model empty_mod_lm
, the intercept is equal to
0.7795.
We are going to add Math (range between 150 and 350) as a predictor to see how the conclusions between the logistic model and the probit model are similar. In addition, we will see how the general linear model leads to predictions outside the limits of the result.
# Centering Math
dat_mods$MATH_550 = (dat_mods$MATH/10) - 55
# Simple linear models
simple_mod_log <- glm(AdmittedFull ~ 1 + MATH_550, family = binomial(link = 'logit'), data = dat_mods)
simple_mod_prob <- glm(AdmittedFull ~ 1 + MATH_550, family = binomial(link = 'probit'), data = dat_mods)
simple_mod_lm <- lm(AdmittedFull ~ 1 + MATH_550, data = dat_mods)
These are our model results:
## Logistic Regression
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.245 0.009 140.930 0
## MATH_550 0.055 0.001 53.377 0
## Probit Regression
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.754 0.005 148.104 0
## MATH_550 0.032 0.001 54.629 0
## General Linear Regression
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.766 0.001 519.500 0
## MATH_550 0.009 0.000 55.567 0
Again, we can go from model scale to data scale via inverse-link functions:
pred_simple_log <- round(plogis(coef(simple_mod_log)[1]),4)
From our model simple_mod_log
, the intercept is equal to
0.7765.
pred_simple_prob <- round(pnorm(coef(simple_mod_prob)[1]),4)
From our model simple_mod_prob
, the intercept is equal
to 0.7746.
From our model simple_mod_lm
, the intercept is equal to
0.7659.
The results of our models after applying the inverse function are very similar, which is good. However, to see the problem of the general linear model, we just need to adjust the reference of our prediction a little.
library(prediction)
prob_logit <- prediction(model = simple_mod_log, type = "response", at = list(MATH_550 = 30))
prob_logit <- round(mean(prob_logit$fitted),4)
At math score equals 850, the simple_mod_log
predicts
\(Adm\) = 0.9483.
prob_probit <- prediction(model = simple_mod_prob, type = "response", at = list(MATH_550 = 30))
prob_probit <- round(mean(prob_probit$fitted),4)
At math score equals 850, the simple_mod_prob
predicts
\(Adm\) = 0.9579.
prob_lm <- prediction(model = simple_mod_lm, type = "response", at = list(MATH_550 = 30))
prob_lm <- round(mean(prob_lm$fitted),4)
At math score equals 850, the simple_mod_lm
predicts
\(Adm\) = 1.0366.
For a student with a maths score of 850, the probability of admission predicted by the general model is 1.0366, whereas the logistic and probit models maintain predicted values below 1.