Fitting Statistical Models

Library

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(here)
here() starts at /Users/nathangreenslit/Desktop/UGA/Spring 2023/MADA/nathangreenslit-MADA-portfolio
library(tidymodels)  # for the parsnip package, along with the rest of tidymodels
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.4     ✔ yardstick    1.1.0
✔ recipes      1.0.5     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(parsnip) #Had to load this separately due to error message

# Helper packages for visualizatiob
library(readr)       # for importing data
library(broom.mixed) # for converting bayesian models to tidy tibbles
library(dotwhisker)  # for visualizing regression results
library(rpart)

Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune
#Libraries for model performance comparrison 
library(performance)

Attaching package: 'performance'

The following objects are masked from 'package:yardstick':

    mae, rmse
library(vip)

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

1. Load Data

d<- readRDS(here("fluanalysis", "data", "SypAct_clean.rds"))

2. Fitting a linear model to the continuous outcome (Body temperature) using only the main predictor of interest (Fever).

Define Linear Regression

linear_reg() %>%
  set_engine("lm")
Linear Regression Model Specification (regression)

Computational engine: lm 
lm_mod<- linear_reg()

Train model to data

lm_fit <- 
  lm_mod %>% 
  fit(BodyTemp ~ SubjectiveFever, data = d)
lm_fit
parsnip model object


Call:
stats::lm(formula = BodyTemp ~ SubjectiveFever, data = data)

Coefficients:
       (Intercept)  SubjectiveFeverYes  
           98.5739              0.5273  

Examine output

tidy(lm_fit)
# A tibble: 2 × 5
  term               estimate std.error statistic      p.value
  <chr>                 <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)          98.6      0.0773   1276.   0           
2 SubjectiveFeverYes    0.527    0.0934      5.65 0.0000000233

Plot

tidy(lm_fit) %>% 
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

glance(lm_fit)
# A tibble: 1 × 12
  r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
    <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
1  0.0420  0.0407  1.17    31.9 2.33e-8     1 -1151. 2307. 2321.   1000.     728
# … with 1 more variable: nobs <int>, and abbreviated variable names
#   ¹​r.squared, ²​adj.r.squared, ³​statistic, ⁴​deviance, ⁵​df.residual

3. Fitting another linear model to the continuous outcome (Body Temperature) using all predictors of interest.

Define Linear Regression

lm_mod_all<- linear_reg()

Train model to data

lm_fit_all <- 
  lm_mod_all %>% 
  fit(BodyTemp ~ ., data = d)
lm_fit_all
parsnip model object


Call:
stats::lm(formula = BodyTemp ~ ., data = data)

Coefficients:
           (Intercept)    SwollenLymphNodesYes      ChestCongestionYes  
             97.932520               -0.166066                0.102608  
       ChillsSweatsYes      NasalCongestionYes               SneezeYes  
              0.191826               -0.221568               -0.372709  
            FatigueYes      SubjectiveFeverYes             HeadacheYes  
              0.272509                0.436694                0.004977  
          WeaknessMild        WeaknessModerate          WeaknessSevere  
              0.005517                0.087801                0.351281  
    CoughIntensityMild  CoughIntensityModerate    CoughIntensitySevere  
              0.359527                0.261469                0.287089  
           MyalgiaMild         MyalgiaModerate           MyalgiaSevere  
              0.156342               -0.028450               -0.136091  
          RunnyNoseYes               AbPainYes            ChestPainYes  
             -0.064666                0.015375                0.100308  
           DiarrheaYes                EyePnYes             InsomniaYes  
             -0.152344                0.128786               -0.005824  
           ItchyEyeYes               NauseaYes                EarPnYes  
             -0.003889               -0.033246                0.111125  
        PharyngitisYes           BreathlessYes              ToothPnYes  
              0.315991                0.086379               -0.035497  
              VomitYes               WheezeYes  
              0.160053               -0.034654  

Examine output

tidy(lm_fit_all)
# A tibble: 32 × 5
   term                 estimate std.error statistic   p.value
   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)          97.9        0.304   323.     0        
 2 SwollenLymphNodesYes -0.166      0.0920   -1.81   0.0714   
 3 ChestCongestionYes    0.103      0.0971    1.06   0.291    
 4 ChillsSweatsYes       0.192      0.127     1.51   0.131    
 5 NasalCongestionYes   -0.222      0.113    -1.95   0.0512   
 6 SneezeYes            -0.373      0.0980   -3.80   0.000156 
 7 FatigueYes            0.273      0.161     1.70   0.0901   
 8 SubjectiveFeverYes    0.437      0.103     4.25   0.0000244
 9 HeadacheYes           0.00498    0.125     0.0397 0.968    
10 WeaknessMild          0.00552    0.189     0.0292 0.977    
# … with 22 more rows

Plot

tidy(lm_fit_all) %>% 
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

4. Compare the model results for the model with just the main predictor and all predictors.

Assess Performance

check_model(lm_fit$fit)

check_model(lm_fit_all$fit)
Variable `Component` is not in your data frame :/

Compare Performance

compare_performance(lm_fit,lm_fit_all)
# Comparison of Model Performance Indices

Name       | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------------------------------------------------
lm_fit     |   _lm | 2307.1 (0.060) | 2307.1 (0.239) | 2320.9 (>.999) | 0.042 |     0.041 | 1.170 | 1.172
lm_fit_all |   _lm | 2301.6 (0.940) | 2304.8 (0.761) | 2453.1 (<.001) | 0.124 |     0.085 | 1.119 | 1.144

There does not seem to be a major difference in model performance between the uni and multivariate models. Both have a similar observance/predicted line path and Normality of Residuals. The multivariate model does have a “higher” R2 and lower RMSE value.

5. Fitting a logistic model to the categorical outcome (Nausea) using only the main predictor of interest (Vomit).

#Make model
log_mod <- logistic_reg() %>% 
  set_engine("glm")

#Train Model
glm_fit <- 
  log_mod %>% 
  fit(Nausea ~ Vomit, data = d)
glm_fit 
parsnip model object


Call:  stats::glm(formula = Nausea ~ Vomit, family = stats::binomial, 
    data = data)

Coefficients:
(Intercept)     VomitYes  
    -0.8811       2.4010  

Degrees of Freedom: 729 Total (i.e. Null);  728 Residual
Null Deviance:      944.7 
Residual Deviance: 862  AIC: 866

Examine model output

tidy(glm_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -0.881    0.0861    -10.2  1.32e-24
2 VomitYes       2.40     0.307       7.81 5.63e-15

Plot

tidy(glm_fit) %>% 
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

5. Fits another logistic model to the categorical outcome using all (important) predictors of interest.

#Make Model
log_mod_all <- logistic_reg() %>% 
  set_engine("glm")

#Train Model
glm_fit_all <- 
  log_mod_all %>% 
  fit(Nausea ~ ., data = d)
glm_fit_all 
parsnip model object


Call:  stats::glm(formula = Nausea ~ ., family = stats::binomial, data = data)

Coefficients:
           (Intercept)    SwollenLymphNodesYes      ChestCongestionYes  
             0.2073331              -0.2468466               0.2652914  
       ChillsSweatsYes      NasalCongestionYes               SneezeYes  
             0.2905732               0.4400651               0.1650192  
            FatigueYes      SubjectiveFeverYes             HeadacheYes  
             0.2309581               0.2550192               0.3374915  
          WeaknessMild        WeaknessModerate          WeaknessSevere  
            -0.1203258               0.3243767               0.8296720  
    CoughIntensityMild  CoughIntensityModerate    CoughIntensitySevere  
            -0.3489315              -0.5080112              -1.0969293  
           MyalgiaMild         MyalgiaModerate           MyalgiaSevere  
            -0.0002988               0.2001406               0.1296120  
          RunnyNoseYes               AbPainYes            ChestPainYes  
             0.0526415               0.9415252               0.0715321  
           DiarrheaYes                EyePnYes             InsomniaYes  
             1.0647853              -0.3277246               0.0841010  
           ItchyEyeYes                EarPnYes          PharyngitisYes  
            -0.0526911              -0.1654360               0.2915358  
         BreathlessYes              ToothPnYes                VomitYes  
             0.5306311               0.4801514               2.4471603  
             WheezeYes                BodyTemp  
            -0.2758144              -0.0313649  

Degrees of Freedom: 729 Total (i.e. Null);  698 Residual
Null Deviance:      944.7 
Residual Deviance: 752.1    AIC: 816.1

Examine Output

tidy(glm_fit_all)
# A tibble: 32 × 5
   term                 estimate std.error statistic p.value
   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)             0.207     7.81     0.0265  0.979 
 2 SwollenLymphNodesYes   -0.247     0.196   -1.26    0.207 
 3 ChestCongestionYes      0.265     0.210    1.26    0.207 
 4 ChillsSweatsYes         0.291     0.287    1.01    0.311 
 5 NasalCongestionYes      0.440     0.253    1.74    0.0822
 6 SneezeYes               0.165     0.210    0.787   0.431 
 7 FatigueYes              0.231     0.371    0.622   0.534 
 8 SubjectiveFeverYes      0.255     0.223    1.14    0.253 
 9 HeadacheYes             0.337     0.285    1.18    0.236 
10 WeaknessMild           -0.120     0.447   -0.269   0.788 
# … with 22 more rows

Plot

tidy(glm_fit_all) %>% 
  dwplot(dot_args = list(size = 2, color = "black"),
         whisker_args = list(color = "black"),
         vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))

6. Compares the model results for the categorical model with just the main predictor and all predictors.

Assess Performance

check_model(glm_fit$fit)

check_model(glm_fit_all$fit)
Variable `Component` is not in your data frame :/

Compare Performance

compare_performance(glm_fit,glm_fit_all)
# Comparison of Model Performance Indices

Name        | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
-------------------------------------------------------------------------------------------------------------------------------------------------
glm_fit     |  _glm | 866.0 (<.001) |  866.0 (<.001) | 875.2 (>.999) |     0.117 | 0.448 | 1.088 |    0.590 |  -123.912 |           0.019 | 0.599
glm_fit_all |  _glm | 816.1 (>.999) |  819.2 (>.999) | 963.1 (<.001) |     0.246 | 0.415 | 1.038 |    0.515 |      -Inf |           0.002 | 0.657

In this case, it looks like the multivariate analysis had a better performance than the uni-variate. While both had a good observed/predicted line path, the former had better Normality of Residuals.