Tidy Tuesday Exercise

Author

Asmith Joseph

Published

April 9, 2025

# Load the necessary libraries
library(here)
Warning: package 'here' was built under R version 4.4.2
here() starts at C:/Users/ajose35/Desktop/MADA-course/AsmithJoseph-MADA-portfolio
library(readr)
Warning: package 'readr' was built under R version 4.4.2
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(knitr)
Warning: package 'knitr' was built under R version 4.4.3
library(maps)
Warning: package 'maps' was built under R version 4.4.3
library(tidyr)
library(tidymodels)
Warning: package 'tidymodels' was built under R version 4.4.2
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.2.1
✔ dials        1.4.0     ✔ tibble       3.2.1
✔ infer        1.0.7     ✔ tune         1.3.0
✔ modeldata    1.4.0     ✔ workflows    1.2.0
✔ parsnip      1.3.0     ✔ workflowsets 1.1.0
✔ purrr        1.0.4     ✔ yardstick    1.3.2
✔ recipes      1.1.1     
Warning: package 'dials' was built under R version 4.4.2
Warning: package 'scales' was built under R version 4.4.2
Warning: package 'infer' was built under R version 4.4.2
Warning: package 'modeldata' was built under R version 4.4.2
Warning: package 'parsnip' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.2
Warning: package 'recipes' was built under R version 4.4.2
Warning: package 'rsample' was built under R version 4.4.3
Warning: package 'tune' was built under R version 4.4.2
Warning: package 'workflows' was built under R version 4.4.2
Warning: package 'workflowsets' was built under R version 4.4.2
Warning: package 'yardstick' was built under R version 4.4.3
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard()  masks scales::discard()
✖ dplyr::filter()   masks stats::filter()
✖ dplyr::lag()      masks stats::lag()
✖ purrr::map()      masks maps::map()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(vip)
Warning: package 'vip' was built under R version 4.4.3

Attaching package: 'vip'
The following object is masked from 'package:utils':

    vi
library(tidymodels)
library(yardstick)
library(tibble)
library(gt)
Warning: package 'gt' was built under R version 4.4.2

Part 1: Importing & Reading the Dataset

# Defining the raw URL for the CSV file
raw_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv"

# Read the CSV file directly from the URL
care_state <- read_csv(raw_url)
Rows: 1232 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): state, condition, measure_id, measure_name, footnote
dbl  (1): score
date (2): start_date, end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the data
head(care_state)
# A tibble: 6 × 8
  state condition   measure_id measure_name score footnote start_date end_date  
  <chr> <chr>       <chr>      <chr>        <dbl> <chr>    <date>     <date>    
1 AK    Healthcare… HCP_COVID… Percentage …   7.3 <NA>     2024-01-01 2024-03-31
2 AK    Healthcare… IMM_3      Healthcare …  80   <NA>     2023-10-01 2024-03-31
3 AK    Emergency … OP_18b     Average (me… 140   25, 26   2023-04-01 2024-03-31
4 AK    Emergency … OP_18b_HI… Average tim… 157   25, 26   2023-04-01 2024-03-31
5 AK    Emergency … OP_18b_LO… Average tim… 136   25, 26   2023-04-01 2024-03-31
6 AK    Emergency … OP_18b_ME… Average tim… 136   25, 26   2023-04-01 2024-03-31
# A tibble: 1,232 × 8
   state condition  measure_id measure_name score footnote start_date end_date  
   <chr> <chr>      <chr>      <chr>        <dbl> <chr>    <date>     <date>    
 1 AK    Healthcar… HCP_COVID… Percentage …   7.3 <NA>     2024-01-01 2024-03-31
 2 AK    Healthcar… IMM_3      Healthcare …  80   <NA>     2023-10-01 2024-03-31
 3 AK    Emergency… OP_18b     Average (me… 140   25, 26   2023-04-01 2024-03-31
 4 AK    Emergency… OP_18b_HI… Average tim… 157   25, 26   2023-04-01 2024-03-31
 5 AK    Emergency… OP_18b_LO… Average tim… 136   25, 26   2023-04-01 2024-03-31
 6 AK    Emergency… OP_18b_ME… Average tim… 136   25, 26   2023-04-01 2024-03-31
 7 AK    Emergency… OP_18b_VE… Average tim…  NA   25, 26   2023-04-01 2024-03-31
 8 AK    Emergency… OP_18c     Average (me… 196   25       2023-04-01 2024-03-31
 9 AK    Emergency… OP_18c_HI… Average tim… 230   25       2023-04-01 2024-03-31
10 AK    Emergency… OP_18c_LO… Average tim… 182   25       2023-04-01 2024-03-31
# ℹ 1,222 more rows

Part 2: Exploratory Data Analysis

# Define the raw URL for the CSV file
raw_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-08/care_state.csv"

# Read the CSV data directly from the URL
care_state <- read_csv(raw_url)
Rows: 1232 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): state, condition, measure_id, measure_name, footnote
dbl  (1): score
date (2): start_date, end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure and summary of the data
str(care_state)
spc_tbl_ [1,232 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state       : chr [1:1232] "AK" "AK" "AK" "AK" ...
 $ condition   : chr [1:1232] "Healthcare Personnel Vaccination" "Healthcare Personnel Vaccination" "Emergency Department" "Emergency Department" ...
 $ measure_id  : chr [1:1232] "HCP_COVID_19" "IMM_3" "OP_18b" "OP_18b_HIGH_MIN" ...
 $ measure_name: chr [1:1232] "Percentage of healthcare personnel who are up to date with COVID-19 vaccinations" "Healthcare workers given influenza vaccination Higher percentages are better" "Average (median) time patients spent in the emergency department before leaving from the visit A lower number o"| __truncated__ "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (high)" ...
 $ score       : num [1:1232] 7.3 80 140 157 136 136 NA 196 230 182 ...
 $ footnote    : chr [1:1232] NA NA "25, 26" "25, 26" ...
 $ start_date  : Date[1:1232], format: "2024-01-01" "2023-10-01" ...
 $ end_date    : Date[1:1232], format: "2024-03-31" "2024-03-31" ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   condition = col_character(),
  ..   measure_id = col_character(),
  ..   measure_name = col_character(),
  ..   score = col_double(),
  ..   footnote = col_character(),
  ..   start_date = col_date(format = ""),
  ..   end_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
summary(care_state)
    state            condition          measure_id        measure_name      
 Length:1232        Length:1232        Length:1232        Length:1232       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     score       footnote           start_date            end_date         
 Min.   :  1   Length:1232        Min.   :2023-01-01   Min.   :2023-12-31  
 1st Qu.: 70   Class :character   1st Qu.:2023-04-01   1st Qu.:2024-03-31  
 Median : 93   Mode  :character   Median :2023-04-01   Median :2024-03-31  
 Mean   :134                      Mean   :2023-04-05   Mean   :2024-03-14  
 3rd Qu.:193                      3rd Qu.:2023-04-01   3rd Qu.:2024-03-31  
 Max.   :730                      Max.   :2024-01-01   Max.   :2024-03-31  
 NA's   :155                                                               

Summary table of missing values for each column

missing_summary <- care_state %>% 
  summarise_all(~ sum(is.na(.)))
print(missing_summary)
# A tibble: 1 × 8
  state condition measure_id measure_name score footnote start_date end_date
  <int>     <int>      <int>        <int> <int>    <int>      <int>    <int>
1     0         0          0            0   155      168          0        0

Frequency Table for Categorical Variables

if("state" %in% colnames(care_state)){
  state_freq <- care_state %>%
    count(state) %>%
    arrange(desc(n))
  
  # Print the frequency table directly so it appears in the output
  print(state_freq)
}
# A tibble: 56 × 2
   state     n
   <chr> <int>
 1 AK       22
 2 AL       22
 3 AR       22
 4 AS       22
 5 AZ       22
 6 CA       22
 7 CO       22
 8 CT       22
 9 DC       22
10 DE       22
# ℹ 46 more rows

Visualizing Numeric Variables

numeric_columns <- care_state %>% 
  select_if(is.numeric) %>% 
  names()

for(col in numeric_columns){
  print(
    ggplot(care_state, aes_string(x = col)) +
      geom_histogram(binwidth = 30, fill = "skyblue", color = "black") +
      labs(title = paste("Histogram of", col),
           x = col, 
           y = "Count") +
      theme_minimal()
  )
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
Warning: Removed 155 rows containing non-finite outside the scale range
(`stat_bin()`).

Observations by State for Each Condition

# Convert state abbreviations to full, lowercase state names
abbr_to_state <- setNames(tolower(state.name), state.abb)
care_state <- care_state %>%
  mutate(state_full = ifelse(state %in% names(abbr_to_state),
                             abbr_to_state[state],
                             tolower(state)))

# Group by the converted state names and condition, then count observations
state_condition_freq <- care_state %>% 
  group_by(state_full, condition) %>% 
  summarize(count = n(), .groups = "drop")

# Get U.S. states map data; note that the 'region' column contains the state full names in lowercase
states_map <- map_data("state")

# Join the map data with the frequency table based on matching state names
map_data_condition <- left_join(states_map, state_condition_freq, 
                                by = c("region" = "state_full"))
Warning in left_join(states_map, state_condition_freq, by = c(region = "state_full")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Produce a faceted choropleth map showing each condition's frequency by state
ggplot(map_data_condition, aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = count), color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "grey90") +
  coord_fixed(1.3) +
  labs(fill = "Count", title = "Observations by State for Each Condition") +
  facet_wrap(~ condition)

Frequency table showing the count for each condition

# Create a frequency table for all conditions
condition_freq <- care_state %>%
  group_by(condition) %>%
  summarize(frequency = n(), .groups = "drop") %>%
  arrange(desc(frequency))

# Print the frequency table
print(condition_freq)
# A tibble: 6 × 2
  condition                           frequency
  <chr>                                   <int>
1 Emergency Department                      672
2 Sepsis Care                               280
3 Healthcare Personnel Vaccination          112
4 Cataract surgery outcome                   56
5 Colonoscopy care                           56
6 Electronic Clinical Quality Measure        56

Part 3: Full Analysis

Formulate a Research Question - Hypothesis: States with higher adherence to septic shock care bundles (both the 3-Hour and 6-Hour Bundles) exhibit better overall sepsis care quality.

Research Question: - Do states with higher adherence to septic shock care protocols (3-Hour and 6-Hour bundles) have better sepsis care quality as reflected in a synthetic quality score?

Outcome (Dependent Variable): - Synthetic Sepsis Quality Score. - This variable is created by combining the adherence counts (for both bundles) and adding some random variation to simulate an overall quality measure.

Predictors (Independent Variables): - Count of Septic Shock 3-Hour Bundle Adherence - Count of Septic Shock 6-Hour Bundle Adherence

Create Synthetic state_summary

# creating a synthetic state_summary.
# This synthetic data assumes one observation per state (50 states) with two predictors:
#   - bundle_3hr: count for "Septic Shock 3-Hour Bundle"
#   - bundle_6hr: count for "Septic Shock 6-Hour Bundle"
# And a synthetic outcome: quality_score

set.seed(123)
state_summary <- data.frame(
  state_full = tolower(state.name),
  bundle_3hr = sample(10:50, 50, replace = TRUE),
  bundle_6hr = sample(5:25, 50, replace = TRUE)
) %>%
  mutate(quality_score = (bundle_3hr + bundle_6hr) * 0.3 + rnorm(50, mean = 0, sd = 5))

# Print the synthetic state_summary
print(state_summary)
       state_full bundle_3hr bundle_6hr quality_score
1         alabama         40         19     14.271466
2          alaska         24         13      9.703332
3         arizona         23         19      8.686349
4        arkansas         12         20      5.705014
5      california         46         24     19.126000
6        colorado         23         10      8.303031
7     connecticut         34         15     15.122719
8        delaware         35         12     10.257632
9         florida         36         11     10.970445
10        georgia         14         20      5.695646
11         hawaii         36         21     20.418643
12          idaho         37         22     19.201396
13       illinois         18         21     12.074284
14        indiana         38          6     14.231863
15           iowa         44          8     13.155386
16         kansas         17         17      7.060242
17       kentucky         35          9     12.965416
18      louisiana         16         23     12.513091
19          maine         18         24     19.061530
20       maryland         28         18     11.482217
21  massachusetts         45          7     17.127316
22       michigan         23         12     10.080056
23      minnesota         26         20     15.851817
24    mississippi         48         16     20.118391
25       missouri         21         18     20.593708
26        montana         24          7      9.488414
27       nebraska         41         18     23.581101
28         nevada         16         11      5.307321
29  new hampshire         18          7      2.771910
30     new jersey         50         19     17.374057
31     new mexico         19         25     15.460151
32       new york         32          9     14.934278
33 north carolina         36         12     13.248689
34   north dakota         16         23     18.687133
35           ohio         36         14     23.818265
36       oklahoma         41         22     21.328007
37         oregon         47         14     16.971305
38   pennsylvania         34         16     15.758057
39   rhode island         43          6     21.583049
40 south carolina         38         14     14.698028
41   south dakota         14         16      1.161624
42      tennessee         17         24     10.996371
43          texas         21         18     16.509052
44           utah         22         21     17.169477
45        vermont         27         18     15.593984
46       virginia         42          7     16.399783
47     washington         36         12     17.382125
48  west virginia         34         18     24.957090
49      wisconsin         47         23     24.014352
50        wyoming         30         19     10.869156

Pre-process and Clean the Data

state_clean <- state_summary %>% 
  drop_na()


state_clean <- state_clean %>%
  mutate(across(c(bundle_3hr, bundle_6hr, quality_score), scale))

# Print a summary and the cleaned dataset to verify changes.
summary(state_clean)
  state_full           bundle_3hr.V1        bundle_6hr.V1    
 Length:50          Min.   :-1.6766699   Min.   :-1.7801476  
 Class :character   1st Qu.:-0.8547729   1st Qu.:-0.7077695  
 Mode  :character   Median : 0.2410898   Median : 0.2752437  
                    Mean   : 0.0000000   Mean   : 0.0000000  
                    3rd Qu.: 0.6976992   3rd Qu.: 0.7220679  
                    Max.   : 1.7935619   Max.   : 1.6157163  
   quality_score.V1  
 Min.   :-2.3807240  
 1st Qu.:-0.6405106  
 Median : 0.0986433  
 Mean   : 0.0000000  
 3rd Qu.: 0.6944662  
 Max.   : 1.8738541  
print(state_clean)
       state_full  bundle_3hr   bundle_6hr quality_score
1         alabama  0.88034301  0.543338236   -0.03671249
2          alaska -0.58080721 -0.529039862   -0.85348492
3         arizona -0.67212910  0.543338236   -1.03531943
4        arkansas -1.67666988  0.722067919   -1.56837568
5      california  1.42827434  1.436986651    0.83126770
6        colorado -0.67212910 -1.065228910   -1.10385580
7     connecticut  0.33241168 -0.171580496    0.11548972
8        delaware  0.42373356 -0.707769545   -0.75437734
9         florida  0.51505545 -0.886499228   -0.62692783
10        georgia -1.49402610  0.722067919   -1.57005068
11         hawaii  0.51505545  0.900797602    1.06238964
12          idaho  0.60637734  1.079527285    0.84474836
13       illinois -1.12873855  0.900797602   -0.42956385
14        indiana  0.69769923 -1.780147642   -0.04379330
15           iowa  1.24563056 -1.422688276   -0.23626519
16         kansas -1.22006043  0.185878870   -1.32606385
17       kentucky  0.42373356 -1.243958593   -0.27023132
18      louisiana -1.31138232  1.258256968   -0.35110620
19          maine -1.12873855  1.436986651    0.81974061
20       maryland -0.21551966  0.364608553   -0.53542409
21  massachusetts  1.33695245 -1.601417959    0.47390741
22       michigan -0.67212910 -0.707769545   -0.78612747
23      minnesota -0.39816344  0.722067919    0.24585095
24    mississippi  1.61091812  0.007149187    1.00870520
25       missouri -0.85477288  0.364608553    1.09369084
26        montana -0.58080721 -1.601417959   -0.89191182
27       nebraska  0.97166490  0.364608553    1.62783020
28         nevada -1.31138232 -0.886499228   -1.63948231
29  new hampshire -1.12873855 -1.601417959   -2.09280828
30     new jersey  1.79356190  0.543338236    0.51802411
31     new mexico -1.03741666  1.615716334    0.17582186
32       new york  0.14976790 -1.243958593    0.08179697
33 north carolina  0.51505545 -0.707769545   -0.21958281
34   north dakota -1.31138232  1.258256968    0.75279936
35           ohio  0.51505545 -0.350310179    1.67023471
36       oklahoma  0.97166490  1.079527285    1.22498186
37         oregon  1.51959623 -0.350310179    0.44601301
38   pennsylvania  0.33241168  0.007149187    0.22908680
39   rhode island  1.15430867 -1.780147642    1.27058288
40 south carolina  0.69769923 -0.350310179    0.03955599
41   south dakota -1.49402610  0.007149187   -2.38072395
42      tennessee -1.22006043  1.436986651   -0.62229248
43          texas -0.85477288  0.364608553    0.36336307
44           utah -0.76345099  0.900797602    0.48144574
45        vermont -0.30684155  0.364608553    0.19975083
46       virginia  1.06298679 -1.601417959    0.34382597
47     washington  0.51505545 -0.707769545    0.51946674
48  west virginia  0.33241168  0.364608553    1.87385415
49      wisconsin  1.51959623  1.258256968    1.70529459
50        wyoming -0.03287588  0.543338236   -0.64503817

Split into Training and Testing Sets

# Set seed for reproducibility
set.seed(123)

# Calculate the sample size for the training set (70% of the data)
train_size <- floor(0.7 * nrow(state_clean))

# Randomly sample indices for the training set
train_indices <- sample(seq_len(nrow(state_clean)), size = train_size)

# Create the training and testing datasets using the sampled indices
train_data <- state_clean[train_indices, ]
test_data  <- state_clean[-train_indices, ]

# Display the dimensions of the training and testing sets
cat("Training set dimensions:", dim(train_data), "\n")
Training set dimensions: 35 4 
cat("Testing set dimensions:", dim(test_data), "\n")
Testing set dimensions: 15 4 

In this part of my analysis, I set out to examine whether states with higher adherence to septic shock care bundles exhibit better sepsis care quality. To do this, I formulated the research question: “Do states with higher adherence to septic shock care protocols (3-Hour and 6-Hour bundles) have better sepsis care quality as reflected in a synthetic quality score?” I then created a synthetic dataset, state_summary, representing 50 states with counts for the Septic Shock 3-Hour and 6-Hour Bundles and derived a synthetic quality score as my outcome. After generating the data, I pre-processed it by removing any missing values and standardizing the predictors and outcome. Finally, I split the cleaned dataset into training (70%) and testing (30%) sets to prepare for further modeling. This end-to-end process allowed me to simulate a complete analysis workflow from data creation, through cleaning, to model preparation.

Part 3a: Three different model types

Setup (Load Libraries and Create CV Folds)

set.seed(123)

# Create 5-fold cross-validation folds from the training data.
cv_folds <- vfold_cv(train_data, v = 5)

Model 1 – Linear Regression

# Specify the linear regression model using the lm engine.
lin_reg_spec <- linear_reg(mode = "regression") %>%
  set_engine("lm")

# Define the workflow: outcome quality_score is predicted by bundle_3hr and bundle_6hr.
lin_reg_workflow <- workflow() %>%
  add_model(lin_reg_spec) %>%
  add_formula(quality_score ~ bundle_3hr + bundle_6hr)

# Fit the linear regression using resampling (CV)
lin_reg_res <- fit_resamples(
  lin_reg_workflow,
  resamples = cv_folds,
  metrics = metric_set(rmse, rsq, mae)
)

# Collect and print performance metrics
lin_reg_metrics <- collect_metrics(lin_reg_res)
print(lin_reg_metrics)
# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.577     5  0.0547 Preprocessor1_Model1
2 rmse    standard   0.724     5  0.0393 Preprocessor1_Model1
3 rsq     standard   0.629     5  0.112  Preprocessor1_Model1

The cross-validation results indicate that the model has a mean MAE of approximately 0.577 and an RMSE of around 0.724, suggesting that, on average, our model’s predictions deviate from the actual values by roughly 0.58–0.72 units. The R² of about 0.629 means that the model is able to explain nearly 63% of the variability in the outcome. The standard errors—roughly 0.055 for MAE, 0.039 for RMSE, and 0.112 for R²—are relatively low, indicating consistent performance across the 5-fold cross-validation.

# Fit the final linear regression model on the training dataset
final_lin_reg <- fit(lin_reg_workflow, data = train_data)

# Generate predictions on the training set and combine them with the actual outcomes
lin_reg_preds <- predict(final_lin_reg, train_data) %>%
  bind_cols(train_data)

# Create the observed vs predicted plot
ggplot(lin_reg_preds, aes(x = quality_score, y = .pred)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Observed vs Predicted: Linear Regression Model",
       x = "Observed Quality Score",
       y = "Predicted Quality Score") +
  theme_minimal()

Most of the predictions are close to the actual values, as shown by the points clustering around the diagonal line. However, there is still some spread, which means the model sometimes predicts a bit too high or too low. Overall, when the actual scores are higher, the predictions tend to be higher too, but there’s still room to improve the model’s accuracy.

Model 2 – Random Forest

# Specify the random forest model. Here, mtry is set to 2 and trees to 500.
rf_spec <- rand_forest(mode = "regression", mtry = 2, trees = 500) %>%
  set_engine("ranger", importance = "permutation")

# Define the workflow for random forest.
rf_workflow <- workflow() %>%
  add_model(rf_spec) %>%
  add_formula(quality_score ~ bundle_3hr + bundle_6hr)

# Fit the random forest with CV.
rf_res <- fit_resamples(
  rf_workflow,
  resamples = cv_folds,
  metrics = metric_set(rmse, rsq, mae)
)
Warning: package 'ranger' was built under R version 4.4.3
# Collect and print performance metrics for random forest.
rf_metrics <- collect_metrics(rf_res)
print(rf_metrics)
# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.532     5  0.0684 Preprocessor1_Model1
2 rmse    standard   0.669     5  0.0958 Preprocessor1_Model1
3 rsq     standard   0.614     5  0.126  Preprocessor1_Model1

Model 2: shows an average MAE of 0.533 (SE = 0.069), RMSE of 0.673 (SE = 0.095), and R² of 0.612 (SE = 0.123). This indicates the model’s predictions deviate by about 0.53 units on average, with low error variability and approximately 61% of the outcome variation explained, demonstrating stable performance across 5-fold CV.

# Fit the final random forest model on the training dataset
final_rf_model <- fit(rf_workflow, data = train_data)

# Generate predictions on the training set and combine them with the actual outcomes
rf_preds <- predict(final_rf_model, train_data) %>%
  bind_cols(train_data)

# Create the observed vs predicted plot for the random forest model
ggplot(rf_preds, aes(x = quality_score, y = .pred)) +
  geom_point(color = "darkgreen", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Observed vs Predicted: Random Forest Model",
       x = "Observed Quality Score",
       y = "Predicted Quality Score") +
  theme_minimal()

The random forest model performs well overall. Most of its predictions are close to the ideal line where observed and predicted values match, showing good agreement. However, some predictions still deviate, indicating that the model occasionally overestimates or underestimates the actual values. This suggests that while the model captures the overall trend, there is still potential for improvement in its accuracy.

Model 3 – Boosted Trees

# Specify the boosted tree model with 500 trees, a tree depth of 3, and a learning rate of 0.1.
boost_spec <- boost_tree(
  mode = "regression",
  trees = 500,
  tree_depth = 3,
  learn_rate = 0.1
) %>%
  set_engine("xgboost")

# Define the workflow for boosted trees.
boost_workflow <- workflow() %>%
  add_model(boost_spec) %>%
  add_formula(quality_score ~ bundle_3hr + bundle_6hr)

# Fit the boosted tree model with CV.
boost_res <- fit_resamples(
  boost_workflow,
  resamples = cv_folds,
  metrics = metric_set(rmse, rsq, mae)
)
Warning: package 'xgboost' was built under R version 4.4.3
# Collect and print performance metrics.
boost_metrics <- collect_metrics(boost_res)
print(boost_metrics)
# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.690     5  0.0370 Preprocessor1_Model1
2 rmse    standard   0.815     5  0.0388 Preprocessor1_Model1
3 rsq     standard   0.540     5  0.121  Preprocessor1_Model1

Model 3: shows an average MAE of 0.690 (SE = 0.037), RMSE of 0.815 (SE = 0.039), and R² of 0.540 (SE = 0.121). This means that, on average, the predictions deviate by about 0.69 units, and roughly 54% of the outcome’s variance is explained by the model, with low standard errors indicating stable performance across 5-fold CV.

# Fit the final boosted trees model on the training dataset
final_boost <- fit(boost_workflow, data = train_data)

# Generate predictions on the training set and combine them with the actual outcomes
boost_preds <- predict(final_boost, train_data) %>%
  bind_cols(train_data)

# Create the observed vs predicted plot for the boosted trees model
ggplot(boost_preds, aes(x = quality_score, y = .pred)) +
  geom_point(color = "purple", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Observed vs Predicted: Boosted Trees Model",
       x = "Observed Quality Score",
       y = "Predicted Quality Score") +
  theme_minimal()

Based on the observed versus predicted plot for the boosted trees model, most of the points are clustered around the red dashed line, which represents perfect predictions. This indicates that the model’s predictions generally align well with the observed quality scores. However, some points deviate from the line, suggesting that in certain cases the model either overpredicts or underpredicts the actual values. Overall, while the model captures the overall trend, the spread of points reveals that there is still some error in the predictions.

# Extract the fitted random forest model from the workflow
rf_fit <- extract_fit_parsnip(final_rf_model)

# Create a visually appealing variable importance plot
vip(rf_fit,
    num_features = 10,       # Show up to 10 features, though in this example we only have two
    geom = "col",            # Use a column chart
    horizontal = TRUE,       # Display horizontally
    aesthetics = list(fill = "skyblue", color = "blue")) +
  labs(title = "Random Forest Variable Importance",
       x = "Importance Score",
       y = "Predictor") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Based on the plot, the predictor “bundle_3hr” shows a higher importance score than “bundle_6hr,” indicating it plays a stronger role in predicting the quality score. Although both variables contribute to the model, the longer bar for “bundle_3hr” demonstrates that it is the more dominant driver in the predictions.

Compare Model Performance

# Combine the performance metrics from all three models for easy comparison.
model_metrics <- bind_rows(
  lin_reg = lin_reg_metrics,
  rf = rf_metrics,
  boost = boost_metrics,
  .id = "model"
)

print(model_metrics)
# A tibble: 9 × 7
  model   .metric .estimator  mean     n std_err .config             
  <chr>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 lin_reg mae     standard   0.577     5  0.0547 Preprocessor1_Model1
2 lin_reg rmse    standard   0.724     5  0.0393 Preprocessor1_Model1
3 lin_reg rsq     standard   0.629     5  0.112  Preprocessor1_Model1
4 rf      mae     standard   0.532     5  0.0684 Preprocessor1_Model1
5 rf      rmse    standard   0.669     5  0.0958 Preprocessor1_Model1
6 rf      rsq     standard   0.614     5  0.126  Preprocessor1_Model1
7 boost   mae     standard   0.690     5  0.0370 Preprocessor1_Model1
8 boost   rmse    standard   0.815     5  0.0388 Preprocessor1_Model1
9 boost   rsq     standard   0.540     5  0.121  Preprocessor1_Model1

I compared three different models using 5-fold CV. The linear regression model yielded an average MAE of 0.577, RMSE of 0.724, and an R² of 0.629. The random forest model performed a bit better, with an MAE of 0.526, RMSE of 0.665, and an R² of 0.624. On the other hand, the boosted trees model had the highest errors—an MAE of 0.690 and an RMSE of 0.815—and a lower R² of 0.540, which indicates it explains less of the outcome’s variance. Overall, while linear regression and random forest show comparable explanatory power, the random forest model stands out for having lower prediction errors, making it the best-performing option among the three.

Dot-and-whisker plot: Comparison of Model Performance Metrics

library(ggplot2)

ggplot(model_metrics, aes(x = model, y = mean, color = model)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), width = 0.2) +
  facet_wrap(~ .metric, scales = "free_y") +
  labs(title = "Comparison of Model Performance Metrics",
       x = "Model",
       y = "Mean Value") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Part 4: Overall best

I decided to go with the random forest model overall. Although linear regression was competitive in terms of explanatory power, the random forest model had slightly lower MAE (0.526 vs. 0.577) and RMSE (0.665 vs. 0.724), which indicates more accurate predictions on average. Moreover, random forests excel at capturing potential non-linear relationships between septic bundle adherence and sepsis care quality—an important aspect when dealing with complex clinical data. Additionally, random forests offer useful insights through variable importance measures, allowing me to better understand which predictors drive the quality score. Overall, I believe this model best balances performance, flexibility, and interpretability for addressing the research question.

Part 5: Final Test Data Evaluation: Performance, Residuals, and Uncertainty

Final Random Forest Model Predictions on Test Data

# Generate predictions on the test data using the final random forest model and bind them to test_data
final_rf_test_preds <- predict(final_rf_model, test_data) %>%
  bind_cols(test_data) %>%
  mutate(quality_score = as.numeric(quality_score),
         .pred = as.numeric(.pred))

# Now, evaluate performance on the test data
test_metrics <- final_rf_test_preds %>%
  metrics(truth = quality_score, estimate = .pred)

print(test_metrics)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.879
2 rsq     standard       0.105
3 mae     standard       0.742

The final evaluation of the random forest model on the test data reveals an RMSE of 0.878, meaning that, on average, the model’s predictions deviate from the actual values by about 0.88 units. The MAE is 0.737, indicating that the typical absolute error in the predictions is roughly 0.74 units. However, the model’s R² value is only 0.113, which means that the model explains just 11% of the variability in the test data. Overall, while the error metrics suggest a moderate level of prediction error, the low R² value highlights that the model has limited predictive power and significant room for improvement.

Calculate and Append Residuals

# Compute the residuals and add them as a new column
final_rf_test_preds <- final_rf_test_preds %>%
  mutate(residual = quality_score - .pred)

# Create a residual plot: Predicted vs Residuals
ggplot(final_rf_test_preds, aes(x = .pred, y = residual)) +
  geom_point(color = "darkgreen", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Predicted (Test Data)",
       x = "Predicted Quality Score",
       y = "Residuals") +
  theme_minimal()

The evaluation metrics for the final random forest model on the test data show an RMSE of 0.878, an MAE of 0.737, and an R² of 0.113. This means that, on average, the model’s predictions are off by about 0.88 units, with an average absolute error of 0.74 units. However, the model only explains roughly 11% of the variance in the test data, which indicates that its predictive power is quite low. Overall, while the error values suggest some level of accuracy, the low R² value reveals that there is significant uncertainty in the model’s performance, and substantial improvements are needed to better capture the underlying patterns in the data.

Part 6: Discussion and Summary of Key Findings

In this analysis, I addressed the research question of whether states with higher adherence to septic shock care protocols exhibit better sepsis care quality, as measured by a synthetic quality score. To explore this, I generated a synthetic dataset representing 50 U.S. states with two predictors—counts for the “Septic Shock 3-Hour Bundle” and “Septic Shock 6-Hour Bundle”—and a synthetic outcome, the quality score, created by combining these counts with some random noise.

After cleaning the data and splitting it into training and testing sets, I built three models—linear regression, random forest, and boosted trees—using 5-fold cross-validation.

The cross-validation results showed that the linear regression and random forest models performed similarly, with slightly better error metrics for the random forest model, while the boosted trees model exhibited higher errors and a lower R². I then selected the random forest model for further evaluation. However, when I evaluated it on the test data, the model produced an RMSE of 0.878, an MAE of 0.737, and an R² of only 0.113, indicating that it explains just 11% of the variability in the outcome. The residual analysis confirmed that there were significant deviations between predicted and observed values, suggesting considerable uncertainty. Overall, while the models indicate some relationship between septic bundle adherence and sepsis care quality, the low R² on the test set reveals that the current predictors explain only a small portion of the outcome variability, underscoring the need for further refinement and potentially additional data to better capture this complex relationship.

Summary Table of Model Performance

# Create the summary table as a tibble
model_performance <- tribble(
  ~Model,         ~MAE,   ~RMSE,   ~RSQ,
  "Linear Reg.",    0.577,  0.724,   0.629,
  "Random Forest",  0.526,  0.665,   0.624,
  "Boosted Trees",  0.690,  0.815,   0.540
)

# Create a nicely formatted table using gt
model_performance %>%
  gt() %>%
  tab_header(
    title = "Summary Table of Model Performance",
    subtitle = "Cross-Validation Results on Training Data"
  ) %>%
  fmt_number(
    columns = c(MAE, RMSE, RSQ),
    decimals = 3
  ) %>%
  cols_label(
    MAE = "Mean Absolute Error",
    RMSE = "Root Mean Squared Error",
    RSQ = "R²"
  )
Summary Table of Model Performance
Cross-Validation Results on Training Data
Model Mean Absolute Error Root Mean Squared Error
Linear Reg. 0.577 0.724 0.629
Random Forest 0.526 0.665 0.624
Boosted Trees 0.690 0.815 0.540