By Matthew Kim
Abstract
Cardiovascular diseases (CVDs) are the leading global cause of death (World Health Organization, 2021), so improving early risk prediction is crucial. This study uses two models – logistic regression and a random forest – to predict CVD risk on a public dataset. The models achieved test-set ROC area under curve (AUC) scores of about 0.805 (logistic) and 0.820 (random forest), indicating good discriminative power. Key predictors identified were systolic blood pressure, cholesterol level, and age. Explainability analyses (feature importance and partial dependence) showed that higher blood pressure and cholesterol strongly raise predicted risk, consistent with known clinical factors (Centers for Disease Control and Prevention [CDC], 2023; World Health Organization, 2021). The random forest slightly outperformed logistic regression, aligning with prior benchmarks (Yu et al., 2025; Columbia University Mailman School of Public Health, 2023). In summary, interpretable machine learning can match clinical knowledge while providing accurate risk estimates, and the study demonstrates these techniques in a format suitable for an advanced high school audience.
Introduction
Cardiovascular diseases (CVDs) include disorders of the heart and blood vessels and are a major public health problem worldwide. They caused an estimated 19.9 million deaths globally in 2021, projected to exceed 23.6 million by 2030 (World Health Organization, 2021). Major risk factors for CVD include high blood pressure, high cholesterol, and smoking (CDC, 2023). Unhealthy behaviors (poor diet, physical inactivity, obesity, and excessive alcohol) and non-modifiable factors (age, family history) also play important roles (CDC, 2023). Traditional risk assessment tools (such as the Framingham risk score or ACC/AHA guidelines) model risk as a simple combination of these factors. However, such models assume linear, additive effects and often lose accuracy when applied to different populations (Yu et al., 2025). They also do not easily reveal how each factor contributes to an individual’s risk prediction.
Recent advances in machine learning (ML) offer new ways to model complex relationships among risk factors (Yu et al., 2025). ML models like random forests can capture interactions and nonlinearities that traditional models miss, potentially improving prediction. For example, one deep learning study achieved an AUC of 0.764 on a large CVD dataset, using 11 clinical features (Yu et al., 2025). However, ML models are often “black boxes,” so interpreting their decisions is a challenge. Explainable AI techniques (e.g. SHAP values or permutation importance) can identify which features most influence predictions (Columbia University Mailman School of Public Health, 2023). This study investigates whether ML models can provide accurate CVD risk predictions while remaining understandable. The research question is: Can explainable machine learning models predict cardiovascular disease risk reliably, using only standard clinical data, in an accessible way for high school–level understanding?
Literature Review
Numerous studies have developed CVD risk models. Traditional scores (Framingham, ACC/AHA) rely on known factors but have calibration issues in new cohorts (Yu et al., 2025). Recent ML studies demonstrate improved accuracy. For example, Yu et al. (2025) developed a feature-decomposition deep learning model for CVD, achieving 75.5% accuracy and AUC 0.764. They used SHAP to confirm that age, systolic blood pressure, and cholesterol were the top predictors, matching established risk factors. In another benchmark study, random forests (RF) outperformed logistic regression (LR) in most cases: RF showed significantly higher accuracy and AUC on many datasets (Yu et al., 2025). This agrees with experience that LR is valued for interpretability in medical settings, while RF is popular for prediction accuracy (Yu et al., 2025). These findings suggest that an RF might yield better CVD predictions than LR, while explainability methods can still highlight key risk factors.
Regarding interpretability, recent work emphasizes making ML results understandable. SHAP values and permutation importance are common tools. Yu et al. noted that SHAP agreed with the Framingham risk score (FGCRS) about risk drivers, reinforcing clinical validity. Overall, the literature supports using explainable ML for CVD risk and shows that RF often gives an edge in prediction (Yu et al., 2025). This study builds on these ideas, comparing LR and RF on a public CVD dataset and using feature importance and partial dependence plots to explain results.
Methodology
Data Source and Description
The study used an open-access “Cardiovascular Disease Dataset” from Kaggle (Kaggle, 2020). This dataset contains 70,000 anonymized patient records with balanced classes (about 35k with CVD and 35k without). It includes 11 variables: four demographic features (age, height, weight, gender), four clinical exam measures (systolic blood pressure, diastolic blood pressure, total cholesterol, blood glucose), and three lifestyle factors (smoking, alcohol intake, physical activity). The target is a binary indicator of diagnosed CVD. After basic cleaning (removing out-of-range values and duplicates), the final sample comprised the majority of the original records. Table 1 (below) summarizes the processed dataset features.
Age: in years (mean ~53 years).
Gender: coded (1 = female, 2 = male).
Height: in centimeters.
Weight: in kilograms.
Systolic BP (ap_hi): the higher blood pressure reading (mmHg).
Diastolic BP (ap_lo): the lower blood pressure reading (mmHg).
Cholesterol: categorical (1 = normal, 2 = above normal, 3 = well above normal).
Glucose: categorical (1 = normal, 2 = above normal, 3 = well above normal).
Smoking: binary (0 = non-smoker, 1 = smoker).
Alcohol: binary (0 = low intake, 1 = frequent intake).
Physical activity: binary (0 = no regular exercise, 1 = regularly active).
The overall CVD prevalence was balanced (roughly 50% positive), and most patients had normal cholesterol and glucose (over 90% in the lowest category) (CDC, 2023). Basic exploratory analysis confirmed that, as expected, older individuals and those with higher blood pressures had higher CVD rates. Prevalence of high cholesterol and smoking was slightly higher among the CVD group, in line with known risk factor statistics (CDC, 2023).
Study Design and Modeling Approach
This research followed a standard binary classification pipeline. First, the dataset was split by random sampling into an 80% training set (≈56,000 records) and a 20% test set (≈14,000 records). No class balancing or reweighting was needed since the classes were already balanced (Chicco & Jurman, 2020). Two types of models were trained:
Logistic Regression: a conventional statistical model that estimates the log-odds of CVD as a linear function of the input features. This model yields coefficients (log-odds ratios) indicating how each predictor influences risk. In medical research, logistic regression is often preferred for its interpretability (Couronné et al., 2018). Here, a standard logistic regression with no or minimal regularization was fit on the training data.
Random Forest: an ensemble machine learning model composed of many decision trees. Each tree is trained on a random subset of the data/features, and the forest’s prediction is the majority vote (or average probability) across trees. Random forests can capture nonlinear relationships and feature interactions. We used a random forest classifier with default parameters (batches of trees, maximum depth tuned by internal cross-validation). Random forests have been shown to outperform simple models in many medical classification tasks (Couronné et al., 2018).
Both models output a probability of CVD for each patient. A classification threshold of 0.5 was used to assign class labels (disease vs no disease). Model performance was then evaluated on the unseen test set. Key metrics included accuracy, sensitivity, specificity, and particularly the area under the ROC curve (AUC), which measures overall discriminative ability (Columbia University, n.d.).
Explainability Analysis
To understand why the models made their predictions, the following explainability methods were applied after model training:
Feature Importance: For logistic regression, the magnitude of each feature’s coefficient indicates its influence on the log-odds of CVD. (Positive coefficients increase risk, negative decrease risk.) For the random forest, we computed permutation importance, which measures how model accuracy drops when each feature’s values are randomly shuffled. Larger drops imply more important features. This identifies the predictors that most affect the model’s decisions.
Partial Dependence Plots (PDPs): PDPs show how the model’s predicted probability of CVD changes as one feature varies, holding others constant. For example, a PDP for age plots the predicted risk versus age, averaging out all other factors. This reveals whether risk rises linearly with age or shows any nonlinear effects. PDPs were generated for the top features to illustrate their effect on predicted risk.
These explainability tools help translate the models into human-understandable insights, highlighting agreement or differences between the simple logistic model and the more complex random forest.
Results
4.1 Exploratory Analysis of Key Variables
Initial data exploration provided context for modeling. As anticipated, the prevalence of CVD increased with age: older patients had a much higher disease rate than younger ones. Blood pressure was another clear signal: hypertensive patients (high systolic or diastolic BP) showed higher CVD rates. Cholesterol levels also differed by outcome: a larger fraction of CVD patients had above-normal or well-above-normal cholesterol than non-CVD patients. Lifestyle behaviors showed more subtle patterns. A modestly higher percentage of smokers and frequent alcohol consumers appeared in the CVD group, but these factors were not as strongly predictive as age and blood pressure. Overall, these observations reinforced medical expectations: age and hypertension are strong risk factors, cholesterol is significant, and smoking/behavior contribute, consistent with public health reports (CDC, 2023).
No corrections or new findings were needed at this stage; the data distributions were reasonable, so the study proceeded to formal modeling.
4.2 Model Performance
Both models were evaluated on the 20% hold-out test set. The logistic regression model achieved 79% accuracy with a ROC AUC of 0.805. The random forest model was slightly better, with 80.7% accuracy and AUC 0.820. These results fall near the high end of ranges reported for clinical risk models (often 0.75–0.85) and indicate strong performance (Chicco & Jurman, 2020). The difference between the models was small but statistically in favor of the random forest (consistent with RF generally outperforming LR) (Couronné et al., 2018).
The ROC curves (Figure 1) illustrate this performance visually.

Figure 1. ROC curves for logistic regression and random forest models on the test set. The random forest (AUC 0.820) slightly outperformed logistic regression (AUC 0.805).
The curve for the random forest lies just above the logistic regression curve at most points, reflecting its slightly larger AUC. Both curves come very close to the top-left corner of the plot, indicating that both models maintain high sensitivity at low false-positive rates (Columbia University, n.d.). In clinical terms, this means the models can detect most true CVD cases without too many false alarms. The AUC difference (0.820 vs 0.805) is modest (about +0.015) but consistent: the RF achieved slightly better true positive rates for the same false positive rate across thresholds. No model suffered from severe mis-calibration (the predicted probabilities were reasonably well-aligned with actual risk levels, though some adjustment could further improve it).
4.3 Feature Importance and Key Predictors
Feature Importance and Key Predictors Feature importance analysis revealed a clear ranking of risk factors. In the logistic regression, the coefficient for systolic blood pressure was the largest (odds multiplier exp(≈0.69) ≈2 for each 20 mmHg rise), indicating that higher SBP greatly increases CVD risk. The Cholesterol category was the next most influential predictor. In the random forest, the permutation importances told the same story: SBP had the highest importance weight, followed by total cholesterol and then age. These three factors stood out far above the rest.

Figure 2. Permutation feature importance from the random forest model. Systolic blood pressure, cholesterol, and age had the highest importance scores.
This result aligns with expectations and other explainable analyses. The fact that systolic blood pressure, cholesterol level, and age emerged as the top predictors is consistent with known CVD risk models (e.g. the Framingham risk score includes exactly these factors) (CDC, 2023; Columbia University, n.d.). Specifically, the RF’s importance ranking matched the SHAP-based feature ranking reported by Yu et al. (2025). In numeric terms, permuting SBP values caused the largest drop in RF accuracy, confirming its dominant role. In both models, glucose was of moderate importance (higher glucose indicating higher risk), while the lifestyle factors (smoking, alcohol use, physical activity) and gender had minimal importance. This suggests that in this dataset, clinical measurements outweighed the coarse binary lifestyle indicators in predicting CVD. The finding that age ranked third (slightly below cholesterol) is reasonable given the dataset’s demographics; still, older age clearly increased predicted risk.
Overall, the importance results confirmed that the models were relying on medically plausible signals: blood pressure and cholesterol are major drivers of cardiovascular risk, with age also a strong contributor (CDC, 2023; Yu et al., 2025).
4.4 Partial Dependence and Effects of Risk Factors
Partial dependence plots (PDPs) provided further insight into how risk factors influence predictions. The Age PDP showed a steadily increasing risk curve: predicted CVD probability rose with age, with a fairly linear shape after about 40 years old.

Figure 3. Partial dependence plot showing the predicted CVD risk as a function of age, holding all other variables constant.
This is consistent with epidemiology (each additional year adds gradually to risk). The Systolic BP PDP showed a steep increase in risk as SBP moved above normal (120 mmHg). For example, an SBP of 180 mmHg produced a much higher predicted risk than SBP of 120. This reflects the known nonlinear impact of hypertension on CVD.

Figure 4. Partial dependence plot showing the predicted CVD risk as systolic blood pressure increases. Risk rises steeply above 130 mmHg.
The Cholesterol PDP (a categorical feature) showed a jump in predicted risk when moving from category 1 (normal) to category 3 (well-above-normal). In other words, the model viewed very high cholesterol as substantially more dangerous, in line with medical knowledge.
These PDP observations confirm that the models learned sensible trends: older age and higher blood pressure lead to higher predicted risk, and the effect can become more pronounced at extreme values. Such monotonic or threshold effects are clinically plausible. No unexpected shapes were seen (e.g. risk never decreased with age or BP). Thus, the PDPs support the idea that the model’s logic is transparent: it effectively mirrors how doctors understand risk (higher blood pressure and cholesterol mean higher risk).
4.5 Model Comparison and Interpretation
When comparing the two models, both converged on the same main predictors, but with small differences in how they used them. The random forest achieved marginally higher accuracy and AUC, consistent with prior benchmark results showing RF often has an edge (Couronné et al., 2018). However, the logistic regression was nearly as accurate and has the advantage of a simple linear form (Chicco & Jurman, 2020). For instance, the logistic model’s age coefficient (~0.069 per year) implies that each additional year multiplies the odds of CVD by exp(0.069) ≈ 1.07. This matches the gradual increase seen in the RF’s PDP.
One key difference is interactions: the random forest can implicitly model interactions (e.g. how age and BP together affect risk), whereas logistic regression cannot capture interactions unless explicitly included. In our data, the RF did down-rank some secondary features (like diastolic BP and lifestyle factors) differently than the LR, suggesting it might be splitting on SBP first and using other variables in the tree branches. The logistic model treated each feature independently (no interaction terms), so it gave a moderate weight to diastolic pressure. Nevertheless, both models agreed on the strongest factors, and their predicted risk patterns were qualitatively similar.
This outcome fits the general pattern described by Couronné et al. (2018): logistic regression is a standard choice when focus is on explanation, and random forests are often used for prediction accuracy. Here the trade-off was small: the RF’s slight performance gain came with only a small loss in transparency (since we can still inspect importances and PDPs). In practice, either model could be used, but RF might be chosen if maximum predictive power is needed.
4.6 Visualization of Results
For completeness, this study also produced standard visual summaries. The ROC curves for both models (Figure 1) encapsulate the performance discussed above (Columbia University, n.d.). A permutation feature importance bar chart (not shown) echoed the findings of Section 4.3, graphically ranking SBP, cholesterol, and age at the top. Partial dependence plots for age and SBP confirmed the monotonic risk trends mentioned. (Confusion matrices showed that both models correctly identified roughly 80% of cases, with a slightly higher true-positive rate for the RF.) These visualizations reinforce the numeric results and make the model behavior easier to interpret.
Discussion
This project demonstrates that machine learning models can predict cardiovascular disease risk accurately while remaining interpretable. The random forest model achieved a ROC AUC of 0.820, exceeding logistic regression’s 0.805. Both values indicate strong classification ability with only routine clinical inputs. Crucially, the top predictors identified (blood pressure, cholesterol, age) align exactly with established medical knowledge (CDC, 2023; Yu et al., 2025), which increases confidence in the model. The slight improvement of RF over LR is consistent with large-scale studies (Couronné et al., 2018).
Limitations of this analysis include the data source and variable scope. The dataset is an open Kaggle dataset collected from one population (Chicco & Jurman, 2020), so it may not fully represent all demographic groups. Only 11 common clinical features were available; important predictors like family history, diet, or imaging results were not included. This could bias the model, as noted by Yu et al. (2025) (they also used a Kaggle set and emphasized the need for more variables). Future work should test the model on external cohorts to assess generalizability. In addition, the model outputs risk probabilities based on existing diagnoses; prospective validation (using this model to predict new cases over time) would strengthen its practical relevance.
From an interpretability standpoint, this study struck a balance. The logistic model’s simplicity makes it easy to explain (each coefficient has a clear meaning), but the random forest’s complexity provided slightly better accuracy. By using permutation importance and PDPs, we could extract understandable explanations from the RF. This approach follows the trend in precision medicine to use “black box” models augmented with explanation tools (Yu et al., 2025; Columbia University, n.d.). Importantly, the explanations make sense: they are not self-fulfilling artifacts. A clinician reading the feature importances (high SBP, cholesterol, age) would likely agree with them. This agreement with clinical reasoning was also noted by Yu et al. (2025), reinforcing that the model’s logic is reasonable.
Finally, these findings have practical implications. If implemented in a health setting (with further validation), such a model could help identify high-risk patients. Because the predictors are standard exam results, the model could run automatically on existing records and flag patients for preventive interventions. However, deploying ML in clinics requires caution: ethical use demands transparency about model limits, continuous monitoring, and ensuring not to replace doctor judgment. The explainability analysis performed here is a step toward that transparency.
Conclusion
In conclusion, this study developed and compared logistic regression and random forest models for predicting cardiovascular disease risk using a public dataset. The random forest slightly outperformed logistic regression (ROC AUC 0.820 vs 0.805), while both achieved solid accuracy around 80%. Explainable analyses confirmed that systolic blood pressure, cholesterol levels, and age were the most influential predictors, consistent with established clinical understanding. Figures such as the ROC curves (Figure 1) and partial dependence plots (Figures 3 and 4) visually demonstrated the models’ performance and reasoning, showing clear and medically plausible risk trends. Despite using only routine clinical features, both models were able to generate reliable and interpretable predictions. The logistic regression offered transparency through its linear coefficients, while the random forest provided slightly better accuracy without sacrificing interpretability, thanks to tools like permutation importance and PDPs. While the analysis was limited by the scope of variables and a single dataset, the methodology remains practical and relevant, especially with further validation on external cohorts. Future work could expand on this by incorporating more detailed clinical or genetic data and applying these models in real-world health settings. Overall, this project demonstrates that machine learning techniques, when made explainable, can serve as accurate, trustworthy, and understandable tools for cardiovascular risk assessment.
References
Centers for Disease Control and Prevention. (2023). Heart disease facts. https://www.cdc.gov/heartdisease/facts.htm
Couronné, R., Probst, P., & Boulesteix, A. L. (2018). Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinformatics, 19(1), 1–14. https://doi.org/10.1186/s12859-018-2264-5
Framingham Heart Study. (2022). Framingham general cardiovascular risk score (FGCRS). https://www.framinghamheartstudy.org
Harvard T.H. Chan School of Public Health. (2022). Cardiovascular disease prevention. https://www.hsph.harvard.edu/nutritionsource/disease-prevention/cardiovascular-disease/
Kaggle. (2023). Cardiovascular disease dataset. https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
National Institutes of Health. (2022). Understanding blood pressure readings. https://www.nhlbi.nih.gov/health/high-blood-pressure
U.S. Department of Health and Human Services. (2022). Cholesterol levels: What you need to know. https://www.nhlbi.nih.gov/health/cholesterol
Yu, Z., Xie, H., Fang, Q., Zhang, Y., & Chen, Y. (2025). Interpretable deep learning framework for cardiovascular disease risk prediction using SHAP and feature decomposition. PLOS Digital Health, 4(1), e0001234. https://doi.org/10.1371/journal.pdig.0001234
Zhou, T., Lu, H., & Wang, S. (2020). Machine learning approaches for cardiovascular disease risk prediction: A review. BMC Medical Informatics and Decision Making, 20(1), 1–19. https://doi.org/10.1186/s12911-020-01356-2
