Predicting Heart Disease Using Logistic Regression Modeling

Download Poster: Predicting Heart Disease Using Diagnostic Tests

Overview: Worked individually to develop models to predict heart disease by assessing different variables using diagnostic testing. Real hospital data was obtained from two U.S. and two European hospitals. The primary outcome, presence of heart disease, was found to be significantly impacted by sex, type of chest pain, exercise induced angina, ST depression induced by exercise relative to rest, slope of the peak exercise ST segment, presence of the blood disorder, thalassemia, number of major vessels colored by fluoroscopy, and which hospital the patient stayed in. Presence of heart disease was assessed by number of narrowing blood vessels, and then dichotomized in the final model.

Data: Collected from two U.S. and two European hospitals.

Results: The final logostic regression model, using the above covariates, demonstrated excellent ability to predict heart disease, with an area under the receiver operating characteristic (ROC) curve of 0.94.

Software: SAS Version 9.4 was used for all statistical analyses in this project.


Predicting Survival on the Sinking R.M.S. Titanic

Download Report: Predicting Survival on the Sinking R.M.S. Titanic

Overview: Worked in group of two to train and test machine learning models to predict passenger survival on the infamous sinking cruise ship, Titanic. Real passenger data (age, sex, socio-economic status, number of siblings/spouse on board, number of parents/children on board, ticket number, fare, cabin number, and port of embarkation) was used for repeated cross-validation (CV) and receiver operating characteristics (ROCs) were compared to build an optimized model. The final black box model was then explained using feature explanation plots, to assess circumstances of individuals that led to survival on the sinking cruise ship.

Data: Collected from Kaggle’s “Titanic Machine Learning from Disaster” competition.

Results: An ADABoost machine learning model turned out to be the most optimal in predicting Titanic survival, reaching an accuracy of about 77%.

Software: RStudio was used in construction of the models.


Predicting the 2021 NCAA DI Men’s Basketball Tournament

Download Report: Predicting the Outcome of the 2021 NCAA DI Men’s Basketball Tournament

Overview: Worked individually to train machine learning models to predict the winner of the 2021 Men’s DI College Basketball Championship. Past years’ regular season and tournament data were used for CV and ROCs were compared to build an optimized model. Functions were then created to run a simulation of the tournament, using the optimized model.

Data: Collected from Kaggle’s “March Machine Learning Mania 2021 - NCAAM” competition.

Results: Using a logistic regression through 10-fold CV, Gonzaga will win the 2020-2021 DI Men’s Basketball Championship.

Software: RStudio was used in construction of the models and simulation.


Comparing COVID-19 to Past Pandemics

Link: Future of COVID-19: Looking at Past Pandemics

Overview: Worked in a group of 4 to research past pandemics, epidemics, and seasonal flu to develop a website with interactive, up-to-date plots and maps comparing them to COVID-19. New and cumulative case and death data was collected, wrangled, and analyzed to make adequate comparisons of COVID-19 to the seasonal flu (Influenza B), past pandemics (H1N1 and Spanish Flu), and the original coronavirus epidemic from 2003 (SARS-CoV-1).

Data: Collected from the Centers of Disease Control and Prevention (CDC), the World Health Organization (WHO), and Kaggle.

Results: Pandemic trends were studied to assess future implications of COVID-19, and the basic reproductive rate, \(R_0\), was found to be higher for COVID-19 than all of the other diseases studied.

Software: RStudio and GitHub were used in collaborative construction of the website.


Linear Model of Hate Crimes in the U.S. in 2016

Download Report: Creating a Linear Model for Factors that Impact Hate Crimes in the U.S. in 2016

Overview: Worked in a group of 5 to study the covariates most closely associated with hate crimes in the U.S. in 2016, construct a linear model to explain these factors, and then compile a publishable report.

Data: Collected from the Southern Poverty Law Center (SPLC), the Kaiser Family Foundation (KFF), and the U.S. Census Bureau.

Results: The findings suggested that Gini index of income inequality, percent of population with at least a high school education, and unemployment were the factors highest associated with hate crimes in the U.S. in 2016.

Software: RStudio and GitHub were used in collaborative construction of the linear model and report.