Projects

A selection of my work across data engineering, machine learning, and analytics.

California Early Warning System

Built a school-level Early Warning System (EWS) to identify CA public high schools at risk of low graduation outcomes using only public, non-PII datasets aligned with the ABC framework (Attendance, Behavior, Course performance).

EducationPolicyAnalyticsPublic Data
View on GitHub →

Flight Delay & Cancellation Prediction (SAN / KSAN)

Predicts flight delays and cancellations leaving out of San Diego International Airport by integrating 2 years of BTS on-time performance data with NOAA weather observations (NCEI ISD) from KSAN.

PythonAWSMLWeather
View on GitHub →

School Sentiment NLP

Analyzes how people talk about schools in high-performing vs. low-performing districts using sentiment analysis and topic modeling on Reddit discussions to compare themes and perceptions.

NLPTopic ModelingSentimentPython
View on GitHub →

Cervical Cancer Risk Prediction

Modeled cervical cancer risk using the Cervical Cancer (Risk Factors) dataset (858 records, 36 variables) with mixed binary/categorical/numerical predictors. Compared multiple models and selected the final model based on sensitivity and clinical relevance.

PythonMLHealthcareClassification
View on GitHub →

Seattle Airbnb ETL Pipeline

End-to-end ETL pipeline integrating Seattle Airbnb listings, Seattle weather, and booking trends using MySQL and Jupyter Notebook for efficient reporting and analysis.

PythonETLMySQLJupyterBI
View on GitHub →

Bike-Sharing Demand Forecasting (Time Series)

Time series forecasting in R to model bike-sharing rental demand for operational planning (redistribution, staffing, maintenance). Includes cleaning, exploratory time series analysis, feature engineering, model building, and forecast evaluation.

RTime SeriesForecastingEDA
View on GitHub →

Bank Term Deposit Conversion Prediction

Predicts which customers will subscribe to term deposits to optimize telemarketing efforts (Bank of Portugal dataset from Kaggle). Built Random Forest, Logistic Regression, and KNN; applied SMOTE for class imbalance. Logistic regression chosen for highest recall and balanced performance.

PythonClassificationSMOTEMarketing
View on GitHub →