Samiya Islam | Brandeis University, M.S. Business Analytics | May 2026
samiyanurislam.com · LinkedIn · GitHub
This project builds an end-to-end customer churn prediction pipeline on a 5,630-customer e-commerce dataset. The core question: which customers are at risk of churning, which marketing signals predict it, and how should retention spend be allocated across customer segments?
This work directly targets the multi-channel marketing optimization problem space -- predicting and attributing customer outcomes across email, push notification, social advertising, and retargeting channels.
A realistic synthetic e-commerce customer dataset (5,630 rows, 26 features) modeled on the Kaggle E-Commerce Customer Churn benchmark. Key feature groups:
| Category | Features |
|---|---|
| Behavioral | Tenure, OrderCount, DaySinceLastOrder, CouponUsed, CashbackAmount |
| Satisfaction | SatisfactionScore, Complain |
| Demographics | Gender, MaritalStatus, CityTier, NumberOfAddress |
| Marketing channels | EmailOpensLast30Days, EmailClicksLast30Days, PushNotifClicked, SocialAdClicked, RetargetingExposed, AcquisitionChannel |
| Engineered | EmailEngagementRate, MultiChannelEngagement |
Churn rate: ~12.3% (class imbalance handled via class_weight='balanced' + PR-curve threshold tuning)
- Churn rate by acquisition channel (Paid Search and Social highest risk)
- Distribution comparisons across marketing and behavioral features
- Correlation analysis of engagement signals vs. churn
- Label encoding for 6 categorical features
- Feature engineering:
EmailEngagementRate(clicks/opens),MultiChannelEngagement(composite score) - 80/20 stratified train-test split
| Model | AUC | F1 | Precision | Recall |
|---|---|---|---|---|
| Logistic Regression | 0.739 | 0.369 | 0.260 | 0.633 |
| Random Forest | 0.708 | 0.334 | 0.212 | 0.791 |
| CatBoost | 0.702 | 0.349 | 0.245 | 0.604 |
| XGBoost | 0.628 | 0.288 | 0.195 | 0.547 |
Thresholds tuned via Precision-Recall curve (not defaulted to 0.5).
SHAP TreeExplainer applied to CatBoost model. Top predictors:
- Tenure -- newer customers are dramatically higher risk
- DaySinceLastOrder -- recency is a strong leading indicator
- SatisfactionScore -- low scores predict churn before cancellation
- EmailClicksLast30Days -- email engagement is protective
- Complain -- complaint history raises churn probability materially
Four actionable customer personas identified, each with a distinct retention strategy (win-back, re-engagement, email nurture, loyalty program).
- Front-load retention spend on customers in their first 6 months (Tenure is the #1 churn driver)
- Automate inactivity triggers: email at Day 14, push at Day 21 of no order activity
- Complaint → retention flag: route complaining customers to priority handling
- Email engagement is measurable and actionable: A/B testing subject lines and send-time optimization has a demonstrable impact on churn probability
- Deploy Logistic Regression for production batch scoring (highest AUC, interpretable coefficients, low maintenance)
├── customer_churn_marketing_analytics.ipynb # Main analysis notebook
├── ecommerce_churn.csv # Dataset
├── model_metrics.csv # Model comparison table
├── segment_summary.csv # Segment profiles
├── figures/
│ ├── 01_eda.png # Exploratory analysis
│ ├── 02_roc.png # ROC curves
│ ├── 03_model_comparison.png # Performance comparison
│ ├── 04_shap.png # SHAP feature importance
│ └── 05_segments.png # Customer segment profiles
└── README.md
Python · scikit-learn · CatBoost · XGBoost · SHAP · pandas · matplotlib · seaborn · KMeans
Built to demonstrate marketing data science capabilities aligned with multi-channel customer analytics -- specifically: clustering/segmentation, boosted tree modeling, cross-channel outcome attribution, and communicating complex model outputs to non-technical stakeholders.