Skip to content

jojo142/EcommerceCustomerChurn

Repository files navigation

Customer Churn Prediction — Multi-Channel Marketing Analytics

Samiya Islam | Brandeis University, M.S. Business Analytics | May 2026
samiyanurislam.com · LinkedIn · GitHub


Overview

This project builds an end-to-end customer churn prediction pipeline on a 5,630-customer e-commerce dataset. The core question: which customers are at risk of churning, which marketing signals predict it, and how should retention spend be allocated across customer segments?

This work directly targets the multi-channel marketing optimization problem space -- predicting and attributing customer outcomes across email, push notification, social advertising, and retargeting channels.


Dataset

A realistic synthetic e-commerce customer dataset (5,630 rows, 26 features) modeled on the Kaggle E-Commerce Customer Churn benchmark. Key feature groups:

Category Features
Behavioral Tenure, OrderCount, DaySinceLastOrder, CouponUsed, CashbackAmount
Satisfaction SatisfactionScore, Complain
Demographics Gender, MaritalStatus, CityTier, NumberOfAddress
Marketing channels EmailOpensLast30Days, EmailClicksLast30Days, PushNotifClicked, SocialAdClicked, RetargetingExposed, AcquisitionChannel
Engineered EmailEngagementRate, MultiChannelEngagement

Churn rate: ~12.3% (class imbalance handled via class_weight='balanced' + PR-curve threshold tuning)


Methodology

1. Exploratory Data Analysis

  • Churn rate by acquisition channel (Paid Search and Social highest risk)
  • Distribution comparisons across marketing and behavioral features
  • Correlation analysis of engagement signals vs. churn

2. Preprocessing

  • Label encoding for 6 categorical features
  • Feature engineering: EmailEngagementRate (clicks/opens), MultiChannelEngagement (composite score)
  • 80/20 stratified train-test split

3. Model Training (class-imbalance corrected)

Model AUC F1 Precision Recall
Logistic Regression 0.739 0.369 0.260 0.633
Random Forest 0.708 0.334 0.212 0.791
CatBoost 0.702 0.349 0.245 0.604
XGBoost 0.628 0.288 0.195 0.547

Thresholds tuned via Precision-Recall curve (not defaulted to 0.5).

4. Interpretability (SHAP)

SHAP TreeExplainer applied to CatBoost model. Top predictors:

  1. Tenure -- newer customers are dramatically higher risk
  2. DaySinceLastOrder -- recency is a strong leading indicator
  3. SatisfactionScore -- low scores predict churn before cancellation
  4. EmailClicksLast30Days -- email engagement is protective
  5. Complain -- complaint history raises churn probability materially

5. Customer Segmentation (K-Means, k=4)

Four actionable customer personas identified, each with a distinct retention strategy (win-back, re-engagement, email nurture, loyalty program).


Key Business Recommendations

  • Front-load retention spend on customers in their first 6 months (Tenure is the #1 churn driver)
  • Automate inactivity triggers: email at Day 14, push at Day 21 of no order activity
  • Complaint → retention flag: route complaining customers to priority handling
  • Email engagement is measurable and actionable: A/B testing subject lines and send-time optimization has a demonstrable impact on churn probability
  • Deploy Logistic Regression for production batch scoring (highest AUC, interpretable coefficients, low maintenance)

Files

├── customer_churn_marketing_analytics.ipynb   # Main analysis notebook
├── ecommerce_churn.csv                        # Dataset
├── model_metrics.csv                          # Model comparison table
├── segment_summary.csv                        # Segment profiles
├── figures/
│   ├── 01_eda.png                             # Exploratory analysis
│   ├── 02_roc.png                             # ROC curves
│   ├── 03_model_comparison.png                # Performance comparison
│   ├── 04_shap.png                            # SHAP feature importance
│   └── 05_segments.png                        # Customer segment profiles
└── README.md

Tech Stack

Python · scikit-learn · CatBoost · XGBoost · SHAP · pandas · matplotlib · seaborn · KMeans


About This Project

Built to demonstrate marketing data science capabilities aligned with multi-channel customer analytics -- specifically: clustering/segmentation, boosted tree modeling, cross-channel outcome attribution, and communicating complex model outputs to non-technical stakeholders.

About

end-to-end churn prediction pipeline on a 5,630-customer e-commerce dataset incorporating multi-channel engagement signals (email, push notification, social advertising, retargeting) to model customer retention risk across the full marketing funnel.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages