Skip to content

Latest commit

Β 

History

History
126 lines (88 loc) Β· 6.82 KB

File metadata and controls

126 lines (88 loc) Β· 6.82 KB

Customer Churn Prediction Using Machine Learning

This project aims to predict customer churn for a telecommunications company using machine learning techniques. By analyzing a customer dataset, we apply data preprocessing, exploratory data analysis (EDA), and machine learning models to predict whether a customer will churn.

πŸ“ Project Overview

Customer churn prediction is a critical problem for businesses, as retaining customers is more cost-effective than acquiring new ones. This project leverages various machine learning algorithms such as Decision Trees, Random Forest, and XGBoost to predict customer churn based on demographic and account information.

The workflow covers:

  • Data Preprocessing: Handling missing values, encoding categorical variables, and addressing class imbalance.
  • Exploratory Data Analysis (EDA): Visualizing and analyzing patterns and correlations in the data.
  • Model Training: Implementing various machine learning algorithms and optimizing them for better performance.
  • Model Evaluation: Using accuracy, confusion matrix, and classification report to evaluate model performance.
  • Deployment: Saving the trained model and encoding tools, which can later be used for prediction.

πŸ“Š Dataset

The dataset used is the Telco Customer Churn dataset (WA_Fn-UseC_-Telco-Customer-Churn.csv), containing the following columns:

  • Demographics: Gender, SeniorCitizen, Partner, Dependents
  • Account Info: Tenure, MonthlyCharges, TotalCharges
  • Churn: Target variable indicating whether the customer has churned (Yes/No)

Dataset Source: The dataset is publicly available on Kaggle.

πŸ› οΈ Key Technologies

  • Data Preprocessing: pandas, numpy, , SMOTE (imbalanced-learn) ,LabelEncoder (sklearn.preprocessing)
  • Data Visualization: matplotlib, seaborn
  • Machine Learning: scikit-learn (RandomForestClassifier, DecisionTreeClassifier), xgboost (XGBClassifier)
  • Model Persistence: pickle
  • Model Evaluation: Accuracy, Precision, Recall, F1-Score

πŸ”‘ Key Steps

  1. Data Loading: Loading the raw dataset into a Pandas DataFrame and inspecting its contents.
  2. Data Preprocessing: Cleaning and preparing the data, including handling missing values, encoding categorical features, and balancing the dataset using SMOTE.
  3. Exploratory Data Analysis (EDA): Visualizing distributions and relationships between variables to gain insights into the data.
  4. Model Training & Evaluation: Training various machine learning models, evaluating their performance using cross-validation, and selecting the best model (Random Forest).
  5. Model Deployment: Saving the trained model and the label encoders using pickle for future use.

πŸ“‚ Project Structure

/customer-churn-prediction
β”œβ”€β”€ data/ # Dataset file(s)
β”‚ └── WA_Fn-UseC_-Telco-Customer-Churn.csv
β”‚
β”œβ”€β”€ models/ # Saved models and encoders
β”‚ β”œβ”€β”€ customer_churn_model.pkl
β”‚ └── encoders.pkl
β”‚
β”œβ”€β”€ notebooks/ # Jupyter Notebook(s) with analysis and modeling
β”‚ └── main.ipynb
β”‚
└── README.md # Project documentation

πŸ“¦ How to Use

  1. Clone the repository:
    git clone <repository-url>
    cd customer-churn-prediction
    
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Run the Jupyter Notebook:
    jupyter notebook notebook/Customer_Churn_Prediction_using_ML.ipynb
    

πŸ“Š Model Results

The two models, Random Forest and XGBoost, were evaluated on the customer churn dataset. The predictions for the same input data showed different results due to the distinct learning mechanisms of both models.

πŸ”Ή Random Forest Model:

  • Prediction: No Churn
  • Prediction Probability: [[0.83 0.17]] (83% likelihood of No Churn)

πŸ”Ή XGBoost Model:

  • Prediction: Churn
  • Prediction Probability: [[0.35321957 0.64678043]] (64.7% likelihood of Churn)

Despite using the same input data, both models arrived at different conclusions because of their inherent differences in the way they learn and make predictions.


🧐 Conclusion

πŸ”‘ Key Insights:

  • Different Models, Different Results:

    • Random Forest is a robust model that averages over multiple decision trees, making it less sensitive to noise but sometimes too conservative.
    • XGBoost is a gradient boosting model that focuses on correcting previous errors, which makes it more sensitive to specific patterns, often giving a better performance on complex datasets.
  • Why the Difference?:

    • Random Forest uses an ensemble of decision trees and makes decisions by majority voting.
    • XGBoost, on the other hand, builds trees sequentially, with each tree aiming to correct the previous one, which can lead to more aggressive predictions, especially for minority classes (e.g., Churn).

πŸ“‰ Prediction Differences:

  • While Random Forest predicted No Churn, XGBoost predicted Churn with a higher probability. This difference suggests that XGBoost might be more sensitive to patterns associated with Churn, whereas Random Forest might be more balanced.

πŸš€ Future Steps

  1. Hyperparameter Tuning πŸ”§:

    • Implement GridSearchCV or RandomizedSearchCV to find the best set of hyperparameters for both models (Random Forest and XGBoost). Hyperparameters like max_depth, n_estimators, learning_rate, and subsample can significantly impact model performance.
  2. Model Selection πŸ”:

    • Test additional models such as Logistic Regression, Support Vector Machines, or K-Nearest Neighbors. Evaluate each model's performance using metrics like accuracy, precision, recall, and F1-score to choose the best model for churn prediction.
  3. Downsampling βš–οΈ:

    • Try downsampling the majority class (No Churn) to balance the dataset, which may improve the model's performance on the minority class (Churn), reducing the impact of class imbalance.
  4. Address Overfitting πŸ› οΈ:

    • Try various techniques to mitigate overfitting:
      • Pruning decision trees in Random Forest.
      • Use early stopping in XGBoost to stop training once the model performance stops improving.
      • Apply regularization techniques in XGBoost like lambda and alpha parameters.
  5. Stratified K-Fold Cross Validation πŸ”„:

    • Implement Stratified K-Fold Cross Validation to ensure that each fold has the same proportion of Churn and No Churn cases, especially when dealing with imbalanced datasets.
    • This will provide a more reliable estimate of model performance across different data splits.