Customer Churn Prediction: End-to-End ML Project Guide

Trong thế giới business ngày nay, một sự thật khó chối cãi: Chi phí acquire customer mới cao gấp 5-7 lần so với retain customer hiện tại. Tuy nhiên, theo khảo sát của Carptech với 80+ doanh nghiệp tại Việt Nam, 68% companies không có hệ thống dự đoán churn - họ chỉ biết khách hàng đã rời đi KHI ĐÃ RỜI ĐI, không có cơ hội can thiệp.

Customer Churn Prediction - dự đoán khách hàng nào sắp rời bỏ - là một trong những ML use cases có ROI rõ ràng nhất và fastest time-to-value (3-6 tháng). Một churn prediction model thành công có thể giảm churn rate 15-30%, tương đương hàng tỷ đồng revenue saved mỗi năm.

Bài viết này là hướng dẫn end-to-end, từ A đến Z, để bạn tự build một churn prediction system: business problem definition, data preparation, feature engineering, model training (với code examples đầy đủ), deployment, và monitoring. Kèm case study thực tế về SaaS company giảm churn từ 8% xuống 5.5% trong 6 tháng.

TL;DR - Key Takeaways

Churn definition varies by industry: E-commerce (90 days no purchase), SaaS (cancelled subscription), Bank (closed account)
Feature engineering is 80% of success: RFM, engagement, support tickets, behavioral changes - 20-30 features typical
Algorithm choice: XGBoost, Random Forest work best for churn (AUC 0.75-0.85 achievable)
Evaluation metric: Precision/Recall trade-off more important than accuracy (imbalanced classes)
Deployment: Weekly batch scoring → CRM integration → Automated retention campaigns
ROI: 20-30% churn reduction typical, 12-24 months to full value

Why Customer Churn Matters: The Business Case

The Economics of Churn

Cost comparison:

Customer Acquisition Cost (CAC): 1,500,000 VND
  - Marketing spend: 1,000,000đ
  - Sales effort: 300,000đ
  - Onboarding: 200,000đ

Customer Retention Cost: 300,000 VND
  - Win-back campaigns: 150,000đ
  - Customer success: 100,000đ
  - Loyalty rewards: 50,000đ

Savings: 1,500,000 - 300,000 = 1,200,000đ per customer (5x cheaper)

Lifetime Value (LTV) impact:

Scenario A: High churn (8% monthly)
  - Average customer lifetime: 12.5 months
  - LTV (at 500k VND/month ARPU): 6.25M VND

Scenario B: Low churn (5% monthly)
  - Average customer lifetime: 20 months
  - LTV: 10M VND

Difference: 3.75M VND per customer (+60%!)

Company-level impact:

SaaS company với 10,000 customers, 8% monthly churn
Churned customers/month: 800
Annual churned customers: 9,600
If reduce churn to 5%: Save 3,600 customers/year
Revenue saved (at 10M LTV): 36B VND/year

Industry Benchmarks: What's "Good" Churn?

Industry	Monthly Churn	Annual Churn	Notes
SaaS (B2B)	3-7%	32-58%	Enterprise SaaS lower (1-2%), SMB higher (5-10%)
SaaS (B2C)	5-10%	46-72%	Consumer apps churn higher
E-commerce	15-25%	84-94%	Based on 90-day definition
Telecom	1-3%	12-31%	Contractual, high switching costs
Banking	10-20%	72-89%	Account dormancy
Subscription boxes	8-15%	62-82%	High initial churn (3 months)

Vietnam specifics (Carptech data):

Vietnamese SaaS: 6-9% monthly (higher than global due to price sensitivity)
E-commerce: 20-30% (90-day definition, competitive market)
Fintech: 8-12% (regulatory trust issues)

Target: Reduce churn by 20-30% (e.g., 8% → 5.5-6%)

Step 1: Define Churn - Không dễ như nghĩ

What is "Churn"?

Contractual churn (explicit):

SaaS: Subscription cancelled (clear event, easy to identify)
Telecom: Contract terminated
Gym membership: Membership not renewed

Non-contractual churn (implicit):

E-commerce: No purchase in X days (how many days = churn?)
Mobile app: No login in X days
Bank: Account dormant (no transactions)

Challenge with non-contractual: You have to DEFINE churn threshold

Churn Definition Examples by Industry

E-commerce:

# Option 1: Fixed time window
churned = (days_since_last_purchase > 90)

# Option 2: Relative to historical behavior
avg_purchase_interval = df.groupby('customer_id')['days_between_purchases'].mean()
churned = (days_since_last_purchase > avg_purchase_interval * 2)

# Option 3: Percentile-based
p90_days = df['days_since_last_purchase'].quantile(0.9)
churned = (days_since_last_purchase > p90_days)

Recommendation: Start with simple (90 days), refine later based on business context

SaaS:

# Clear for paid subscriptions
churned = (subscription_status == 'cancelled')

# For freemium, define "active user"
active_user = (logins_30d >= 4) & (features_used_30d >= 2)
churned = ~active_user

Mobile app:

# Based on login frequency
churned = (days_since_last_login > 30)

# Or engagement threshold
monthly_active_users = (logins_30d > 0)
churned = ~monthly_active_users

Prediction Window

Not just "has customer churned?" but "will customer churn in next X days?"

Typical windows:

30 days: Fast-moving (e-commerce, apps)
60-90 days: Medium-term (SaaS)
180+ days: Long-term (banking, insurance)

Why this matters for ML:

30-day window: Need frequent predictions (weekly), faster action
90-day window: Monthly predictions OK, more planning time

Example label creation (SaaS, 60-day window):

from datetime import datetime, timedelta

def create_churn_labels(df, prediction_date, churn_window_days=60):
    """
    Label customers as churned if they cancel within churn_window_days of prediction_date
    """
    future_date = prediction_date + timedelta(days=churn_window_days)

    labels = []
    for _, customer in df.iterrows():
        # Did customer churn between prediction_date and future_date?
        if customer['subscription_status'] == 'active':
            # Check future cancellations
            cancellations = df[
                (df['customer_id'] == customer['customer_id']) &
                (df['cancellation_date'] >= prediction_date) &
                (df['cancellation_date'] <= future_date)
            ]
            churned = (len(cancellations) > 0)
        else:
            churned = False  # Already churned

        labels.append({
            'customer_id': customer['customer_id'],
            'prediction_date': prediction_date,
            'churned_next_60d': churned
        })

    return pd.DataFrame(labels)

Step 2: Feature Engineering - 80% của Success

ML rule: "Garbage in, garbage out" Corollary: "Great features in, great model out"

Feature Categories

1. RFM - Recency, Frequency, Monetary (Most predictive):

# Recency
features['days_since_last_purchase'] = (
    prediction_date - df.groupby('customer_id')['purchase_date'].max()
).dt.days

# Frequency
features['total_purchases_all_time'] = df.groupby('customer_id')['order_id'].count()
features['purchases_last_30d'] = df[df['purchase_date'] >= prediction_date - timedelta(30)] \
    .groupby('customer_id')['order_id'].count()
features['purchases_last_90d'] = df[df['purchase_date'] >= prediction_date - timedelta(90)] \
    .groupby('customer_id')['order_id'].count()

# Monetary
features['total_spend_all_time'] = df.groupby('customer_id')['amount'].sum()
features['total_spend_last_90d'] = df[df['purchase_date'] >= prediction_date - timedelta(90)] \
    .groupby('customer_id')['amount'].sum()
features['avg_order_value'] = df.groupby('customer_id')['amount'].mean()

2. Engagement features (Product usage):

# For SaaS/apps
features['logins_30d'] = login_df.groupby('customer_id')['login_date'].count()
features['days_active_30d'] = login_df.groupby('customer_id')['login_date'].nunique()
features['features_used_30d'] = feature_usage_df.groupby('customer_id')['feature_name'].nunique()

# Feature depth (power users)
features['advanced_features_used'] = feature_usage_df[
    feature_usage_df['feature_tier'] == 'advanced'
].groupby('customer_id')['feature_name'].nunique()

# For e-commerce
features['website_visits_30d'] = web_analytics.groupby('customer_id')['session_id'].nunique()
features['product_views_30d'] = web_analytics.groupby('customer_id')['product_id'].nunique()
features['add_to_cart_30d'] = web_analytics[
    web_analytics['event'] == 'add_to_cart'
].groupby('customer_id')['event'].count()

3. Customer service features (Warning signals):

# Support tickets
features['support_tickets_90d'] = support_tickets.groupby('customer_id')['ticket_id'].count()
features['complaints_90d'] = support_tickets[
    support_tickets['type'] == 'complaint'
].groupby('customer_id')['ticket_id'].count()

# Resolution metrics
features['avg_resolution_time'] = support_tickets.groupby('customer_id')['resolution_hours'].mean()
features['unresolved_tickets'] = support_tickets[
    support_tickets['status'] == 'open'
].groupby('customer_id')['ticket_id'].count()

4. Behavioral change features (Trends are powerful):

# Change in frequency
features['purchases_30d_vs_prior_30d'] = (
    purchases_last_30d - purchases_prior_30d
)
features['purchases_trend'] = features['purchases_30d_vs_prior_30d'] / (purchases_prior_30d + 1)  # Avoid /0

# Change in monetary
features['spend_30d_vs_prior_30d'] = (
    spend_last_30d - spend_prior_30d
)

# Change in engagement
features['logins_30d_vs_prior_30d'] = (
    logins_last_30d - logins_prior_30d
)
features['login_trend'] = features['logins_30d_vs_prior_30d'] / (logins_prior_30d + 1)

Why trends matter: A customer with 10 purchases/month is good. But if they had 20 purchases last month and dropped to 10 this month → warning signal (even though absolute number still looks OK).

5. Customer attributes:

# Demographics
features['customer_age_days'] = (prediction_date - df['signup_date']).dt.days
features['acquisition_channel'] = df['acquisition_channel']  # Categorical
features['customer_segment'] = df['segment']  # VIP, Regular, etc.

# Plan/pricing
features['plan_type'] = df['plan_type']  # Free, Starter, Pro, Enterprise
features['mrr'] = df['monthly_recurring_revenue']  # For SaaS
features['is_on_discount'] = df['discount_percent'] > 0

6. Product/category preferences (E-commerce specific):

# Category diversity
features['categories_purchased'] = purchase_df.groupby('customer_id')['category'].nunique()

# Favorite category
features['top_category'] = purchase_df.groupby('customer_id')['category'].agg(
    lambda x: x.value_counts().index[0]
)

# Brand loyalty
features['brands_purchased'] = purchase_df.groupby('customer_id')['brand'].nunique()
features['repeat_brand_rate'] = (
    purchase_df[purchase_df.duplicated(['customer_id', 'brand'], keep=False)]
    .groupby('customer_id')['order_id'].count() / features['total_purchases_all_time']
)

Example: Complete Feature Set (30 features)

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def engineer_churn_features(customer_id, prediction_date, transactions_df, logins_df, support_df):
    """
    Create features for a single customer at a specific prediction_date
    """
    features = {'customer_id': customer_id, 'prediction_date': prediction_date}

    # Filter data up to prediction_date only (no data leakage!)
    hist_transactions = transactions_df[
        (transactions_df['customer_id'] == customer_id) &
        (transactions_df['purchase_date'] < prediction_date)
    ]
    hist_logins = logins_df[
        (logins_df['customer_id'] == customer_id) &
        (logins_df['login_date'] < prediction_date)
    ]

    # RFM
    if len(hist_transactions) > 0:
        features['days_since_last_purchase'] = (
            prediction_date - hist_transactions['purchase_date'].max()
        ).days
        features['total_purchases'] = len(hist_transactions)
        features['total_spend'] = hist_transactions['amount'].sum()
        features['avg_order_value'] = hist_transactions['amount'].mean()

        # Last 30 days
        last_30d = hist_transactions[
            hist_transactions['purchase_date'] >= prediction_date - timedelta(30)
        ]
        features['purchases_30d'] = len(last_30d)
        features['spend_30d'] = last_30d['amount'].sum() if len(last_30d) > 0 else 0

        # Prior 30 days (for trend)
        prior_30d = hist_transactions[
            (hist_transactions['purchase_date'] >= prediction_date - timedelta(60)) &
            (hist_transactions['purchase_date'] < prediction_date - timedelta(30))
        ]
        features['purchases_prior_30d'] = len(prior_30d)
        features['spend_prior_30d'] = prior_30d['amount'].sum() if len(prior_30d) > 0 else 0

        # Trend
        features['purchase_trend'] = (
            features['purchases_30d'] - features['purchases_prior_30d']
        ) / (features['purchases_prior_30d'] + 1)

    else:
        # No purchase history → Fill with defaults
        features.update({
            'days_since_last_purchase': 9999,
            'total_purchases': 0,
            'total_spend': 0,
            'avg_order_value': 0,
            'purchases_30d': 0,
            'spend_30d': 0,
            'purchase_trend': 0
        })

    # Engagement
    if len(hist_logins) > 0:
        features['days_since_last_login'] = (
            prediction_date - hist_logins['login_date'].max()
        ).days
        features['logins_30d'] = len(hist_logins[
            hist_logins['login_date'] >= prediction_date - timedelta(30)
        ])
    else:
        features['days_since_last_login'] = 9999
        features['logins_30d'] = 0

    # Support tickets (warning signal)
    hist_support = support_df[
        (support_df['customer_id'] == customer_id) &
        (support_df['created_date'] < prediction_date)
    ]
    features['support_tickets_90d'] = len(hist_support[
        hist_support['created_date'] >= prediction_date - timedelta(90)
    ])

    return features

# Apply to all customers
all_features = []
for customer_id in customer_ids:
    features = engineer_churn_features(
        customer_id, prediction_date, transactions_df, logins_df, support_df
    )
    all_features.append(features)

features_df = pd.DataFrame(all_features)

Common Feature Engineering Mistakes

1. Data leakage (most dangerous):

# WRONG: Using future information
features['days_until_churn'] = (churn_date - prediction_date).days  # THIS IS THE TARGET!

# WRONG: Including post-prediction data
features['purchases_next_30d'] = ...  # Future data

# RIGHT: Only use data BEFORE prediction_date
features['purchases_last_30d'] = hist_data[
    hist_data['date'] < prediction_date
].count()

2. Not handling missing values:

# Many customers have no support tickets → NaN
# Model will fail or treat NaN strangely

# WRONG: Drop rows with NaN
df = df.dropna()  # Lose customers

# RIGHT: Fill with meaningful defaults
features['support_tickets_90d'] = features['support_tickets_90d'].fillna(0)
features['days_since_last_purchase'] = features['days_since_last_purchase'].fillna(9999)  # "Never"

3. Ignoring feature scaling (for some algorithms):

# Features have different scales
# days_since_last_purchase: 0-1000
# total_purchases: 1-500
# total_spend: 100,000-50,000,000

# For tree-based models (XGBoost, Random Forest): Scaling not needed
# For linear models (Logistic Regression): Need scaling

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Model Training - Algorithms & Evaluation

Dataset Preparation

import pandas as pd
from sklearn.model_selection import train_test_split

# Load features & labels
features_df = pd.read_csv('churn_features.csv')
labels_df = pd.read_csv('churn_labels.csv')

# Merge
df = features_df.merge(labels_df, on=['customer_id', 'prediction_date'])

# Separate features (X) and target (y)
feature_cols = [
    'days_since_last_purchase', 'total_purchases', 'total_spend', 'avg_order_value',
    'purchases_30d', 'spend_30d', 'purchase_trend',
    'days_since_last_login', 'logins_30d',
    'support_tickets_90d',
    # ... add all 30 features
]
X = df[feature_cols]
y = df['churned_next_60d']

# Train/test split (temporal split preferred)
# Use older data for training, recent for testing
split_date = '2024-10-01'
train_mask = df['prediction_date'] < split_date
test_mask = df['prediction_date'] >= split_date

X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
print(f"Train churn rate: {y_train.mean():.2%}")
print(f"Test churn rate: {y_test.mean():.2%}")

Algorithm Comparison

Try multiple algorithms, compare results:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, random_state=42)
}

results = {}
for name, model in models.items():
    # Train
    model.fit(X_train, y_train)

    # Predict probabilities
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    # Evaluate
    auc = roc_auc_score(y_test, y_pred_proba)
    results[name] = {'model': model, 'auc': auc, 'predictions': y_pred_proba}

    print(f"\n{name}:")
    print(f"  AUC: {auc:.4f}")

# Select best model
best_model_name = max(results, key=lambda x: results[x]['auc'])
best_model = results[best_model_name]['model']
print(f"\nBest model: {best_model_name} (AUC: {results[best_model_name]['auc']:.4f})")

Typical results:

Logistic Regression: AUC 0.72-0.76 (good baseline, interpretable)
Random Forest: AUC 0.76-0.80 (better performance)
XGBoost: AUC 0.78-0.85 (usually wins, our recommendation)

Hyperparameter Tuning (XGBoost)

from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Grid search with cross-validation
xgb = XGBClassifier(random_state=42)
grid_search = GridSearchCV(
    xgb,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation AUC: {grid_search.best_score_:.4f}")

# Final model with best parameters
best_xgb = grid_search.best_estimator_

Warning: Grid search is computationally expensive (hours for large datasets). Start with default params, only tune if needed for marginal gains.

Evaluation Metrics: Beyond Accuracy

Imbalanced classes problem:

Churn rate = 8% → 92% non-churn
Naive model: Predict everyone as "not churned" → 92% accuracy!
But this model is useless (doesn't catch any churners)

Better metrics:

1. ROC-AUC (Area Under ROC Curve):

from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

auc = roc_auc_score(y_test, y_pred_proba)
print(f"AUC: {auc:.4f}")

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'AUC = {auc:.4f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

AUC = 0.5: Random guessing
AUC = 0.7-0.8: Good model
AUC = 0.8-0.9: Excellent model
AUC > 0.9: Suspiciously good (check for data leakage!)

2. Precision & Recall (Confusion Matrix):

from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Choose threshold (default 0.5 often not optimal)
threshold = 0.3  # Lower threshold = catch more churners, but more false alarms
y_pred = (y_pred_proba >= threshold).astype(int)

# Metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.2%}")  # Of predicted churners, how many actually churned?
print(f"Recall: {recall:.2%}")  # Of actual churners, how many did we catch?
print(f"F1-score: {f1:.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

Example output:

Confusion Matrix:
                Predicted Non-Churn    Predicted Churn
Actual Non-Churn      8,500 (TN)          450 (FP)
Actual Churn            200 (FN)          850 (TP)

Precision: 65% (850 / (850+450)) → Of customers we flagged, 65% actually churned
Recall: 81% (850 / (850+200)) → We caught 81% of churners

Trade-off:

High precision (low FP): Fewer false alarms, but miss some churners
High recall (low FN): Catch most churners, but many false alarms

Business decision:

High cost of false positives (expensive retention campaigns): Prefer high precision
High cost of losing customers (high LTV): Prefer high recall

3. Precision-Recall curve (find optimal threshold):

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)

# Plot
plt.plot(thresholds, precisions[:-1], label='Precision')
plt.plot(thresholds, recalls[:-1], label='Recall')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall vs Threshold')
plt.legend()
plt.grid()
plt.show()

# Find threshold where precision = 0.7 (business requirement)
idx = (precisions >= 0.7).argmax()
optimal_threshold = thresholds[idx]
print(f"Threshold for 70% precision: {optimal_threshold:.4f}")
print(f"Recall at this threshold: {recalls[idx]:.2%}")

Feature Importance: What Drives Churn?

# XGBoost feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance.head(10))

# Plot
import matplotlib.pyplot as plt

feature_importance.head(15).plot(x='feature', y='importance', kind='barh', figsize=(10, 8))
plt.xlabel('Importance')
plt.title('Top 15 Features for Churn Prediction')
plt.gca().invert_yaxis()
plt.show()

Typical top features:

days_since_last_purchase (20-25% importance)
purchase_trend (15-18%)
total_purchases (10-12%)
logins_30d (8-10%)
support_tickets_90d (6-8%)

Insights:

Behavioral changes (trends) more predictive than absolute values
Engagement metrics (logins, feature usage) strong signals
Support tickets = red flag

Step 4: Deployment - From Model to Action

Batch Scoring (Weekly Schedule)

import schedule
import time

def score_customers():
    """
    Score all active customers for churn risk
    Run weekly (every Monday 6 AM)
    """
    print(f"[{datetime.now()}] Starting weekly churn scoring...")

    # 1. Get active customers
    active_customers = get_active_customers()  # Query from database
    print(f"  Found {len(active_customers)} active customers")

    # 2. Engineer features
    prediction_date = datetime.now().date()
    features = []
    for customer_id in active_customers:
        feat = engineer_churn_features(customer_id, prediction_date, ...)
        features.append(feat)

    features_df = pd.DataFrame(features)
    X = features_df[feature_cols]

    # 3. Score with model
    churn_probabilities = best_xgb.predict_proba(X)[:, 1]

    # 4. Create results table
    results = pd.DataFrame({
        'customer_id': features_df['customer_id'],
        'prediction_date': prediction_date,
        'churn_probability': churn_probabilities,
        'risk_tier': pd.cut(churn_probabilities, bins=[0, 0.3, 0.7, 1.0], labels=['Low', 'Medium', 'High'])
    })

    # 5. Save to database
    save_to_database(results, table='churn_predictions')
    print(f"  Scored {len(results)} customers, saved to database")

    # 6. Trigger retention campaigns for high-risk
    high_risk = results[results['risk_tier'] == 'High']
    trigger_retention_campaigns(high_risk)
    print(f"  Triggered retention campaigns for {len(high_risk)} high-risk customers")

    print("[Completed] Churn scoring finished\n")

# Schedule weekly (every Monday at 6 AM)
schedule.every().monday.at("06:00").do(score_customers)

# Keep running
while True:
    schedule.run_pending()
    time.sleep(3600)  # Check every hour

Integration with CRM / Marketing Automation

Option 1: Database integration:

-- Analysts query churn scores directly
SELECT
    c.customer_id,
    c.email,
    c.name,
    cp.churn_probability,
    cp.risk_tier,
    cp.prediction_date
FROM customers c
JOIN churn_predictions cp ON c.customer_id = cp.customer_id
WHERE cp.prediction_date = CURRENT_DATE
    AND cp.risk_tier = 'High'
ORDER BY cp.churn_probability DESC
LIMIT 100;

Option 2: API endpoint (for real-time scoring):

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load model on startup
model = joblib.load('churn_model.pkl')

@app.route('/predict_churn', methods=['POST'])
def predict_churn():
    """
    API endpoint to score a single customer
    """
    data = request.json
    customer_id = data['customer_id']

    # Engineer features
    features = engineer_churn_features(customer_id, datetime.now().date(), ...)
    X = pd.DataFrame([features])[feature_cols]

    # Predict
    churn_prob = model.predict_proba(X)[0, 1]

    return jsonify({
        'customer_id': customer_id,
        'churn_probability': float(churn_prob),
        'risk_tier': 'High' if churn_prob > 0.7 else ('Medium' if churn_prob > 0.3 else 'Low')
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Option 3: Reverse ETL (Hightouch, Census):

Churn scores in data warehouse → Sync to CRM (Salesforce, HubSpot)
Sales reps see churn risk in CRM interface

Retention Campaign Automation

def trigger_retention_campaigns(high_risk_customers):
    """
    Send personalized retention campaigns based on churn reasons
    """
    for _, customer in high_risk_customers.iterrows():
        customer_id = customer['customer_id']
        churn_prob = customer['churn_probability']

        # Analyze churn reasons (feature values)
        features = get_customer_features(customer_id)

        # Decision tree for campaign type
        if features['purchase_trend'] < -0.5:
            # Decreasing purchase frequency
            campaign_type = 'reactivation_discount'
            campaign_message = f"Nhớ bạn quá! Voucher 20% cho đơn hàng tiếp theo"

        elif features['support_tickets_90d'] > 2:
            # Support issues
            campaign_type = 'customer_success_call'
            campaign_message = "Đội Customer Success sẽ gọi để hỗ trợ bạn"

        elif features['logins_30d'] < 2:
            # Low engagement
            campaign_type = 'feature_education'
            campaign_message = "5 tính năng bạn chưa biết trong app"

        else:
            # General win-back
            campaign_type = 'loyalty_reward'
            campaign_message = "Ưu đãi đặc biệt cho khách hàng thân thiết"

        # Send campaign via marketing automation (Mailchimp, Braze, etc.)
        send_campaign(
            customer_id=customer_id,
            campaign_type=campaign_type,
            message=campaign_message,
            discount_percent=20 if 'discount' in campaign_type else 0
        )

        print(f"  Sent {campaign_type} to customer {customer_id} (churn prob: {churn_prob:.2%})")

Step 5: Monitoring & Continuous Improvement

Model Performance Monitoring

# Weekly evaluation on new data
def evaluate_model_performance():
    """
    Evaluate model on past week's predictions vs actual outcomes
    """
    # Get predictions from 60 days ago (when we predicted 60-day churn)
    prediction_date = datetime.now().date() - timedelta(days=60)
    predictions = get_predictions_from_date(prediction_date)

    # Get actual churn outcomes
    actual_churn = get_actual_churn_outcomes(prediction_date, prediction_date + timedelta(60))

    # Merge
    df = predictions.merge(actual_churn, on='customer_id')

    # Evaluate
    auc = roc_auc_score(df['actually_churned'], df['churn_probability'])
    precision = precision_score(df['actually_churned'], df['predicted_high_risk'])
    recall = recall_score(df['actually_churned'], df['predicted_high_risk'])

    print(f"[{datetime.now()}] Model Performance (last 60 days):")
    print(f"  AUC: {auc:.4f}")
    print(f"  Precision: {precision:.2%}")
    print(f"  Recall: {recall:.2%}")

    # Alert if performance degrades
    if auc < 0.70:  # Threshold
        send_alert("CHURN MODEL PERFORMANCE DEGRADED", details=f"AUC dropped to {auc:.4f}")

# Run monthly
schedule.every().month.at("01:00").do(evaluate_model_performance)

Business Impact Tracking

Metrics to track:

-- Overall churn rate (before vs after ML)
WITH monthly_churn AS (
  SELECT
    DATE_TRUNC('month', date) as month,
    COUNT(DISTINCT CASE WHEN churned = TRUE THEN customer_id END) as churned_customers,
    COUNT(DISTINCT customer_id) as total_customers,
    COUNT(DISTINCT CASE WHEN churned = TRUE THEN customer_id END)::FLOAT /
      COUNT(DISTINCT customer_id) * 100 as churn_rate
  FROM customer_status_daily
  GROUP BY month
)
SELECT * FROM monthly_churn
ORDER BY month DESC
LIMIT 12;

-- Retention campaign effectiveness
SELECT
    campaign_type,
    COUNT(*) as customers_targeted,
    SUM(CASE WHEN churned_60d = FALSE THEN 1 ELSE 0 END) as retained,
    SUM(CASE WHEN churned_60d = FALSE THEN 1 ELSE 0 END)::FLOAT / COUNT(*) * 100 as retention_rate
FROM retention_campaigns rc
JOIN customer_outcomes co ON rc.customer_id = co.customer_id
WHERE rc.sent_date >= '2024-01-01'
GROUP BY campaign_type
ORDER BY retention_rate DESC;

Dashboard metrics:

Monthly churn rate trend (target: -20% vs baseline)
Customers flagged as high-risk
Retention campaigns sent
Campaign success rate (% retained)
ROI (revenue saved vs campaign cost)

Model Retraining Schedule

When to retrain:

Regular schedule: Every 3-6 months (to capture new patterns)
Performance degradation: If AUC drops >5% from baseline
Concept drift: Customer behavior changes (e.g., COVID impact)
New features available: Additional data sources connected

Retraining workflow:

def retrain_churn_model():
    """
    Retrain model with latest data (last 18 months)
    """
    print(f"[{datetime.now()}] Starting model retraining...")

    # 1. Get fresh training data (last 18 months)
    end_date = datetime.now().date() - timedelta(days=60)  # Need 60 days for labels
    start_date = end_date - timedelta(days=540)  # 18 months

    # 2. Generate features & labels for all customers in this period
    # (Same feature engineering code as before)

    # 3. Train new model
    # (Same training code as before)

    # 4. Evaluate on holdout set
    # Compare new model vs old model

    # 5. If new model better → Deploy
    if new_model_auc > old_model_auc + 0.02:  # At least 2% improvement
        save_model(new_model, 'churn_model_v2.pkl')
        print(f"  New model deployed (AUC: {new_model_auc:.4f} vs {old_model_auc:.4f})")
    else:
        print(f"  New model not better, keeping old model")

# Run quarterly
schedule.every(3).months.do(retrain_churn_model)

Case Study: SaaS Company Giảm Churn 30%

Background:

Company: B2B SaaS (project management tool), ~5,000 customers
Baseline churn: 8% monthly (industry average 5-7%)
Annual revenue: 60B VND
Problem: Losing ~400 customers/month, no proactive retention

Implementation (4 months):

Month 1: Data preparation

Connected Stripe (subscriptions), Mixpanel (product usage), Zendesk (support tickets)
Built Data Warehouse (BigQuery) + dbt models
Historical data: 24 months

Month 2: Feature engineering & modeling

28 features engineered (RFM, engagement, support tickets, behavioral trends)
Trained XGBoost model
AUC: 0.81 (excellent for first attempt)
Top features:
1. days_since_last_login (22%)
2. logins_30d_trend (18%)
3. features_used_30d (15%)
4. support_tickets_90d (12%)

Month 3: Deployment

Weekly batch scoring (every Monday)
CRM integration (Salesforce) - churn scores visible to CSMs
Automated email campaigns for high-risk (>70% churn prob)

Month 4: Optimization

A/B test: 50% control (no intervention), 50% treatment (retention campaigns)
Campaign types:
- Feature education: Video tutorials for underutilized features
- Customer success call: Proactive check-in for high-risk accounts
- Discount offers: 20% off next 3 months for at-risk customers

Results after 6 months:

Metric	Before	After	Change
Monthly churn rate	8.0%	5.5%	-31% (↓2.5pp)
Customers lost/month	400	275	-125 customers
High-risk customers identified	N/A	600/month	New
Retention campaign success rate	N/A	35%	210 saved/month
Annual revenue saved	N/A	18B VND	(125 × 12 × 12M LTV)
Campaign cost	N/A	2B VND	Discounts + CS time
Net benefit	N/A	16B VND/year	8x ROI

Key learnings:

Behavioral changes > absolute metrics: Login trend more predictive than absolute login count
Support tickets = strong signal: Customers with 2+ tickets in 90 days had 3x churn rate
Discount not always needed: 60% retained with feature education alone (cheaper than discounts)
Timing matters: Intervene when churn probability 40-70% (still salvageable)

Common Pitfalls & How to Avoid

1. Chasing perfect model, ignoring business impact:

Spending 3 months to improve AUC from 0.80 → 0.83
Better: Deploy 0.80 model quickly, measure business impact, iterate

2. Not A/B testing:

Assume churn reduction due to ML, but could be seasonality, product changes
Solution: Always have control group (no intervention)

3. Retention campaigns too generic:

Same "20% discount" email to everyone
Better: Personalize based on churn reasons (support issues → CS call, low engagement → education)

4. Forgetting about model maintenance:

Train once in 2023, use in 2025 → Performance degrades
Solution: Quarterly retraining, monthly performance monitoring

5. Over-reliance on model:

"Model says low churn risk, ignore this customer complaint"
Solution: ML assists, humans decide. Customer Success Managers have final say.

Kết Luận: Churn Prediction = Proven ROI

Customer churn prediction là một trong những ML use cases dễ justify ROI nhất:

Clear metric: Churn rate reduction (%)
Fast time-to-value: 3-6 months pilot → production
High impact: 20-30% churn reduction = tens of billions VND for mid-size companies
Transferable: Learnings apply to other prediction problems (upsell, LTV prediction)

Implementation roadmap:

Month 1: Data preparation (Data Warehouse, historical data)
Month 2: Feature engineering, model training
Month 3: Deployment, CRM integration
Month 4-6: A/B testing, optimization
Ongoing: Monitoring, retraining, scaling

Next steps:

Assess data readiness (have 12+ months history? Clean data?)
Define churn appropriately for your business
Start with pilot (top 20% customers by revenue)
Liên hệ Carptech nếu cần support end-to-end ML project (carptech.vn/contact)

Tài liệu tham khảo:

Bài viết này là phần 2 của series "Advanced Analytics & AI/ML" tháng 5. Đọc thêm về Analytics Maturity, Recommendation Systems, và Demand Forecasting.

Carptech - Data Platform & ML Solutions for Vietnamese Enterprises. Liên hệ tư vấn miễn phí.