MLOps: Production ML tại quy mô Doanh Nghiệp

TL;DR

MLOps (Machine Learning Operations) là discipline kết hợp ML, DevOps, và Data Engineering để đưa ML models lên production một cách reliable và scalable. Research của VentureBeat cho thấy 85% ML models không bao giờ đến được production - MLOps chính là giải pháp cho vấn đề này.

Core components của MLOps:

Experiment Tracking: MLflow, Weights & Biases để track experiments
Feature Store: Feast, Tecton để reuse features across models
Model Registry: Versioning, staging, metadata management
Training Pipeline: Automated, scheduled retraining
Deployment: REST APIs, batch predictions, real-time serving
Monitoring: Model performance, data drift, concept drift detection
CI/CD: Automated testing, deployment, rollback

Case study Vietnamese fintech: Triển khai MLOps cho fraud detection model:

Before: 2 tuần để deploy model update, manual retraining hàng tháng
After: Deploy hourly updates, auto-retrain daily based on new fraud patterns
Result: Phát hiện fraud nhanh hơn 70%, adapt real-time với attack patterns mới

Bài này sẽ guide bạn qua 4 maturity levels của MLOps và cách build MVP MLOps system với open-source tools.

1. Tại sao cần MLOps? The Production Gap

1.1. Jupyter Notebook → Production: The Valley of Death

Scenario quen thuộc tại các doanh nghiệp VN:

Week 1-4: Data Scientist build model trong Jupyter notebook

Accuracy: 92% trên test set
Leadership hào hứng: "Deploy luôn!"

Week 5-8: Engineering team cố gắng productionize

Code trong notebook không run được trên server
Hardcoded paths: /Users/datascientist/Downloads/data.csv
Library conflicts: notebook dùng pandas 1.5, server có 1.3
Model file 2GB, không biết deploy như thế nào

Week 9-12: Sau nhiều debugging

Cuối cùng deploy được... nhưng accuracy drop xuống 75%
Vì training data đã cũ 3 tháng
Không có monitoring → không biết model đang perform thế nào

Week 13+: Model bị "bỏ quên"

Không có retraining schedule
Performance degradation không được phát hiện
6 tháng sau, model predictions hoàn toàn sai

85% ML projects kết thúc ở đây (VentureBeat, 2019).

1.2. MLOps khác gì DevOps?

MLOps = DevOps + Data + Models

Aspect	Traditional DevOps	MLOps
Code	Version control (Git)	✅ Same
Data	N/A	✅ Data versioning (DVC)
Models	N/A	✅ Model versioning
Testing	Unit tests, integration tests	✅ Same + data validation + model tests
Deployment	Blue-green, canary	✅ Same + A/B testing models
Monitoring	Server metrics, logs	✅ Same + model performance + drift
Dependencies	requirements.txt, Docker	✅ Same + data pipelines

Key difference: ML systems have three moving parts (code, data, model) thay vì chỉ code.

1.3. Business Impact của MLOps

1. Time to Production

Without MLOps: 3-6 tháng để deploy 1 model
With MLOps: 1-2 tuần

2. Model Performance

Without monitoring: Model degradation 10-30% per year
With MLOps: Detect và retrain kịp thời

3. Cost Efficiency

Manual operations: Data Scientist spend 60% time on deployment
Automated MLOps: Focus 80% on model improvement

4. Scalability

Manual: Maximum 5-10 models in production
MLOps: 50-500+ models

Case study - Vietnamese E-commerce (500M GMV/month):

Deployed 12 ML models với MLOps:
- Product recommendations (3 models)
- Churn prediction
- Demand forecasting (per category)
- Fraud detection
- Price optimization
- Customer segmentation
Before MLOps: 1 model in production (recommendations), updated quarterly
After MLOps: 12 models, automated retraining weekly, hourly deployments
Impact: 25% increase in conversion rate, $2M annual revenue increase

2. MLOps Maturity Levels: Roadmap của bạn

Google định nghĩa 4 maturity levels cho MLOps:

Level 0: Manual Process (90% doanh nghiệp VN đang ở đây)

Characteristics:

Data Scientists work in notebooks
Manual data collection và preprocessing
Train model locally, save pickle file
Deploy = copy file to server, write Flask API manually
No monitoring, manual retraining khi nhớ ra

Tools: Jupyter, pandas, scikit-learn, Flask

Pros: Fast for POC và experiments Cons: Not reproducible, not scalable, high failure rate

When acceptable: Research, one-off analysis, POCs

Level 1: ML Pipeline Automation

Characteristics:

Automated training pipeline: scheduled retraining
Data validation (check schema, distributions)
Feature engineering pipeline
Model versioning
Automated deployment of new model versions

Tools:

Pipeline orchestration: Apache Airflow, Prefect, Kubeflow
Model registry: MLflow, custom

Improvement: Reproducible training, scheduled updates

Example pipeline (Apache Airflow):

# airflow_ml_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-science',
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'churn_prediction_training',
    default_args=default_args,
    schedule_interval='0 2 * * 0',  # Weekly at 2 AM Sunday
    start_date=datetime(2025, 5, 1),
    catchup=False
)

def extract_data():
    """Extract customer data from BigQuery"""
    from google.cloud import bigquery

    client = bigquery.Client()
    query = """
        SELECT
            customer_id,
            -- RFM features
            DATE_DIFF(CURRENT_DATE(), MAX(order_date), DAY) as recency,
            COUNT(DISTINCT order_id) as frequency,
            SUM(order_total) as monetary,
            -- Engagement features
            COUNT(DISTINCT login_date) as login_days,
            AVG(session_duration) as avg_session,
            -- Target
            CASE
                WHEN MAX(order_date) < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
                THEN 1 ELSE 0
            END as churned
        FROM `project.dataset.customer_events`
        WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)
        GROUP BY customer_id
    """

    df = client.query(query).to_dataframe()
    df.to_parquet('/data/raw/churn_data.parquet')
    print(f"Extracted {len(df)} customers")

def validate_data():
    """Validate data quality"""
    import pandas as pd
    import great_expectations as ge

    df = pd.read_parquet('/data/raw/churn_data.parquet')

    # Convert to Great Expectations dataset
    ge_df = ge.from_pandas(df)

    # Define expectations
    ge_df.expect_column_values_to_not_be_null('customer_id')
    ge_df.expect_column_values_to_be_between('recency', 0, 730)
    ge_df.expect_column_values_to_be_between('frequency', 1, 1000)
    ge_df.expect_column_mean_to_be_between('churned', 0.05, 0.30)

    # Validate
    results = ge_df.validate()

    if not results.success:
        raise ValueError(f"Data validation failed: {results}")

    print("✅ Data validation passed")

def train_model():
    """Train churn prediction model"""
    import pandas as pd
    import mlflow
    import mlflow.sklearn
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score, precision_score, recall_score

    # Load data
    df = pd.read_parquet('/data/raw/churn_data.parquet')

    # Split features and target
    feature_cols = ['recency', 'frequency', 'monetary', 'login_days', 'avg_session']
    X = df[feature_cols]
    y = df['churned']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Start MLflow run
    mlflow.set_experiment('churn-prediction')

    with mlflow.start_run():
        # Train model
        model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            min_samples_split=100,
            random_state=42
        )
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]

        auc = roc_auc_score(y_test, y_pred_proba)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)

        # Log metrics
        mlflow.log_metric('auc', auc)
        mlflow.log_metric('precision', precision)
        mlflow.log_metric('recall', recall)

        # Log parameters
        mlflow.log_param('n_estimators', 100)
        mlflow.log_param('max_depth', 10)

        # Log model
        mlflow.sklearn.log_model(model, 'model')

        print(f"✅ Model trained - AUC: {auc:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}")

        # Save model version
        run_id = mlflow.active_run().info.run_id
        with open('/data/models/latest_run_id.txt', 'w') as f:
            f.write(run_id)

def deploy_model():
    """Deploy model to production"""
    import mlflow

    # Load latest run ID
    with open('/data/models/latest_run_id.txt', 'r') as f:
        run_id = f.read().strip()

    # Load model
    model_uri = f'runs:/{run_id}/model'

    # Register model (promotes to registry)
    mlflow.register_model(model_uri, 'churn-prediction')

    # Transition to Production
    from mlflow.tracking import MlflowClient
    client = MlflowClient()

    # Get latest version
    versions = client.search_model_versions(f"name='churn-prediction'")
    latest_version = max([int(v.version) for v in versions])

    # Promote to Production
    client.transition_model_version_stage(
        name='churn-prediction',
        version=latest_version,
        stage='Production'
    )

    print(f"✅ Model version {latest_version} deployed to Production")

# Define tasks
task_extract = PythonOperator(
    task_id='extract_data',
    python_callable=extract_data,
    dag=dag
)

task_validate = PythonOperator(
    task_id='validate_data',
    python_callable=validate_data,
    dag=dag
)

task_train = PythonOperator(
    task_id='train_model',
    python_callable=train_model,
    dag=dag
)

task_deploy = PythonOperator(
    task_id='deploy_model',
    python_callable=deploy_model,
    dag=dag
)

# Define dependencies
task_extract >> task_validate >> task_train >> task_deploy

Kết quả: Automated weekly retraining, reproducible pipeline.

Level 2: CI/CD Pipeline Automation

Characteristics:

Continuous Integration: Automated testing for code, data, models
Continuous Delivery: Automated deployment với approval gates
Feature store: Centralized, reusable features
Model versioning và staging (Dev → Staging → Production)
Automated rollback nếu model performance drop

Tools:

CI/CD: GitHub Actions, GitLab CI, Jenkins
Feature store: Feast, Tecton
Model registry: MLflow, Vertex AI Model Registry

Example CI/CD (GitHub Actions):

# .github/workflows/ml-pipeline.yml
name: ML Training Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 0'  # Weekly at 2 AM Sunday
  workflow_dispatch:  # Manual trigger

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Validate data quality
        run: |
          python scripts/validate_data.py
        env:
          GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

  model-training:
    needs: data-validation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Train model
        run: |
          python scripts/train_model.py
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}
          GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

      - name: Evaluate model
        id: evaluate
        run: |
          python scripts/evaluate_model.py
          # Returns: {"auc": 0.85, "precision": 0.78}

      - name: Check performance threshold
        run: |
          python -c "
          import json
          import sys
          metrics = json.loads('${{ steps.evaluate.outputs.metrics }}')
          if metrics['auc'] < 0.75:
              print('❌ Model AUC below threshold')
              sys.exit(1)
          print('✅ Model meets performance threshold')
          "

  model-deployment:
    needs: model-training
    runs-on: ubuntu-latest
    environment: production  # Requires approval
    steps:
      - name: Deploy to Vertex AI
        run: |
          python scripts/deploy_vertex_ai.py
        env:
          GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

      - name: Run smoke tests
        run: |
          python scripts/test_deployed_model.py

  monitoring:
    needs: model-deployment
    runs-on: ubuntu-latest
    steps:
      - name: Setup monitoring
        run: |
          python scripts/setup_monitoring.py

Level 3: Automated MLOps (The Goal)

Characteristics:

Full automation: Training, deployment, monitoring, retraining
Monitoring-driven retraining: Trigger retraining khi detect performance degradation
Data drift detection: Automatic alerts
Concept drift detection: Model performance monitoring
Auto-rollback: Revert to previous version if new model underperforms
Multi-model management: Serve 10s-100s models

Tools:

Integrated platforms: Databricks ML, AWS SageMaker, GCP Vertex AI
Monitoring: Evidently AI, WhyLabs, Arize AI

Architecture diagram:

┌─────────────────┐
│  Data Sources   │
│  (BigQuery, S3) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│ Feature Store   │◄─────┤ Feature Eng  │
│ (Feast, Tecton) │      │  Pipeline    │
└────────┬────────┘      └──────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│ Training        │◄─────┤ Trigger:     │
│ Pipeline        │      │ - Scheduled  │
│ (Kubeflow)      │      │ - Drift      │
└────────┬────────┘      │ - Manual     │
         │               └──────────────┘
         ▼
┌─────────────────┐
│ Model Registry  │
│ (MLflow)        │
│ - Dev           │
│ - Staging       │
│ - Production    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│ Model Serving   │─────►│ Predictions  │
│ (Vertex AI)     │      │ API          │
└────────┬────────┘      └──────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│ Monitoring      │─────►│ Alerts &     │
│ - Performance   │      │ Retraining   │
│ - Data drift    │      │ Triggers     │
│ - Concept drift │      └──────────────┘
└─────────────────┘

3. Core Components của MLOps Stack

3.1. Experiment Tracking (MLflow)

Problem: Data Scientists chạy 100+ experiments → "experiment 47 có accuracy cao nhất, nhưng không nhớ hyperparameters là gì"

Solution: MLflow Tracking

Setup MLflow:

# Install
pip install mlflow

# Start MLflow server
mlflow server \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlflow-artifacts \
    --host 0.0.0.0 \
    --port 5000

Example usage:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Set experiment
mlflow.set_experiment('churn-prediction-tuning')

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20],
    'min_samples_split': [50, 100, 200]
}

# Grid search
for n_est in param_grid['n_estimators']:
    for depth in param_grid['max_depth']:
        for min_samples in param_grid['min_samples_split']:

            with mlflow.start_run():
                # Log parameters
                mlflow.log_param('n_estimators', n_est)
                mlflow.log_param('max_depth', depth)
                mlflow.log_param('min_samples_split', min_samples)

                # Train model
                model = RandomForestClassifier(
                    n_estimators=n_est,
                    max_depth=depth,
                    min_samples_split=min_samples,
                    random_state=42
                )
                model.fit(X_train, y_train)

                # Evaluate
                y_pred_proba = model.predict_proba(X_test)[:, 1]
                auc = roc_auc_score(y_test, y_pred_proba)

                # Log metrics
                mlflow.log_metric('auc', auc)

                # Log model
                mlflow.sklearn.log_model(model, 'model')

                print(f"n_est={n_est}, depth={depth}, min_samples={min_samples} → AUC={auc:.3f}")

# Find best run
best_run = mlflow.search_runs(
    experiment_ids=['1'],
    order_by=['metrics.auc DESC'],
    max_results=1
)

print(f"Best run: {best_run[['params.n_estimators', 'params.max_depth', 'metrics.auc']]}")

MLflow UI: http://localhost:5000

3.2. Feature Store (Feast)

Problem:

10 models cùng dùng "customer lifetime value" feature → compute 10 lần
Training dùng features tính 1 cách, production tính khác → training-serving skew
Features không reusable across teams

Solution: Centralized Feature Store

Feast example:

# feature_repo/features.py
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta

# Define entity
customer = Entity(
    name='customer_id',
    value_type=ValueType.INT64,
    description='Customer ID'
)

# Define data source
customer_features_source = FileSource(
    path='/data/customer_features.parquet',
    event_timestamp_column='event_timestamp'
)

# Define feature view
customer_features = FeatureView(
    name='customer_features',
    entities=['customer_id'],
    ttl=timedelta(days=7),
    features=[
        Feature(name='recency', dtype=ValueType.INT64),
        Feature(name='frequency', dtype=ValueType.INT64),
        Feature(name='monetary', dtype=ValueType.FLOAT),
        Feature(name='clv', dtype=ValueType.FLOAT),
        Feature(name='churn_score', dtype=ValueType.FLOAT)
    ],
    source=customer_features_source
)

Feature retrieval (training):

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path='feature_repo/')

# Entity dataframe
entity_df = pd.DataFrame({
    'customer_id': [1001, 1002, 1003],
    'event_timestamp': pd.to_datetime(['2025-05-27'] * 3)
})

# Get historical features (point-in-time correct)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        'customer_features:recency',
        'customer_features:frequency',
        'customer_features:monetary',
        'customer_features:clv'
    ]
).to_df()

print(training_df)

Feature retrieval (production):

# Get online features (low latency)
online_features = store.get_online_features(
    features=[
        'customer_features:recency',
        'customer_features:frequency',
        'customer_features:monetary',
        'customer_features:clv'
    ],
    entity_rows=[{'customer_id': 1001}]
).to_dict()

print(online_features)
# Output: {'customer_id': [1001], 'recency': [15], 'frequency': [12], ...}

Benefits:

Reusability: 1 lần compute, nhiều models dùng
Consistency: Training và serving dùng same features
Point-in-time correctness: Avoid data leakage

3.3. Model Monitoring & Drift Detection

Hai loại drift:

1. Data Drift: Input data distribution thay đổi

Example: COVID-19 → customer behavior thay đổi hoàn toàn

2. Concept Drift: Relationship giữa X và y thay đổi

Example: Fraud patterns evolve → old fraud detection model không còn hiệu quả

Detect drift với Evidently AI:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
import pandas as pd

# Reference data (training data)
reference_data = pd.read_parquet('/data/training/churn_features.parquet')

# Current data (production data)
current_data = pd.read_parquet('/data/production/latest_week.parquet')

# Create drift report
report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset()
])

report.run(reference_data=reference_data, current_data=current_data)

# Save report
report.save_html('/reports/drift_report.html')

# Get drift score
drift_metrics = report.as_dict()
drift_detected = drift_metrics['metrics'][0]['result']['dataset_drift']

if drift_detected:
    print("⚠️ Data drift detected! Consider retraining model")
else:
    print("✅ No significant drift detected")

Production monitoring dashboard:

# monitoring/dashboard.py
import streamlit as st
import pandas as pd
import mlflow
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab, ClassificationPerformanceTab

st.title("ML Model Monitoring Dashboard")

# Load production predictions
predictions_df = pd.read_gbq("SELECT * FROM `project.dataset.predictions` WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)")

# Model performance over time
st.header("Model Performance Over Time")

daily_metrics = predictions_df.groupby('date').apply(lambda x: {
    'precision': precision_score(x['actual'], x['predicted']),
    'recall': recall_score(x['actual'], x['predicted']),
    'auc': roc_auc_score(x['actual'], x['predicted_proba'])
}).apply(pd.Series)

st.line_chart(daily_metrics)

# Alert threshold
auc_threshold = 0.75
if daily_metrics['auc'].tail(7).mean() < auc_threshold:
    st.error(f"⚠️ ALERT: 7-day average AUC ({daily_metrics['auc'].tail(7).mean():.3f}) below threshold ({auc_threshold})")
    st.info("🔄 Triggering automated retraining pipeline...")

3.4. Model Deployment Options

Option 1: REST API (Flask/FastAPI)

# serve.py
from fastapi import FastAPI
import mlflow.pyfunc
import pandas as pd

app = FastAPI()

# Load model from MLflow
model = mlflow.pyfunc.load_model('models:/churn-prediction/Production')

@app.post('/predict')
def predict(customer_id: int):
    # Fetch features from Feature Store
    from feast import FeatureStore
    store = FeatureStore(repo_path='feature_repo/')

    features = store.get_online_features(
        features=[
            'customer_features:recency',
            'customer_features:frequency',
            'customer_features:monetary'
        ],
        entity_rows=[{'customer_id': customer_id}]
    ).to_dict()

    # Convert to DataFrame
    feature_df = pd.DataFrame([features])

    # Predict
    churn_probability = model.predict(feature_df)[0]

    return {
        'customer_id': customer_id,
        'churn_probability': float(churn_probability),
        'churn_risk': 'HIGH' if churn_probability > 0.7 else 'MEDIUM' if churn_probability > 0.4 else 'LOW'
    }

# Run: uvicorn serve:app --host 0.0.0.0 --port 8000

Option 2: Batch Predictions (Cloud Storage)

# batch_predict.py
import mlflow.pyfunc
import pandas as pd
from google.cloud import bigquery

# Load model
model = mlflow.pyfunc.load_model('models:/churn-prediction/Production')

# Load customers to score
client = bigquery.Client()
query = """
    SELECT customer_id, recency, frequency, monetary
    FROM `project.dataset.customer_features`
    WHERE last_order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 180 DAY)
"""
customers_df = client.query(query).to_dataframe()

# Batch predict
customers_df['churn_probability'] = model.predict(customers_df[['recency', 'frequency', 'monetary']])

# Save results
customers_df[['customer_id', 'churn_probability']].to_gbq(
    'project.dataset.churn_predictions',
    if_exists='replace'
)

print(f"✅ Scored {len(customers_df)} customers")

Option 3: Real-time Serving (Vertex AI)

# deploy_vertex_ai.py
from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

# Upload model to Vertex AI Model Registry
model = aiplatform.Model.upload(
    display_name='churn-prediction',
    artifact_uri='gs://my-bucket/models/churn/v2',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest'
)

# Deploy to endpoint
endpoint = model.deploy(
    machine_type='n1-standard-4',
    min_replica_count=1,
    max_replica_count=10,  # Auto-scaling
    traffic_percentage=100
)

print(f"✅ Model deployed to: {endpoint.resource_name}")

# Prediction
prediction = endpoint.predict(instances=[{
    'recency': 45,
    'frequency': 8,
    'monetary': 2500
}])

print(f"Churn probability: {prediction.predictions[0]}")

4. Case Study: Vietnamese Fintech - Fraud Detection MLOps

4.1. Context

Company: Vietnamese digital lending platform (3M customers, 500K loans/month)

Challenge:

Fraud patterns evolve rapidly (new attack methods weekly)
Manual model updates took 2 weeks from training to deployment
Fraud detection model accuracy degraded 15% over 3 months → undetected
Data Scientists spend 60% time on deployment instead of improving models

Goal: Build MLOps system để deploy fraud detection updates hourly và auto-retrain daily

4.2. Architecture

Before MLOps:

Data Scientist → Jupyter Notebook → pickle file
→ Email to Engineering → Manual deployment (2 weeks)
→ No monitoring

After MLOps:

BigQuery (transaction data)
    ↓
Feast Feature Store
    ↓
Kubeflow Pipeline (daily training)
    ↓
MLflow Model Registry
    ↓
Vertex AI Endpoint (auto-deploy)
    ↓
Evidently Monitoring → Alerts → Auto-retrain

4.3. Implementation

Step 1: Feature Store

Centralize 50+ fraud detection features:

# Features
- transaction_velocity_1h: Số transactions trong 1h
- amount_deviation_30d: So với average 30 ngày
- device_fingerprint_new: Device mới lạ?
- ip_country_mismatch: IP khác registered country
- merchant_risk_score: Historical fraud rate của merchant
- user_behavior_anomaly_score: ML-based anomaly score
...

Step 2: Automated Daily Training

Kubeflow pipeline chạy mỗi ngày 3 AM:

@dsl.pipeline(name='fraud-detection-training')
def fraud_pipeline():
    # Extract yesterday's transactions + labels
    extract_op = extract_labeled_data(
        query="""
            SELECT * FROM transactions
            WHERE date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
            AND fraud_label IS NOT NULL
        """
    )

    # Train model
    train_op = train_xgboost_model(
        data=extract_op.output,
        params={'max_depth': 6, 'learning_rate': 0.1}
    )

    # Evaluate on holdout set
    eval_op = evaluate_model(train_op.output)

    # Deploy if AUC > 0.90 (threshold)
    with dsl.Condition(eval_op.outputs['auc'] > 0.90):
        deploy_op = deploy_to_vertex(train_op.output)

Step 3: Real-time Serving

API endpoint phản hồi < 100ms:

@app.post('/score-transaction')
def score_transaction(transaction: Transaction):
    # Get features from Feast
    features = feature_store.get_online_features(
        entity_rows=[{'transaction_id': transaction.id}],
        features=[
            'fraud_features:transaction_velocity_1h',
            'fraud_features:amount_deviation_30d',
            ...
        ]
    ).to_dict()

    # Predict
    fraud_score = model.predict_proba(features)[0][1]

    # Decision rules
    if fraud_score > 0.95:
        return {'decision': 'BLOCK', 'score': fraud_score}
    elif fraud_score > 0.75:
        return {'decision': 'REVIEW', 'score': fraud_score}
    else:
        return {'decision': 'APPROVE', 'score': fraud_score}

Step 4: Monitoring & Auto-Retraining

Monitor performance hourly:

# Check performance every hour
if current_hour_precision < 0.85:
    # Trigger immediate retraining
    trigger_kubeflow_pipeline('fraud-detection-training')
    send_slack_alert("🚨 Fraud model performance dropped. Retraining triggered.")

4.4. Results

Deployment Speed:

Before: 2 weeks per update
After: Hourly deployments (automated)

Model Freshness:

Before: Quarterly updates
After: Daily retraining với data từ ngày hôm trước

Detection Performance:

Before: 78% precision, 65% recall (degrading)
After: 92% precision, 88% recall (consistently)

Fraud Adaptation:

Before: New fraud patterns detected sau 2-4 tuần
After: Detected trong 24 giờ (daily retraining catches new patterns)

Cost Savings:

Prevented fraud: $8M/year (additional $3M từ faster detection)
Reduced false positives: 40% fewer legitimate transactions blocked → better UX

Data Science Productivity:

Before: 60% time on deployment
After: 90% time on model improvement → shipped 3 new fraud models in 6 tháng

5. Tools Landscape: Build vs Buy

5.1. All-in-One Platforms

Google Cloud Vertex AI

✅ Fully managed: Training, serving, monitoring
✅ AutoML for non-experts
✅ Feature Store built-in
✅ Tight integration với BigQuery, GCS
❌ Vendor lock-in
❌ Higher cost ($$$)
Best for: GCP-native companies, enterprises cần support

AWS SageMaker

✅ Comprehensive MLOps suite
✅ SageMaker Pipelines for orchestration
✅ Model Monitor for drift detection
❌ Complex setup
❌ AWS-only
Best for: AWS-heavy companies

Databricks ML

✅ End-to-end: data prep → training → serving
✅ Unity Catalog for governance
✅ Great for Spark workloads
❌ Expensive
Best for: Big data + ML workloads

5.2. Best-of-Breed (Open Source)

Recommended MVP Stack:

Component	Tool	Why
Experiment Tracking	MLflow	Industry standard, easy setup
Pipeline Orchestration	Apache Airflow	Flexible, Python-based, proven
Feature Store	Feast	Open-source, cloud-agnostic
Model Serving	FastAPI + Docker	Lightweight, full control
Monitoring	Evidently AI	Free tier, drift detection
Infrastructure	Kubernetes	Scalable, portable

Setup cost: $0 (open-source) + infrastructure cost Time to MVP: 2-4 tuần Best for: Startups, cost-conscious teams, need flexibility

5.3. Decision Framework

Use All-in-One Platform nếu:

✅ Budget > $50K/year cho MLOps tools
✅ Need enterprise support
✅ Already committed to cloud provider (GCP/AWS)
✅ Prefer less maintenance

Use Best-of-Breed nếu:

✅ Budget-constrained
✅ Want flexibility và no vendor lock-in
✅ Have engineering resources để maintain
✅ Multi-cloud strategy

Vietnamese startup reality: Hầu hết nên start với open-source MVP, migrate to managed platform khi scale.

6. Getting Started: Your MLOps MVP in 4 Tuần

Week 1: Experiment Tracking

Goal: Stop losing experiments

Tasks:

Setup MLflow server (Docker)
Migrate 1 model training script to log với MLflow
Train 10 experiments, compare trong UI

Deliverable: All team members track experiments trong MLflow

Week 2: Model Registry & Versioning

Goal: Reproducible model deployments

Tasks:

Setup MLflow Model Registry
Register models với stages (Dev/Staging/Production)
Deploy 1 model to production via registry

Deliverable: Production model served from registry, not local pickle files

Week 3: Automated Training Pipeline

Goal: Scheduled retraining

Tasks:

Setup Apache Airflow (or Prefect)
Convert training script to Airflow DAG
Schedule weekly training

Deliverable: Model auto-retrains every week, auto-registers in MLflow

Week 4: Basic Monitoring

Goal: Detect when model degrades

Tasks:

Setup Evidently AI monitoring
Track predictions in BigQuery/database
Daily drift report

Deliverable: Daily email report về model performance + drift

After 4 tuần: Bạn đã có Level 1 MLOps - đủ để deploy models reliably.

7. Common Pitfalls & Best Practices

❌ Pitfall 1: Boil the Ocean

Cố gắng implement tất cả components cùng lúc → overwhelmed → fail

✅ Best Practice: Start small

Week 1-4: Experiment tracking
Week 5-8: Model registry
Week 9-12: Automated pipeline
Iterate từ đó

❌ Pitfall 2: Tools Before Process

Mua Databricks/SageMaker nhưng team vẫn làm manual

✅ Best Practice: Document process first

How to train models?
How to deploy?
How to monitor? Then automate process đó

❌ Pitfall 3: Over-Engineering

Build Kubernetes cluster cho 2 models in production

✅ Best Practice: Match complexity to scale

1-5 models: FastAPI + Docker trên single server
5-20 models: Managed service (Vertex AI)
20+ models: Kubernetes + full MLOps

❌ Pitfall 4: No Monitoring

Deploy model rồi quên → performance degrade 30% không ai biết

✅ Best Practice: Monitoring is non-negotiable

Minimum: Track daily prediction accuracy
Better: Automated drift detection
Best: Real-time performance dashboards

❌ Pitfall 5: Training-Serving Skew

Training dùng pandas, production dùng SQL → features khác nhau

✅ Best Practice: Feature Store

Single source of truth cho features
Same code for training & serving

8. ROI & Business Case cho MLOps

8.1. Cost của KHÔNG có MLOps

Scenario: Company với 5 ML models in production

Manual operations cost (per year):

Data Scientist time on deployment: 60% × $80K × 3 DS = $144K
Failed deployments: 2 per quarter × $20K impact = $160K
Model performance degradation: 15% revenue impact = $500K (nếu ML contributes $3M revenue)
Total cost: $804K/year

With MLOps ($100K investment):

Data Scientist focus on models: +40% productivity = $200K value
Prevent failed deployments: $160K saved
Maintain model performance: $500K saved
Total benefit: $860K/year

ROI = ($860K - $100K) / $100K = 760%

8.2. Metrics to Track

Business metrics:

Time to production: Weeks → Days → Hours
Model refresh rate: Quarterly → Weekly → Daily
Number of models in production: 1-5 → 10-50+
Data Scientist productivity: % time on model improvement (should increase)

Technical metrics:

Deployment success rate: Target > 95%
Model performance stability: AUC variance < 5%
Incident response time: Hours → Minutes
Training pipeline uptime: Target > 99%

9. Tương Lai của MLOps: Trends

9.1. AutoML + MLOps

Automated model selection, hyperparameter tuning → MLOps pipelines tự optimize

Tools: H2O AutoML, Google Vertex AI AutoML, DataRobot

9.2. LLMOps

MLOps cho Large Language Models:

Prompt versioning
Fine-tuning pipelines
LLM monitoring (hallucination detection, toxicity)

Tools: LangChain, PromptLayer, Weights & Biases for LLMs

9.3. Real-time ML

Shift từ batch → streaming ML:

Online learning (model updates real-time)
Feature computation trong stream (Kafka, Flink)

Use cases: Fraud detection, recommendation systems, dynamic pricing

9.4. ML Governance & Compliance

Especially relevant cho Fintech, Healthcare:

Model explainability (SHAP, LIME)
Bias detection
Audit trails
Regulatory compliance (GDPR, SBV)

Tools: IBM AI Fairness 360, Microsoft Fairlearn

Kết Luận

MLOps không phải luxury - it's necessity để scale ML in production.

Key takeaways:

85% ML models fail không phải vì model không tốt, mà vì lack of MLOps
Start small: Experiment tracking (Week 1) → Registry (Week 2) → Pipeline (Week 3) → Monitoring (Week 4)
Match tools to scale: Open-source MVP → Managed platform khi scale
Monitoring is critical: Model performance degrades over time - bạn cần phát hiện và retrain
Automation is key: Manual operations không scale beyond 5-10 models

Next steps:

✅ Đọc Customer Churn Prediction để hiểu end-to-end ML project
✅ Đọc From BI to AI để assess analytics maturity của bạn
✅ Setup MLflow experiment tracking tuần này
✅ Document current deployment process → identify automation opportunities

Need help? Carptech đã implement MLOps cho 10+ Vietnamese companies (Fintech, E-commerce, Logistics). Book free consultation để discuss MLOps roadmap cho company bạn.

Related Posts:

MLOps: Production ML tại quy mô Doanh Nghiệp

TL;DR

1. Tại sao cần MLOps? The Production Gap

1.1. Jupyter Notebook → Production: The Valley of Death

1.2. MLOps khác gì DevOps?

1.3. Business Impact của MLOps

2. MLOps Maturity Levels: Roadmap của bạn

Level 0: Manual Process (90% doanh nghiệp VN đang ở đây)

Level 1: ML Pipeline Automation

Level 2: CI/CD Pipeline Automation

Level 3: Automated MLOps (The Goal)

3. Core Components của MLOps Stack

3.1. Experiment Tracking (MLflow)

3.2. Feature Store (Feast)

3.3. Model Monitoring & Drift Detection

3.4. Model Deployment Options

4. Case Study: Vietnamese Fintech - Fraud Detection MLOps

4.1. Context

4.2. Architecture

4.3. Implementation

4.4. Results

5. Tools Landscape: Build vs Buy

5.1. All-in-One Platforms

5.2. Best-of-Breed (Open Source)

5.3. Decision Framework

6. Getting Started: Your MLOps MVP in 4 Tuần

Week 1: Experiment Tracking

Week 2: Model Registry & Versioning

Week 3: Automated Training Pipeline

Week 4: Basic Monitoring

7. Common Pitfalls & Best Practices

❌ Pitfall 1: Boil the Ocean

❌ Pitfall 2: Tools Before Process

❌ Pitfall 3: Over-Engineering

❌ Pitfall 4: No Monitoring

❌ Pitfall 5: Training-Serving Skew

8. ROI & Business Case cho MLOps

8.1. Cost của KHÔNG có MLOps

8.2. Metrics to Track

9. Tương Lai của MLOps: Trends

9.1. AutoML + MLOps

9.2. LLMOps

9.3. Real-time ML

9.4. ML Governance & Compliance

Kết Luận

Có câu hỏi về Data Platform?

Bài viết liên quan

Feature Store: nền tảng để scale machine learning

Case study: Doanh nghiệp sản xuất tiết kiệm 5 triệu USD với bảo trì dự đoán

Customer Segmentation: Kỹ Thuật Nâng Cao để Personalization & Targeting

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý