TL;DR
MLOps (Machine Learning Operations) là discipline kết hợp ML, DevOps, và Data Engineering để đưa ML models lên production một cách reliable và scalable. Research của VentureBeat cho thấy 85% ML models không bao giờ đến được production - MLOps chính là giải pháp cho vấn đề này.
Core components của MLOps:
- Experiment Tracking: MLflow, Weights & Biases để track experiments
- Feature Store: Feast, Tecton để reuse features across models
- Model Registry: Versioning, staging, metadata management
- Training Pipeline: Automated, scheduled retraining
- Deployment: REST APIs, batch predictions, real-time serving
- Monitoring: Model performance, data drift, concept drift detection
- CI/CD: Automated testing, deployment, rollback
Case study Vietnamese fintech: Triển khai MLOps cho fraud detection model:
- Before: 2 tuần để deploy model update, manual retraining hàng tháng
- After: Deploy hourly updates, auto-retrain daily based on new fraud patterns
- Result: Phát hiện fraud nhanh hơn 70%, adapt real-time với attack patterns mới
Bài này sẽ guide bạn qua 4 maturity levels của MLOps và cách build MVP MLOps system với open-source tools.
1. Tại sao cần MLOps? The Production Gap
1.1. Jupyter Notebook → Production: The Valley of Death
Scenario quen thuộc tại các doanh nghiệp VN:
Week 1-4: Data Scientist build model trong Jupyter notebook
- Accuracy: 92% trên test set
- Leadership hào hứng: "Deploy luôn!"
Week 5-8: Engineering team cố gắng productionize
- Code trong notebook không run được trên server
- Hardcoded paths:
/Users/datascientist/Downloads/data.csv - Library conflicts: notebook dùng pandas 1.5, server có 1.3
- Model file 2GB, không biết deploy như thế nào
Week 9-12: Sau nhiều debugging
- Cuối cùng deploy được... nhưng accuracy drop xuống 75%
- Vì training data đã cũ 3 tháng
- Không có monitoring → không biết model đang perform thế nào
Week 13+: Model bị "bỏ quên"
- Không có retraining schedule
- Performance degradation không được phát hiện
- 6 tháng sau, model predictions hoàn toàn sai
85% ML projects kết thúc ở đây (VentureBeat, 2019).
1.2. MLOps khác gì DevOps?
MLOps = DevOps + Data + Models
| Aspect | Traditional DevOps | MLOps |
|---|---|---|
| Code | Version control (Git) | ✅ Same |
| Data | N/A | ✅ Data versioning (DVC) |
| Models | N/A | ✅ Model versioning |
| Testing | Unit tests, integration tests | ✅ Same + data validation + model tests |
| Deployment | Blue-green, canary | ✅ Same + A/B testing models |
| Monitoring | Server metrics, logs | ✅ Same + model performance + drift |
| Dependencies | requirements.txt, Docker | ✅ Same + data pipelines |
Key difference: ML systems have three moving parts (code, data, model) thay vì chỉ code.
1.3. Business Impact của MLOps
1. Time to Production
- Without MLOps: 3-6 tháng để deploy 1 model
- With MLOps: 1-2 tuần
2. Model Performance
- Without monitoring: Model degradation 10-30% per year
- With MLOps: Detect và retrain kịp thời
3. Cost Efficiency
- Manual operations: Data Scientist spend 60% time on deployment
- Automated MLOps: Focus 80% on model improvement
4. Scalability
- Manual: Maximum 5-10 models in production
- MLOps: 50-500+ models
Case study - Vietnamese E-commerce (500M GMV/month):
- Deployed 12 ML models với MLOps:
- Product recommendations (3 models)
- Churn prediction
- Demand forecasting (per category)
- Fraud detection
- Price optimization
- Customer segmentation
- Before MLOps: 1 model in production (recommendations), updated quarterly
- After MLOps: 12 models, automated retraining weekly, hourly deployments
- Impact: 25% increase in conversion rate, $2M annual revenue increase
2. MLOps Maturity Levels: Roadmap của bạn
Google định nghĩa 4 maturity levels cho MLOps:
Level 0: Manual Process (90% doanh nghiệp VN đang ở đây)
Characteristics:
- Data Scientists work in notebooks
- Manual data collection và preprocessing
- Train model locally, save pickle file
- Deploy = copy file to server, write Flask API manually
- No monitoring, manual retraining khi nhớ ra
Tools: Jupyter, pandas, scikit-learn, Flask
Pros: Fast for POC và experiments Cons: Not reproducible, not scalable, high failure rate
When acceptable: Research, one-off analysis, POCs
Level 1: ML Pipeline Automation
Characteristics:
- Automated training pipeline: scheduled retraining
- Data validation (check schema, distributions)
- Feature engineering pipeline
- Model versioning
- Automated deployment of new model versions
Tools:
- Pipeline orchestration: Apache Airflow, Prefect, Kubeflow
- Model registry: MLflow, custom
Improvement: Reproducible training, scheduled updates
Example pipeline (Apache Airflow):
# airflow_ml_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-science',
'retries': 2,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'churn_prediction_training',
default_args=default_args,
schedule_interval='0 2 * * 0', # Weekly at 2 AM Sunday
start_date=datetime(2025, 5, 1),
catchup=False
)
def extract_data():
"""Extract customer data from BigQuery"""
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT
customer_id,
-- RFM features
DATE_DIFF(CURRENT_DATE(), MAX(order_date), DAY) as recency,
COUNT(DISTINCT order_id) as frequency,
SUM(order_total) as monetary,
-- Engagement features
COUNT(DISTINCT login_date) as login_days,
AVG(session_duration) as avg_session,
-- Target
CASE
WHEN MAX(order_date) < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
THEN 1 ELSE 0
END as churned
FROM `project.dataset.customer_events`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)
GROUP BY customer_id
"""
df = client.query(query).to_dataframe()
df.to_parquet('/data/raw/churn_data.parquet')
print(f"Extracted {len(df)} customers")
def validate_data():
"""Validate data quality"""
import pandas as pd
import great_expectations as ge
df = pd.read_parquet('/data/raw/churn_data.parquet')
# Convert to Great Expectations dataset
ge_df = ge.from_pandas(df)
# Define expectations
ge_df.expect_column_values_to_not_be_null('customer_id')
ge_df.expect_column_values_to_be_between('recency', 0, 730)
ge_df.expect_column_values_to_be_between('frequency', 1, 1000)
ge_df.expect_column_mean_to_be_between('churned', 0.05, 0.30)
# Validate
results = ge_df.validate()
if not results.success:
raise ValueError(f"Data validation failed: {results}")
print("✅ Data validation passed")
def train_model():
"""Train churn prediction model"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, precision_score, recall_score
# Load data
df = pd.read_parquet('/data/raw/churn_data.parquet')
# Split features and target
feature_cols = ['recency', 'frequency', 'monetary', 'login_days', 'avg_session']
X = df[feature_cols]
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Start MLflow run
mlflow.set_experiment('churn-prediction')
with mlflow.start_run():
# Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=100,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
# Log metrics
mlflow.log_metric('auc', auc)
mlflow.log_metric('precision', precision)
mlflow.log_metric('recall', recall)
# Log parameters
mlflow.log_param('n_estimators', 100)
mlflow.log_param('max_depth', 10)
# Log model
mlflow.sklearn.log_model(model, 'model')
print(f"✅ Model trained - AUC: {auc:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}")
# Save model version
run_id = mlflow.active_run().info.run_id
with open('/data/models/latest_run_id.txt', 'w') as f:
f.write(run_id)
def deploy_model():
"""Deploy model to production"""
import mlflow
# Load latest run ID
with open('/data/models/latest_run_id.txt', 'r') as f:
run_id = f.read().strip()
# Load model
model_uri = f'runs:/{run_id}/model'
# Register model (promotes to registry)
mlflow.register_model(model_uri, 'churn-prediction')
# Transition to Production
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Get latest version
versions = client.search_model_versions(f"name='churn-prediction'")
latest_version = max([int(v.version) for v in versions])
# Promote to Production
client.transition_model_version_stage(
name='churn-prediction',
version=latest_version,
stage='Production'
)
print(f"✅ Model version {latest_version} deployed to Production")
# Define tasks
task_extract = PythonOperator(
task_id='extract_data',
python_callable=extract_data,
dag=dag
)
task_validate = PythonOperator(
task_id='validate_data',
python_callable=validate_data,
dag=dag
)
task_train = PythonOperator(
task_id='train_model',
python_callable=train_model,
dag=dag
)
task_deploy = PythonOperator(
task_id='deploy_model',
python_callable=deploy_model,
dag=dag
)
# Define dependencies
task_extract >> task_validate >> task_train >> task_deploy
Kết quả: Automated weekly retraining, reproducible pipeline.
Level 2: CI/CD Pipeline Automation
Characteristics:
- Continuous Integration: Automated testing for code, data, models
- Continuous Delivery: Automated deployment với approval gates
- Feature store: Centralized, reusable features
- Model versioning và staging (Dev → Staging → Production)
- Automated rollback nếu model performance drop
Tools:
- CI/CD: GitHub Actions, GitLab CI, Jenkins
- Feature store: Feast, Tecton
- Model registry: MLflow, Vertex AI Model Registry
Example CI/CD (GitHub Actions):
# .github/workflows/ml-pipeline.yml
name: ML Training Pipeline
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * 0' # Weekly at 2 AM Sunday
workflow_dispatch: # Manual trigger
jobs:
data-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Validate data quality
run: |
python scripts/validate_data.py
env:
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
model-training:
needs: data-validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Train model
run: |
python scripts/train_model.py
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
- name: Evaluate model
id: evaluate
run: |
python scripts/evaluate_model.py
# Returns: {"auc": 0.85, "precision": 0.78}
- name: Check performance threshold
run: |
python -c "
import json
import sys
metrics = json.loads('${{ steps.evaluate.outputs.metrics }}')
if metrics['auc'] < 0.75:
print('❌ Model AUC below threshold')
sys.exit(1)
print('✅ Model meets performance threshold')
"
model-deployment:
needs: model-training
runs-on: ubuntu-latest
environment: production # Requires approval
steps:
- name: Deploy to Vertex AI
run: |
python scripts/deploy_vertex_ai.py
env:
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
- name: Run smoke tests
run: |
python scripts/test_deployed_model.py
monitoring:
needs: model-deployment
runs-on: ubuntu-latest
steps:
- name: Setup monitoring
run: |
python scripts/setup_monitoring.py
Level 3: Automated MLOps (The Goal)
Characteristics:
- Full automation: Training, deployment, monitoring, retraining
- Monitoring-driven retraining: Trigger retraining khi detect performance degradation
- Data drift detection: Automatic alerts
- Concept drift detection: Model performance monitoring
- Auto-rollback: Revert to previous version if new model underperforms
- Multi-model management: Serve 10s-100s models
Tools:
- Integrated platforms: Databricks ML, AWS SageMaker, GCP Vertex AI
- Monitoring: Evidently AI, WhyLabs, Arize AI
Architecture diagram:
┌─────────────────┐
│ Data Sources │
│ (BigQuery, S3) │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ Feature Store │◄─────┤ Feature Eng │
│ (Feast, Tecton) │ │ Pipeline │
└────────┬────────┘ └──────────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ Training │◄─────┤ Trigger: │
│ Pipeline │ │ - Scheduled │
│ (Kubeflow) │ │ - Drift │
└────────┬────────┘ │ - Manual │
│ └──────────────┘
▼
┌─────────────────┐
│ Model Registry │
│ (MLflow) │
│ - Dev │
│ - Staging │
│ - Production │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ Model Serving │─────►│ Predictions │
│ (Vertex AI) │ │ API │
└────────┬────────┘ └──────────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ Monitoring │─────►│ Alerts & │
│ - Performance │ │ Retraining │
│ - Data drift │ │ Triggers │
│ - Concept drift │ └──────────────┘
└─────────────────┘
3. Core Components của MLOps Stack
3.1. Experiment Tracking (MLflow)
Problem: Data Scientists chạy 100+ experiments → "experiment 47 có accuracy cao nhất, nhưng không nhớ hyperparameters là gì"
Solution: MLflow Tracking
Setup MLflow:
# Install
pip install mlflow
# Start MLflow server
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlflow-artifacts \
--host 0.0.0.0 \
--port 5000
Example usage:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Set experiment
mlflow.set_experiment('churn-prediction-tuning')
# Hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20],
'min_samples_split': [50, 100, 200]
}
# Grid search
for n_est in param_grid['n_estimators']:
for depth in param_grid['max_depth']:
for min_samples in param_grid['min_samples_split']:
with mlflow.start_run():
# Log parameters
mlflow.log_param('n_estimators', n_est)
mlflow.log_param('max_depth', depth)
mlflow.log_param('min_samples_split', min_samples)
# Train model
model = RandomForestClassifier(
n_estimators=n_est,
max_depth=depth,
min_samples_split=min_samples,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
# Log metrics
mlflow.log_metric('auc', auc)
# Log model
mlflow.sklearn.log_model(model, 'model')
print(f"n_est={n_est}, depth={depth}, min_samples={min_samples} → AUC={auc:.3f}")
# Find best run
best_run = mlflow.search_runs(
experiment_ids=['1'],
order_by=['metrics.auc DESC'],
max_results=1
)
print(f"Best run: {best_run[['params.n_estimators', 'params.max_depth', 'metrics.auc']]}")
MLflow UI: http://localhost:5000
3.2. Feature Store (Feast)
Problem:
- 10 models cùng dùng "customer lifetime value" feature → compute 10 lần
- Training dùng features tính 1 cách, production tính khác → training-serving skew
- Features không reusable across teams
Solution: Centralized Feature Store
Feast example:
# feature_repo/features.py
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta
# Define entity
customer = Entity(
name='customer_id',
value_type=ValueType.INT64,
description='Customer ID'
)
# Define data source
customer_features_source = FileSource(
path='/data/customer_features.parquet',
event_timestamp_column='event_timestamp'
)
# Define feature view
customer_features = FeatureView(
name='customer_features',
entities=['customer_id'],
ttl=timedelta(days=7),
features=[
Feature(name='recency', dtype=ValueType.INT64),
Feature(name='frequency', dtype=ValueType.INT64),
Feature(name='monetary', dtype=ValueType.FLOAT),
Feature(name='clv', dtype=ValueType.FLOAT),
Feature(name='churn_score', dtype=ValueType.FLOAT)
],
source=customer_features_source
)
Feature retrieval (training):
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path='feature_repo/')
# Entity dataframe
entity_df = pd.DataFrame({
'customer_id': [1001, 1002, 1003],
'event_timestamp': pd.to_datetime(['2025-05-27'] * 3)
})
# Get historical features (point-in-time correct)
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
'customer_features:recency',
'customer_features:frequency',
'customer_features:monetary',
'customer_features:clv'
]
).to_df()
print(training_df)
Feature retrieval (production):
# Get online features (low latency)
online_features = store.get_online_features(
features=[
'customer_features:recency',
'customer_features:frequency',
'customer_features:monetary',
'customer_features:clv'
],
entity_rows=[{'customer_id': 1001}]
).to_dict()
print(online_features)
# Output: {'customer_id': [1001], 'recency': [15], 'frequency': [12], ...}
Benefits:
- Reusability: 1 lần compute, nhiều models dùng
- Consistency: Training và serving dùng same features
- Point-in-time correctness: Avoid data leakage
3.3. Model Monitoring & Drift Detection
Hai loại drift:
1. Data Drift: Input data distribution thay đổi
- Example: COVID-19 → customer behavior thay đổi hoàn toàn
2. Concept Drift: Relationship giữa X và y thay đổi
- Example: Fraud patterns evolve → old fraud detection model không còn hiệu quả
Detect drift với Evidently AI:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
import pandas as pd
# Reference data (training data)
reference_data = pd.read_parquet('/data/training/churn_features.parquet')
# Current data (production data)
current_data = pd.read_parquet('/data/production/latest_week.parquet')
# Create drift report
report = Report(metrics=[
DataDriftPreset(),
DataQualityPreset()
])
report.run(reference_data=reference_data, current_data=current_data)
# Save report
report.save_html('/reports/drift_report.html')
# Get drift score
drift_metrics = report.as_dict()
drift_detected = drift_metrics['metrics'][0]['result']['dataset_drift']
if drift_detected:
print("⚠️ Data drift detected! Consider retraining model")
else:
print("✅ No significant drift detected")
Production monitoring dashboard:
# monitoring/dashboard.py
import streamlit as st
import pandas as pd
import mlflow
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab, ClassificationPerformanceTab
st.title("ML Model Monitoring Dashboard")
# Load production predictions
predictions_df = pd.read_gbq("SELECT * FROM `project.dataset.predictions` WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)")
# Model performance over time
st.header("Model Performance Over Time")
daily_metrics = predictions_df.groupby('date').apply(lambda x: {
'precision': precision_score(x['actual'], x['predicted']),
'recall': recall_score(x['actual'], x['predicted']),
'auc': roc_auc_score(x['actual'], x['predicted_proba'])
}).apply(pd.Series)
st.line_chart(daily_metrics)
# Alert threshold
auc_threshold = 0.75
if daily_metrics['auc'].tail(7).mean() < auc_threshold:
st.error(f"⚠️ ALERT: 7-day average AUC ({daily_metrics['auc'].tail(7).mean():.3f}) below threshold ({auc_threshold})")
st.info("🔄 Triggering automated retraining pipeline...")
3.4. Model Deployment Options
Option 1: REST API (Flask/FastAPI)
# serve.py
from fastapi import FastAPI
import mlflow.pyfunc
import pandas as pd
app = FastAPI()
# Load model from MLflow
model = mlflow.pyfunc.load_model('models:/churn-prediction/Production')
@app.post('/predict')
def predict(customer_id: int):
# Fetch features from Feature Store
from feast import FeatureStore
store = FeatureStore(repo_path='feature_repo/')
features = store.get_online_features(
features=[
'customer_features:recency',
'customer_features:frequency',
'customer_features:monetary'
],
entity_rows=[{'customer_id': customer_id}]
).to_dict()
# Convert to DataFrame
feature_df = pd.DataFrame([features])
# Predict
churn_probability = model.predict(feature_df)[0]
return {
'customer_id': customer_id,
'churn_probability': float(churn_probability),
'churn_risk': 'HIGH' if churn_probability > 0.7 else 'MEDIUM' if churn_probability > 0.4 else 'LOW'
}
# Run: uvicorn serve:app --host 0.0.0.0 --port 8000
Option 2: Batch Predictions (Cloud Storage)
# batch_predict.py
import mlflow.pyfunc
import pandas as pd
from google.cloud import bigquery
# Load model
model = mlflow.pyfunc.load_model('models:/churn-prediction/Production')
# Load customers to score
client = bigquery.Client()
query = """
SELECT customer_id, recency, frequency, monetary
FROM `project.dataset.customer_features`
WHERE last_order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 180 DAY)
"""
customers_df = client.query(query).to_dataframe()
# Batch predict
customers_df['churn_probability'] = model.predict(customers_df[['recency', 'frequency', 'monetary']])
# Save results
customers_df[['customer_id', 'churn_probability']].to_gbq(
'project.dataset.churn_predictions',
if_exists='replace'
)
print(f"✅ Scored {len(customers_df)} customers")
Option 3: Real-time Serving (Vertex AI)
# deploy_vertex_ai.py
from google.cloud import aiplatform
aiplatform.init(project='my-project', location='us-central1')
# Upload model to Vertex AI Model Registry
model = aiplatform.Model.upload(
display_name='churn-prediction',
artifact_uri='gs://my-bucket/models/churn/v2',
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest'
)
# Deploy to endpoint
endpoint = model.deploy(
machine_type='n1-standard-4',
min_replica_count=1,
max_replica_count=10, # Auto-scaling
traffic_percentage=100
)
print(f"✅ Model deployed to: {endpoint.resource_name}")
# Prediction
prediction = endpoint.predict(instances=[{
'recency': 45,
'frequency': 8,
'monetary': 2500
}])
print(f"Churn probability: {prediction.predictions[0]}")
4. Case Study: Vietnamese Fintech - Fraud Detection MLOps
4.1. Context
Company: Vietnamese digital lending platform (3M customers, 500K loans/month)
Challenge:
- Fraud patterns evolve rapidly (new attack methods weekly)
- Manual model updates took 2 weeks from training to deployment
- Fraud detection model accuracy degraded 15% over 3 months → undetected
- Data Scientists spend 60% time on deployment instead of improving models
Goal: Build MLOps system để deploy fraud detection updates hourly và auto-retrain daily
4.2. Architecture
Before MLOps:
Data Scientist → Jupyter Notebook → pickle file
→ Email to Engineering → Manual deployment (2 weeks)
→ No monitoring
After MLOps:
BigQuery (transaction data)
↓
Feast Feature Store
↓
Kubeflow Pipeline (daily training)
↓
MLflow Model Registry
↓
Vertex AI Endpoint (auto-deploy)
↓
Evidently Monitoring → Alerts → Auto-retrain
4.3. Implementation
Step 1: Feature Store
Centralize 50+ fraud detection features:
# Features
- transaction_velocity_1h: Số transactions trong 1h
- amount_deviation_30d: So với average 30 ngày
- device_fingerprint_new: Device mới lạ?
- ip_country_mismatch: IP khác registered country
- merchant_risk_score: Historical fraud rate của merchant
- user_behavior_anomaly_score: ML-based anomaly score
...
Step 2: Automated Daily Training
Kubeflow pipeline chạy mỗi ngày 3 AM:
@dsl.pipeline(name='fraud-detection-training')
def fraud_pipeline():
# Extract yesterday's transactions + labels
extract_op = extract_labeled_data(
query="""
SELECT * FROM transactions
WHERE date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND fraud_label IS NOT NULL
"""
)
# Train model
train_op = train_xgboost_model(
data=extract_op.output,
params={'max_depth': 6, 'learning_rate': 0.1}
)
# Evaluate on holdout set
eval_op = evaluate_model(train_op.output)
# Deploy if AUC > 0.90 (threshold)
with dsl.Condition(eval_op.outputs['auc'] > 0.90):
deploy_op = deploy_to_vertex(train_op.output)
Step 3: Real-time Serving
API endpoint phản hồi < 100ms:
@app.post('/score-transaction')
def score_transaction(transaction: Transaction):
# Get features from Feast
features = feature_store.get_online_features(
entity_rows=[{'transaction_id': transaction.id}],
features=[
'fraud_features:transaction_velocity_1h',
'fraud_features:amount_deviation_30d',
...
]
).to_dict()
# Predict
fraud_score = model.predict_proba(features)[0][1]
# Decision rules
if fraud_score > 0.95:
return {'decision': 'BLOCK', 'score': fraud_score}
elif fraud_score > 0.75:
return {'decision': 'REVIEW', 'score': fraud_score}
else:
return {'decision': 'APPROVE', 'score': fraud_score}
Step 4: Monitoring & Auto-Retraining
Monitor performance hourly:
# Check performance every hour
if current_hour_precision < 0.85:
# Trigger immediate retraining
trigger_kubeflow_pipeline('fraud-detection-training')
send_slack_alert("🚨 Fraud model performance dropped. Retraining triggered.")
4.4. Results
Deployment Speed:
- Before: 2 weeks per update
- After: Hourly deployments (automated)
Model Freshness:
- Before: Quarterly updates
- After: Daily retraining với data từ ngày hôm trước
Detection Performance:
- Before: 78% precision, 65% recall (degrading)
- After: 92% precision, 88% recall (consistently)
Fraud Adaptation:
- Before: New fraud patterns detected sau 2-4 tuần
- After: Detected trong 24 giờ (daily retraining catches new patterns)
Cost Savings:
- Prevented fraud: $8M/year (additional $3M từ faster detection)
- Reduced false positives: 40% fewer legitimate transactions blocked → better UX
Data Science Productivity:
- Before: 60% time on deployment
- After: 90% time on model improvement → shipped 3 new fraud models in 6 tháng
5. Tools Landscape: Build vs Buy
5.1. All-in-One Platforms
Google Cloud Vertex AI
- ✅ Fully managed: Training, serving, monitoring
- ✅ AutoML for non-experts
- ✅ Feature Store built-in
- ✅ Tight integration với BigQuery, GCS
- ❌ Vendor lock-in
- ❌ Higher cost ($$$)
- Best for: GCP-native companies, enterprises cần support
AWS SageMaker
- ✅ Comprehensive MLOps suite
- ✅ SageMaker Pipelines for orchestration
- ✅ Model Monitor for drift detection
- ❌ Complex setup
- ❌ AWS-only
- Best for: AWS-heavy companies
Databricks ML
- ✅ End-to-end: data prep → training → serving
- ✅ Unity Catalog for governance
- ✅ Great for Spark workloads
- ❌ Expensive
- Best for: Big data + ML workloads
5.2. Best-of-Breed (Open Source)
Recommended MVP Stack:
| Component | Tool | Why |
|---|---|---|
| Experiment Tracking | MLflow | Industry standard, easy setup |
| Pipeline Orchestration | Apache Airflow | Flexible, Python-based, proven |
| Feature Store | Feast | Open-source, cloud-agnostic |
| Model Serving | FastAPI + Docker | Lightweight, full control |
| Monitoring | Evidently AI | Free tier, drift detection |
| Infrastructure | Kubernetes | Scalable, portable |
Setup cost: $0 (open-source) + infrastructure cost Time to MVP: 2-4 tuần Best for: Startups, cost-conscious teams, need flexibility
5.3. Decision Framework
Use All-in-One Platform nếu:
- ✅ Budget > $50K/year cho MLOps tools
- ✅ Need enterprise support
- ✅ Already committed to cloud provider (GCP/AWS)
- ✅ Prefer less maintenance
Use Best-of-Breed nếu:
- ✅ Budget-constrained
- ✅ Want flexibility và no vendor lock-in
- ✅ Have engineering resources để maintain
- ✅ Multi-cloud strategy
Vietnamese startup reality: Hầu hết nên start với open-source MVP, migrate to managed platform khi scale.
6. Getting Started: Your MLOps MVP in 4 Tuần
Week 1: Experiment Tracking
Goal: Stop losing experiments
Tasks:
- Setup MLflow server (Docker)
- Migrate 1 model training script to log với MLflow
- Train 10 experiments, compare trong UI
Deliverable: All team members track experiments trong MLflow
Week 2: Model Registry & Versioning
Goal: Reproducible model deployments
Tasks:
- Setup MLflow Model Registry
- Register models với stages (Dev/Staging/Production)
- Deploy 1 model to production via registry
Deliverable: Production model served from registry, not local pickle files
Week 3: Automated Training Pipeline
Goal: Scheduled retraining
Tasks:
- Setup Apache Airflow (or Prefect)
- Convert training script to Airflow DAG
- Schedule weekly training
Deliverable: Model auto-retrains every week, auto-registers in MLflow
Week 4: Basic Monitoring
Goal: Detect when model degrades
Tasks:
- Setup Evidently AI monitoring
- Track predictions in BigQuery/database
- Daily drift report
Deliverable: Daily email report về model performance + drift
After 4 tuần: Bạn đã có Level 1 MLOps - đủ để deploy models reliably.
7. Common Pitfalls & Best Practices
❌ Pitfall 1: Boil the Ocean
Cố gắng implement tất cả components cùng lúc → overwhelmed → fail
✅ Best Practice: Start small
- Week 1-4: Experiment tracking
- Week 5-8: Model registry
- Week 9-12: Automated pipeline
- Iterate từ đó
❌ Pitfall 2: Tools Before Process
Mua Databricks/SageMaker nhưng team vẫn làm manual
✅ Best Practice: Document process first
- How to train models?
- How to deploy?
- How to monitor? Then automate process đó
❌ Pitfall 3: Over-Engineering
Build Kubernetes cluster cho 2 models in production
✅ Best Practice: Match complexity to scale
- 1-5 models: FastAPI + Docker trên single server
- 5-20 models: Managed service (Vertex AI)
- 20+ models: Kubernetes + full MLOps
❌ Pitfall 4: No Monitoring
Deploy model rồi quên → performance degrade 30% không ai biết
✅ Best Practice: Monitoring is non-negotiable
- Minimum: Track daily prediction accuracy
- Better: Automated drift detection
- Best: Real-time performance dashboards
❌ Pitfall 5: Training-Serving Skew
Training dùng pandas, production dùng SQL → features khác nhau
✅ Best Practice: Feature Store
- Single source of truth cho features
- Same code for training & serving
8. ROI & Business Case cho MLOps
8.1. Cost của KHÔNG có MLOps
Scenario: Company với 5 ML models in production
Manual operations cost (per year):
- Data Scientist time on deployment: 60% × $80K × 3 DS = $144K
- Failed deployments: 2 per quarter × $20K impact = $160K
- Model performance degradation: 15% revenue impact = $500K (nếu ML contributes $3M revenue)
- Total cost: $804K/year
With MLOps ($100K investment):
- Data Scientist focus on models: +40% productivity = $200K value
- Prevent failed deployments: $160K saved
- Maintain model performance: $500K saved
- Total benefit: $860K/year
ROI = ($860K - $100K) / $100K = 760%
8.2. Metrics to Track
Business metrics:
- Time to production: Weeks → Days → Hours
- Model refresh rate: Quarterly → Weekly → Daily
- Number of models in production: 1-5 → 10-50+
- Data Scientist productivity: % time on model improvement (should increase)
Technical metrics:
- Deployment success rate: Target > 95%
- Model performance stability: AUC variance < 5%
- Incident response time: Hours → Minutes
- Training pipeline uptime: Target > 99%
9. Tương Lai của MLOps: Trends
9.1. AutoML + MLOps
Automated model selection, hyperparameter tuning → MLOps pipelines tự optimize
Tools: H2O AutoML, Google Vertex AI AutoML, DataRobot
9.2. LLMOps
MLOps cho Large Language Models:
- Prompt versioning
- Fine-tuning pipelines
- LLM monitoring (hallucination detection, toxicity)
Tools: LangChain, PromptLayer, Weights & Biases for LLMs
9.3. Real-time ML
Shift từ batch → streaming ML:
- Online learning (model updates real-time)
- Feature computation trong stream (Kafka, Flink)
Use cases: Fraud detection, recommendation systems, dynamic pricing
9.4. ML Governance & Compliance
Especially relevant cho Fintech, Healthcare:
- Model explainability (SHAP, LIME)
- Bias detection
- Audit trails
- Regulatory compliance (GDPR, SBV)
Tools: IBM AI Fairness 360, Microsoft Fairlearn
Kết Luận
MLOps không phải luxury - it's necessity để scale ML in production.
Key takeaways:
- 85% ML models fail không phải vì model không tốt, mà vì lack of MLOps
- Start small: Experiment tracking (Week 1) → Registry (Week 2) → Pipeline (Week 3) → Monitoring (Week 4)
- Match tools to scale: Open-source MVP → Managed platform khi scale
- Monitoring is critical: Model performance degrades over time - bạn cần phát hiện và retrain
- Automation is key: Manual operations không scale beyond 5-10 models
Next steps:
- ✅ Đọc Customer Churn Prediction để hiểu end-to-end ML project
- ✅ Đọc From BI to AI để assess analytics maturity của bạn
- ✅ Setup MLflow experiment tracking tuần này
- ✅ Document current deployment process → identify automation opportunities
Need help? Carptech đã implement MLOps cho 10+ Vietnamese companies (Fintech, E-commerce, Logistics). Book free consultation để discuss MLOps roadmap cho company bạn.
Related Posts:




