Fintech là ngành duy nhất mà Data Platform không chỉ là competitive advantage mà còn là compliance requirement. Theo Circular 47/2020 của Ngân hàng Nhà nước (SBV), các tổ chức tài chính phải đảm bảo khả năng audit, traceability, và data residency - những yêu cầu không thể đáp ứng bằng spreadsheets hay dashboards rời rạc.
Thực tế, từ kinh nghiệm làm việc với 10+ fintech clients tại Việt Nam, Carptech nhận thấy 85% fintechs giai đoạn Series A gặp bottleneck về compliance và fraud detection khi scale - họ có sản phẩm tốt, product-market fit rõ ràng, nhưng không có data infrastructure để đáp ứng regulatory requirements và manage risks hiệu quả.
Bài viết này sẽ hướng dẫn chi tiết cách xây dựng Data Platform cho fintech, từ architecture design với security first, compliance by design, đến real-time fraud detection và credit scoring. Kèm case study thực tế về lending fintech giảm approval time từ 3 ngày xuống 2 giờ và reduce default rate 15%.
TL;DR - Key Takeaways
- Fintech cần real-time + compliance: 99.99% uptime, data residency in Vietnam, immutable audit logs
- Security architecture: Encryption at rest & in transit, RBAC, data masking, zero-trust network
- Regulatory requirements: SBV Circular 47/2020, PDPA, AML/KYC audit trails
- Real-time use cases: Fraud detection (<100ms latency), transaction monitoring, credit scoring
- Tech stack: Vietnam cloud providers (Viettel IDC, VNPT, FPT), Apache Kafka, Postgres/TimescaleDB, dbt
- ROI: 50-70% reduction in approval time, 10-20% lower default rates, 90%+ compliance automation
Fintech Unique Requirements: Tại Sao Khác Biệt?
1. Compliance is Non-Negotiable
Circular 47/2020/TT-NHNN (December 2020) - SBV regulations về an toàn hệ thống thông tin:
Data residency:
- Core financial data must be stored in Vietnam
- Customer data (KYC, transactions) cannot leave Vietnam without approval
- Cloud providers: Must use Vietnamese providers hoặc international clouds with Vietnam regions
Audit requirements:
- Immutable logs: All transactions, data changes must be logged and cannot be modified
- Retention: Minimum 5 years for transactional data, 10 years for audit logs
- Traceability: Ability to reconstruct any transaction or decision
- Reporting: Quarterly reports to SBV on security incidents, data breaches
PDPA (Personal Data Protection Act):
- Consent management: Track customer consent for data usage
- Right to be forgotten: Ability to delete customer data on request (challenge: conflicts with retention!)
- Data minimization: Only collect necessary data
- Breach notification: Within 72 hours to authorities
2. Real-time is Mission-Critical
Fraud detection:
- Latency requirement: <100ms từ transaction initiation đến fraud score
- Accuracy: >99% (false positives costly, false negatives catastrophic)
- Scale: Process 1000-10,000 transactions/second (peak times)
Transaction monitoring:
- AML (Anti-Money Laundering) rules: Detect suspicious patterns real-time
- Large transaction alerts: >50M VND require additional verification
- Velocity checks: Too many transactions in short time = red flag
Credit decisions:
- Traditional banks: 3-7 days approval time
- Digital lenders: Target <24 hours, best-in-class <2 hours
- Real-time components: Credit scoring, income verification, fraud check
3. Security is Paramount
Data sensitivity levels:
- Tier 1 - Critical: Card numbers, CVV, account passwords → Encrypt, never log
- Tier 2 - High: PII (name, ID, phone, address), account balances → Encrypt, audit all access
- Tier 3 - Medium: Transaction history, credit scores → Access control, masked in non-prod
- Tier 4 - Low: Aggregated analytics, anonymized data → Standard controls
Attack vectors unique to fintech:
- Account takeover: Stolen credentials → multi-factor auth, behavioral biometrics
- Transaction fraud: Stolen cards, synthetic identities → ML fraud models
- Social engineering: Phishing, scams → Customer education, transaction confirmations
- Internal fraud: Rogue employees → Strict RBAC, audit all data access
4. High Availability Requirements
Uptime SLA:
- 99.99% uptime = 52 minutes downtime/year
- Maintenance windows: Off-peak only (2-5 AM)
- Disaster recovery: RTO <4 hours, RPO <15 minutes
Implications for Data Platform:
- Multi-region deployment (Hà Nội + HCM data centers)
- Active-active or active-passive replication
- Automated failover
- Regular DR drills
Data Sources trong Fintech Ecosystem
Core Banking System / Ledger
Traditional banks: Core banking platforms (Temenos, Oracle Flexcube, local systems) Digital banks/fintechs: Modern alternatives
- Mambu: Cloud-native core banking (SaaS)
- Custom built: Node.js/Java + PostgreSQL với double-entry accounting
Data:
- Accounts: account_id, customer_id, account_type (savings, loan), balance, status
- Transactions: transaction_id, from_account, to_account, amount, timestamp, type
- Ledger entries: Debit/credit entries với references
Integration:
- Real-time: Event streaming (Kafka) cho mỗi transaction
- Batch: Nightly dump for reconciliation, reporting
Payment Gateways
Domestic:
- NAPAS: Interbank transfers, ATM network
- VietQR: QR code payments
- Bank APIs: Individual bank integrations (Vietcombank, Techcombank, etc.)
International:
- Visa/Mastercard: Card payments
- PayPal, Stripe: For international transactions (if licensed)
Data:
- Payment status: initiated, processing, success, failed, reversed
- Metadata: payment method, device info, IP address, location
- Fees: transaction fees, currency conversion
KYC/AML Vendors
eKYC providers:
- VNPT eKYC: ID card scanning, face matching
- FPT.AI: Similar services
- Trusting Social: Digital footprint for credit
Data:
- ID verification: ID number, name, DOB, address (from government database)
- Face matching score: Selfie vs ID photo
- Liveness detection: Anti-spoofing (video, 3D face)
AML screening:
- Dow Jones Risk & Compliance: Watchlist screening (PEP, sanctions)
- Local databases: SBV blacklists
Credit Bureaus
Vietnam Credit Information Center (CIC) - SBV's credit bureau:
- Credit history: Loans, payment history, defaults
- Credit score: CIC score (not as sophisticated as FICO yet)
- Lag: Updated monthly (not real-time)
Alternative data providers:
- Telecom data: Call/SMS patterns, top-up history (proxy for stability)
- E-commerce: Shopee/Lazada purchase history (spending patterns)
- Social: Facebook, Zalo (controversial, privacy concerns)
App Analytics & Behavioral Data
- Firebase Analytics: App usage, screens, events
- Mixpanel/Amplitude: User funnels, retention
- Behavioral signals: Device fingerprinting, typing patterns, location patterns
Fraud detection signals:
- New device from different location = higher risk
- Copy-paste password (vs typing) = credential stuffing attack
- Fast form completion = bot
Architecture: Security & Compliance First
Reference Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ (Mobile app, Web app, APIs) │
│ - TLS 1.3 encryption │
│ - OAuth 2.0 + JWT tokens │
└────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ API GATEWAY + WAF │
│ - Rate limiting (1000 req/min per user) │
│ - DDoS protection │
│ - Request logging (audit trail) │
└────────────────────────────┬────────────────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ KAFKA CLUSTER │ │ OPERATIONAL DB │
│ (Event Stream) │ │ (PostgreSQL) │
│ - 3 brokers │ │ - Encrypted │
│ - Replication=3 │ │ - Multi-AZ │
└────────┬─────────┘ └──────────────────┘
│
┌────────┴────────────────────┐
▼ ▼
┌────────────────────┐ ┌────────────────────────┐
│ STREAM PROCESSING │ │ DATA WAREHOUSE │
│ (Kafka Streams/ │ │ (PostgreSQL + │
│ Flink) │ │ TimescaleDB) │
│ - Fraud detection │ │ - Column encryption │
│ - Real-time agg │ │ - Row-level security │
│ - Alerting │ │ - Audit logs │
└────────────────────┘ └───────────┬────────────┘
│
▼
┌────────────────────────┐
│ ANALYTICS LAYER │
│ (dbt + Metabase) │
│ - Role-based access │
│ - Data masking │
│ - Query audit logs │
└────────────────────────┘
Cloud Provider Choice: Vietnam Requirements
Option 1: Vietnamese cloud providers (Recommended for compliance)
- Viettel IDC (Viettel Cloud)
- VNPT Cloud
- FPT Cloud
- CMC Cloud
Pros:
- ✅ Data residency compliance by default
- ✅ Vietnamese support, contract in VND
- ✅ Government relationships (easier licensing)
Cons:
- ❌ Limited services vs AWS/GCP/Azure
- ❌ Smaller ecosystem, fewer integrations
- ❌ Potentially higher costs for compute/storage
Option 2: International clouds with Vietnam regions
- AWS: ap-southeast-1 (Singapore) - Planned Hanoi region
- GCP: asia-southeast1 (Singapore) - No Vietnam region yet
- Azure: Southeast Asia (Singapore) - No Vietnam region yet
Hybrid approach (Most common):
- Tier 1 data (transactions, PII): Vietnamese cloud
- Tier 4 data (aggregated analytics): International cloud
- Replication: Encrypted backup to international cloud for DR
Encryption Strategy
Data at rest:
- Database encryption: PostgreSQL TDE (Transparent Data Encryption)
- Field-level encryption: For highly sensitive fields (card numbers, IDs)
- Algorithm: AES-256-GCM
- Key management: AWS KMS, HashiCorp Vault, or Vietnamese HSM providers
- Key rotation: Every 90 days
Data in transit:
- TLS 1.3: All communications (app ↔ API ↔ database)
- Certificate pinning: Mobile apps to prevent MITM attacks
- VPN/Private networks: Between cloud regions
Example: Field-level encryption trong PostgreSQL
-- Encrypt ID number before storing
INSERT INTO customers (customer_id, id_number_encrypted, name)
VALUES (
gen_random_uuid(),
pgp_sym_encrypt('001234567890', current_setting('app.encryption_key')),
'Nguyen Van A'
);
-- Decrypt when authorized
SELECT
customer_id,
name,
pgp_sym_decrypt(id_number_encrypted::bytea, current_setting('app.encryption_key')) as id_number
FROM customers
WHERE customer_id = 'xxx'
AND current_user_has_permission('view_pii') = true;
Access Control: RBAC + Attribute-Based
Roles hierarchy:
- Admin: Full access (CTO, Security team) - Audit all actions
- Data Analyst: Read-only, masked PII
- Risk Officer: Read transaction data, write to risk tables
- Customer Support: Read customer info (masked), cannot see balances/transactions
- External Auditor: Read-only, specific schemas, time-limited access
Row-level security example:
-- Customer support can only see customers they're assigned to
CREATE POLICY customer_support_policy ON customers
FOR SELECT
TO customer_support_role
USING (assigned_support_id = current_user_id());
-- Analysts can see all customers but PII is masked
CREATE POLICY analyst_policy ON customers
FOR SELECT
TO analyst_role
USING (true)
WITH CHECK (false); -- No writes
-- Create masked view for analysts
CREATE VIEW customers_masked AS
SELECT
customer_id,
CONCAT(LEFT(name, 2), '***') as name,
CONCAT('***', RIGHT(phone, 4)) as phone,
DATE_TRUNC('month', created_at) as created_month, -- Hide exact date
customer_segment
FROM customers;
GRANT SELECT ON customers_masked TO analyst_role;
Audit Logging: Immutable & Complete
What to log:
- Data access: Who queried what data, when
- Data changes: All INSERT/UPDATE/DELETE with before/after values
- Authentication: Login attempts (success/fail), session duration
- Admin actions: Configuration changes, permission grants
- API calls: All requests/responses (sanitized)
Implementation with PostgreSQL + Audit trigger:
-- Audit log table
CREATE TABLE audit_log (
log_id BIGSERIAL PRIMARY KEY,
table_name TEXT NOT NULL,
operation TEXT NOT NULL, -- INSERT, UPDATE, DELETE
old_values JSONB,
new_values JSONB,
user_id UUID,
username TEXT,
ip_address INET,
timestamp TIMESTAMPTZ DEFAULT NOW()
);
-- Make it append-only (cannot UPDATE or DELETE)
REVOKE UPDATE, DELETE ON audit_log FROM PUBLIC;
-- Audit trigger function
CREATE OR REPLACE FUNCTION audit_trigger_function()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO audit_log (table_name, operation, old_values, new_values, user_id, username, ip_address)
VALUES (
TG_TABLE_NAME,
TG_OP,
CASE WHEN TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN row_to_json(OLD) ELSE NULL END,
CASE WHEN TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN row_to_json(NEW) ELSE NULL END,
current_setting('app.current_user_id', true)::UUID,
current_user,
inet_client_addr()
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Attach to tables
CREATE TRIGGER audit_customers
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION audit_trigger_function();
Retention: Store audit logs for 10 years, archive to cold storage after 2 years
Real-time Fraud Detection: <100ms Latency
Fraud Detection Pipeline
Transaction initiated
↓
Feature extraction (10ms)
- Device fingerprint
- Location (IP geolocation)
- Transaction amount, recipient
- User behavior (time since last login, typing speed)
↓
Rule-based checks (20ms)
- Blacklist check (stolen cards, banned users)
- Velocity rules (>5 transactions in 1 minute)
- Amount limits (>50M VND)
↓
ML model scoring (50ms)
- Fraud probability: 0.0-1.0
- Model: XGBoost, LightGBM
↓
Decision (10ms)
- Score <0.1: Approve (95% of transactions)
- Score 0.1-0.5: Additional verification (4%)
- Score >0.5: Decline (1%)
↓
Log & feedback loop (10ms)
- Store decision, send to analytics
- Update user profile
Total latency: ~100ms
Feature Engineering for Fraud Detection
User features:
{
"user_id": "uuid",
"account_age_days": 45,
"total_transactions": 12,
"total_volume_30d": 15000000, # VND
"avg_transaction_amount": 1250000,
"kyc_verified": true,
"kyc_score": 0.95,
"devices_count": 2,
"login_locations_count": 3 # Distinct cities
}
Transaction features:
{
"amount": 5000000,
"amount_vs_avg_ratio": 4.0, # 4x higher than average
"recipient_new": true, # Never sent to this recipient before
"recipient_high_risk": false, # Not in fraud database
"time_since_last_transaction_minutes": 2,
"transactions_last_hour": 3, # Velocity
"is_round_number": false, # 5M vs 5,123,456 (round = suspicious)
"weekend_transaction": true,
"late_night_transaction": false
}
Device features:
{
"device_id": "fingerprint_hash",
"device_new": false,
"device_os": "Android 13",
"ip_address": "14.xxx.xxx.xxx",
"ip_location": "Hanoi",
"ip_location_mismatch": false, # vs registered address
"vpn_detected": false,
"emulator_detected": false
}
ML Model Training
Dataset:
- Positive class (fraud): 0.1-1% of transactions
- Negative class (legitimate): 99-99.9%
- Class imbalance: Use SMOTE, class weights, or undersampling
Labels:
- Confirmed fraud: Chargebacks, user reports
- Suspected fraud: Declined transactions, manual review
- Time delay: True labels emerge 30-90 days later (chargebacks)
Model training pipeline:
import pandas as pd
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve
# Load training data (last 6 months)
df = load_transactions_with_labels(start_date='2024-10-01')
# Feature engineering
features = [
'account_age_days', 'total_transactions', 'amount', 'amount_vs_avg_ratio',
'recipient_new', 'time_since_last_transaction_minutes', 'transactions_last_hour',
'device_new', 'ip_location_mismatch', 'vpn_detected'
]
X = df[features]
y = df['is_fraud']
# Train/test split (temporal - no leakage)
split_date = '2025-03-01'
X_train = X[df['transaction_date'] < split_date]
y_train = y[df['transaction_date'] < split_date]
X_test = X[df['transaction_date'] >= split_date]
y_test = y[df['transaction_date'] >= split_date]
# Train LightGBM model
model = LGBMClassifier(
n_estimators=500,
max_depth=6,
learning_rate=0.05,
scale_pos_weight=(~y_train).sum() / y_train.sum(), # Handle imbalance
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"AUC: {auc:.4f}") # Target: >0.95
# Find optimal threshold (balance precision/recall)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
# Choose threshold where precision = 0.9 (accept 10% false positives for 90% recall)
# Save model
model.booster_.save_model('fraud_model_v2.txt')
Model deployment:
- Containerize: Docker image với model file
- Deploy: Kubernetes cluster (auto-scaling)
- A/B testing: Shadow mode → 10% traffic → 100% rollout
- Monitoring: Latency, prediction distribution, false positive rate
- Retraining: Monthly with new fraud labels
Rule-based Layer: Fast & Interpretable
ML models are powerful but "black box". Combine với rule-based checks cho transparency và regulatory compliance.
Example rules:
def rule_based_fraud_check(transaction, user):
# Rule 1: Blacklist
if transaction['recipient_id'] in BLACKLIST:
return {"decision": "DECLINE", "reason": "Recipient blacklisted"}
# Rule 2: Large transactions
if transaction['amount'] > 50_000_000: # >50M VND
return {"decision": "REVIEW", "reason": "Large transaction"}
# Rule 3: Velocity - too many transactions
if user['transactions_last_hour'] > 5:
return {"decision": "DECLINE", "reason": "Velocity exceeded"}
# Rule 4: New device + large amount
if transaction['device_new'] and transaction['amount'] > 10_000_000:
return {"decision": "REVIEW", "reason": "New device, large amount"}
# Rule 5: Location mismatch
if transaction['ip_location'] != user['registered_city']:
if transaction['amount'] > 5_000_000:
return {"decision": "REVIEW", "reason": "Location mismatch"}
return {"decision": "PASS", "reason": None}
Benefit: Explainability cho customer ("We declined because ...") và auditors.
Credit Scoring: Alternative Data + ML
Traditional Credit Scoring (CIC Score)
Vietnam Credit Information Center (CIC) score:
- Range: Không công khai (not like FICO 300-850)
- Based on: Loan history, payment behavior, defaults
- Limitations:
- Thin-file problem: 60-70% adults không có credit history
- Monthly update lag: Không real-time
- Limited coverage: Chỉ formal banking
Alternative Data Approach
Telecom data (with customer consent):
- Call/SMS patterns: Stable pattern = stable life
- Top-up frequency: Regular top-ups = income regularity
- Network quality: Premium number = higher income segment
E-commerce data:
- Purchase history: Categories (essentials vs luxury)
- Delivery addresses: Stability (same address for 6 months)
- Payment method: Credit card vs COD (credit card users lower default)
App behavioral data:
- Form completion time: Too fast = potential fraud
- Midnight applications: Higher risk (desperation signal)
- Loan amount requested: Exactly round numbers = less thought, higher risk
Credit Scoring Model
Target variable:
- Binary: Default (yes/no) - default = >60 days overdue
- Multi-class: Risk tiers (A, B, C, D, E)
Features (200+ features):
{
# Demographics
"age": 32,
"gender": "M",
"marital_status": "married",
"education": "university",
"city": "Hanoi",
# Employment
"employment_type": "full_time",
"company_type": "private",
"months_at_job": 24,
"monthly_income": 15000000, # Self-declared
# Loan request
"loan_amount": 30000000,
"loan_term_months": 12,
"loan_purpose": "motorbike",
"amount_to_income_ratio": 2.0,
# CIC data (if available)
"cic_score": 650,
"existing_loans": 1,
"total_debt": 50000000,
"debt_to_income": 3.3,
"late_payments_12m": 0,
# Alternative data - Telecom
"telecom_tenure_months": 36,
"avg_monthly_topup": 200000,
"topup_regularity": 0.85, # % of months with topup
# Alternative data - E-commerce
"ecommerce_orders_6m": 8,
"ecommerce_spend_6m": 5000000,
"delivery_address_changes": 0,
# Behavioral
"application_hour": 14, # 2 PM
"application_completion_time_seconds": 420,
"device_type": "Android",
"form_fields_edited": 3 # How many times user changed answers
}
Model:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
# Training data: historical loans với outcomes
X_train = df_train[features]
y_train = df_train['defaulted'] # 0 or 1
# Train model
model = GradientBoostingClassifier(
n_estimators=300,
max_depth=5,
learning_rate=0.1,
random_state=42
)
model.fit(X_train, y_train)
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Target: >0.75 (với alternative data, 0.72-0.78 typical)
# Feature importance
feature_importance = pd.DataFrame({
'feature': features,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance.head(10))
Typical important features:
- CIC score (if available) - 25%
- Debt-to-income ratio - 12%
- Months at job - 8%
- Telecom tenure - 7%
- Age - 6%
- Loan amount to income - 5%
Credit Decision Workflow
Application submitted
↓
Fraud check (ML model)
↓ (Pass)
eKYC verification
↓ (Success)
Income verification
- Bank statement upload
- Or telecom data proxy
↓
Credit scoring (ML model)
- Score: 0.0 (high risk) to 1.0 (low risk)
↓
Decision rules
- Score >0.7 → Approve automatically (40-50% of applications)
- Score 0.4-0.7 → Manual review (30-40%)
- Score <0.4 → Decline (10-20%)
↓
Loan offer
- Approved amount (may be lower than requested)
- Interest rate (risk-based pricing)
- Term options
Approval time:
- Automated approvals: <2 hours (eKYC + scoring + bank account creation)
- Manual review: 4-24 hours
- Average: ~6-8 hours (vs 3-7 days traditional banks)
Case Study: Lending Fintech - 2 Giờ Approval, 15% Lower Default
Background:
- Company: Digital lending platform, focus on SME loans
- AUM: ~200B VND
- Problem:
- Manual credit assessment: 3 days average
- Default rate: 12% (industry ~8-10%)
- Scalability bottleneck: Credit team can only process 50 loans/day
Pain points:
- Excel-based credit scoring: Inconsistent, not scalable
- Fragmented data: CIC report (PDF), bank statements (PDF), application form (Google Forms)
- No fraud detection: Losing 5-8% to first-party fraud (intentional defaults)
- Compliance burden: Manual preparation of SBV reports (2 days/quarter)
Solution: Data Platform + ML Decisioning (12 weeks implementation)
Phase 1: Data Infrastructure (Weeks 1-4)
Cloud setup:
- Viettel Cloud (VTC Cloud): Compute, storage (compliance requirement)
- Kubernetes: Container orchestration
- PostgreSQL: Operational database (TimescaleDB extension for time-series)
- Apache Kafka: Event streaming
Data integrations:
- Core banking API: Real-time loan creation, status updates
- eKYC API (VNPT): ID verification, face matching
- Bank statement OCR: Extract transactions từ PDF using ML (Tesseract + custom model)
- CIC API: Fetch credit reports
- Telecom data (partnership with MobiFone): Top-up, usage patterns
Data warehouse:
-- Loans table
CREATE TABLE loans (
loan_id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
application_date TIMESTAMPTZ,
approved_date TIMESTAMPTZ,
amount NUMERIC(15,2),
term_months INT,
interest_rate NUMERIC(5,2),
status TEXT, -- pending, approved, disbursed, closed, defaulted
default_date TIMESTAMPTZ,
ml_credit_score NUMERIC(3,2),
approval_method TEXT -- auto, manual
);
-- Customer features (for ML)
CREATE TABLE customer_features (
customer_id UUID PRIMARY KEY,
snapshot_date DATE,
age INT,
monthly_income NUMERIC(15,2),
debt_to_income NUMERIC(5,2),
cic_score INT,
telecom_tenure_months INT,
-- ... 200+ features
UNIQUE(customer_id, snapshot_date)
);
-- Audit log (compliance)
CREATE TABLE loan_decisions_audit (
decision_id UUID PRIMARY KEY,
loan_id UUID,
timestamp TIMESTAMPTZ DEFAULT NOW(),
decision TEXT, -- approve, decline, review
credit_score NUMERIC(3,2),
decision_factors JSONB, -- Explainability
user_id UUID, -- Who made decision (ML or manual)
IMMUTABLE -- Cannot be modified
);
Phase 2: ML Model Development (Weeks 5-8)
Credit scoring model:
- Training data: 5,000 historical loans (18 months)
- Features: 180 features (demo, CIC, telecom, behavioral)
- Algorithm: LightGBM (best performance after testing XGBoost, Random Forest, Logistic Regression)
- Performance:
- AUC: 0.78 (up from 0.65 with CIC score alone)
- Precision@90% recall: 0.72 (acceptable trade-off)
Feature importance insights:
- CIC score: 28% (still most important if available)
- Debt-to-income: 15%
- Telecom tenure: 11% (strong signal!)
- Age: 8%
- Business type (for SME): 7%
Explainability (regulatory requirement):
import shap
# SHAP values for interpretability
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# For a specific loan application
loan_application = X_test.iloc[0]
shap.force_plot(explainer.expected_value, shap_values[0], loan_application)
# Export explanation for audit
explanation = {
"loan_id": "abc-123",
"credit_score": 0.72,
"top_positive_factors": [
{"feature": "cic_score", "value": 680, "impact": +0.15},
{"feature": "telecom_tenure_months", "value": 36, "impact": +0.08},
{"feature": "age", "value": 35, "impact": +0.05}
],
"top_negative_factors": [
{"feature": "debt_to_income", "value": 4.5, "impact": -0.12},
{"feature": "late_payments_12m", "value": 2, "impact": -0.08}
],
"decision": "MANUAL_REVIEW"
}
Fraud detection model:
- Separate model for first-party fraud detection
- Features: Application patterns, device fingerprinting, duplicate detection
- Reduced fraud rate from 8% to 2%
Phase 3: Automation & Deployment (Weeks 9-12)
Automated decisioning workflow:
- Customer submits application (mobile app)
- Fraud check (100ms): Pass/Review/Decline
- eKYC (30 seconds): Face + ID verification
- Data collection (2 minutes):
- Fetch CIC report (API)
- Bank statement upload → OCR → Extract income
- Fetch telecom data
- Feature calculation (10 seconds): 180 features
- ML credit scoring (100ms): Score 0.0-1.0
- Decision rules:
- Score >0.7 → Auto-approve (45% of applications)
- Score 0.4-0.7 → Manual review queue (40%)
- Score <0.4 → Auto-decline (15%)
- Loan offer (if approved): Amount, rate, terms
- Disbursement (customer accepts): Transfer to bank account
Total time for auto-approved loans: ~2 hours (mostly waiting for customer to upload documents)
Manual review queue:
- Credit officers review 40% of applications
- See ML score, explanation, all data in one dashboard
- Decision time: 2-4 hours (vs 3 days before)
Results After 6 Months
| Metric | Before | After | Change |
|---|---|---|---|
| Avg approval time | 3 days | 2 hours (auto) / 6 hours (manual) | -92% / -83% |
| Applications processed/day | 50 | 200 | +300% |
| Auto-approval rate | 0% | 45% | New |
| Default rate | 12% | 10.2% | -15% |
| Cost per loan processed | 450k VND | 120k VND | -73% |
| Customer satisfaction | 3.2/5 | 4.6/5 | +44% |
| SBV reporting time | 2 days/quarter | 2 hours (automated) | -95% |
Financial impact (annual):
- Operational savings: 2,400 loans/month × 330k saved × 12 = ~9.5B VND
- Revenue increase (4x throughput): New AUM 800B VND, additional interest ~40B VND/year
- Default reduction (12% → 10.2%): Saved ~14B VND in potential losses
ROI: Data Platform investment ~1.5B VND → Payback in <2 months
Key Learnings
What worked:
- Alternative data (telecom) is a game-changer for thin-file customers
- ML explainability is non-negotiable for regulated industries
- Automated decisioning frees up credit team for high-value, complex cases
- Real-time fraud detection prevents majority of fraud attempts
Challenges:
- Data quality: Bank statement OCR accuracy only 85% → Manual review needed
- CIC API downtime: 5-10% of queries fail → Need retry logic, fallback
- Model drift: Performance degraded 3 months after launch → Monthly retraining implemented
- Customer trust: Some customers uncomfortable with "algorithm decides" → Education, transparency helped
Compliance Checklist: SBV, PDPA, AML
Data residency (Circular 47/2020):
- Core data stored in Vietnam cloud providers
- Or international cloud with Vietnam region + Vietnamese entity contract
- Disaster recovery backups can be offshore (encrypted)
Encryption:
- TLS 1.3 for all communications
- Database encryption at rest (AES-256)
- Field-level encryption for PII (ID numbers, account numbers)
- Key rotation policy (every 90 days)
Access control:
- Role-based access control (RBAC) implemented
- Row-level security for sensitive tables
- Data masking for non-production environments
- Multi-factor authentication for all staff
Audit trails:
- Immutable audit logs (append-only tables)
- All data access logged (who, what, when)
- All data changes logged (before/after values)
- 10-year retention for audit logs
PDPA compliance:
- Consent management system
- Right to access implementation (customer can download their data)
- Right to be forgotten (delete customer data on request)
- Breach notification process (<72 hours)
AML/KYC:
- KYC verification for all customers
- PEP (Politically Exposed Person) screening
- Sanctions list screening (OFAC, UN, etc.)
- Transaction monitoring rules (large transactions, suspicious patterns)
- SAR (Suspicious Activity Report) filing process
Business continuity:
- Multi-region deployment (Hà Nội + HCM)
- Automated failover (<5 minutes RTO)
- Daily backups with point-in-time recovery
- Quarterly disaster recovery drills
- Incident response plan
Reporting:
- Quarterly reports to SBV (automated)
- Security incident reporting process
- Internal audit schedule (bi-annual)
- External audit (annual for licensed entities)
Tech Stack Recommendations
Cloud infrastructure:
- Tier 1: Viettel IDC, VNPT Cloud, FPT Cloud (data residency)
- Hybrid: AWS Singapore với VPN to Vietnam cloud
Databases:
- Operational: PostgreSQL 15+ (proven, reliable, feature-rich)
- Time-series: TimescaleDB extension (for transaction history, metrics)
- Caching: Redis (session management, rate limiting)
Streaming:
- Apache Kafka: Event streaming, real-time pipelines
- Debezium: CDC (Change Data Capture) từ PostgreSQL → Kafka
Processing:
- Batch: dbt (SQL transformations), Apache Airflow (orchestration)
- Stream: Kafka Streams, Apache Flink
ML/AI:
- Training: Python (scikit-learn, LightGBM, XGBoost)
- Serving: Docker containers, Kubernetes
- Monitoring: MLflow, Prometheus + Grafana
BI & Visualization:
- Self-hosted: Metabase, Apache Superset (cost-effective)
- Cloud: Looker, Tableau (richer features, higher cost)
Security:
- Secrets: HashiCorp Vault
- API Gateway: Kong, AWS API Gateway
- WAF: Cloudflare, AWS WAF
Kết Luận: Compliance & Performance Can Coexist
Nhiều fintechs lo ngại rằng compliance requirements sẽ làm chậm innovation. Case study trên chứng minh ngược lại: A well-designed Data Platform vừa đáp ứng regulatory requirements vừa dramatically improve performance.
Key principles:
- Security first, not security later: Design encryption, access control từ đầu
- Automate compliance: Audit logs, reporting tự động → Không tốn effort ongoing
- Real-time + ML = Competitive advantage: Fraud detection, credit scoring nhanh hơn, chính xác hơn
- Explainability matters: Regulatory + customer trust cần transparency
- Leverage alternative data: Expand addressable market với thin-file customers
Next steps:
- Audit hiện tại data infrastructure: Gaps nào về compliance?
- Identify top 3 use cases: Fraud detection, credit scoring, reporting?
- Start small: 1 use case, 12-week implementation
- Liên hệ Carptech nếu cần guidance (carptech.vn/contact)
Tài liệu tham khảo:
- Circular 47/2020/TT-NHNN - SBV Information Security
- Vietnam Personal Data Protection Decree 13/2023
- PostgreSQL Row-Level Security
- Apache Kafka for Financial Services
Bài viết này là phần của series "Data Platform for Industries". Đọc thêm về E-commerce, Retail, và Manufacturing.
Carptech - Data Platform Solutions for Vietnamese Enterprises. Liên hệ tư vấn miễn phí.




