Data Platform cho Fintech: Compliance, Real-time, và Risk Management

Fintech là ngành duy nhất mà Data Platform không chỉ là competitive advantage mà còn là compliance requirement. Theo Circular 47/2020 của Ngân hàng Nhà nước (SBV), các tổ chức tài chính phải đảm bảo khả năng audit, traceability, và data residency - những yêu cầu không thể đáp ứng bằng spreadsheets hay dashboards rời rạc.

Thực tế, từ kinh nghiệm làm việc với 10+ fintech clients tại Việt Nam, Carptech nhận thấy 85% fintechs giai đoạn Series A gặp bottleneck về compliance và fraud detection khi scale - họ có sản phẩm tốt, product-market fit rõ ràng, nhưng không có data infrastructure để đáp ứng regulatory requirements và manage risks hiệu quả.

Bài viết này sẽ hướng dẫn chi tiết cách xây dựng Data Platform cho fintech, từ architecture design với security first, compliance by design, đến real-time fraud detection và credit scoring. Kèm case study thực tế về lending fintech giảm approval time từ 3 ngày xuống 2 giờ và reduce default rate 15%.

TL;DR - Key Takeaways

Fintech cần real-time + compliance: 99.99% uptime, data residency in Vietnam, immutable audit logs
Security architecture: Encryption at rest & in transit, RBAC, data masking, zero-trust network
Regulatory requirements: SBV Circular 47/2020, PDPA, AML/KYC audit trails
Real-time use cases: Fraud detection (<100ms latency), transaction monitoring, credit scoring
Tech stack: Vietnam cloud providers (Viettel IDC, VNPT, FPT), Apache Kafka, Postgres/TimescaleDB, dbt
ROI: 50-70% reduction in approval time, 10-20% lower default rates, 90%+ compliance automation

Fintech Unique Requirements: Tại Sao Khác Biệt?

1. Compliance is Non-Negotiable

Circular 47/2020/TT-NHNN (December 2020) - SBV regulations về an toàn hệ thống thông tin:

Data residency:

Core financial data must be stored in Vietnam
Customer data (KYC, transactions) cannot leave Vietnam without approval
Cloud providers: Must use Vietnamese providers hoặc international clouds with Vietnam regions

Audit requirements:

Immutable logs: All transactions, data changes must be logged and cannot be modified
Retention: Minimum 5 years for transactional data, 10 years for audit logs
Traceability: Ability to reconstruct any transaction or decision
Reporting: Quarterly reports to SBV on security incidents, data breaches

PDPA (Personal Data Protection Act):

Consent management: Track customer consent for data usage
Right to be forgotten: Ability to delete customer data on request (challenge: conflicts with retention!)
Data minimization: Only collect necessary data
Breach notification: Within 72 hours to authorities

2. Real-time is Mission-Critical

Fraud detection:

Latency requirement: <100ms từ transaction initiation đến fraud score
Accuracy: >99% (false positives costly, false negatives catastrophic)
Scale: Process 1000-10,000 transactions/second (peak times)

Transaction monitoring:

AML (Anti-Money Laundering) rules: Detect suspicious patterns real-time
Large transaction alerts: >50M VND require additional verification
Velocity checks: Too many transactions in short time = red flag

Credit decisions:

Traditional banks: 3-7 days approval time
Digital lenders: Target <24 hours, best-in-class <2 hours
Real-time components: Credit scoring, income verification, fraud check

3. Security is Paramount

Data sensitivity levels:

Tier 1 - Critical: Card numbers, CVV, account passwords → Encrypt, never log
Tier 2 - High: PII (name, ID, phone, address), account balances → Encrypt, audit all access
Tier 3 - Medium: Transaction history, credit scores → Access control, masked in non-prod
Tier 4 - Low: Aggregated analytics, anonymized data → Standard controls

Attack vectors unique to fintech:

Account takeover: Stolen credentials → multi-factor auth, behavioral biometrics
Transaction fraud: Stolen cards, synthetic identities → ML fraud models
Social engineering: Phishing, scams → Customer education, transaction confirmations
Internal fraud: Rogue employees → Strict RBAC, audit all data access

4. High Availability Requirements

Uptime SLA:

99.99% uptime = 52 minutes downtime/year
Maintenance windows: Off-peak only (2-5 AM)
Disaster recovery: RTO <4 hours, RPO <15 minutes

Implications for Data Platform:

Multi-region deployment (Hà Nội + HCM data centers)
Active-active or active-passive replication
Automated failover
Regular DR drills

Data Sources trong Fintech Ecosystem

Core Banking System / Ledger

Traditional banks: Core banking platforms (Temenos, Oracle Flexcube, local systems) Digital banks/fintechs: Modern alternatives

Mambu: Cloud-native core banking (SaaS)
Custom built: Node.js/Java + PostgreSQL với double-entry accounting

Data:

Accounts: account_id, customer_id, account_type (savings, loan), balance, status
Transactions: transaction_id, from_account, to_account, amount, timestamp, type
Ledger entries: Debit/credit entries với references

Integration:

Real-time: Event streaming (Kafka) cho mỗi transaction
Batch: Nightly dump for reconciliation, reporting

Payment Gateways

Domestic:

NAPAS: Interbank transfers, ATM network
VietQR: QR code payments
Bank APIs: Individual bank integrations (Vietcombank, Techcombank, etc.)

International:

Visa/Mastercard: Card payments
PayPal, Stripe: For international transactions (if licensed)

Data:

Payment status: initiated, processing, success, failed, reversed
Metadata: payment method, device info, IP address, location
Fees: transaction fees, currency conversion

KYC/AML Vendors

eKYC providers:

VNPT eKYC: ID card scanning, face matching
FPT.AI: Similar services
Trusting Social: Digital footprint for credit

Data:

ID verification: ID number, name, DOB, address (from government database)
Face matching score: Selfie vs ID photo
Liveness detection: Anti-spoofing (video, 3D face)

AML screening:

Dow Jones Risk & Compliance: Watchlist screening (PEP, sanctions)
Local databases: SBV blacklists

Credit Bureaus

Vietnam Credit Information Center (CIC) - SBV's credit bureau:

Credit history: Loans, payment history, defaults
Credit score: CIC score (not as sophisticated as FICO yet)
Lag: Updated monthly (not real-time)

Alternative data providers:

Telecom data: Call/SMS patterns, top-up history (proxy for stability)
E-commerce: Shopee/Lazada purchase history (spending patterns)
Social: Facebook, Zalo (controversial, privacy concerns)

App Analytics & Behavioral Data

Firebase Analytics: App usage, screens, events
Mixpanel/Amplitude: User funnels, retention
Behavioral signals: Device fingerprinting, typing patterns, location patterns

Fraud detection signals:

New device from different location = higher risk
Copy-paste password (vs typing) = credential stuffing attack
Fast form completion = bot

Architecture: Security & Compliance First

Reference Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          APPLICATION LAYER                          │
│  (Mobile app, Web app, APIs)                                       │
│  - TLS 1.3 encryption                                              │
│  - OAuth 2.0 + JWT tokens                                          │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      API GATEWAY + WAF                              │
│  - Rate limiting (1000 req/min per user)                           │
│  - DDoS protection                                                 │
│  - Request logging (audit trail)                                   │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                    ┌────────┴────────┐
                    ▼                 ▼
         ┌──────────────────┐  ┌──────────────────┐
         │   KAFKA CLUSTER  │  │  OPERATIONAL DB  │
         │  (Event Stream)  │  │   (PostgreSQL)   │
         │  - 3 brokers     │  │  - Encrypted     │
         │  - Replication=3 │  │  - Multi-AZ      │
         └────────┬─────────┘  └──────────────────┘
                  │
         ┌────────┴────────────────────┐
         ▼                             ▼
┌────────────────────┐      ┌────────────────────────┐
│  STREAM PROCESSING │      │   DATA WAREHOUSE       │
│  (Kafka Streams/   │      │   (PostgreSQL +        │
│   Flink)           │      │    TimescaleDB)        │
│  - Fraud detection │      │  - Column encryption   │
│  - Real-time agg   │      │  - Row-level security  │
│  - Alerting        │      │  - Audit logs          │
└────────────────────┘      └───────────┬────────────┘
                                        │
                                        ▼
                            ┌────────────────────────┐
                            │   ANALYTICS LAYER      │
                            │   (dbt + Metabase)     │
                            │  - Role-based access   │
                            │  - Data masking        │
                            │  - Query audit logs    │
                            └────────────────────────┘

Cloud Provider Choice: Vietnam Requirements

Option 1: Vietnamese cloud providers (Recommended for compliance)

Viettel IDC (Viettel Cloud)
VNPT Cloud
FPT Cloud
CMC Cloud

Pros:

✅ Data residency compliance by default
✅ Vietnamese support, contract in VND
✅ Government relationships (easier licensing)

Cons:

❌ Limited services vs AWS/GCP/Azure
❌ Smaller ecosystem, fewer integrations
❌ Potentially higher costs for compute/storage

Option 2: International clouds with Vietnam regions

AWS: ap-southeast-1 (Singapore) - Planned Hanoi region
GCP: asia-southeast1 (Singapore) - No Vietnam region yet
Azure: Southeast Asia (Singapore) - No Vietnam region yet

Hybrid approach (Most common):

Tier 1 data (transactions, PII): Vietnamese cloud
Tier 4 data (aggregated analytics): International cloud
Replication: Encrypted backup to international cloud for DR

Encryption Strategy

Data at rest:

Database encryption: PostgreSQL TDE (Transparent Data Encryption)
Field-level encryption: For highly sensitive fields (card numbers, IDs)
- Algorithm: AES-256-GCM
- Key management: AWS KMS, HashiCorp Vault, or Vietnamese HSM providers
- Key rotation: Every 90 days

Data in transit:

TLS 1.3: All communications (app ↔ API ↔ database)
Certificate pinning: Mobile apps to prevent MITM attacks
VPN/Private networks: Between cloud regions

Example: Field-level encryption trong PostgreSQL

-- Encrypt ID number before storing
INSERT INTO customers (customer_id, id_number_encrypted, name)
VALUES (
  gen_random_uuid(),
  pgp_sym_encrypt('001234567890', current_setting('app.encryption_key')),
  'Nguyen Van A'
);

-- Decrypt when authorized
SELECT
  customer_id,
  name,
  pgp_sym_decrypt(id_number_encrypted::bytea, current_setting('app.encryption_key')) as id_number
FROM customers
WHERE customer_id = 'xxx'
AND current_user_has_permission('view_pii') = true;

Access Control: RBAC + Attribute-Based

Roles hierarchy:

Admin: Full access (CTO, Security team) - Audit all actions
Data Analyst: Read-only, masked PII
Risk Officer: Read transaction data, write to risk tables
Customer Support: Read customer info (masked), cannot see balances/transactions
External Auditor: Read-only, specific schemas, time-limited access

Row-level security example:

-- Customer support can only see customers they're assigned to
CREATE POLICY customer_support_policy ON customers
FOR SELECT
TO customer_support_role
USING (assigned_support_id = current_user_id());

-- Analysts can see all customers but PII is masked
CREATE POLICY analyst_policy ON customers
FOR SELECT
TO analyst_role
USING (true)
WITH CHECK (false);  -- No writes

-- Create masked view for analysts
CREATE VIEW customers_masked AS
SELECT
  customer_id,
  CONCAT(LEFT(name, 2), '***') as name,
  CONCAT('***', RIGHT(phone, 4)) as phone,
  DATE_TRUNC('month', created_at) as created_month,  -- Hide exact date
  customer_segment
FROM customers;

GRANT SELECT ON customers_masked TO analyst_role;

Audit Logging: Immutable & Complete

What to log:

Data access: Who queried what data, when
Data changes: All INSERT/UPDATE/DELETE with before/after values
Authentication: Login attempts (success/fail), session duration
Admin actions: Configuration changes, permission grants
API calls: All requests/responses (sanitized)

Implementation with PostgreSQL + Audit trigger:

-- Audit log table
CREATE TABLE audit_log (
  log_id BIGSERIAL PRIMARY KEY,
  table_name TEXT NOT NULL,
  operation TEXT NOT NULL,  -- INSERT, UPDATE, DELETE
  old_values JSONB,
  new_values JSONB,
  user_id UUID,
  username TEXT,
  ip_address INET,
  timestamp TIMESTAMPTZ DEFAULT NOW()
);

-- Make it append-only (cannot UPDATE or DELETE)
REVOKE UPDATE, DELETE ON audit_log FROM PUBLIC;

-- Audit trigger function
CREATE OR REPLACE FUNCTION audit_trigger_function()
RETURNS TRIGGER AS $$
BEGIN
  INSERT INTO audit_log (table_name, operation, old_values, new_values, user_id, username, ip_address)
  VALUES (
    TG_TABLE_NAME,
    TG_OP,
    CASE WHEN TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN row_to_json(OLD) ELSE NULL END,
    CASE WHEN TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN row_to_json(NEW) ELSE NULL END,
    current_setting('app.current_user_id', true)::UUID,
    current_user,
    inet_client_addr()
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Attach to tables
CREATE TRIGGER audit_customers
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION audit_trigger_function();

Retention: Store audit logs for 10 years, archive to cold storage after 2 years

Real-time Fraud Detection: <100ms Latency

Fraud Detection Pipeline

Transaction initiated
  ↓
Feature extraction (10ms)
  - Device fingerprint
  - Location (IP geolocation)
  - Transaction amount, recipient
  - User behavior (time since last login, typing speed)
  ↓
Rule-based checks (20ms)
  - Blacklist check (stolen cards, banned users)
  - Velocity rules (>5 transactions in 1 minute)
  - Amount limits (>50M VND)
  ↓
ML model scoring (50ms)
  - Fraud probability: 0.0-1.0
  - Model: XGBoost, LightGBM
  ↓
Decision (10ms)
  - Score <0.1: Approve (95% of transactions)
  - Score 0.1-0.5: Additional verification (4%)
  - Score >0.5: Decline (1%)
  ↓
Log & feedback loop (10ms)
  - Store decision, send to analytics
  - Update user profile

Total latency: ~100ms

Feature Engineering for Fraud Detection

User features:

{
  "user_id": "uuid",
  "account_age_days": 45,
  "total_transactions": 12,
  "total_volume_30d": 15000000,  # VND
  "avg_transaction_amount": 1250000,
  "kyc_verified": true,
  "kyc_score": 0.95,
  "devices_count": 2,
  "login_locations_count": 3  # Distinct cities
}

Transaction features:

{
  "amount": 5000000,
  "amount_vs_avg_ratio": 4.0,  # 4x higher than average
  "recipient_new": true,  # Never sent to this recipient before
  "recipient_high_risk": false,  # Not in fraud database
  "time_since_last_transaction_minutes": 2,
  "transactions_last_hour": 3,  # Velocity
  "is_round_number": false,  # 5M vs 5,123,456 (round = suspicious)
  "weekend_transaction": true,
  "late_night_transaction": false
}

Device features:

{
  "device_id": "fingerprint_hash",
  "device_new": false,
  "device_os": "Android 13",
  "ip_address": "14.xxx.xxx.xxx",
  "ip_location": "Hanoi",
  "ip_location_mismatch": false,  # vs registered address
  "vpn_detected": false,
  "emulator_detected": false
}

ML Model Training

Dataset:

Positive class (fraud): 0.1-1% of transactions
Negative class (legitimate): 99-99.9%
Class imbalance: Use SMOTE, class weights, or undersampling

Labels:

Confirmed fraud: Chargebacks, user reports
Suspected fraud: Declined transactions, manual review
Time delay: True labels emerge 30-90 days later (chargebacks)

Model training pipeline:

import pandas as pd
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve

# Load training data (last 6 months)
df = load_transactions_with_labels(start_date='2024-10-01')

# Feature engineering
features = [
  'account_age_days', 'total_transactions', 'amount', 'amount_vs_avg_ratio',
  'recipient_new', 'time_since_last_transaction_minutes', 'transactions_last_hour',
  'device_new', 'ip_location_mismatch', 'vpn_detected'
]

X = df[features]
y = df['is_fraud']

# Train/test split (temporal - no leakage)
split_date = '2025-03-01'
X_train = X[df['transaction_date'] < split_date]
y_train = y[df['transaction_date'] < split_date]
X_test = X[df['transaction_date'] >= split_date]
y_test = y[df['transaction_date'] >= split_date]

# Train LightGBM model
model = LGBMClassifier(
  n_estimators=500,
  max_depth=6,
  learning_rate=0.05,
  scale_pos_weight=(~y_train).sum() / y_train.sum(),  # Handle imbalance
  random_state=42
)

model.fit(X_train, y_train)

# Evaluate
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"AUC: {auc:.4f}")  # Target: >0.95

# Find optimal threshold (balance precision/recall)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
# Choose threshold where precision = 0.9 (accept 10% false positives for 90% recall)

# Save model
model.booster_.save_model('fraud_model_v2.txt')

Model deployment:

Containerize: Docker image với model file
Deploy: Kubernetes cluster (auto-scaling)
A/B testing: Shadow mode → 10% traffic → 100% rollout
Monitoring: Latency, prediction distribution, false positive rate
Retraining: Monthly with new fraud labels

Rule-based Layer: Fast & Interpretable

ML models are powerful but "black box". Combine với rule-based checks cho transparency và regulatory compliance.

Example rules:

def rule_based_fraud_check(transaction, user):
    # Rule 1: Blacklist
    if transaction['recipient_id'] in BLACKLIST:
        return {"decision": "DECLINE", "reason": "Recipient blacklisted"}

    # Rule 2: Large transactions
    if transaction['amount'] > 50_000_000:  # >50M VND
        return {"decision": "REVIEW", "reason": "Large transaction"}

    # Rule 3: Velocity - too many transactions
    if user['transactions_last_hour'] > 5:
        return {"decision": "DECLINE", "reason": "Velocity exceeded"}

    # Rule 4: New device + large amount
    if transaction['device_new'] and transaction['amount'] > 10_000_000:
        return {"decision": "REVIEW", "reason": "New device, large amount"}

    # Rule 5: Location mismatch
    if transaction['ip_location'] != user['registered_city']:
        if transaction['amount'] > 5_000_000:
            return {"decision": "REVIEW", "reason": "Location mismatch"}

    return {"decision": "PASS", "reason": None}

Benefit: Explainability cho customer ("We declined because ...") và auditors.

Credit Scoring: Alternative Data + ML

Traditional Credit Scoring (CIC Score)

Vietnam Credit Information Center (CIC) score:

Range: Không công khai (not like FICO 300-850)
Based on: Loan history, payment behavior, defaults
Limitations:
- Thin-file problem: 60-70% adults không có credit history
- Monthly update lag: Không real-time
- Limited coverage: Chỉ formal banking

Alternative Data Approach

Telecom data (with customer consent):

Call/SMS patterns: Stable pattern = stable life
Top-up frequency: Regular top-ups = income regularity
Network quality: Premium number = higher income segment

E-commerce data:

Purchase history: Categories (essentials vs luxury)
Delivery addresses: Stability (same address for 6 months)
Payment method: Credit card vs COD (credit card users lower default)

App behavioral data:

Form completion time: Too fast = potential fraud
Midnight applications: Higher risk (desperation signal)
Loan amount requested: Exactly round numbers = less thought, higher risk

Credit Scoring Model

Target variable:

Binary: Default (yes/no) - default = >60 days overdue
Multi-class: Risk tiers (A, B, C, D, E)

Features (200+ features):

{
  # Demographics
  "age": 32,
  "gender": "M",
  "marital_status": "married",
  "education": "university",
  "city": "Hanoi",

  # Employment
  "employment_type": "full_time",
  "company_type": "private",
  "months_at_job": 24,
  "monthly_income": 15000000,  # Self-declared

  # Loan request
  "loan_amount": 30000000,
  "loan_term_months": 12,
  "loan_purpose": "motorbike",
  "amount_to_income_ratio": 2.0,

  # CIC data (if available)
  "cic_score": 650,
  "existing_loans": 1,
  "total_debt": 50000000,
  "debt_to_income": 3.3,
  "late_payments_12m": 0,

  # Alternative data - Telecom
  "telecom_tenure_months": 36,
  "avg_monthly_topup": 200000,
  "topup_regularity": 0.85,  # % of months with topup

  # Alternative data - E-commerce
  "ecommerce_orders_6m": 8,
  "ecommerce_spend_6m": 5000000,
  "delivery_address_changes": 0,

  # Behavioral
  "application_hour": 14,  # 2 PM
  "application_completion_time_seconds": 420,
  "device_type": "Android",
  "form_fields_edited": 3  # How many times user changed answers
}

Model:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Training data: historical loans với outcomes
X_train = df_train[features]
y_train = df_train['defaulted']  # 0 or 1

# Train model
model = GradientBoostingClassifier(
  n_estimators=300,
  max_depth=5,
  learning_rate=0.1,
  random_state=42
)

model.fit(X_train, y_train)

# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Target: >0.75 (với alternative data, 0.72-0.78 typical)

# Feature importance
feature_importance = pd.DataFrame({
  'feature': features,
  'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance.head(10))

Typical important features:

CIC score (if available) - 25%
Debt-to-income ratio - 12%
Months at job - 8%
Telecom tenure - 7%
Age - 6%
Loan amount to income - 5%

Credit Decision Workflow

Application submitted
  ↓
Fraud check (ML model)
  ↓ (Pass)
eKYC verification
  ↓ (Success)
Income verification
  - Bank statement upload
  - Or telecom data proxy
  ↓
Credit scoring (ML model)
  - Score: 0.0 (high risk) to 1.0 (low risk)
  ↓
Decision rules
  - Score >0.7 → Approve automatically (40-50% of applications)
  - Score 0.4-0.7 → Manual review (30-40%)
  - Score <0.4 → Decline (10-20%)
  ↓
Loan offer
  - Approved amount (may be lower than requested)
  - Interest rate (risk-based pricing)
  - Term options

Approval time:

Automated approvals: <2 hours (eKYC + scoring + bank account creation)
Manual review: 4-24 hours
Average: ~6-8 hours (vs 3-7 days traditional banks)

Case Study: Lending Fintech - 2 Giờ Approval, 15% Lower Default

Background:

Company: Digital lending platform, focus on SME loans
AUM: ~200B VND
Problem:
- Manual credit assessment: 3 days average
- Default rate: 12% (industry ~8-10%)
- Scalability bottleneck: Credit team can only process 50 loans/day

Pain points:

Excel-based credit scoring: Inconsistent, not scalable
Fragmented data: CIC report (PDF), bank statements (PDF), application form (Google Forms)
No fraud detection: Losing 5-8% to first-party fraud (intentional defaults)
Compliance burden: Manual preparation of SBV reports (2 days/quarter)

Solution: Data Platform + ML Decisioning (12 weeks implementation)

Phase 1: Data Infrastructure (Weeks 1-4)

Cloud setup:

Viettel Cloud (VTC Cloud): Compute, storage (compliance requirement)
Kubernetes: Container orchestration
PostgreSQL: Operational database (TimescaleDB extension for time-series)
Apache Kafka: Event streaming

Data integrations:

Core banking API: Real-time loan creation, status updates
eKYC API (VNPT): ID verification, face matching
Bank statement OCR: Extract transactions từ PDF using ML (Tesseract + custom model)
CIC API: Fetch credit reports
Telecom data (partnership with MobiFone): Top-up, usage patterns

Data warehouse:

-- Loans table
CREATE TABLE loans (
  loan_id UUID PRIMARY KEY,
  customer_id UUID NOT NULL,
  application_date TIMESTAMPTZ,
  approved_date TIMESTAMPTZ,
  amount NUMERIC(15,2),
  term_months INT,
  interest_rate NUMERIC(5,2),
  status TEXT,  -- pending, approved, disbursed, closed, defaulted
  default_date TIMESTAMPTZ,
  ml_credit_score NUMERIC(3,2),
  approval_method TEXT  -- auto, manual
);

-- Customer features (for ML)
CREATE TABLE customer_features (
  customer_id UUID PRIMARY KEY,
  snapshot_date DATE,
  age INT,
  monthly_income NUMERIC(15,2),
  debt_to_income NUMERIC(5,2),
  cic_score INT,
  telecom_tenure_months INT,
  -- ... 200+ features
  UNIQUE(customer_id, snapshot_date)
);

-- Audit log (compliance)
CREATE TABLE loan_decisions_audit (
  decision_id UUID PRIMARY KEY,
  loan_id UUID,
  timestamp TIMESTAMPTZ DEFAULT NOW(),
  decision TEXT,  -- approve, decline, review
  credit_score NUMERIC(3,2),
  decision_factors JSONB,  -- Explainability
  user_id UUID,  -- Who made decision (ML or manual)
  IMMUTABLE  -- Cannot be modified
);

Phase 2: ML Model Development (Weeks 5-8)

Credit scoring model:

Training data: 5,000 historical loans (18 months)
Features: 180 features (demo, CIC, telecom, behavioral)
Algorithm: LightGBM (best performance after testing XGBoost, Random Forest, Logistic Regression)
Performance:
- AUC: 0.78 (up from 0.65 with CIC score alone)
- Precision@90% recall: 0.72 (acceptable trade-off)

Feature importance insights:

CIC score: 28% (still most important if available)
Debt-to-income: 15%
Telecom tenure: 11% (strong signal!)
Age: 8%
Business type (for SME): 7%

Explainability (regulatory requirement):

import shap

# SHAP values for interpretability
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# For a specific loan application
loan_application = X_test.iloc[0]
shap.force_plot(explainer.expected_value, shap_values[0], loan_application)

# Export explanation for audit
explanation = {
  "loan_id": "abc-123",
  "credit_score": 0.72,
  "top_positive_factors": [
    {"feature": "cic_score", "value": 680, "impact": +0.15},
    {"feature": "telecom_tenure_months", "value": 36, "impact": +0.08},
    {"feature": "age", "value": 35, "impact": +0.05}
  ],
  "top_negative_factors": [
    {"feature": "debt_to_income", "value": 4.5, "impact": -0.12},
    {"feature": "late_payments_12m", "value": 2, "impact": -0.08}
  ],
  "decision": "MANUAL_REVIEW"
}

Fraud detection model:

Separate model for first-party fraud detection
Features: Application patterns, device fingerprinting, duplicate detection
Reduced fraud rate from 8% to 2%

Phase 3: Automation & Deployment (Weeks 9-12)

Automated decisioning workflow:

Customer submits application (mobile app)
Fraud check (100ms): Pass/Review/Decline
eKYC (30 seconds): Face + ID verification
Data collection (2 minutes):
- Fetch CIC report (API)
- Bank statement upload → OCR → Extract income
- Fetch telecom data
Feature calculation (10 seconds): 180 features
ML credit scoring (100ms): Score 0.0-1.0
Decision rules:
- Score >0.7 → Auto-approve (45% of applications)
- Score 0.4-0.7 → Manual review queue (40%)
- Score <0.4 → Auto-decline (15%)
Loan offer (if approved): Amount, rate, terms
Disbursement (customer accepts): Transfer to bank account

Total time for auto-approved loans: ~2 hours (mostly waiting for customer to upload documents)

Manual review queue:

Credit officers review 40% of applications
See ML score, explanation, all data in one dashboard
Decision time: 2-4 hours (vs 3 days before)

Results After 6 Months

Metric	Before	After	Change
Avg approval time	3 days	2 hours (auto) / 6 hours (manual)	-92% / -83%
Applications processed/day	50	200	+300%
Auto-approval rate	0%	45%	New
Default rate	12%	10.2%	-15%
Cost per loan processed	450k VND	120k VND	-73%
Customer satisfaction	3.2/5	4.6/5	+44%
SBV reporting time	2 days/quarter	2 hours (automated)	-95%

Financial impact (annual):

Operational savings: 2,400 loans/month × 330k saved × 12 = ~9.5B VND
Revenue increase (4x throughput): New AUM 800B VND, additional interest ~40B VND/year
Default reduction (12% → 10.2%): Saved ~14B VND in potential losses

ROI: Data Platform investment ~1.5B VND → Payback in <2 months

Key Learnings

What worked:

Alternative data (telecom) is a game-changer for thin-file customers
ML explainability is non-negotiable for regulated industries
Automated decisioning frees up credit team for high-value, complex cases
Real-time fraud detection prevents majority of fraud attempts

Challenges:

Data quality: Bank statement OCR accuracy only 85% → Manual review needed
CIC API downtime: 5-10% of queries fail → Need retry logic, fallback
Model drift: Performance degraded 3 months after launch → Monthly retraining implemented
Customer trust: Some customers uncomfortable with "algorithm decides" → Education, transparency helped

Compliance Checklist: SBV, PDPA, AML

Data residency (Circular 47/2020):

Core data stored in Vietnam cloud providers
Or international cloud with Vietnam region + Vietnamese entity contract
Disaster recovery backups can be offshore (encrypted)

Encryption:

TLS 1.3 for all communications
Database encryption at rest (AES-256)
Field-level encryption for PII (ID numbers, account numbers)
Key rotation policy (every 90 days)

Access control:

Role-based access control (RBAC) implemented
Row-level security for sensitive tables
Data masking for non-production environments
Multi-factor authentication for all staff

Audit trails:

Immutable audit logs (append-only tables)
All data access logged (who, what, when)
All data changes logged (before/after values)
10-year retention for audit logs

PDPA compliance:

Consent management system
Right to access implementation (customer can download their data)
Right to be forgotten (delete customer data on request)
Breach notification process (<72 hours)

AML/KYC:

KYC verification for all customers
PEP (Politically Exposed Person) screening
Sanctions list screening (OFAC, UN, etc.)
Transaction monitoring rules (large transactions, suspicious patterns)
SAR (Suspicious Activity Report) filing process

Business continuity:

Multi-region deployment (Hà Nội + HCM)
Automated failover (<5 minutes RTO)
Daily backups with point-in-time recovery
Quarterly disaster recovery drills
Incident response plan

Reporting:

Quarterly reports to SBV (automated)
Security incident reporting process
Internal audit schedule (bi-annual)
External audit (annual for licensed entities)

Tech Stack Recommendations

Cloud infrastructure:

Tier 1: Viettel IDC, VNPT Cloud, FPT Cloud (data residency)
Hybrid: AWS Singapore với VPN to Vietnam cloud

Databases:

Operational: PostgreSQL 15+ (proven, reliable, feature-rich)
Time-series: TimescaleDB extension (for transaction history, metrics)
Caching: Redis (session management, rate limiting)

Streaming:

Apache Kafka: Event streaming, real-time pipelines
Debezium: CDC (Change Data Capture) từ PostgreSQL → Kafka

Processing:

Batch: dbt (SQL transformations), Apache Airflow (orchestration)
Stream: Kafka Streams, Apache Flink

ML/AI:

Training: Python (scikit-learn, LightGBM, XGBoost)
Serving: Docker containers, Kubernetes
Monitoring: MLflow, Prometheus + Grafana

BI & Visualization:

Self-hosted: Metabase, Apache Superset (cost-effective)
Cloud: Looker, Tableau (richer features, higher cost)

Security:

Secrets: HashiCorp Vault
API Gateway: Kong, AWS API Gateway
WAF: Cloudflare, AWS WAF

Kết Luận: Compliance & Performance Can Coexist

Nhiều fintechs lo ngại rằng compliance requirements sẽ làm chậm innovation. Case study trên chứng minh ngược lại: A well-designed Data Platform vừa đáp ứng regulatory requirements vừa dramatically improve performance.

Key principles:

Security first, not security later: Design encryption, access control từ đầu
Automate compliance: Audit logs, reporting tự động → Không tốn effort ongoing
Real-time + ML = Competitive advantage: Fraud detection, credit scoring nhanh hơn, chính xác hơn
Explainability matters: Regulatory + customer trust cần transparency
Leverage alternative data: Expand addressable market với thin-file customers

Next steps:

Audit hiện tại data infrastructure: Gaps nào về compliance?
Identify top 3 use cases: Fraud detection, credit scoring, reporting?
Start small: 1 use case, 12-week implementation
Liên hệ Carptech nếu cần guidance (carptech.vn/contact)

Tài liệu tham khảo:

Bài viết này là phần của series "Data Platform for Industries". Đọc thêm về E-commerce, Retail, và Manufacturing.

Carptech - Data Platform Solutions for Vietnamese Enterprises. Liên hệ tư vấn miễn phí.

Data Platform cho Fintech: Compliance, Real-time, và Risk Management

TL;DR - Key Takeaways

Fintech Unique Requirements: Tại Sao Khác Biệt?

1. Compliance is Non-Negotiable

2. Real-time is Mission-Critical

3. Security is Paramount

4. High Availability Requirements

Data Sources trong Fintech Ecosystem

Core Banking System / Ledger

Payment Gateways

KYC/AML Vendors

Credit Bureaus

App Analytics & Behavioral Data

Architecture: Security & Compliance First

Reference Architecture

Cloud Provider Choice: Vietnam Requirements

Encryption Strategy

Access Control: RBAC + Attribute-Based

Audit Logging: Immutable & Complete

Real-time Fraud Detection: <100ms Latency

Fraud Detection Pipeline

Feature Engineering for Fraud Detection

ML Model Training

Rule-based Layer: Fast & Interpretable

Credit Scoring: Alternative Data + ML

Traditional Credit Scoring (CIC Score)

Alternative Data Approach

Credit Scoring Model

Credit Decision Workflow

Case Study: Lending Fintech - 2 Giờ Approval, 15% Lower Default

Phase 1: Data Infrastructure (Weeks 1-4)

Phase 2: ML Model Development (Weeks 5-8)

Phase 3: Automation & Deployment (Weeks 9-12)

Results After 6 Months

Key Learnings

Compliance Checklist: SBV, PDPA, AML

Tech Stack Recommendations

Kết Luận: Compliance & Performance Can Coexist

Có câu hỏi về Data Platform?

Bài viết liên quan

Xu hướng data platform 2025: nhìn lại một năm đầy biến động

Composable data stack: best-of-breed vs all-in-one platform

Data Mesh: kiến trúc dữ liệu phi tập trung cho doanh nghiệp lớn

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý