Data Platform cho Manufacturing: OEE, Predictive Maintenance, Supply Chain

Manufacturing tại Việt Nam đang trong giai đoạn chuyển đổi từ Industry 3.0 (automation cơ bản) sang Industry 4.0 (smart factory với IoT, AI, data-driven decisions). Tuy nhiên, theo khảo sát của Carptech với 30+ nhà máy sản xuất (electronics, automotive parts, F&B, textiles), chỉ 15% có khả năng tính OEE real-time, và 72% vẫn đang làm maintenance theo lịch (scheduled) thay vì predictive - dẫn đến 30-40% unplanned downtime, chi phí bảo trì cao và chất lượng không ổn định.

Bài viết này sẽ hướng dẫn chi tiết cách xây dựng Data Platform cho manufacturing, từ kiến trúc IoT integration, tính toán OEE real-time, predictive maintenance models, quality prediction, đến supply chain optimization. Kèm case study về electronics manufacturer giảm unplanned downtime 40%, tiết kiệm 2B VND/năm.

TL;DR - Key Takeaways

Industry 4.0 data sources: IoT sensors, MES, ERP, SCADA, QMS - tích hợp real-time + batch
OEE (Overall Equipment Effectiveness): Availability × Performance × Quality - KPI quan trọng nhất trong manufacturing
Predictive maintenance: ML models dự đoán failure trước 3-7 days → Reduce unplanned downtime 30-50%
Architecture: Edge computing + time-series DB (InfluxDB, TimescaleDB) + cloud warehouse
ROI: 15% OEE improvement = $2M/year cho medium factory, 25-30% maintenance cost reduction
Implementation: Start with 1 production line (pilot), 12-16 weeks, scale to plant-wide

Manufacturing Data Landscape: IT/OT Convergence

1. IoT Sensors & SCADA - Operational Technology (OT)

SCADA (Supervisory Control and Data Acquisition):

Real-time monitoring & control của production equipment
Data: Temperature, pressure, RPM, vibration, current draw
Protocols: Modbus, OPC UA, MQTT

IoT sensor types:

Sensor Type	Measures	Use Case	Frequency
Temperature	°C	Overheat detection, process control	1-10 sec
Vibration	Frequency, amplitude	Bearing wear, imbalance detection	0.1-1 sec
Pressure	PSI, bar	Hydraulic/pneumatic systems	1-10 sec
Current	Amperage	Motor load, power consumption	1 sec
Proximity	On/off	Part presence detection	Event-based
Vision	Image/video	Quality inspection, defect detection	Per unit

Data volume:

1 production line với 20 machines, 100 sensors total
Sample rate: 1 Hz (1 sample/second)
Data points: 100 sensors × 3600 seconds × 24 hours = 8.6M data points/day
Storage: ~150-200 MB/day uncompressed, ~20-30 MB compressed

Challenge: Legacy machines không có sensors

Brownfield factories: 50-80% machines là older models (10-20 years)
Solutions:
- Retrofit sensors: Add vibration, temperature sensors externally
- Smart plugs: Monitor power consumption (proxy for machine state)
- Manual logging: Operators enter downtime reasons via tablet

2. MES (Manufacturing Execution System)

MES functions:

Production scheduling: What to produce, when, on which line
Work order tracking: Status of each job (queued, in-progress, completed)
Material consumption: Raw materials used per batch
Operator assignments: Who operated which machine, shift info
Downtime logging: When machine stopped, reason, duration

MES platforms: SAP MES, Siemens Opcenter, AVEVA MES, local Vietnamese systems

Data structure:

-- Work orders
CREATE TABLE work_orders (
  work_order_id VARCHAR(50) PRIMARY KEY,
  product_id VARCHAR(50),
  quantity_planned INT,
  quantity_produced INT,
  production_line_id VARCHAR(20),
  scheduled_start TIMESTAMPTZ,
  actual_start TIMESTAMPTZ,
  actual_end TIMESTAMPTZ,
  status VARCHAR(20)  -- queued, in_progress, completed, cancelled
);

-- Downtime events
CREATE TABLE downtime_events (
  downtime_id SERIAL PRIMARY KEY,
  machine_id VARCHAR(50),
  start_time TIMESTAMPTZ,
  end_time TIMESTAMPTZ,
  duration_minutes INT,
  downtime_category VARCHAR(50),  -- planned, unplanned
  downtime_reason VARCHAR(100),  -- breakdown, changeover, no_material, no_operator
  operator_id VARCHAR(50)
);

3. ERP (Enterprise Resource Planning)

ERP modules relevant for manufacturing:

Materials management: Inventory levels, procurement
Production planning: Master production schedule (MPS)
Quality management: Inspection results, non-conformance reports
Maintenance management: Work orders, spare parts inventory

ERP platforms: SAP, Oracle, Microsoft Dynamics, Odoo

Integration with Data Platform: Batch sync daily/hourly (ERP not real-time)

4. QMS (Quality Management System)

Quality data:

Inspection results: First-piece inspection, in-process, final inspection
Defect tracking: Defect types, counts, root causes
Statistical Process Control (SPC): Control charts, Cpk values
Customer returns: RMA (Return Merchandise Authorization), failure analysis

Defect categorization:

CREATE TABLE quality_inspections (
  inspection_id SERIAL PRIMARY KEY,
  work_order_id VARCHAR(50),
  inspection_type VARCHAR(50),  -- first_piece, in_process, final
  timestamp TIMESTAMPTZ,
  inspector_id VARCHAR(50),
  sample_size INT,
  defects_found INT,
  defect_types JSONB,  -- {"scratch": 2, "dimension_out_of_spec": 1}
  pass_fail VARCHAR(10)
);

5. Supply Chain Data

Inbound:

Supplier deliveries: On-time delivery rate, lead times, quality issues
Raw material inventory: Stock levels, consumption rates

Outbound:

Finished goods inventory: Warehouse stock
Shipments: Order fulfillment, delivery performance

Metrics:

Supplier OTD (On-Time Delivery): % deliveries arriving on scheduled date
Lead time: Days from PO to goods receipt
Inventory days: (Inventory value) / (Daily consumption cost)

Architecture: Edge-to-Cloud for Manufacturing

Edge Layer: Real-time Processing tại Factory Floor

Why edge computing?

Latency: Real-time control requires <10ms response (cloud round-trip ~100-500ms)
Reliability: Factory must operate even if internet connection down
Bandwidth: Sending 8M+ data points/day to cloud = expensive, slow
Security: Sensitive production data stays on-premise

Edge architecture:

Production Machines
    ↓ (Modbus, OPC UA)
Edge Gateway (Industrial PC)
  - Collect sensor data
  - Filter, aggregate (only send anomalies/summaries to cloud)
  - Local dashboards (OEE, line status)
  - Temporary storage (24-hour buffer)
    ↓ (MQTT, Kafka)
Local Time-Series Database
  - InfluxDB or TimescaleDB
  - Store high-frequency data (1 week retention)
  - Used for real-time analytics

Edge gateway hardware:

Raspberry Pi 4 (low-cost, $100): Good for pilot projects
Industrial PCs (Advantech, Siemens): Ruggedized, $500-2000
Siemens IoT2050: Designed for factory environments, ~$300

Software stack:

Node-RED: Visual programming for IoT flows (easy for OT teams)
Telegraf: Data collection agent (plugins for Modbus, OPC UA)
InfluxDB: Time-series database
Grafana: Real-time dashboards

Cloud Layer: Long-term Storage & Advanced Analytics

What goes to cloud:

Aggregated metrics: OEE by hour, downtime events, quality metrics
Anomalies: Sensor readings outside normal range
Contextual data: Work orders, product specs, maintenance history
Not sent to cloud: Raw high-frequency sensor data (stays at edge for 7 days, then discarded)

Cloud architecture:

Edge Gateway
    ↓ (Kafka / REST API)
Cloud Data Platform
  ┌─────────────────────────────────────┐
  │ Ingestion Layer                     │
  │  - Kafka: Streaming from edge       │
  │  - Airbyte: Batch from ERP, MES     │
  └──────────────┬──────────────────────┘
                 ▼
  ┌─────────────────────────────────────┐
  │ Data Warehouse (BigQuery/Snowflake) │
  │                                     │
  │  - Production metrics (hourly agg)  │
  │  - Downtime events                  │
  │  - Quality inspections              │
  │  - Maintenance logs                 │
  │  - ML training datasets             │
  └──────────────┬──────────────────────┘
                 ▼
  ┌─────────────────────────────────────┐
  │ Analytics & ML Layer                │
  │  - dbt: Transformations             │
  │  - Python: ML models (predictive    │
  │    maintenance, quality prediction) │
  │  - Airflow: Orchestration           │
  └──────────────┬──────────────────────┘
                 ▼
  ┌─────────────────────────────────────┐
  │ BI Layer (Looker, Tableau)          │
  │  - Executive dashboards             │
  │  - Plant manager dashboards         │
  │  - Maintenance team dashboards      │
  └─────────────────────────────────────┘

Time-Series Database Choice

InfluxDB:

Pros: Purpose-built for time-series, excellent compression, InfluxQL query language
Cons: Clustering complex in open-source version
Use case: Edge deployments, short-term storage

TimescaleDB (PostgreSQL extension):

Pros: SQL compatible, integrates with existing tools, mature ecosystem
Cons: Slightly lower performance than InfluxDB for pure time-series
Use case: Cloud warehouse, long-term storage, complex queries

Example schema (TimescaleDB):

-- Sensor readings (hypertable for automatic partitioning)
CREATE TABLE sensor_readings (
  timestamp TIMESTAMPTZ NOT NULL,
  machine_id VARCHAR(50),
  sensor_id VARCHAR(50),
  value NUMERIC(10,4),
  unit VARCHAR(20)
);

-- Convert to hypertable (automatic time-based partitioning)
SELECT create_hypertable('sensor_readings', 'timestamp');

-- Create indexes
CREATE INDEX idx_sensor_readings_machine ON sensor_readings (machine_id, timestamp DESC);

-- Continuous aggregates (pre-computed rollups for faster queries)
CREATE MATERIALIZED VIEW sensor_readings_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', timestamp) AS hour,
  machine_id,
  sensor_id,
  AVG(value) as avg_value,
  MIN(value) as min_value,
  MAX(value) as max_value,
  STDDEV(value) as stddev_value
FROM sensor_readings
GROUP BY hour, machine_id, sensor_id;

OEE (Overall Equipment Effectiveness): The North Star Metric

OEE Definition

OEE = Availability × Performance × Quality

Where:

Availability: % of scheduled time that machine is running
- (Planned production time - Downtime) / Planned production time
Performance: % of maximum speed that machine is running
- (Actual output / Theoretical max output) × 100
Quality: % of produced units that are good (no defects)
- Good units / Total units produced

World-class OEE:

85%+: World-class manufacturing
60-85%: Typical manufacturing
<60%: Needs improvement

Example calculation:

Shift duration: 8 hours = 480 minutes
Planned downtime: 30 minutes (lunch break, changeover)
Unplanned downtime: 60 minutes (breakdown)
Production time: 480 - 30 - 60 = 390 minutes
Ideal cycle time: 1 minute/unit
Total units produced: 350 units
Defective units: 20 units

Calculations:

Availability = (480 - 30 - 60) / (480 - 30) = 390 / 450 = 86.7%
Performance = (350 units × 1 min) / 390 min = 89.7%
Quality = (350 - 20) / 350 = 94.3%

OEE = 0.867 × 0.897 × 0.943 = 73.3%

Six Big Losses (OEE loss categories):

Breakdowns (Availability) - Equipment failure
Setup/Changeover (Availability) - Switching between products
Small stops (Performance) - Idling, minor stoppages
Reduced speed (Performance) - Running below max speed
Startup rejects (Quality) - Scrap during startup
Production rejects (Quality) - Defects during normal production

Real-time OEE Calculation

Data inputs:

Machine state: Running, idle, down (from sensors or operator input)
Production count: Good units, defective units (from vision system or manual entry)
Downtime reasons: Breakdown, changeover, no material, etc. (operator logging)

SQL implementation (dbt model):

-- Calculate OEE per shift per machine
WITH shift_schedule AS (
  SELECT
    machine_id,
    shift_date,
    shift_name,  -- A, B, C (morning, afternoon, night)
    shift_start,
    shift_end,
    EXTRACT(EPOCH FROM (shift_end - shift_start)) / 60 as planned_production_time_minutes,
    planned_downtime_minutes  -- Lunch, meetings
  FROM production_shifts
),

downtime_summary AS (
  SELECT
    machine_id,
    DATE(start_time) as shift_date,
    shift_name,
    SUM(duration_minutes) as total_downtime_minutes
  FROM downtime_events
  WHERE downtime_category = 'unplanned'
  GROUP BY machine_id, shift_date, shift_name
),

production_summary AS (
  SELECT
    machine_id,
    DATE(timestamp) as shift_date,
    shift_name,
    SUM(quantity_produced) as total_units,
    SUM(quantity_defective) as defective_units
  FROM production_logs
  GROUP BY machine_id, shift_date, shift_name
),

machine_specs AS (
  SELECT
    machine_id,
    ideal_cycle_time_seconds  -- Theoretical fastest production rate
  FROM machines
)

SELECT
  s.machine_id,
  s.shift_date,
  s.shift_name,

  -- Availability
  (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) as actual_runtime_minutes,
  (s.planned_production_time_minutes - s.planned_downtime_minutes) as available_time_minutes,
  (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes, 0) as availability,

  -- Performance
  COALESCE(p.total_units, 0) as total_units,
  (COALESCE(p.total_units, 0) * m.ideal_cycle_time_seconds / 60.0) as theoretical_runtime_minutes,
  (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) as actual_runtime_minutes,
  (COALESCE(p.total_units, 0) * m.ideal_cycle_time_seconds / 60.0) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0), 0) as performance,

  -- Quality
  COALESCE(p.total_units, 0) - COALESCE(p.defective_units, 0) as good_units,
  (COALESCE(p.total_units, 0) - COALESCE(p.defective_units, 0)) /
    NULLIF(COALESCE(p.total_units, 0), 0) as quality,

  -- OEE
  (
    (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes, 0)
  ) *
  (
    (COALESCE(p.total_units, 0) * m.ideal_cycle_time_seconds / 60.0) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0), 0)
  ) *
  (
    (COALESCE(p.total_units, 0) - COALESCE(p.defective_units, 0)) /
    NULLIF(COALESCE(p.total_units, 0), 0)
  ) as oee

FROM shift_schedule s
LEFT JOIN downtime_summary d USING (machine_id, shift_date, shift_name)
LEFT JOIN production_summary p USING (machine_id, shift_date, shift_name)
JOIN machine_specs m USING (machine_id);

Real-time OEE dashboard (updated every minute):

Current shift OEE: 73.3% (target: 80%)
Availability: 86.7% ⚠️ (target: 90%)
Performance: 89.7% ✓
Quality: 94.3% ✓

Alerts:

OEE below target for 2 consecutive hours → Notify production manager
Availability drop (machine down) → Immediate SMS to maintenance team

OEE Improvement Initiatives

Availability improvement:

Reduce breakdowns: Predictive maintenance (see next section)
Reduce changeover time: SMED (Single-Minute Exchange of Die) methodology
- Example: Reduce changeover from 60 min → 15 min = 45 min saved per changeover
- If 2 changeovers/day: 90 min × 250 days = 22,500 min/year = 375 hours/year

Performance improvement:

Eliminate small stops: Root cause analysis (jams, misfeeds)
Speed optimization: Gradually increase speed while monitoring quality

Quality improvement:

Process control: SPC charts, real-time monitoring
First-piece inspection: Catch setup errors early
Poka-yoke: Error-proofing devices

Predictive Maintenance: Từ Reactive → Proactive

Maintenance Strategies Comparison

Strategy	Trigger	Cost	Downtime	Suitability
Reactive	Run to failure	Low upfront, high emergency cost	High (unplanned)	Non-critical equipment
Preventive	Time-based schedule	Medium	Medium (planned)	Most equipment
Predictive	Condition-based	High (sensors, ML)	Low (planned at optimal time)	Critical equipment
Prescriptive	AI recommends action	Highest	Lowest	Future state

Example: Motor bearing maintenance

Reactive: Wait until bearing fails → Unplanned downtime 8 hours, emergency parts order (3x cost), possible collateral damage

Preventive: Replace bearing every 5,000 hours → Some bearings replaced too early (wasted), some fail before schedule

Predictive: Monitor vibration, temperature → Replace when condition degrades → Optimal timing, planned downtime 2 hours, no emergency costs

Predictive Maintenance ML Model

Use case: Predict bearing failure 3-7 days in advance

Sensor data:

Vibration: Frequency spectrum (FFT analysis)
- Bearing defect frequencies: BPFO, BPFI, BSF, FTF
- Increased vibration at specific frequencies = bearing wear
Temperature: Bearing housing temperature
- Normal: 40-60°C, Warning: >70°C, Critical: >85°C
Acoustic emission: Ultrasonic sound
- Early indicator of cracks, lubrication issues

Feature engineering:

import numpy as np
from scipy import signal
from scipy.fft import fft

def extract_vibration_features(vibration_signal, sampling_rate=10000):
    """
    Extract features from vibration sensor signal
    vibration_signal: array of acceleration values (g)
    """
    features = {}

    # Time domain features
    features['rms'] = np.sqrt(np.mean(vibration_signal**2))  # Root mean square
    features['peak'] = np.max(np.abs(vibration_signal))
    features['crest_factor'] = features['peak'] / features['rms']
    features['kurtosis'] = np.mean((vibration_signal - np.mean(vibration_signal))**4) / (np.std(vibration_signal)**4)

    # Frequency domain features (FFT)
    fft_values = np.abs(fft(vibration_signal))
    frequencies = np.fft.fftfreq(len(vibration_signal), 1/sampling_rate)

    # Energy in specific frequency bands (bearing defect frequencies)
    # For example, BPFO (Ball Pass Frequency Outer race) for this bearing = 156 Hz
    bpfo_band = (150, 162)  # Hz
    bpfo_energy = np.sum(fft_values[(frequencies >= bpfo_band[0]) & (frequencies <= bpfo_band[1])]**2)
    features['bpfo_energy'] = bpfo_energy

    # Similar for BPFI, BSF, etc.

    return features

# Apply to historical data
df['vibration_features'] = df['vibration_signal'].apply(extract_vibration_features)

Labeling data (supervised learning):

Failure events: Known bearing failures (from maintenance logs)
Time window: Label last 7 days before failure as "degraded"
Healthy: All other periods

Training data structure:

# Features (X)
features = [
  'vibration_rms', 'vibration_peak', 'vibration_kurtosis', 'bpfo_energy',
  'temperature', 'temperature_change_rate',
  'hours_since_last_maintenance', 'machine_age_days',
  # Rolling statistics (trends)
  'vibration_rms_7d_mean', 'vibration_rms_7d_trend',
  'temperature_7d_mean', 'temperature_7d_max'
]

X = df[features]

# Labels (y)
# 0 = healthy, 1 = needs maintenance soon (within 7 days)
y = df['needs_maintenance_7d']

Model training (Random Forest or XGBoost):

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import precision_recall_curve, roc_auc_score

# Time-series split (avoid data leakage)
tscv = TimeSeriesSplit(n_splits=5)

model = RandomForestClassifier(
  n_estimators=300,
  max_depth=10,
  min_samples_split=20,
  class_weight='balanced',  # Handle imbalanced classes
  random_state=42
)

# Cross-validation
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    auc = roc_auc_score(y_test, y_pred_proba)
    print(f"AUC: {auc:.4f}")  # Target: >0.85

# Feature importance
feature_importance = pd.DataFrame({
  'feature': features,
  'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance.head(10))

Typical important features:

vibration_rms (25%) - Most predictive
temperature (18%)
bpfo_energy (15%) - Bearing-specific frequency
vibration_kurtosis (12%)
hours_since_last_maintenance (10%)

Model deployment:

# Real-time scoring (runs every hour)
def predict_maintenance_need(machine_id, model):
    # Fetch latest sensor data (last 24 hours)
    sensor_data = fetch_sensor_data(machine_id, hours=24)

    # Extract features
    features = extract_vibration_features(sensor_data['vibration'])
    features['temperature'] = sensor_data['temperature'].mean()
    # ... other features

    # Predict
    features_df = pd.DataFrame([features])
    failure_probability = model.predict_proba(features_df)[0, 1]

    # Decision rules
    if failure_probability > 0.7:
        return {"action": "URGENT", "message": "Schedule maintenance within 3 days"}
    elif failure_probability > 0.4:
        return {"action": "PLAN", "message": "Monitor closely, plan maintenance within 7 days"}
    else:
        return {"action": "OK", "message": "No action needed"}

# Run for all critical machines
for machine in critical_machines:
    prediction = predict_maintenance_need(machine['id'], trained_model)
    if prediction['action'] in ['URGENT', 'PLAN']:
        create_maintenance_work_order(machine['id'], prediction)
        notify_maintenance_team(machine['id'], prediction)

ROI of Predictive Maintenance

Case study: Electronics manufacturer, 50 critical machines

Before (Reactive + Preventive):

Unplanned downtime: 8% of production time
- 50 machines × 8% × 8,000 hours/year = 32,000 hours lost
- Cost: 32,000 hours × $200/hour (lost production) = $6.4M/year
Maintenance cost: $800K/year
- Emergency repairs (50% premium): $300K
- Planned preventive: $500K

After (Predictive):

Unplanned downtime: 3% (reduced 62%)
- Saved: 5% × 50 machines × 8,000 hours × $200 = $4M/year
Maintenance cost: $600K/year (25% reduction)
- No emergency premiums: Save $150K
- Optimize PM schedule: Save $50K
System cost: $200K (sensors, software, ML models)

Net benefit Year 1: $4M + $200K - $200K = $4M ROI: 2000%

Quality Prediction: Real-time Defect Prevention

SPC (Statistical Process Control) Automation

Control charts: Monitor process parameters, detect when going out of control

Example: PCB soldering temperature

Target: 250°C ± 5°C
UCL (Upper Control Limit): 255°C
LCL (Lower Control Limit): 245°C

Automated SPC:

-- Real-time monitoring (query runs every minute)
WITH latest_readings AS (
  SELECT
    machine_id,
    AVG(value) as avg_temperature,
    STDDEV(value) as stddev_temperature,
    COUNT(*) as sample_size
  FROM sensor_readings
  WHERE sensor_id = 'TEMP_SOLDERING'
    AND timestamp >= NOW() - INTERVAL '10 minutes'
  GROUP BY machine_id
),

control_limits AS (
  SELECT
    machine_id,
    target_value,
    ucl,  -- Upper control limit
    lcl   -- Lower control limit
  FROM process_specifications
  WHERE parameter = 'soldering_temperature'
)

SELECT
  lr.machine_id,
  lr.avg_temperature,
  cl.target_value,
  cl.ucl,
  cl.lcl,
  CASE
    WHEN lr.avg_temperature > cl.ucl THEN 'OUT_OF_CONTROL_HIGH'
    WHEN lr.avg_temperature < cl.lcl THEN 'OUT_OF_CONTROL_LOW'
    ELSE 'IN_CONTROL'
  END as status,
  CASE
    WHEN lr.avg_temperature > cl.ucl OR lr.avg_temperature < cl.lcl
      THEN 'ALERT: Adjust soldering temperature immediately'
    ELSE NULL
  END as alert_message
FROM latest_readings lr
JOIN control_limits cl USING (machine_id)
WHERE status != 'IN_CONTROL';

Alert actions:

Out of control: Stop production, notify operator, adjust settings
Trending towards limits: Warning, investigate root cause

Defect Prediction Model

Use case: Predict defect probability before final inspection

Input features:

Process parameters: Temperature, pressure, speed, humidity
Material batch: Supplier, batch number, incoming inspection results
Machine condition: Hours since maintenance, previous defect rates
Operator: Experience level, shift (fatigue factor)

Training data:

# Historical production data with defect labels
df = pd.read_sql("""
  SELECT
    p.work_order_id,
    p.machine_id,
    p.operator_id,
    p.material_batch_id,
    -- Process parameters (average during production)
    AVG(s.temperature) as avg_temperature,
    AVG(s.pressure) as avg_pressure,
    AVG(s.speed) as avg_speed,
    -- Material quality
    m.incoming_inspection_score,
    -- Machine condition
    mc.hours_since_maintenance,
    mc.recent_defect_rate_7d,
    -- Operator
    o.experience_years,
    -- Target
    CASE WHEN qi.defects_found > 0 THEN 1 ELSE 0 END as has_defects
  FROM production_logs p
  JOIN sensor_readings s USING (machine_id, timestamp)
  JOIN materials m ON p.material_batch_id = m.batch_id
  JOIN machine_condition mc USING (machine_id)
  JOIN operators o USING (operator_id)
  LEFT JOIN quality_inspections qi USING (work_order_id)
""", con=db_connection)

Model:

from xgboost import XGBClassifier

features = [
  'avg_temperature', 'avg_pressure', 'avg_speed',
  'incoming_inspection_score', 'hours_since_maintenance',
  'recent_defect_rate_7d', 'experience_years'
]

X = df[features]
y = df['has_defects']

model = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1)
model.fit(X, y)

# Predict defect probability in real-time
def predict_defect_risk(current_work_order):
    features = extract_features(current_work_order)
    defect_probability = model.predict_proba([features])[0, 1]

    if defect_probability > 0.3:  # High risk
        alert = f"High defect risk ({defect_probability:.1%}). Recommend: Increase inspection frequency"
        send_alert_to_quality_team(alert)

    return defect_probability

Closed-loop control:

High defect risk detected → Automatically adjust process parameters (e.g., reduce speed, increase temperature) → Re-predict → Iterate until risk acceptable

ROI:

Reduce defect rate 5% → 2% (60% improvement)
For factory with 1M units/year, defect cost $10/unit:
- Before: 1M × 5% × $10 = $500K/year
- After: 1M × 2% × $10 = $200K/year
- Savings: $300K/year

Supply Chain Optimization: Demand-Supply Matching

Demand Forecasting for Production Planning

Inputs:

Historical sales: Customer orders (SKU level)
Seasonality: Monthly, quarterly patterns
Customer signals: Forecasts from key customers, RFQs (Request for Quotation)
Market trends: Industry growth, competitor actions

Forecasting model (Prophet or ARIMA):

from prophet import Prophet

# Historical sales data
df_sales = pd.read_sql("""
  SELECT
    DATE_TRUNC('week', order_date) as ds,
    SUM(quantity) as y
  FROM customer_orders
  WHERE product_id = 'SKU-12345'
    AND order_date >= '2023-01-01'
  GROUP BY ds
  ORDER BY ds
""", con=db_connection)

# Train Prophet model
model = Prophet(yearly_seasonality=True, weekly_seasonality=False)
model.fit(df_sales)

# Forecast next 12 weeks
future = model.make_future_dataframe(periods=12, freq='W')
forecast = model.predict(future)

print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(12))

Production plan:

Forecast demand: 10,000 units/week (next 12 weeks)
Lead time: 2 weeks (raw materials to finished goods)
Safety stock: 2 weeks (buffer for forecast error)
Production schedule: 10,000 units/week, maintain 20,000 units safety stock

Inventory Optimization: Raw Materials

Challenge: Balance inventory holding cost vs stockout risk

Multi-echelon inventory model:

Suppliers → Raw materials warehouse → Production lines → Finished goods warehouse → Customers

Optimal reorder point & quantity:

import numpy as np

def calculate_reorder_point(
    avg_daily_demand,
    lead_time_days,
    demand_std,
    service_level=0.95
):
    """
    Reorder point = (Avg daily demand × Lead time) + Safety stock
    Safety stock = Z-score × Std(demand) × √(lead time)
    """
    from scipy.stats import norm

    z_score = norm.ppf(service_level)  # 0.95 → 1.645
    safety_stock = z_score * demand_std * np.sqrt(lead_time_days)
    reorder_point = (avg_daily_demand * lead_time_days) + safety_stock

    return {
        'reorder_point': reorder_point,
        'safety_stock': safety_stock
    }

# Example: Steel sheets for stamping
avg_daily_demand = 500  # sheets/day
lead_time_days = 14  # Supplier lead time
demand_std = 100  # Standard deviation

result = calculate_reorder_point(avg_daily_demand, lead_time_days, demand_std, service_level=0.98)
print(f"Reorder point: {result['reorder_point']:.0f} sheets")
print(f"Safety stock: {result['safety_stock']:.0f} sheets")

# Reorder point: 7,000 + 331 = 7,331 sheets
# When inventory drops below 7,331 sheets → Place order

Economic Order Quantity (EOQ):

def calculate_eoq(annual_demand, ordering_cost, holding_cost_per_unit):
    """
    EOQ = √(2 × D × S / H)
    D = Annual demand
    S = Ordering cost per order
    H = Holding cost per unit per year
    """
    eoq = np.sqrt((2 * annual_demand * ordering_cost) / holding_cost_per_unit)
    return eoq

annual_demand = 500 * 250  # 500/day × 250 working days = 125,000
ordering_cost = 5_000_000  # VND per order (admin, shipping, receiving)
holding_cost = 50_000  # VND per sheet per year (storage, insurance, obsolescence)

eoq = calculate_eoq(annual_demand, ordering_cost, holding_cost)
print(f"EOQ: {eoq:.0f} sheets per order")
# EOQ ≈ 5,000 sheets

# Order 5,000 sheets each time, 25 times per year

Supplier Performance Tracking

Metrics:

OTD (On-Time Delivery): % of orders delivered on promised date
Quality: % defect rate in incoming materials
Lead time: Average days from PO to delivery
Responsiveness: Time to respond to inquiries

Scorecard:

SELECT
  supplier_id,
  supplier_name,
  COUNT(*) as total_orders,

  -- On-time delivery
  SUM(CASE WHEN actual_delivery_date <= promised_delivery_date THEN 1 ELSE 0 END) /
    COUNT(*) * 100 as otd_percentage,

  -- Quality
  AVG(incoming_inspection_pass_rate) as avg_quality_score,

  -- Lead time
  AVG(DATE_DIFF(actual_delivery_date, po_date, DAY)) as avg_lead_time_days,

  -- Overall score (weighted average)
  (
    (SUM(CASE WHEN actual_delivery_date <= promised_delivery_date THEN 1 ELSE 0 END) / COUNT(*)) * 0.4 +
    AVG(incoming_inspection_pass_rate) * 0.4 +
    (1 - (AVG(DATE_DIFF(actual_delivery_date, po_date, DAY)) / 30)) * 0.2  -- Normalize lead time
  ) * 100 as supplier_score

FROM purchase_orders po
JOIN suppliers s USING (supplier_id)
WHERE po_date >= '2024-01-01'
GROUP BY supplier_id, supplier_name
ORDER BY supplier_score DESC;

Actions:

Top suppliers (score >90): Preferred, increase share
Medium (70-90): Acceptable, monitor
Low (<70): Improvement plan or replace

Case Study: Electronics Manufacturer - Giảm Downtime 40%, Tiết Kiệm 2B VND/Năm

Background:

Company: Contract manufacturer (PCB assembly), 2 factories
Production: 5M units/year, 20 SMT lines
Revenue: ~500B VND/year

Problems:

High unplanned downtime: 12% (industry average ~6-8%)
- Frequent machine breakdowns
- No early warning system
Manual OEE tracking: Operators fill Excel sheets end-of-shift
- Data incomplete, inaccurate
- No real-time visibility for managers
Quality issues: 3% defect rate (target <1%)
- Reactive: Find defects at final inspection, rework costly

Solution: Smart Factory Platform (20 weeks implementation)

Phase 1: IoT Infrastructure (Weeks 1-8)

Sensor deployment:

20 SMT lines, each with:
- Vibration sensors: 4 per line (critical motors)
- Temperature sensors: 6 per line (reflow oven zones)
- Vision systems: 2 per line (component placement accuracy, solder quality)
- Current sensors: 10 per line (motor loads)
Total: 440 sensors

Edge gateways:

20 industrial PCs (one per line): Collect sensor data, local processing
Node-RED flows: Data filtering, aggregation
Local InfluxDB: 7-day retention, real-time dashboards

Network:

Factory WiFi upgrade (industrial-grade APs)
Redundant internet (2 ISPs for reliability)
VPN to cloud

Integration with existing systems:

MES (Siemens Opcenter): Work orders, production counts
ERP (SAP): Materials, maintenance logs
APIs + Airbyte connectors

Cost: 800M VND (sensors $100K, edge devices $50K, installation labor $50K)

Phase 2: Data Platform & Analytics (Weeks 9-16)

Cloud setup (Google Cloud Platform):

BigQuery: Data warehouse
Cloud Functions: Serverless processing
Cloud Composer (Airflow): Orchestration
Vertex AI: ML model training & deployment

Data pipelines:

Real-time: Sensor data (anomalies only) → Pub/Sub → BigQuery
Batch: MES, ERP → Airbyte → BigQuery (hourly)

dbt models:

-- OEE calculation (per line, per shift)
{{ config(materialized='incremental') }}

SELECT
  production_line_id,
  shift_date,
  shift_name,
  availability,
  performance,
  quality,
  availability * performance * quality as oee
FROM {{ ref('shift_production_summary') }}
WHERE shift_date >= '2024-01-01'

ML models:

Predictive maintenance (vibration + temperature → bearing failure prediction)
- Training data: 18 months historical + 50 labeled failures
- Model: Random Forest, AUC 0.88
- Deployment: Hourly scoring, alerts if failure probability >50%
Quality prediction (process params → defect probability)
- Training data: 2M units with inspection results
- Model: XGBoost, AUC 0.82
- Deployment: Real-time scoring per unit

Cost: 400M VND (GCP credits, software licenses, ML development)

Phase 3: Dashboards & Alerts (Weeks 17-20)

Real-time dashboards (Grafana):

Line overview: 20 lines, status (running/idle/down), current OEE
OEE details: Availability, Performance, Quality breakdown
Alerts: Downtime events, predictive maintenance warnings

BI dashboards (Looker):

Executive: Overall OEE trend, top loss categories, plant comparison
Production manager: Line performance ranking, shift comparisons
Maintenance team: Equipment health, upcoming maintenance, spare parts inventory

Mobile app (for factory floor):

Operators: Log downtime reasons, view line status
Maintenance: Work orders, equipment health alerts

Automated alerts:

Downtime >10 min: SMS to production manager
Predictive maintenance (failure probability >70%): Email to maintenance team, create work order
Quality risk (defect probability >30%): Alert operator, suggest parameter adjustments

Cost: 200M VND (dashboard development, mobile app)

Results After 12 Months

Metric	Before	After	Change
OEE	65%	78%	+13 pp (+20%)
Availability	82%	91%	+9 pp
Unplanned downtime	12%	7%	-42% (↓5 pp)
Defect rate	3.0%	1.2%	-60% (↓1.8 pp)
Maintenance cost	20B VND	15B VND	-25%
Emergency repairs	40% of maintenance	10%	-75%
Output (units/year)	5M	5.8M	+16% (same capacity)

Financial impact (annual):

Revenue increase (higher output):

Additional 800K units × 100,000 VND/unit = 80B VND

Cost savings:

Reduced downtime: 5% capacity increase = 25B VND (opportunity cost)
Reduced defects: 1.8% × 5.8M units × 50,000 VND rework cost = 5.2B VND
Maintenance optimization: 5B VND

Total benefit: 80B + 25B + 5.2B + 5B = 115B VND Investment: 1.4B VND ROI Year 1: 8,100% (payback in 5 days!)

Intangible benefits:

Customer satisfaction ↑ (fewer delays, better quality)
Employee morale ↑ (less firefighting, more proactive work)
Competitive advantage (can bid for higher-quality contracts)

Key Learnings

What worked:

Start with pilot: 1 production line first (2 weeks), prove value, then scale
OT-IT collaboration: Involve maintenance team early, they know the machines
Quick wins: Real-time OEE dashboards had immediate impact, built buy-in

Challenges:

Sensor installation: Downtime for retrofitting (scheduled during maintenance windows)
Data quality: Initial sensor readings noisy (vibration sensors placed wrong) → Recalibration needed
Change management: Operators initially skeptical of "robots watching them" → Training, emphasize helping not policing

Recommendations:

Budget 20-30% buffer for unexpected issues (sensor failures, network problems)
Plan for ongoing ML model maintenance (retrain quarterly)
Invest in training (operators, maintenance team need to understand the system)

Implementation Roadmap: 16 Weeks to Smart Factory (Pilot Line)

Phase 1: Assessment & Design (Weeks 1-2)

Identify pilot production line (high-value, representative)
Map current state: Machines, sensors, data systems
Define use cases: OEE tracking, predictive maintenance, quality prediction
Architecture design: Edge + cloud components
Vendor selection: Sensors, edge hardware, cloud platform

Phase 2: Infrastructure Setup (Weeks 3-6)

Procure hardware: Sensors, edge gateways, network equipment
Install sensors on pilot line (plan downtime)
Setup edge gateway: Node-RED, InfluxDB
Test sensor connectivity, data collection
Setup cloud Data Warehouse (BigQuery/Snowflake)

Phase 3: Data Integration (Weeks 7-10)

Integrate MES, ERP data sources (Airbyte connectors)
Build data pipelines: Edge → Cloud, batch syncs
Data modeling: OEE tables, sensor aggregations
dbt transformations: Core metrics calculations
Data quality checks, validation

Phase 4: Analytics & ML (Weeks 11-14)

Build OEE dashboard (real-time, Grafana)
Train predictive maintenance model (if sufficient historical data)
Deploy ML models (Cloud Functions, Vertex AI)
Setup alerting rules (downtime, anomalies)
BI dashboards for management (Looker/Tableau)

Phase 5: Testing & Rollout (Weeks 15-16)

UAT with operators, maintenance team
Refine dashboards based on feedback
Training sessions (operators, managers)
Documentation (SOPs, troubleshooting guides)
Go-live: Switch from old system to new

Phase 6: Scale (Post Week 16)

Monitor pilot line for 4-8 weeks, iterate
Calculate ROI, build business case for scaling
Rollout to remaining production lines (4-8 weeks per line)
Expand use cases: Supply chain, energy optimization, etc.

Checklist: Data Platform for Manufacturing

Data sources:

IoT sensors deployed (vibration, temperature, current)
SCADA/PLC integration (Modbus, OPC UA)
MES integration (work orders, downtime logging)
ERP integration (materials, maintenance)
QMS integration (inspection results, defects)

Infrastructure:

Edge gateways installed (Node-RED, Telegraf)
Time-series database (InfluxDB or TimescaleDB)
Cloud Data Warehouse (BigQuery, Snowflake, Redshift)
Network (WiFi, VPN, redundancy)

Data models:

OEE calculation (availability, performance, quality)
Downtime event tracking (with categorization)
Production metrics (output, cycle times)
Equipment health metrics (vibration, temperature trends)
Quality metrics (defect rates, Cpk)

Analytics:

Real-time OEE dashboard (per line, per shift)
Predictive maintenance models deployed
Quality prediction models deployed
Root cause analysis tools (Pareto charts, fishbone diagrams)

Alerts & actions:

Downtime alerts (immediate notification)
Predictive maintenance alerts (create work orders)
Quality alerts (out-of-spec warnings)
Automated reporting (daily OEE summary, weekly review)

Kết Luận: Industry 4.0 is ROI-Positive from Day 1

Manufacturing không thể cạnh tranh bằng labor cost thấp mãi mãi. Smart factories với Data Platform mang lại competitive advantage bền vững: higher quality, faster production, lower costs, và flexibility to adapt.

Key takeaways:

OEE improvement 10-15% = $1-2M/year cho medium factory
Predictive maintenance reduces downtime 30-50%, maintenance costs 25-30%
Quality prediction prevents defects, reduces rework/scrap 60-80%
Payback period: 6-18 months (depending on factory size, use cases)

Next steps:

Assess current state (do you have sensors? Can you access MES/ERP data?)
Identify top pain point (downtime? quality? inventory?)
Start with 1 production line pilot (16 weeks, prove value)
Liên hệ Carptech nếu cần tư vấn (carptech.vn/contact)

Tài liệu tham khảo:

Bài viết này là phần của series "Data Platform for Industries". Đọc thêm về E-commerce, Fintech, và Retail.

Carptech - Data Platform Solutions for Vietnamese Enterprises. Liên hệ tư vấn miễn phí.