Quay lại Blog
Data PlatformCập nhật: 22 tháng 4, 202527 phút đọc

Data Platform cho Manufacturing: OEE, Predictive Maintenance, Supply Chain

Hướng dẫn xây dựng Data Platform cho nhà máy sản xuất, từ thu thập IoT sensor data, tính toán OEE real-time, predictive maintenance models, đến supply chain optimization. Industry 4.0 implementation roadmap cho Việt Nam.

Vũ Đức Trung

Vũ Đức Trung

Senior Data Engineer (Chuyên gia Streaming)

Biểu đồ smart factory với IoT sensors, máy móc, predictive maintenance alerts và OEE dashboards
#Manufacturing#Industry 4.0#OEE#Predictive Maintenance#IoT#Supply Chain#Smart Factory

Manufacturing tại Việt Nam đang trong giai đoạn chuyển đổi từ Industry 3.0 (automation cơ bản) sang Industry 4.0 (smart factory với IoT, AI, data-driven decisions). Tuy nhiên, theo khảo sát của Carptech với 30+ nhà máy sản xuất (electronics, automotive parts, F&B, textiles), chỉ 15% có khả năng tính OEE real-time, và 72% vẫn đang làm maintenance theo lịch (scheduled) thay vì predictive - dẫn đến 30-40% unplanned downtime, chi phí bảo trì cao và chất lượng không ổn định.

Bài viết này sẽ hướng dẫn chi tiết cách xây dựng Data Platform cho manufacturing, từ kiến trúc IoT integration, tính toán OEE real-time, predictive maintenance models, quality prediction, đến supply chain optimization. Kèm case study về electronics manufacturer giảm unplanned downtime 40%, tiết kiệm 2B VND/năm.

TL;DR - Key Takeaways

  • Industry 4.0 data sources: IoT sensors, MES, ERP, SCADA, QMS - tích hợp real-time + batch
  • OEE (Overall Equipment Effectiveness): Availability × Performance × Quality - KPI quan trọng nhất trong manufacturing
  • Predictive maintenance: ML models dự đoán failure trước 3-7 days → Reduce unplanned downtime 30-50%
  • Architecture: Edge computing + time-series DB (InfluxDB, TimescaleDB) + cloud warehouse
  • ROI: 15% OEE improvement = $2M/year cho medium factory, 25-30% maintenance cost reduction
  • Implementation: Start with 1 production line (pilot), 12-16 weeks, scale to plant-wide

Manufacturing Data Landscape: IT/OT Convergence

1. IoT Sensors & SCADA - Operational Technology (OT)

SCADA (Supervisory Control and Data Acquisition):

  • Real-time monitoring & control của production equipment
  • Data: Temperature, pressure, RPM, vibration, current draw
  • Protocols: Modbus, OPC UA, MQTT

IoT sensor types:

Sensor TypeMeasuresUse CaseFrequency
Temperature°COverheat detection, process control1-10 sec
VibrationFrequency, amplitudeBearing wear, imbalance detection0.1-1 sec
PressurePSI, barHydraulic/pneumatic systems1-10 sec
CurrentAmperageMotor load, power consumption1 sec
ProximityOn/offPart presence detectionEvent-based
VisionImage/videoQuality inspection, defect detectionPer unit

Data volume:

  • 1 production line với 20 machines, 100 sensors total
  • Sample rate: 1 Hz (1 sample/second)
  • Data points: 100 sensors × 3600 seconds × 24 hours = 8.6M data points/day
  • Storage: ~150-200 MB/day uncompressed, ~20-30 MB compressed

Challenge: Legacy machines không có sensors

  • Brownfield factories: 50-80% machines là older models (10-20 years)
  • Solutions:
    • Retrofit sensors: Add vibration, temperature sensors externally
    • Smart plugs: Monitor power consumption (proxy for machine state)
    • Manual logging: Operators enter downtime reasons via tablet

2. MES (Manufacturing Execution System)

MES functions:

  • Production scheduling: What to produce, when, on which line
  • Work order tracking: Status of each job (queued, in-progress, completed)
  • Material consumption: Raw materials used per batch
  • Operator assignments: Who operated which machine, shift info
  • Downtime logging: When machine stopped, reason, duration

MES platforms: SAP MES, Siemens Opcenter, AVEVA MES, local Vietnamese systems

Data structure:

-- Work orders
CREATE TABLE work_orders (
  work_order_id VARCHAR(50) PRIMARY KEY,
  product_id VARCHAR(50),
  quantity_planned INT,
  quantity_produced INT,
  production_line_id VARCHAR(20),
  scheduled_start TIMESTAMPTZ,
  actual_start TIMESTAMPTZ,
  actual_end TIMESTAMPTZ,
  status VARCHAR(20)  -- queued, in_progress, completed, cancelled
);

-- Downtime events
CREATE TABLE downtime_events (
  downtime_id SERIAL PRIMARY KEY,
  machine_id VARCHAR(50),
  start_time TIMESTAMPTZ,
  end_time TIMESTAMPTZ,
  duration_minutes INT,
  downtime_category VARCHAR(50),  -- planned, unplanned
  downtime_reason VARCHAR(100),  -- breakdown, changeover, no_material, no_operator
  operator_id VARCHAR(50)
);

3. ERP (Enterprise Resource Planning)

ERP modules relevant for manufacturing:

  • Materials management: Inventory levels, procurement
  • Production planning: Master production schedule (MPS)
  • Quality management: Inspection results, non-conformance reports
  • Maintenance management: Work orders, spare parts inventory

ERP platforms: SAP, Oracle, Microsoft Dynamics, Odoo

Integration with Data Platform: Batch sync daily/hourly (ERP not real-time)

4. QMS (Quality Management System)

Quality data:

  • Inspection results: First-piece inspection, in-process, final inspection
  • Defect tracking: Defect types, counts, root causes
  • Statistical Process Control (SPC): Control charts, Cpk values
  • Customer returns: RMA (Return Merchandise Authorization), failure analysis

Defect categorization:

CREATE TABLE quality_inspections (
  inspection_id SERIAL PRIMARY KEY,
  work_order_id VARCHAR(50),
  inspection_type VARCHAR(50),  -- first_piece, in_process, final
  timestamp TIMESTAMPTZ,
  inspector_id VARCHAR(50),
  sample_size INT,
  defects_found INT,
  defect_types JSONB,  -- {"scratch": 2, "dimension_out_of_spec": 1}
  pass_fail VARCHAR(10)
);

5. Supply Chain Data

Inbound:

  • Supplier deliveries: On-time delivery rate, lead times, quality issues
  • Raw material inventory: Stock levels, consumption rates

Outbound:

  • Finished goods inventory: Warehouse stock
  • Shipments: Order fulfillment, delivery performance

Metrics:

  • Supplier OTD (On-Time Delivery): % deliveries arriving on scheduled date
  • Lead time: Days from PO to goods receipt
  • Inventory days: (Inventory value) / (Daily consumption cost)

Architecture: Edge-to-Cloud for Manufacturing

Edge Layer: Real-time Processing tại Factory Floor

Why edge computing?

  • Latency: Real-time control requires <10ms response (cloud round-trip ~100-500ms)
  • Reliability: Factory must operate even if internet connection down
  • Bandwidth: Sending 8M+ data points/day to cloud = expensive, slow
  • Security: Sensitive production data stays on-premise

Edge architecture:

Production Machines
    ↓ (Modbus, OPC UA)
Edge Gateway (Industrial PC)
  - Collect sensor data
  - Filter, aggregate (only send anomalies/summaries to cloud)
  - Local dashboards (OEE, line status)
  - Temporary storage (24-hour buffer)
    ↓ (MQTT, Kafka)
Local Time-Series Database
  - InfluxDB or TimescaleDB
  - Store high-frequency data (1 week retention)
  - Used for real-time analytics

Edge gateway hardware:

  • Raspberry Pi 4 (low-cost, $100): Good for pilot projects
  • Industrial PCs (Advantech, Siemens): Ruggedized, $500-2000
  • Siemens IoT2050: Designed for factory environments, ~$300

Software stack:

  • Node-RED: Visual programming for IoT flows (easy for OT teams)
  • Telegraf: Data collection agent (plugins for Modbus, OPC UA)
  • InfluxDB: Time-series database
  • Grafana: Real-time dashboards

Cloud Layer: Long-term Storage & Advanced Analytics

What goes to cloud:

  • Aggregated metrics: OEE by hour, downtime events, quality metrics
  • Anomalies: Sensor readings outside normal range
  • Contextual data: Work orders, product specs, maintenance history
  • Not sent to cloud: Raw high-frequency sensor data (stays at edge for 7 days, then discarded)

Cloud architecture:

Edge Gateway
    ↓ (Kafka / REST API)
Cloud Data Platform
  ┌─────────────────────────────────────┐
  │ Ingestion Layer                     │
  │  - Kafka: Streaming from edge       │
  │  - Airbyte: Batch from ERP, MES     │
  └──────────────┬──────────────────────┘
                 ▼
  ┌─────────────────────────────────────┐
  │ Data Warehouse (BigQuery/Snowflake) │
  │                                     │
  │  - Production metrics (hourly agg)  │
  │  - Downtime events                  │
  │  - Quality inspections              │
  │  - Maintenance logs                 │
  │  - ML training datasets             │
  └──────────────┬──────────────────────┘
                 ▼
  ┌─────────────────────────────────────┐
  │ Analytics & ML Layer                │
  │  - dbt: Transformations             │
  │  - Python: ML models (predictive    │
  │    maintenance, quality prediction) │
  │  - Airflow: Orchestration           │
  └──────────────┬──────────────────────┘
                 ▼
  ┌─────────────────────────────────────┐
  │ BI Layer (Looker, Tableau)          │
  │  - Executive dashboards             │
  │  - Plant manager dashboards         │
  │  - Maintenance team dashboards      │
  └─────────────────────────────────────┘

Time-Series Database Choice

InfluxDB:

  • Pros: Purpose-built for time-series, excellent compression, InfluxQL query language
  • Cons: Clustering complex in open-source version
  • Use case: Edge deployments, short-term storage

TimescaleDB (PostgreSQL extension):

  • Pros: SQL compatible, integrates with existing tools, mature ecosystem
  • Cons: Slightly lower performance than InfluxDB for pure time-series
  • Use case: Cloud warehouse, long-term storage, complex queries

Example schema (TimescaleDB):

-- Sensor readings (hypertable for automatic partitioning)
CREATE TABLE sensor_readings (
  timestamp TIMESTAMPTZ NOT NULL,
  machine_id VARCHAR(50),
  sensor_id VARCHAR(50),
  value NUMERIC(10,4),
  unit VARCHAR(20)
);

-- Convert to hypertable (automatic time-based partitioning)
SELECT create_hypertable('sensor_readings', 'timestamp');

-- Create indexes
CREATE INDEX idx_sensor_readings_machine ON sensor_readings (machine_id, timestamp DESC);

-- Continuous aggregates (pre-computed rollups for faster queries)
CREATE MATERIALIZED VIEW sensor_readings_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', timestamp) AS hour,
  machine_id,
  sensor_id,
  AVG(value) as avg_value,
  MIN(value) as min_value,
  MAX(value) as max_value,
  STDDEV(value) as stddev_value
FROM sensor_readings
GROUP BY hour, machine_id, sensor_id;

OEE (Overall Equipment Effectiveness): The North Star Metric

OEE Definition

OEE = Availability × Performance × Quality

Where:

  • Availability: % of scheduled time that machine is running
    • (Planned production time - Downtime) / Planned production time
  • Performance: % of maximum speed that machine is running
    • (Actual output / Theoretical max output) × 100
  • Quality: % of produced units that are good (no defects)
    • Good units / Total units produced

World-class OEE:

  • 85%+: World-class manufacturing
  • 60-85%: Typical manufacturing
  • <60%: Needs improvement

Example calculation:

  • Shift duration: 8 hours = 480 minutes

  • Planned downtime: 30 minutes (lunch break, changeover)

  • Unplanned downtime: 60 minutes (breakdown)

  • Production time: 480 - 30 - 60 = 390 minutes

  • Ideal cycle time: 1 minute/unit

  • Total units produced: 350 units

  • Defective units: 20 units

Calculations:

Availability = (480 - 30 - 60) / (480 - 30) = 390 / 450 = 86.7%
Performance = (350 units × 1 min) / 390 min = 89.7%
Quality = (350 - 20) / 350 = 94.3%

OEE = 0.867 × 0.897 × 0.943 = 73.3%

Six Big Losses (OEE loss categories):

  1. Breakdowns (Availability) - Equipment failure
  2. Setup/Changeover (Availability) - Switching between products
  3. Small stops (Performance) - Idling, minor stoppages
  4. Reduced speed (Performance) - Running below max speed
  5. Startup rejects (Quality) - Scrap during startup
  6. Production rejects (Quality) - Defects during normal production

Real-time OEE Calculation

Data inputs:

  • Machine state: Running, idle, down (from sensors or operator input)
  • Production count: Good units, defective units (from vision system or manual entry)
  • Downtime reasons: Breakdown, changeover, no material, etc. (operator logging)

SQL implementation (dbt model):

-- Calculate OEE per shift per machine
WITH shift_schedule AS (
  SELECT
    machine_id,
    shift_date,
    shift_name,  -- A, B, C (morning, afternoon, night)
    shift_start,
    shift_end,
    EXTRACT(EPOCH FROM (shift_end - shift_start)) / 60 as planned_production_time_minutes,
    planned_downtime_minutes  -- Lunch, meetings
  FROM production_shifts
),

downtime_summary AS (
  SELECT
    machine_id,
    DATE(start_time) as shift_date,
    shift_name,
    SUM(duration_minutes) as total_downtime_minutes
  FROM downtime_events
  WHERE downtime_category = 'unplanned'
  GROUP BY machine_id, shift_date, shift_name
),

production_summary AS (
  SELECT
    machine_id,
    DATE(timestamp) as shift_date,
    shift_name,
    SUM(quantity_produced) as total_units,
    SUM(quantity_defective) as defective_units
  FROM production_logs
  GROUP BY machine_id, shift_date, shift_name
),

machine_specs AS (
  SELECT
    machine_id,
    ideal_cycle_time_seconds  -- Theoretical fastest production rate
  FROM machines
)

SELECT
  s.machine_id,
  s.shift_date,
  s.shift_name,

  -- Availability
  (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) as actual_runtime_minutes,
  (s.planned_production_time_minutes - s.planned_downtime_minutes) as available_time_minutes,
  (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes, 0) as availability,

  -- Performance
  COALESCE(p.total_units, 0) as total_units,
  (COALESCE(p.total_units, 0) * m.ideal_cycle_time_seconds / 60.0) as theoretical_runtime_minutes,
  (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) as actual_runtime_minutes,
  (COALESCE(p.total_units, 0) * m.ideal_cycle_time_seconds / 60.0) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0), 0) as performance,

  -- Quality
  COALESCE(p.total_units, 0) - COALESCE(p.defective_units, 0) as good_units,
  (COALESCE(p.total_units, 0) - COALESCE(p.defective_units, 0)) /
    NULLIF(COALESCE(p.total_units, 0), 0) as quality,

  -- OEE
  (
    (s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0)) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes, 0)
  ) *
  (
    (COALESCE(p.total_units, 0) * m.ideal_cycle_time_seconds / 60.0) /
    NULLIF(s.planned_production_time_minutes - s.planned_downtime_minutes - COALESCE(d.total_downtime_minutes, 0), 0)
  ) *
  (
    (COALESCE(p.total_units, 0) - COALESCE(p.defective_units, 0)) /
    NULLIF(COALESCE(p.total_units, 0), 0)
  ) as oee

FROM shift_schedule s
LEFT JOIN downtime_summary d USING (machine_id, shift_date, shift_name)
LEFT JOIN production_summary p USING (machine_id, shift_date, shift_name)
JOIN machine_specs m USING (machine_id);

Real-time OEE dashboard (updated every minute):

  • Current shift OEE: 73.3% (target: 80%)
  • Availability: 86.7% ⚠️ (target: 90%)
  • Performance: 89.7% ✓
  • Quality: 94.3% ✓

Alerts:

  • OEE below target for 2 consecutive hours → Notify production manager
  • Availability drop (machine down) → Immediate SMS to maintenance team

OEE Improvement Initiatives

Availability improvement:

  • Reduce breakdowns: Predictive maintenance (see next section)
  • Reduce changeover time: SMED (Single-Minute Exchange of Die) methodology
    • Example: Reduce changeover from 60 min → 15 min = 45 min saved per changeover
    • If 2 changeovers/day: 90 min × 250 days = 22,500 min/year = 375 hours/year

Performance improvement:

  • Eliminate small stops: Root cause analysis (jams, misfeeds)
  • Speed optimization: Gradually increase speed while monitoring quality

Quality improvement:

  • Process control: SPC charts, real-time monitoring
  • First-piece inspection: Catch setup errors early
  • Poka-yoke: Error-proofing devices

Predictive Maintenance: Từ Reactive → Proactive

Maintenance Strategies Comparison

StrategyTriggerCostDowntimeSuitability
ReactiveRun to failureLow upfront, high emergency costHigh (unplanned)Non-critical equipment
PreventiveTime-based scheduleMediumMedium (planned)Most equipment
PredictiveCondition-basedHigh (sensors, ML)Low (planned at optimal time)Critical equipment
PrescriptiveAI recommends actionHighestLowestFuture state

Example: Motor bearing maintenance

Reactive: Wait until bearing fails → Unplanned downtime 8 hours, emergency parts order (3x cost), possible collateral damage

Preventive: Replace bearing every 5,000 hours → Some bearings replaced too early (wasted), some fail before schedule

Predictive: Monitor vibration, temperature → Replace when condition degrades → Optimal timing, planned downtime 2 hours, no emergency costs

Predictive Maintenance ML Model

Use case: Predict bearing failure 3-7 days in advance

Sensor data:

  • Vibration: Frequency spectrum (FFT analysis)
    • Bearing defect frequencies: BPFO, BPFI, BSF, FTF
    • Increased vibration at specific frequencies = bearing wear
  • Temperature: Bearing housing temperature
    • Normal: 40-60°C, Warning: >70°C, Critical: >85°C
  • Acoustic emission: Ultrasonic sound
    • Early indicator of cracks, lubrication issues

Feature engineering:

import numpy as np
from scipy import signal
from scipy.fft import fft

def extract_vibration_features(vibration_signal, sampling_rate=10000):
    """
    Extract features from vibration sensor signal
    vibration_signal: array of acceleration values (g)
    """
    features = {}

    # Time domain features
    features['rms'] = np.sqrt(np.mean(vibration_signal**2))  # Root mean square
    features['peak'] = np.max(np.abs(vibration_signal))
    features['crest_factor'] = features['peak'] / features['rms']
    features['kurtosis'] = np.mean((vibration_signal - np.mean(vibration_signal))**4) / (np.std(vibration_signal)**4)

    # Frequency domain features (FFT)
    fft_values = np.abs(fft(vibration_signal))
    frequencies = np.fft.fftfreq(len(vibration_signal), 1/sampling_rate)

    # Energy in specific frequency bands (bearing defect frequencies)
    # For example, BPFO (Ball Pass Frequency Outer race) for this bearing = 156 Hz
    bpfo_band = (150, 162)  # Hz
    bpfo_energy = np.sum(fft_values[(frequencies >= bpfo_band[0]) & (frequencies <= bpfo_band[1])]**2)
    features['bpfo_energy'] = bpfo_energy

    # Similar for BPFI, BSF, etc.

    return features

# Apply to historical data
df['vibration_features'] = df['vibration_signal'].apply(extract_vibration_features)

Labeling data (supervised learning):

  • Failure events: Known bearing failures (from maintenance logs)
  • Time window: Label last 7 days before failure as "degraded"
  • Healthy: All other periods

Training data structure:

# Features (X)
features = [
  'vibration_rms', 'vibration_peak', 'vibration_kurtosis', 'bpfo_energy',
  'temperature', 'temperature_change_rate',
  'hours_since_last_maintenance', 'machine_age_days',
  # Rolling statistics (trends)
  'vibration_rms_7d_mean', 'vibration_rms_7d_trend',
  'temperature_7d_mean', 'temperature_7d_max'
]

X = df[features]

# Labels (y)
# 0 = healthy, 1 = needs maintenance soon (within 7 days)
y = df['needs_maintenance_7d']

Model training (Random Forest or XGBoost):

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import precision_recall_curve, roc_auc_score

# Time-series split (avoid data leakage)
tscv = TimeSeriesSplit(n_splits=5)

model = RandomForestClassifier(
  n_estimators=300,
  max_depth=10,
  min_samples_split=20,
  class_weight='balanced',  # Handle imbalanced classes
  random_state=42
)

# Cross-validation
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    auc = roc_auc_score(y_test, y_pred_proba)
    print(f"AUC: {auc:.4f}")  # Target: >0.85

# Feature importance
feature_importance = pd.DataFrame({
  'feature': features,
  'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance.head(10))

Typical important features:

  1. vibration_rms (25%) - Most predictive
  2. temperature (18%)
  3. bpfo_energy (15%) - Bearing-specific frequency
  4. vibration_kurtosis (12%)
  5. hours_since_last_maintenance (10%)

Model deployment:

# Real-time scoring (runs every hour)
def predict_maintenance_need(machine_id, model):
    # Fetch latest sensor data (last 24 hours)
    sensor_data = fetch_sensor_data(machine_id, hours=24)

    # Extract features
    features = extract_vibration_features(sensor_data['vibration'])
    features['temperature'] = sensor_data['temperature'].mean()
    # ... other features

    # Predict
    features_df = pd.DataFrame([features])
    failure_probability = model.predict_proba(features_df)[0, 1]

    # Decision rules
    if failure_probability > 0.7:
        return {"action": "URGENT", "message": "Schedule maintenance within 3 days"}
    elif failure_probability > 0.4:
        return {"action": "PLAN", "message": "Monitor closely, plan maintenance within 7 days"}
    else:
        return {"action": "OK", "message": "No action needed"}

# Run for all critical machines
for machine in critical_machines:
    prediction = predict_maintenance_need(machine['id'], trained_model)
    if prediction['action'] in ['URGENT', 'PLAN']:
        create_maintenance_work_order(machine['id'], prediction)
        notify_maintenance_team(machine['id'], prediction)

ROI of Predictive Maintenance

Case study: Electronics manufacturer, 50 critical machines

Before (Reactive + Preventive):

  • Unplanned downtime: 8% of production time
    • 50 machines × 8% × 8,000 hours/year = 32,000 hours lost
    • Cost: 32,000 hours × $200/hour (lost production) = $6.4M/year
  • Maintenance cost: $800K/year
    • Emergency repairs (50% premium): $300K
    • Planned preventive: $500K

After (Predictive):

  • Unplanned downtime: 3% (reduced 62%)
    • Saved: 5% × 50 machines × 8,000 hours × $200 = $4M/year
  • Maintenance cost: $600K/year (25% reduction)
    • No emergency premiums: Save $150K
    • Optimize PM schedule: Save $50K
  • System cost: $200K (sensors, software, ML models)

Net benefit Year 1: $4M + $200K - $200K = $4M ROI: 2000%

Quality Prediction: Real-time Defect Prevention

SPC (Statistical Process Control) Automation

Control charts: Monitor process parameters, detect when going out of control

Example: PCB soldering temperature

  • Target: 250°C ± 5°C
  • UCL (Upper Control Limit): 255°C
  • LCL (Lower Control Limit): 245°C

Automated SPC:

-- Real-time monitoring (query runs every minute)
WITH latest_readings AS (
  SELECT
    machine_id,
    AVG(value) as avg_temperature,
    STDDEV(value) as stddev_temperature,
    COUNT(*) as sample_size
  FROM sensor_readings
  WHERE sensor_id = 'TEMP_SOLDERING'
    AND timestamp >= NOW() - INTERVAL '10 minutes'
  GROUP BY machine_id
),

control_limits AS (
  SELECT
    machine_id,
    target_value,
    ucl,  -- Upper control limit
    lcl   -- Lower control limit
  FROM process_specifications
  WHERE parameter = 'soldering_temperature'
)

SELECT
  lr.machine_id,
  lr.avg_temperature,
  cl.target_value,
  cl.ucl,
  cl.lcl,
  CASE
    WHEN lr.avg_temperature > cl.ucl THEN 'OUT_OF_CONTROL_HIGH'
    WHEN lr.avg_temperature < cl.lcl THEN 'OUT_OF_CONTROL_LOW'
    ELSE 'IN_CONTROL'
  END as status,
  CASE
    WHEN lr.avg_temperature > cl.ucl OR lr.avg_temperature < cl.lcl
      THEN 'ALERT: Adjust soldering temperature immediately'
    ELSE NULL
  END as alert_message
FROM latest_readings lr
JOIN control_limits cl USING (machine_id)
WHERE status != 'IN_CONTROL';

Alert actions:

  • Out of control: Stop production, notify operator, adjust settings
  • Trending towards limits: Warning, investigate root cause

Defect Prediction Model

Use case: Predict defect probability before final inspection

Input features:

  • Process parameters: Temperature, pressure, speed, humidity
  • Material batch: Supplier, batch number, incoming inspection results
  • Machine condition: Hours since maintenance, previous defect rates
  • Operator: Experience level, shift (fatigue factor)

Training data:

# Historical production data with defect labels
df = pd.read_sql("""
  SELECT
    p.work_order_id,
    p.machine_id,
    p.operator_id,
    p.material_batch_id,
    -- Process parameters (average during production)
    AVG(s.temperature) as avg_temperature,
    AVG(s.pressure) as avg_pressure,
    AVG(s.speed) as avg_speed,
    -- Material quality
    m.incoming_inspection_score,
    -- Machine condition
    mc.hours_since_maintenance,
    mc.recent_defect_rate_7d,
    -- Operator
    o.experience_years,
    -- Target
    CASE WHEN qi.defects_found > 0 THEN 1 ELSE 0 END as has_defects
  FROM production_logs p
  JOIN sensor_readings s USING (machine_id, timestamp)
  JOIN materials m ON p.material_batch_id = m.batch_id
  JOIN machine_condition mc USING (machine_id)
  JOIN operators o USING (operator_id)
  LEFT JOIN quality_inspections qi USING (work_order_id)
""", con=db_connection)

Model:

from xgboost import XGBClassifier

features = [
  'avg_temperature', 'avg_pressure', 'avg_speed',
  'incoming_inspection_score', 'hours_since_maintenance',
  'recent_defect_rate_7d', 'experience_years'
]

X = df[features]
y = df['has_defects']

model = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1)
model.fit(X, y)

# Predict defect probability in real-time
def predict_defect_risk(current_work_order):
    features = extract_features(current_work_order)
    defect_probability = model.predict_proba([features])[0, 1]

    if defect_probability > 0.3:  # High risk
        alert = f"High defect risk ({defect_probability:.1%}). Recommend: Increase inspection frequency"
        send_alert_to_quality_team(alert)

    return defect_probability

Closed-loop control:

  • High defect risk detected → Automatically adjust process parameters (e.g., reduce speed, increase temperature) → Re-predict → Iterate until risk acceptable

ROI:

  • Reduce defect rate 5% → 2% (60% improvement)
  • For factory with 1M units/year, defect cost $10/unit:
    • Before: 1M × 5% × $10 = $500K/year
    • After: 1M × 2% × $10 = $200K/year
    • Savings: $300K/year

Supply Chain Optimization: Demand-Supply Matching

Demand Forecasting for Production Planning

Inputs:

  • Historical sales: Customer orders (SKU level)
  • Seasonality: Monthly, quarterly patterns
  • Customer signals: Forecasts from key customers, RFQs (Request for Quotation)
  • Market trends: Industry growth, competitor actions

Forecasting model (Prophet or ARIMA):

from prophet import Prophet

# Historical sales data
df_sales = pd.read_sql("""
  SELECT
    DATE_TRUNC('week', order_date) as ds,
    SUM(quantity) as y
  FROM customer_orders
  WHERE product_id = 'SKU-12345'
    AND order_date >= '2023-01-01'
  GROUP BY ds
  ORDER BY ds
""", con=db_connection)

# Train Prophet model
model = Prophet(yearly_seasonality=True, weekly_seasonality=False)
model.fit(df_sales)

# Forecast next 12 weeks
future = model.make_future_dataframe(periods=12, freq='W')
forecast = model.predict(future)

print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(12))

Production plan:

  • Forecast demand: 10,000 units/week (next 12 weeks)
  • Lead time: 2 weeks (raw materials to finished goods)
  • Safety stock: 2 weeks (buffer for forecast error)
  • Production schedule: 10,000 units/week, maintain 20,000 units safety stock

Inventory Optimization: Raw Materials

Challenge: Balance inventory holding cost vs stockout risk

Multi-echelon inventory model:

Suppliers → Raw materials warehouse → Production lines → Finished goods warehouse → Customers

Optimal reorder point & quantity:

import numpy as np

def calculate_reorder_point(
    avg_daily_demand,
    lead_time_days,
    demand_std,
    service_level=0.95
):
    """
    Reorder point = (Avg daily demand × Lead time) + Safety stock
    Safety stock = Z-score × Std(demand) × √(lead time)
    """
    from scipy.stats import norm

    z_score = norm.ppf(service_level)  # 0.95 → 1.645
    safety_stock = z_score * demand_std * np.sqrt(lead_time_days)
    reorder_point = (avg_daily_demand * lead_time_days) + safety_stock

    return {
        'reorder_point': reorder_point,
        'safety_stock': safety_stock
    }

# Example: Steel sheets for stamping
avg_daily_demand = 500  # sheets/day
lead_time_days = 14  # Supplier lead time
demand_std = 100  # Standard deviation

result = calculate_reorder_point(avg_daily_demand, lead_time_days, demand_std, service_level=0.98)
print(f"Reorder point: {result['reorder_point']:.0f} sheets")
print(f"Safety stock: {result['safety_stock']:.0f} sheets")

# Reorder point: 7,000 + 331 = 7,331 sheets
# When inventory drops below 7,331 sheets → Place order

Economic Order Quantity (EOQ):

def calculate_eoq(annual_demand, ordering_cost, holding_cost_per_unit):
    """
    EOQ = √(2 × D × S / H)
    D = Annual demand
    S = Ordering cost per order
    H = Holding cost per unit per year
    """
    eoq = np.sqrt((2 * annual_demand * ordering_cost) / holding_cost_per_unit)
    return eoq

annual_demand = 500 * 250  # 500/day × 250 working days = 125,000
ordering_cost = 5_000_000  # VND per order (admin, shipping, receiving)
holding_cost = 50_000  # VND per sheet per year (storage, insurance, obsolescence)

eoq = calculate_eoq(annual_demand, ordering_cost, holding_cost)
print(f"EOQ: {eoq:.0f} sheets per order")
# EOQ ≈ 5,000 sheets

# Order 5,000 sheets each time, 25 times per year

Supplier Performance Tracking

Metrics:

  • OTD (On-Time Delivery): % of orders delivered on promised date
  • Quality: % defect rate in incoming materials
  • Lead time: Average days from PO to delivery
  • Responsiveness: Time to respond to inquiries

Scorecard:

SELECT
  supplier_id,
  supplier_name,
  COUNT(*) as total_orders,

  -- On-time delivery
  SUM(CASE WHEN actual_delivery_date <= promised_delivery_date THEN 1 ELSE 0 END) /
    COUNT(*) * 100 as otd_percentage,

  -- Quality
  AVG(incoming_inspection_pass_rate) as avg_quality_score,

  -- Lead time
  AVG(DATE_DIFF(actual_delivery_date, po_date, DAY)) as avg_lead_time_days,

  -- Overall score (weighted average)
  (
    (SUM(CASE WHEN actual_delivery_date <= promised_delivery_date THEN 1 ELSE 0 END) / COUNT(*)) * 0.4 +
    AVG(incoming_inspection_pass_rate) * 0.4 +
    (1 - (AVG(DATE_DIFF(actual_delivery_date, po_date, DAY)) / 30)) * 0.2  -- Normalize lead time
  ) * 100 as supplier_score

FROM purchase_orders po
JOIN suppliers s USING (supplier_id)
WHERE po_date >= '2024-01-01'
GROUP BY supplier_id, supplier_name
ORDER BY supplier_score DESC;

Actions:

  • Top suppliers (score >90): Preferred, increase share
  • Medium (70-90): Acceptable, monitor
  • Low (<70): Improvement plan or replace

Case Study: Electronics Manufacturer - Giảm Downtime 40%, Tiết Kiệm 2B VND/Năm

Background:

  • Company: Contract manufacturer (PCB assembly), 2 factories
  • Production: 5M units/year, 20 SMT lines
  • Revenue: ~500B VND/year

Problems:

  1. High unplanned downtime: 12% (industry average ~6-8%)
    • Frequent machine breakdowns
    • No early warning system
  2. Manual OEE tracking: Operators fill Excel sheets end-of-shift
    • Data incomplete, inaccurate
    • No real-time visibility for managers
  3. Quality issues: 3% defect rate (target <1%)
    • Reactive: Find defects at final inspection, rework costly

Solution: Smart Factory Platform (20 weeks implementation)

Phase 1: IoT Infrastructure (Weeks 1-8)

Sensor deployment:

  • 20 SMT lines, each with:
    • Vibration sensors: 4 per line (critical motors)
    • Temperature sensors: 6 per line (reflow oven zones)
    • Vision systems: 2 per line (component placement accuracy, solder quality)
    • Current sensors: 10 per line (motor loads)
  • Total: 440 sensors

Edge gateways:

  • 20 industrial PCs (one per line): Collect sensor data, local processing
  • Node-RED flows: Data filtering, aggregation
  • Local InfluxDB: 7-day retention, real-time dashboards

Network:

  • Factory WiFi upgrade (industrial-grade APs)
  • Redundant internet (2 ISPs for reliability)
  • VPN to cloud

Integration with existing systems:

  • MES (Siemens Opcenter): Work orders, production counts
  • ERP (SAP): Materials, maintenance logs
  • APIs + Airbyte connectors

Cost: 800M VND (sensors $100K, edge devices $50K, installation labor $50K)

Phase 2: Data Platform & Analytics (Weeks 9-16)

Cloud setup (Google Cloud Platform):

  • BigQuery: Data warehouse
  • Cloud Functions: Serverless processing
  • Cloud Composer (Airflow): Orchestration
  • Vertex AI: ML model training & deployment

Data pipelines:

  • Real-time: Sensor data (anomalies only) → Pub/Sub → BigQuery
  • Batch: MES, ERP → Airbyte → BigQuery (hourly)

dbt models:

-- OEE calculation (per line, per shift)
{{ config(materialized='incremental') }}

SELECT
  production_line_id,
  shift_date,
  shift_name,
  availability,
  performance,
  quality,
  availability * performance * quality as oee
FROM {{ ref('shift_production_summary') }}
WHERE shift_date >= '2024-01-01'

ML models:

  1. Predictive maintenance (vibration + temperature → bearing failure prediction)

    • Training data: 18 months historical + 50 labeled failures
    • Model: Random Forest, AUC 0.88
    • Deployment: Hourly scoring, alerts if failure probability >50%
  2. Quality prediction (process params → defect probability)

    • Training data: 2M units with inspection results
    • Model: XGBoost, AUC 0.82
    • Deployment: Real-time scoring per unit

Cost: 400M VND (GCP credits, software licenses, ML development)

Phase 3: Dashboards & Alerts (Weeks 17-20)

Real-time dashboards (Grafana):

  • Line overview: 20 lines, status (running/idle/down), current OEE
  • OEE details: Availability, Performance, Quality breakdown
  • Alerts: Downtime events, predictive maintenance warnings

BI dashboards (Looker):

  • Executive: Overall OEE trend, top loss categories, plant comparison
  • Production manager: Line performance ranking, shift comparisons
  • Maintenance team: Equipment health, upcoming maintenance, spare parts inventory

Mobile app (for factory floor):

  • Operators: Log downtime reasons, view line status
  • Maintenance: Work orders, equipment health alerts

Automated alerts:

  • Downtime >10 min: SMS to production manager
  • Predictive maintenance (failure probability >70%): Email to maintenance team, create work order
  • Quality risk (defect probability >30%): Alert operator, suggest parameter adjustments

Cost: 200M VND (dashboard development, mobile app)

Results After 12 Months

MetricBeforeAfterChange
OEE65%78%+13 pp (+20%)
Availability82%91%+9 pp
Unplanned downtime12%7%-42% (↓5 pp)
Defect rate3.0%1.2%-60% (↓1.8 pp)
Maintenance cost20B VND15B VND-25%
Emergency repairs40% of maintenance10%-75%
Output (units/year)5M5.8M+16% (same capacity)

Financial impact (annual):

Revenue increase (higher output):

  • Additional 800K units × 100,000 VND/unit = 80B VND

Cost savings:

  • Reduced downtime: 5% capacity increase = 25B VND (opportunity cost)
  • Reduced defects: 1.8% × 5.8M units × 50,000 VND rework cost = 5.2B VND
  • Maintenance optimization: 5B VND

Total benefit: 80B + 25B + 5.2B + 5B = 115B VND Investment: 1.4B VND ROI Year 1: 8,100% (payback in 5 days!)

Intangible benefits:

  • Customer satisfaction ↑ (fewer delays, better quality)
  • Employee morale ↑ (less firefighting, more proactive work)
  • Competitive advantage (can bid for higher-quality contracts)

Key Learnings

What worked:

  • Start with pilot: 1 production line first (2 weeks), prove value, then scale
  • OT-IT collaboration: Involve maintenance team early, they know the machines
  • Quick wins: Real-time OEE dashboards had immediate impact, built buy-in

Challenges:

  • Sensor installation: Downtime for retrofitting (scheduled during maintenance windows)
  • Data quality: Initial sensor readings noisy (vibration sensors placed wrong) → Recalibration needed
  • Change management: Operators initially skeptical of "robots watching them" → Training, emphasize helping not policing

Recommendations:

  • Budget 20-30% buffer for unexpected issues (sensor failures, network problems)
  • Plan for ongoing ML model maintenance (retrain quarterly)
  • Invest in training (operators, maintenance team need to understand the system)

Implementation Roadmap: 16 Weeks to Smart Factory (Pilot Line)

Phase 1: Assessment & Design (Weeks 1-2)

  • Identify pilot production line (high-value, representative)
  • Map current state: Machines, sensors, data systems
  • Define use cases: OEE tracking, predictive maintenance, quality prediction
  • Architecture design: Edge + cloud components
  • Vendor selection: Sensors, edge hardware, cloud platform

Phase 2: Infrastructure Setup (Weeks 3-6)

  • Procure hardware: Sensors, edge gateways, network equipment
  • Install sensors on pilot line (plan downtime)
  • Setup edge gateway: Node-RED, InfluxDB
  • Test sensor connectivity, data collection
  • Setup cloud Data Warehouse (BigQuery/Snowflake)

Phase 3: Data Integration (Weeks 7-10)

  • Integrate MES, ERP data sources (Airbyte connectors)
  • Build data pipelines: Edge → Cloud, batch syncs
  • Data modeling: OEE tables, sensor aggregations
  • dbt transformations: Core metrics calculations
  • Data quality checks, validation

Phase 4: Analytics & ML (Weeks 11-14)

  • Build OEE dashboard (real-time, Grafana)
  • Train predictive maintenance model (if sufficient historical data)
  • Deploy ML models (Cloud Functions, Vertex AI)
  • Setup alerting rules (downtime, anomalies)
  • BI dashboards for management (Looker/Tableau)

Phase 5: Testing & Rollout (Weeks 15-16)

  • UAT with operators, maintenance team
  • Refine dashboards based on feedback
  • Training sessions (operators, managers)
  • Documentation (SOPs, troubleshooting guides)
  • Go-live: Switch from old system to new

Phase 6: Scale (Post Week 16)

  • Monitor pilot line for 4-8 weeks, iterate
  • Calculate ROI, build business case for scaling
  • Rollout to remaining production lines (4-8 weeks per line)
  • Expand use cases: Supply chain, energy optimization, etc.

Checklist: Data Platform for Manufacturing

Data sources:

  • IoT sensors deployed (vibration, temperature, current)
  • SCADA/PLC integration (Modbus, OPC UA)
  • MES integration (work orders, downtime logging)
  • ERP integration (materials, maintenance)
  • QMS integration (inspection results, defects)

Infrastructure:

  • Edge gateways installed (Node-RED, Telegraf)
  • Time-series database (InfluxDB or TimescaleDB)
  • Cloud Data Warehouse (BigQuery, Snowflake, Redshift)
  • Network (WiFi, VPN, redundancy)

Data models:

  • OEE calculation (availability, performance, quality)
  • Downtime event tracking (with categorization)
  • Production metrics (output, cycle times)
  • Equipment health metrics (vibration, temperature trends)
  • Quality metrics (defect rates, Cpk)

Analytics:

  • Real-time OEE dashboard (per line, per shift)
  • Predictive maintenance models deployed
  • Quality prediction models deployed
  • Root cause analysis tools (Pareto charts, fishbone diagrams)

Alerts & actions:

  • Downtime alerts (immediate notification)
  • Predictive maintenance alerts (create work orders)
  • Quality alerts (out-of-spec warnings)
  • Automated reporting (daily OEE summary, weekly review)

Kết Luận: Industry 4.0 is ROI-Positive from Day 1

Manufacturing không thể cạnh tranh bằng labor cost thấp mãi mãi. Smart factories với Data Platform mang lại competitive advantage bền vững: higher quality, faster production, lower costs, và flexibility to adapt.

Key takeaways:

  • OEE improvement 10-15% = $1-2M/year cho medium factory
  • Predictive maintenance reduces downtime 30-50%, maintenance costs 25-30%
  • Quality prediction prevents defects, reduces rework/scrap 60-80%
  • Payback period: 6-18 months (depending on factory size, use cases)

Next steps:

  • Assess current state (do you have sensors? Can you access MES/ERP data?)
  • Identify top pain point (downtime? quality? inventory?)
  • Start with 1 production line pilot (16 weeks, prove value)
  • Liên hệ Carptech nếu cần tư vấn (carptech.vn/contact)

Tài liệu tham khảo:


Bài viết này là phần của series "Data Platform for Industries". Đọc thêm về E-commerce, Fintech, và Retail.

Carptech - Data Platform Solutions for Vietnamese Enterprises. Liên hệ tư vấn miễn phí.

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn