Data Modeling 101: Star Schema vs Snowflake Schema vs Data Vault

Bạn có một Data Warehouse với hàng trăm bảng, nhưng analysts vẫn phàn nàn: “Không tìm ra data cần”, “Query chạy mãi không ra”, “Số liệu không khớp với nhau”.

Vấn đề không phải là data. Vấn đề là data modeling.

Data modeling là nền tảng của mọi Data Warehouse và Data Platform thành công. Modeling sai từ đầu = analytics chậm, query phức tạp, ác mộng bảo trì và cuối cùng là dự án thất bại.

Theo Gartner, 60% data warehouse projects fail - và majority trong số đó fail vì poor data modeling decisions, không phải thiếu data hay technology.

Nhưng data modeling không khó. Có 4 hướng chính, mỗi hướng có use case riêng:

Star Schema: đơn giản, nhanh, thân thiện BI (Kimball methodology)
Snowflake Schema: chuẩn hoá, giảm dư thừa
Data Vault 2.0: chuẩn enterprise, dễ audit, linh hoạt
One Big Table (OBT): tiếp cận hiện đại với dbt

Trong bài này, bạn sẽ nắm:

Vì sao data modeling quan trọng (performance, clarity, scalability)
Chi tiết từng approach: ưu/nhược điểm và khi nào dùng
Code sample (dbt) cho mỗi pattern
Decision framework: chọn mô hình cho use case của bạn
Các lỗi phổ biến và cách tránh
Ví dụ thực tế từ e-commerce, finance, SaaS

Sau bài này, bạn sẽ tự tin design data models scale được, perform tốt, và dễ maintain.

Tại sao Data Modeling Matters?

Trước khi đi vào các approaches, hãy hiểu tại sao data modeling lại quan trọng.

Problem: Data Warehouse Without Good Modeling

Tình huống thực tế:

Bạn có một e-commerce Data Warehouse với 200 tables:

orders, order_items, order_payments, order_shipments
customers, customer_addresses, customer_segments
products, product_categories, product_variants
marketing_campaigns, campaign_performance
và 180+ tables khác...

Analyst muốn answer simple question: "Revenue by product category, by month, last 12 months"

Khi không có data modeling bài bản:

-- Analyst phải join 8 tables!
SELECT
  pc.category_name,
  DATE_TRUNC('month', o.order_date) as month,
  SUM(oi.quantity * oi.unit_price) as revenue
FROM orders o
  JOIN order_items oi ON o.order_id = oi.order_id
  JOIN order_payments op ON o.order_id = op.order_id
  JOIN products p ON oi.product_id = p.product_id
  JOIN product_variants pv ON oi.variant_id = pv.variant_id
  JOIN product_categories pc ON p.category_id = pc.category_id
WHERE op.payment_status = 'completed'
  AND o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY 1, 2
ORDER BY 2, 1;

Vấn đề:

❌ Query phức tạp, nhiều joins → chậm (5-10 giây)
❌ Analyst phải hiểu cực rõ schema
❌ Logic khác nhau → kết quả không nhất quán
❌ Mỗi analyst tự viết → khó tái sử dụng

Khi áp dụng Star Schema đúng cách:

-- Simple, fast, clear
SELECT
  product_category,
  month,
  SUM(revenue) as revenue
FROM fct_sales
WHERE month >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY 1, 2
ORDER BY 2, 1;

Lợi ích:

✅ Chỉ cần 1 bảng, không join
✅ Query < 1 giây
✅ Bất kỳ ai cũng viết được
✅ Logic thống nhất

→ Đó là sức mạnh của data modeling chuẩn chỉnh.

3 lợi ích chính của Data Modeling

1. Performance (hiệu năng)

Dữ liệu được pre-join → ít join khi query
Metrics pre-aggregate → dashboard lên tức thì
Indexing/partitioning hợp lý
Tác động: Query nhanh hơn 10-100 lần

2. Clarity & Usability (rõ ràng & dễ dùng)

Schema tự giải thích
Dùng thuật ngữ business, không jargon kỹ thuật
Analyst tự phục vụ
Tác động: Giảm 80% tin nhắn kiểu “Data X ở đâu?”

3. Scalability (khả năng mở rộng)

Phân tầng rõ: raw → staged → marts
Dễ thêm nguồn dữ liệu mới
Duy trì ổn định nhiều năm
Tác động: Onboard data source mới trong vài ngày, không phải vài tháng

Star Schema: “Xương sống” của BI

Star Schema là approach phổ biến nhất trong data modeling, được Ralph Kimball phổ biến từ những năm 1990.

Star Schema Structure

                  ┌─────────────────┐
                  │  dim_customer   │
                  │─────────────────│
                  │ customer_key    │
                  │ customer_name   │
                  │ email           │
                  │ segment         │
                  └────────┬────────┘
                           │
                           │
     ┌─────────────────┐   │   ┌─────────────────┐
     │  dim_product    │   │   │   dim_time      │
     │─────────────────│   │   │─────────────────│
     │ product_key     │   │   │ date_key        │
     │ product_name    │   │   │ date            │
     │ category        │   │   │ year            │
     │ brand           │   │   │ quarter         │
     └────────┬────────┘   │   │ month           │
              │            │   └────────┬────────┘
              │            │            │
              │      ┌─────▼─────┐      │
              └─────►│ fct_sales │◄─────┘
                     │───────────│
                     │ sales_key │
                     │ date_key  │
                     │ customer_key
                     │ product_key
                     │ quantity  │
                     │ revenue   │
                     │ profit    │
                     └───────────┘

2 loại bảng chính:

1. Fact Tables (bảng sự kiện)

Trung tâm của star
Chứa metrics/measures: revenue, quantity, profit
Chứa foreign key tới các dimension
Thường rất lớn (hàng triệu → tỷ dòng)
Ví dụ: fct_sales, fct_orders, fct_web_sessions

2. Dimension Tables (bảng chiều)

Các điểm xung quanh star
Chứa thuộc tính mô tả: tên, danh mục, ngày tháng
Denormalized (cấu trúc phẳng)
Nhỏ hơn (vài nghìn → vài triệu dòng)
Ví dụ: dim_customer, dim_product, dim_time

Star Schema: Ưu và nhược điểm

✅ Ưu điểm:

1. Query performance

Join tối thiểu (fact + dimension)
BI tools tối ưu cho star schema
Có thể pre-aggregate metrics
Query điển hình: < 1 giây

2. Đơn giản

Dễ hiểu
Business user tự viết query được
Schema tự giải thích
Onboarding nhanh

3. Thân thiện với BI

Tableau, Looker, Power BI “ưa thích” star schema
Tự nhận biết quan hệ
Drag-and-drop analytics
Tối ưu hiệu năng

❌ Nhược điểm:

1. Dư thừa dữ liệu

Dimension denormalized → dữ liệu lặp
Ví dụ: tên category lặp hàng triệu lần trong dim_product
Tác động: Tăng storage chút ít (nhưng storage giờ rẻ)

2. Update phức tạp

Update thuộc tính dimension → phải cập nhật nhiều dòng
Ví dụ: khách đổi segment → update dim_customer + dữ liệu lịch sử?
Giải pháp: dùng Slowly Changing Dimension (SCD Type 1/2/3)

3. Không thân thiện audit

Khó track lineage
Không có historization built-in
Muốn compliance/audit trail phải làm thêm

Khi nào nên dùng Star Schema

✅ Phù hợp khi:

Business Intelligence & Reporting
Dashboards có drill-down
Self-service analytics
OLAP workload
Khi performance quan trọng hơn chi phí storage

❌ Không nên dùng khi:

Ngành yêu cầu compliance cao (cân nhắc Data Vault)
Dimension thay đổi liên tục
Quan hệ many-to-many phức tạp
Cần audit trail đầy đủ

Ví dụ Star Schema: E-commerce

Tình huống: Công ty e-commerce muốn phân tích hiệu suất bán hàng.

Triển khai bằng dbt:

Fact Table:

-- models/marts/fct_sales.sql
{{
  config(
    materialized='incremental',
    unique_key='sales_key',
    partition_by={
      'field': 'order_date',
      'data_type': 'date',
      'granularity': 'day'
    }
  )
}}

SELECT
  {{ dbt_utils.generate_surrogate_key(['order_id', 'order_item_id']) }} as sales_key,
  o.order_date,
  dc.customer_key,
  dp.product_key,
  dst.store_key,

  -- Metrics
  oi.quantity,
  oi.unit_price,
  oi.quantity * oi.unit_price as revenue,
  oi.quantity * oi.unit_price - oi.quantity * p.cost_price as profit,
  oi.discount_amount,

  -- Flags
  CASE WHEN o.is_first_order THEN 1 ELSE 0 END as is_first_order_flag

FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_order_items') }} oi ON o.order_id = oi.order_id
JOIN {{ ref('dim_customer') }} dc ON o.customer_id = dc.customer_id
JOIN {{ ref('dim_product') }} dp ON oi.product_id = dp.product_id
JOIN {{ ref('dim_store') }} dst ON o.store_id = dst.store_id
JOIN {{ ref('stg_products') }} p ON oi.product_id = p.product_id

{% if is_incremental() %}
  WHERE o.order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Dimension Table - Customer:

-- models/marts/dim_customer.sql
{{
  config(
    materialized='table'
  )
}}

SELECT
  {{ dbt_utils.generate_surrogate_key(['customer_id']) }} as customer_key,
  customer_id,

  -- Customer attributes
  first_name,
  last_name,
  first_name || ' ' || last_name as full_name,
  email,
  phone,

  -- Demographics
  birth_date,
  EXTRACT(YEAR FROM CURRENT_DATE) - EXTRACT(YEAR FROM birth_date) as age,
  CASE
    WHEN age < 25 THEN '18-24'
    WHEN age < 35 THEN '25-34'
    WHEN age < 45 THEN '35-44'
    WHEN age < 55 THEN '45-54'
    ELSE '55+'
  END as age_group,
  gender,

  -- Location (denormalized)
  city,
  state,
  country,
  postal_code,

  -- Segmentation
  customer_segment,  -- VIP, Regular, New
  lifetime_value_bucket,  -- High, Medium, Low

  -- Metadata
  first_order_date,
  last_order_date,
  total_orders,
  total_revenue,

  created_at,
  updated_at

FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('customer_metrics') }} cm ON c.customer_id = cm.customer_id

Dimension Table - Product:

-- models/marts/dim_product.sql
SELECT
  {{ dbt_utils.generate_surrogate_key(['product_id']) }} as product_key,
  product_id,

  -- Product details
  product_name,
  product_description,
  sku,

  -- Category hierarchy (denormalized)
  category_l1,  -- Electronics
  category_l2,  -- Laptops
  category_l3,  -- Gaming Laptops
  category_path,  -- Electronics > Laptops > Gaming Laptops

  -- Brand
  brand_name,
  brand_country,

  -- Attributes
  color,
  size,
  weight,

  -- Pricing
  list_price,
  cost_price,
  list_price - cost_price as margin,
  (list_price - cost_price) / NULLIF(list_price, 0) as margin_pct,

  -- Status
  is_active,
  is_featured,
  stock_status,

  created_at,
  updated_at

FROM {{ ref('stg_products') }} p
JOIN {{ ref('stg_categories') }} c ON p.category_id = c.category_id
JOIN {{ ref('stg_brands') }} b ON p.brand_id = b.brand_id

Simple Business Query:

-- Revenue by product category, by month
SELECT
  dp.category_l1 as category,
  DATE_TRUNC('month', fs.order_date) as month,
  SUM(fs.revenue) as total_revenue,
  SUM(fs.profit) as total_profit,
  SUM(fs.quantity) as units_sold,
  COUNT(DISTINCT fs.customer_key) as unique_customers

FROM {{ ref('fct_sales') }} fs
JOIN {{ ref('dim_product') }} dp ON fs.product_key = dp.product_key

WHERE fs.order_date >= '2024-01-01'

GROUP BY 1, 2
ORDER BY 2 DESC, 3 DESC;

Query performance: < 1 second, dù fact table có 100M+ rows.

Snowflake Schema: Normalized Dimensions

Snowflake Schema là variation của Star Schema với normalized dimensions.

Snowflake Schema Structure

┌──────────────┐
│ dim_country  │
└──────┬───────┘
       │
       │
┌──────▼───────┐     ┌─────────────────┐
│  dim_state   │     │  dim_category   │
└──────┬───────┘     └────────┬────────┘
       │                      │
       │                      │
┌──────▼───────┐     ┌────────▼────────┐
│  dim_city    │     │ dim_subcategory │
└──────┬───────┘     └────────┬────────┘
       │                      │
       │      ┌───────────┐   │
       └─────►│ fct_sales │◄──┘
              │───────────│
              │ date_key  │
              │ city_key  │
              │ product_key
              │ revenue   │
              └───────────┘
                    ▲
                    │
              ┌─────┴──────┐
              │ dim_brand  │
              └────────────┘

Key Difference vs Star:

Dimensions normalized thành multiple tables
Reduces data redundancy
More tables, more joins

Snowflake Schema: Ưu & nhược điểm

✅ Ưu điểm:

1. Less Redundancy

No duplicate data
Example: Country name stored once, not millions times
Storage savings (marginal trong cloud era)

2. Easier Updates

Update dimension attribute → update 1 row
Example: Rename category → 1 row update, not millions

3. Data Integrity

Foreign keys enforce referential integrity
Harder to get inconsistent data

❌ Nhược điểm:

1. Query Performance

More joins → slower queries
BI tools struggle with deep hierarchies
Typical query: 2-5x slower than Star

2. Complexity

Harder to understand
More tables to navigate
Onboarding takes longer

3. BI Tool Limitations

Many BI tools not optimized for snowflake schema
Auto-relationships break
Performance issues

Khi nào nên dùng Snowflake Schema

✅ Dùng khi:

Chi phí lưu trữ cực cao (hiếm trong thời cloud)
Dimension update rất thường xuyên
Cần đảm bảo integrity tuyệt đối
Dimension có hierarchy sâu

❌ Tránh khi:

Ưu tiên performance
Phụ thuộc BI tool nhiều
Team muốn sự đơn giản
Chi phí storage không đáng kể (cloud hiện đại)

Thực tế: Snowflake Schema nổi bật thời storage đắt (1990s-2000s). Với Snowflake/BigQuery ngày nay, storage rẻ và compute mạnh → Star Schema vẫn thắng đa số trường hợp.

Data Vault 2.0: Enterprise Data Warehouse

Data Vault là methodology được Dan Linstedt phát triển cho enterprise data warehouses với strict audit, compliance, và agility requirements.

Data Vault Structure

3 loại bảng:

1. Hubs – Business Keys

Danh sách duy nhất các thực thể business
Chỉ chứa business key + metadata
Ví dụ: hub_customer, hub_product, hub_order

2. Links – Relationships

Kết nối các hub
Thể hiện quan hệ many-to-many
Ví dụ: link_order_customer, link_order_product

3. Satellites – Thuộc tính & lịch sử

Chứa thuộc tính mô tả
Ghi nhận toàn bộ lịch sử thay đổi
Ví dụ: sat_customer_details, sat_product_pricing

┌──────────────┐
│ hub_customer │
│──────────────│
│ customer_hk  │  (hash key)
│ customer_id  │  (business key)
│ load_date    │
│ record_source│
└──────┬───────┘
       │
       │
       ├──────────────────────────┐
       │                          │
┌──────▼─────────────┐   ┌────────▼──────────┐
│sat_customer_details│   │sat_customer_status│
│────────────────────│   │───────────────────│
│customer_hk         │   │customer_hk        │
│first_name          │   │status             │
│last_name           │   │segment            │
│email               │   │load_date          │
│load_date           │   │load_end_date      │
│load_end_date       │   └───────────────────┘
└────────────────────┘
       │
       │
       ▼
┌──────────────────┐
│link_order_customer│
│──────────────────│
│link_hk           │
│order_hk          │
│customer_hk       │
│load_date         │
└──────────────────┘

Data Vault: Ưu & nhược điểm

✅ Ưu điểm:

1. Complete Audit Trail

Every change tracked với timestamp
Know exactly what data looked like at any point in time
Perfect cho compliance (SOX, GDPR, HIPAA)

2. Agility

Add new data sources easily
No need to redesign existing structures
Parallel development possible

3. Historization

Built-in slowly changing dimensions
No data loss
Time travel queries

4. Scalability

Handles complex relationships well
Insert-only (no updates/deletes) → fast loads
Partition-friendly

❌ Nhược điểm:

1. Complexity

3 table types to understand
Many more tables than Star Schema
Steep learning curve

2. Not Query-Friendly

Can't query Data Vault directly
Need to build Business Vault or Information Marts on top
Extra ETL/ELT layer required

3. Implementation Cost

More tables → more infrastructure
Requires specialized knowledge
Higher initial development time

4. Query Performance

Many joins to reconstruct business entities
Slower than Star Schema for analytics

Khi nào nên dùng Data Vault

✅ Thích hợp khi:

Ngành được quản lý chặt: Finance, Healthcare, Government
Enterprise data warehouse với 100+ nguồn dữ liệu
Yêu cầu audit: cần lineage & lịch sử đầy đủ
Môi trường agile: yêu cầu thay đổi liên tục
Dự án dài hạn: 5+ năm, cần scale lớn

❌ Quá nặng khi:

SMB hoặc mid-market
Dự án < 10 nguồn dữ liệu
Không có yêu cầu compliance nghiêm ngặt
Chỉ phục vụ BI/reporting (dùng Star Schema)

Data Vault Example: Banking

Hubs:

-- models/vault/hub_customer.sql
{{
  config(
    materialized='incremental',
    unique_key='customer_hk'
  )
}}

SELECT DISTINCT
  {{ dbt_utils.generate_surrogate_key(['customer_id', 'source_system']) }} as customer_hk,
  customer_id as customer_business_key,
  source_system as record_source,
  CURRENT_TIMESTAMP as load_date,
  'dbt' as loaded_by

FROM {{ ref('stg_customers') }}

{% if is_incremental() %}
  WHERE load_date > (SELECT MAX(load_date) FROM {{ this }})
{% endif %}

Links:

-- models/vault/link_account_customer.sql
{{
  config(
    materialized='incremental',
    unique_key='link_hk'
  )
}}

SELECT
  {{ dbt_utils.generate_surrogate_key(['account_hk', 'customer_hk']) }} as link_hk,
  ha.account_hk,
  hc.customer_hk,
  CURRENT_TIMESTAMP as load_date,
  'core_banking' as record_source

FROM {{ ref('stg_accounts') }} a
JOIN {{ ref('hub_account') }} ha ON a.account_id = ha.account_business_key
JOIN {{ ref('hub_customer') }} hc ON a.customer_id = hc.customer_business_key

{% if is_incremental() %}
  WHERE load_date > (SELECT MAX(load_date) FROM {{ this }})
{% endif %}

Satellites:

-- models/vault/sat_customer_details.sql
{{
  config(
    materialized='incremental',
    unique_key=['customer_hk', 'load_date']
  )
}}

SELECT
  hc.customer_hk,

  -- Attributes
  c.first_name,
  c.last_name,
  c.email,
  c.phone,
  c.date_of_birth,
  c.address,
  c.city,
  c.state,
  c.postal_code,

  -- Metadata
  CURRENT_TIMESTAMP as load_date,
  CAST(NULL AS TIMESTAMP) as load_end_date,  -- Updated when new version arrives
  {{ dbt_utils.generate_surrogate_key([
    'c.first_name', 'c.last_name', 'c.email',
    'c.phone', 'c.address'
  ]) }} as hash_diff,  -- Detect changes
  'core_banking' as record_source

FROM {{ ref('stg_customers') }} c
JOIN {{ ref('hub_customer') }} hc ON c.customer_id = hc.customer_business_key

{% if is_incremental() %}
  -- Only load changed records
  WHERE hash_diff NOT IN (
    SELECT hash_diff FROM {{ this }}
    WHERE load_end_date IS NULL
  )
{% endif %}

Business Vault (Analytics Layer):

-- models/marts/dim_customer_dv.sql
-- Reconstruct customer dimension from Data Vault

SELECT
  hc.customer_hk as customer_key,
  hc.customer_business_key as customer_id,

  -- Latest attributes from satellite
  scd.first_name,
  scd.last_name,
  scd.email,
  scd.phone,
  scd.address,
  scd.city,
  scd.state,

  -- Status from another satellite
  scs.customer_status,
  scs.segment,
  scs.risk_rating

FROM {{ ref('hub_customer') }} hc
LEFT JOIN {{ ref('sat_customer_details') }} scd
  ON hc.customer_hk = scd.customer_hk
  AND scd.load_end_date IS NULL  -- Latest version
LEFT JOIN {{ ref('sat_customer_status') }} scs
  ON hc.customer_hk = scs.customer_hk
  AND scs.load_end_date IS NULL

One Big Table (OBT): Modern dbt Approach

One Big Table là modern approach popularized bởi dbt community.

OBT Concept

Thay vì force-fit vào Star/Snowflake/Data Vault, create denormalized tables for specific use cases.

Philosophy:

Storage is cheap
Compute is powerful
Simplicity > optimization
Use case driven

OBT Structure

Raw Layer (Bronze)
↓
Staging Layer (Silver) - cleaned, typed
↓
Intermediate Layer - business logic
↓
Marts Layer (Gold) - One Big Table per use case
  ├── sales_mart (all sales data denormalized)
  ├── marketing_mart (all marketing data denormalized)
  └── customer_360_mart (all customer data denormalized)

OBT Example: E-commerce

-- models/marts/sales_mart.sql
-- One Big Table cho sales analytics

{{
  config(
    materialized='table',
    partition_by={
      'field': 'order_date',
      'data_type': 'date'
    }
  )
}}

SELECT
  -- Order info
  o.order_id,
  o.order_date,
  o.order_timestamp,
  DATE_TRUNC('month', o.order_date) as order_month,
  DATE_TRUNC('year', o.order_date) as order_year,
  o.order_status,

  -- Customer info (denormalized)
  c.customer_id,
  c.customer_name,
  c.customer_email,
  c.customer_segment,
  c.customer_lifetime_value,
  c.first_order_date,
  CASE WHEN o.order_date = c.first_order_date THEN TRUE ELSE FALSE END as is_first_order,

  -- Customer location
  c.city as customer_city,
  c.state as customer_state,
  c.country as customer_country,

  -- Product info (denormalized)
  p.product_id,
  p.product_name,
  p.sku,
  p.category_l1,
  p.category_l2,
  p.category_l3,
  p.brand_name,
  p.brand_country,

  -- Order item metrics
  oi.quantity,
  oi.unit_price,
  oi.discount_amount,
  oi.quantity * oi.unit_price as revenue,
  oi.quantity * oi.unit_price - oi.discount_amount as net_revenue,
  oi.quantity * p.cost_price as cost,
  oi.quantity * oi.unit_price - oi.quantity * p.cost_price as profit,

  -- Marketing attribution (denormalized)
  m.campaign_name,
  m.campaign_type,
  m.channel,
  m.medium,
  m.source,

  -- Store info
  s.store_id,
  s.store_name,
  s.store_type,
  s.store_city,
  s.store_region

FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_order_items') }} oi ON o.order_id = oi.order_id
JOIN {{ ref('dim_customer') }} c ON o.customer_id = c.customer_id
JOIN {{ ref('dim_product') }} p ON oi.product_id = p.product_id
LEFT JOIN {{ ref('stg_marketing_attribution') }} m ON o.order_id = m.order_id
LEFT JOIN {{ ref('dim_store') }} s ON o.store_id = s.store_id

WHERE o.order_status != 'cancelled'

Usage - Super Simple:

-- Revenue by brand, by month
SELECT
  brand_name,
  order_month,
  SUM(revenue) as total_revenue,
  SUM(profit) as total_profit
FROM sales_mart
WHERE order_year = 2024
GROUP BY 1, 2
ORDER BY 2, 3 DESC;

-- First-time customer revenue
SELECT
  order_month,
  COUNT(DISTINCT CASE WHEN is_first_order THEN customer_id END) as new_customers,
  SUM(CASE WHEN is_first_order THEN revenue ELSE 0 END) as new_customer_revenue
FROM sales_mart
WHERE order_year = 2024
GROUP BY 1;

-- Marketing channel performance
SELECT
  channel,
  COUNT(DISTINCT order_id) as orders,
  SUM(revenue) as revenue,
  SUM(profit) as profit,
  profit / NULLIF(revenue, 0) as profit_margin
FROM sales_mart
WHERE campaign_name IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC;

OBT: Ưu & nhược điểm

✅ Ưu điểm:

1. Extreme Simplicity

1 table per use case
No joins needed
Anyone có thể query

2. Performance

Pre-joined, pre-calculated
Queries < 1 second
Perfect for dashboards

3. Use Case Driven

Build exactly what business needs
No over-engineering
Fast time to value

❌ Nhược điểm:

1. Redundancy

Same data trong multiple OBTs
Higher storage cost
Mitigation: Storage is cheap trong cloud

2. Maintenance

Multiple tables to keep in sync
Updates propagate to nhiều tables
Mitigation: dbt tests, CI/CD

3. Not Normalized

Data integrity relies on upstream models
Can have inconsistencies nếu not careful

Khi nào nên dùng OBT

✅ Lý tưởng khi:

Dùng cloud DW hiện đại (Snowflake, BigQuery)
Tập trung self-service analytics
Làm dashboard/report nhanh
Môi trường startup, cần iterate nhanh
Workflow dựa trên dbt

❌ Tránh khi:

Yêu cầu audit/compliance chặt chẽ (dùng Data Vault)
Storage cost là ưu tiên (hiếm)
Cần chuẩn hoá sâu

Decision Framework: Nên chọn mô hình nào?

START: What's your primary use case?

├─ BI & Dashboards
│  ├─ Team size < 20 people → ⭐ One Big Table (OBT)
│  └─ Team size > 20 people → ⭐ Star Schema
│
├─ Regulatory/Compliance (Finance, Healthcare, Government)
│  └─ → ⭐ Data Vault 2.0
│
├─ Ad-hoc Analytics & Exploration
│  └─ → ⭐ One Big Table (OBT) or Star Schema
│
├─ Enterprise DW (100+ sources, long-term, agile)
│  └─ → ⭐ Data Vault 2.0
│
└─ Storage Cost Critical (unlikely in cloud era)
   └─ → ❓ Snowflake Schema (but reconsider - storage is cheap!)

Quick Reference Table:

Criteria	Star Schema	Snowflake Schema	Data Vault	OBT
Query Performance	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐
Simplicity	⭐⭐⭐⭐	⭐⭐	⭐	⭐⭐⭐⭐⭐
Storage Efficiency	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐	⭐⭐
Audit Trail	⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐
Agility	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
BI Friendly	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐
Maintenance	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐
Learning Curve	⭐⭐⭐⭐	⭐⭐⭐	⭐	⭐⭐⭐⭐⭐

Carptech Recommendation by Company Type:

Startup/SMB (< 100 employees): → One Big Table với dbt

Fast, simple, iterate quickly
Low maintenance
Perfect cho Snowflake/BigQuery

Mid-Market (100-1000 employees): → Star Schema với dbt

Balance giữa simplicity và structure
BI tools work great
Scalable

Enterprise (1000+ employees): → Data Vault 2.0 for core DW + Star Schema marts for consumption

Data Vault: raw vault, audit trail
Star Schema: BI layer on top
Best of both worlds

Regulated Industries (any size): → Data Vault 2.0

Compliance requirements
Complete audit trail
Historization

Common Data Modeling Mistakes

Mistake 1: Over-Normalization

Problem: Normalize everything cho đến khi có 50 tiny tables, every query cần 10+ joins.

Example:

-- Bad: Over-normalized
SELECT
  c.first_name,
  c.last_name,
  ct.city_name,
  s.state_name,
  co.country_name,
  cat1.name as category_l1,
  cat2.name as category_l2,
  cat3.name as category_l3
FROM customers c
JOIN cities ct ON c.city_id = ct.city_id
JOIN states s ON ct.state_id = s.state_id
JOIN countries co ON s.country_id = co.country_id
JOIN products p ON ...
JOIN categories cat3 ON p.category_id = cat3.category_id
JOIN categories cat2 ON cat3.parent_id = cat2.category_id
JOIN categories cat1 ON cat2.parent_id = cat1.category_id
-- 10 joins for simple query!

Solution: Denormalize dimensions. Storage is cheap, analyst time is expensive.

-- Good: Denormalized dimension
SELECT
  full_name,
  city,
  state,
  country,
  category_l1,
  category_l2,
  category_l3
FROM dim_customer c
JOIN dim_product p ON ...
-- 2 joins only

Mistake 2: Under-Normalization (trong Fact Tables)

Problem: Chứa descriptive attributes trong fact table.

Example:

-- Bad: Descriptive attributes trong fact
CREATE TABLE fct_sales (
  order_id INT,
  customer_name VARCHAR,  -- ❌ Dimension attribute!
  customer_email VARCHAR, -- ❌ Dimension attribute!
  product_name VARCHAR,   -- ❌ Dimension attribute!
  product_category VARCHAR, -- ❌ Dimension attribute!
  revenue DECIMAL
);

Solution: Facts chỉ chứa keys + metrics.

-- Good: Facts = keys + metrics only
CREATE TABLE fct_sales (
  order_id INT,
  customer_key INT,  -- ✅ Foreign key to dim_customer
  product_key INT,   -- ✅ Foreign key to dim_product
  date_key INT,      -- ✅ Foreign key to dim_date
  revenue DECIMAL,   -- ✅ Metric
  quantity INT       -- ✅ Metric
);

Mistake 3: Not Handling Slowly Changing Dimensions

Problem: Customer changes address. Update dim_customer. Historical sales reports now show wrong city.

Example:

-- Customer lived in Hanoi, bought product Jan 1
-- Customer moved to HCMC, updated record Feb 1
-- Query sales Jan 1: shows HCMC (wrong!)

Solution: Implement SCD Type 2: keep historical versions.

-- dim_customer with SCD Type 2
CREATE TABLE dim_customer (
  customer_key INT,  -- Surrogate key (changes every version)
  customer_id INT,   -- Business key (stays same)
  full_name VARCHAR,
  city VARCHAR,
  valid_from DATE,
  valid_to DATE,
  is_current BOOLEAN
);

-- Example data:
-- customer_key | customer_id | city | valid_from | valid_to | is_current
-- 1            | 100         | HN   | 2024-01-01 | 2024-02-01 | FALSE
-- 2            | 100         | HCMC | 2024-02-01 | 9999-12-31 | TRUE

dbt implementation:

-- Use dbt snapshots
{% snapshot dim_customer_snapshot %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='check',
    check_cols=['city', 'state', 'email']
  )
}}

SELECT * FROM {{ ref('stg_customers') }}

{% endsnapshot %}

Mistake 4: Mixing Grain trong Fact Tables

Problem: Fact table có mixed granularity.

Example:

-- Bad: Mixed grain
SELECT * FROM fct_sales;
-- order_id | product_id | revenue
-- 1        | 101        | 100     -- Order line level
-- 1        | NULL       | 500     -- Order level ❌ WRONG!
-- 2        | 201        | 50      -- Order line level

Solution: 1 fact table = 1 grain. Tạo separate fact tables cho different grains.

-- Good: Separate tables
-- fct_order_items (order line grain)
-- fct_orders (order grain)
-- fct_daily_sales (daily grain)

Mistake 5: No Documentation

Problem: Không ai biết dim_customer_v2 khác dim_customer như thế nào.

Solution: Use dbt docs + schema.yml.

# schema.yml
models:
  - name: dim_customer
    description: "Customer dimension with SCD Type 2 for address changes"
    columns:
      - name: customer_key
        description: "Surrogate key, changes for each version"
        tests:
          - unique
          - not_null

      - name: customer_id
        description: "Business key, natural customer ID from source system"
        tests:
          - not_null

      - name: city
        description: "Customer's city. Tracked for historical changes (SCD Type 2)"

      - name: is_current
        description: "TRUE for current version, FALSE for historical"

Kết luận

Data modeling không phải rocket science, nhưng lại là nền tảng của mọi Data Warehouse thành công.

Những điểm cần nhớ

Chọn approach phù hợp:
- OBT: Startup, modern cloud DW, cần tốc độ
- Star Schema: BI-heavy, mid-market, cân bằng
- Data Vault: Enterprise, compliance, audit
- Snowflake Schema: Hiếm khi cần (storage giờ rẻ)
Bắt đầu đơn giản, nâng cấp dần:
- Đừng over-engineer ngay từ ngày đầu
- Có thể đi theo lộ trình OBT → Star Schema → Data Vault
Denormalize dimension, normalize fact:
- Dimension: phẳng, mô tả
- Fact: chỉ key + metric
Xử lý Slowly Changing Dimensions:
- SCD Type 1: overwrite (đơn giản)
- SCD Type 2: lưu lịch sử (phổ biến)
- Dùng dbt snapshots để quản lý dễ dàng
Document đầy đủ:
- schema.yml trong dbt
- Auto-generate docs: dbt docs generate
- Giúp dữ liệu tự phục vụ

Bạn nên làm gì tiếp?

Nếu bạn đang:

✅ Bắt đầu DW mới: Chọn Star Schema hoặc OBT với dbt
✅ Có schema normalized cũ: Cân nhắc denormalize sang Star/OBT
✅ Enterprise có compliance: Đánh giá Data Vault 2.0
✅ Analyst than query chậm: Denormalize và tạo OBT

Action plan gợi ý:

Đánh giá hiện trạng modeling
Xác định use case chính: BI, compliance, ad-hoc?
Chọn approach: Star/OBT/Data Vault
Bắt đầu với một subject area (ví dụ Sales)
Lặp lại, đo hiệu năng query
Mở rộng sang các domain khác

Bạn cần help với Data Modeling?

Carptech đã thiết kế data model cho 20+ doanh nghiệp e-commerce, finance, SaaS, logistics tại Việt Nam và Đông Nam Á.

Chúng tôi có thể giúp:

✅ Đánh giá data model hiện tại, tìm vấn đề
✅ Thiết kế kiến trúc Star Schema / Data Vault
✅ Triển khai bằng dbt
✅ Migration từ mô hình yếu sang mô hình tối ưu
✅ Đào tạo team về best practice data modeling

Typical Results:

10-50x faster queries
80% reduction trong "where is data?" questions
Self-service analytics enabled

Đặt lịch data modeling assessment miễn phí →

Hoặc nếu bạn muốn tìm hiểu thêm về Data Engineering:

Bài viết được viết bởi Carptech Team - chuyên gia về Data Modeling và Data Warehouse Architecture. Nếu có câu hỏi về Star Schema, Data Vault, hoặc dimensional modeling, hãy liên hệ với chúng tôi.

Bước tiếp theo

Làm Data Maturity Assessment → — Đánh giá mức độ trưởng thành dữ liệu trên 6 dimensions
Tính ROI Data Platform → — Ước tính chi phí và lợi ích đầu tư data platform
Đặt lịch tư vấn miễn phí → — 60 phút cùng chuyên gia Carptech

Data Modeling 101: Star Schema vs Snowflake Schema vs Data Vault

Tại sao Data Modeling Matters?

Problem: Data Warehouse Without Good Modeling

3 lợi ích chính của Data Modeling

Star Schema: “Xương sống” của BI

Star Schema Structure

Star Schema: Ưu và nhược điểm

Khi nào nên dùng Star Schema

Ví dụ Star Schema: E-commerce

Snowflake Schema: Normalized Dimensions

Snowflake Schema Structure

Snowflake Schema: Ưu & nhược điểm

Khi nào nên dùng Snowflake Schema

Data Vault 2.0: Enterprise Data Warehouse

Data Vault Structure

Data Vault: Ưu & nhược điểm

Khi nào nên dùng Data Vault

Data Vault Example: Banking

One Big Table (OBT): Modern dbt Approach

OBT Concept

OBT Structure

OBT Example: E-commerce

OBT: Ưu & nhược điểm

Khi nào nên dùng OBT

Decision Framework: Nên chọn mô hình nào?

Common Data Modeling Mistakes

Mistake 1: Over-Normalization

Mistake 2: Under-Normalization (trong Fact Tables)

Mistake 3: Not Handling Slowly Changing Dimensions

Mistake 4: Mixing Grain trong Fact Tables

Mistake 5: No Documentation

Kết luận

Những điểm cần nhớ

Bạn nên làm gì tiếp?

Bạn cần help với Data Modeling?

Bước tiếp theo

Đăng ký nhận bài viết mới

Có câu hỏi về Data Platform?

Bài viết liên quan

Infrastructure as Code cho Data Platform: Terraform + dbt + Git

dbt (data build tool) 101: Hướng dẫn cho người mới bắt đầu

Real-time analytics: từ batch đến phân tích dưới giây với stream processing

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý