Data Quality: Framework và tools để đảm bảo dữ liệu đáng tin

"Dashboard says revenue $5M. Finance says $4.8M. Sales says $5.2M. Ai đúng?"

Nếu bạn là Data Engineer và nghe câu này, bạn có data quality problem.

"Garbage in, garbage out" là nguyên lý cơ bản của data. Nếu data quality kém:

Dashboards sai → Bad decisions → Mất tiền
Reports inconsistent → Mất trust → Không ai dùng data
ML models unreliable → Predictions sai → Waste effort

Theo Gartner, poor data quality costs organizations $12.9 million per year on average. Harvard Business Review nói: "Only 3% of companies' data meets basic quality standards."

Data quality không phải "nice to have" - đó là foundation của mọi Data Platform, cùng với data governance tạo thành nền tảng quản trị dữ liệu vững chắc.

Trong bài này, bạn sẽ học:

6 Dimensions of Data Quality: Framework để measure data quality
Data Quality Framework: Define, Measure, Monitor, Remediate
Tools & Implementation: dbt tests, Great Expectations, Monte Carlo
Real examples: E-commerce, finance, marketing data quality rules
Data Quality Scorecard: Track metrics over time
Incident Response: What to do when quality fails
Cost of bad data: Case studies với quantified losses

Sau bài này, bạn sẽ có framework cụ thể để implement data quality trong organization của bạn.

6 Dimensions of Data Quality

Data quality không phải yes/no. Có 6 dimensions để evaluate:

1. Accuracy (Độ chính xác)

Definition: Data đúng với reality.

Examples:

✅ Good: Customer email là email thật, có thể send được
❌ Bad: Email = "test@test.com", "abc@xyz", null
✅ Good: Order total = sum of line items
❌ Bad: Order total = $100, nhưng line items sum = $95

How to measure:

-- E-commerce: Order total accuracy
SELECT
  order_id,
  order_total,
  SUM(line_item_price * quantity) as calculated_total,
  ABS(order_total - calculated_total) as difference
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY order_id, order_total
HAVING ABS(difference) > 0.01;  -- More than 1 cent off = inaccurate

dbt test:

# schema.yml
models:
  - name: orders
    columns:
      - name: order_total
        tests:
          - dbt_utils.expression_is_true:
              expression: "ABS(order_total - line_items_sum) < 0.01"

2. Completeness (Độ đầy đủ)

Definition: Không missing critical data.

Examples:

✅ Good: Mọi customer có email (required field)
❌ Bad: 20% customers không có email → Can't send marketing
✅ Good: Mọi order có payment method
❌ Bad: 5% orders missing payment_method → Can't analyze payment trends

How to measure:

-- Completeness rate
SELECT
  COUNT(*) as total_customers,
  COUNT(email) as customers_with_email,
  COUNT(email) * 100.0 / COUNT(*) as email_completeness_pct
FROM customers;

-- Should be > 95%

dbt test:

models:
  - name: customers
    columns:
      - name: email
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_not_be_null:
              row_condition: "customer_status = 'active'"  -- Active customers must have email

3. Consistency (Tính nhất quán)

Definition: Same data across systems match.

Examples:

✅ Good: Customer name trong CRM = name trong Data Warehouse
❌ Bad: CRM says "John Smith", DW says "SMITH, JOHN"
✅ Good: Revenue trong sales table = revenue trong finance table
❌ Bad: Sales table: $5M, Finance table: $4.8M

How to measure:

-- Compare revenue across systems
WITH sales_revenue AS (
  SELECT SUM(revenue) as total_revenue
  FROM sales_transactions
  WHERE transaction_date = '2024-01-15'
),
finance_revenue AS (
  SELECT SUM(amount) as total_revenue
  FROM finance_ledger
  WHERE ledger_date = '2024-01-15'
  AND account_type = 'revenue'
)
SELECT
  s.total_revenue as sales_revenue,
  f.total_revenue as finance_revenue,
  ABS(s.total_revenue - f.total_revenue) as difference,
  ABS(s.total_revenue - f.total_revenue) / s.total_revenue * 100 as difference_pct
FROM sales_revenue s, finance_revenue f;

-- Should be < 1% difference

dbt test:

models:
  - name: daily_revenue_reconciliation
    tests:
      - dbt_utils.expression_is_true:
          expression: "ABS(sales_revenue - finance_revenue) / sales_revenue < 0.01"
          name: revenue_reconciliation_within_1_percent

4. Timeliness (Tính kịp thời)

Definition: Data available khi cần.

Examples:

✅ Good: Yesterday's orders available at 6AM today
❌ Bad: Yesterday's orders chưa available at 9AM (people waiting)
✅ Good: Customer profile updated trong 1 hour sau khi change
❌ Bad: Customer change address, 24 hours sau vẫn show old address

How to measure:

-- Data freshness
SELECT
  MAX(order_date) as latest_order_date,
  CURRENT_TIMESTAMP as now,
  DATEDIFF(hour, MAX(order_date), CURRENT_TIMESTAMP) as hours_delay
FROM orders;

-- Should be < 6 hours for daily batch, < 1 hour for real-time

dbt test:

models:
  - name: orders
    tests:
      - dbt_utils.recency:
          datepart: hour
          field: loaded_at
          interval: 6  -- Should be loaded within 6 hours

5. Validity (Tính hợp lệ)

Definition: Data conform to business rules.

Examples:

✅ Good: Order status trong ['pending', 'completed', 'cancelled', 'refunded']
❌ Bad: Order status = 'DONE', 'ok', 'finished' (không valid values)
✅ Good: Email format: xxx@xxx.xxx
❌ Bad: Email = 'notanemail', '12345'

How to measure:

-- Invalid email formats
SELECT
  COUNT(*) as total_customers,
  COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) as invalid_emails,
  COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) * 100.0 / COUNT(*) as invalid_pct
FROM customers;

-- Should be 0%

dbt test:

models:
  - name: orders
    columns:
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'completed', 'cancelled', 'refunded']

  - name: customers
    columns:
      - name: email
        tests:
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

6. Uniqueness (Tính duy nhất)

Definition: No duplicates where there shouldn't be.

Examples:

✅ Good: Mỗi order_id unique
❌ Bad: Duplicate order_id (order appears twice)
✅ Good: Mỗi customer_email unique
❌ Bad: 2 customers có cùng email

How to measure:

-- Find duplicates
SELECT
  order_id,
  COUNT(*) as count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Should return 0 rows

dbt test:

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

  - name: customers
    columns:
      - name: email
        tests:
          - unique:
              where: "email IS NOT NULL"  -- Only check non-null emails

Bước tiếp theo

Làm Data Maturity Assessment → — Đánh giá mức độ trưởng thành dữ liệu trên 6 dimensions
Tính ROI Data Platform → — Ước tính chi phí và lợi ích đầu tư data platform
Đặt lịch tư vấn miễn phí → — 60 phút cùng chuyên gia Carptech

Data Quality: Framework và tools để đảm bảo dữ liệu đáng tin

6 Dimensions of Data Quality

1. Accuracy (Độ chính xác)

2. Completeness (Độ đầy đủ)

3. Consistency (Tính nhất quán)

4. Timeliness (Tính kịp thời)

5. Validity (Tính hợp lệ)

6. Uniqueness (Tính duy nhất)

Bước tiếp theo

Đăng ký nhận bài viết mới

Có câu hỏi về Data Platform?

Bài viết liên quan

Real-time analytics: từ batch đến phân tích dưới giây với stream processing

Infrastructure as Code cho Data Platform: Terraform + dbt + Git

Real-time Data Pipeline: Kafka, CDC, và Streaming Architecture

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý