Quay lại Blog
Data EngineeringCập nhật: 4 tháng 3, 20255 phút đọc

Data Quality: Framework và Tools để Đảm Bảo Dữ Liệu Đáng Tin

Garbage in, garbage out. Data quality là vấn đề #1 của Data Platforms. Hướng dẫn chi tiết về 6 dimensions of data quality, framework triển khai, tools (dbt tests, Great Expectations, Monte Carlo), và cách build data quality scorecard.

Trần Thị Mai Linh

Trần Thị Mai Linh

Head of Data Engineering

Data quality framework showing validation, testing, and monitoring processes ensuring high-quality data
#Data Quality#dbt tests#Great Expectations#Data Observability#Data Validation#Data Testing#Monte Carlo#Data Governance

"Dashboard says revenue $5M. Finance says $4.8M. Sales says $5.2M. Ai đúng?"

Nếu bạn là Data Engineer và nghe câu này, bạn có data quality problem.

"Garbage in, garbage out" là nguyên lý cơ bản của data. Nếu data quality kém:

  • Dashboards sai → Bad decisions → Mất tiền
  • Reports inconsistent → Mất trust → Không ai dùng data
  • ML models unreliable → Predictions sai → Waste effort

Theo Gartner, poor data quality costs organizations $12.9 million per year on average. Harvard Business Review nói: "Only 3% of companies' data meets basic quality standards."

Data quality không phải "nice to have" - đó là foundation của mọi Data Platform.

Trong bài này, bạn sẽ học:

  • 6 Dimensions of Data Quality: Framework để measure data quality
  • Data Quality Framework: Define, Measure, Monitor, Remediate
  • Tools & Implementation: dbt tests, Great Expectations, Monte Carlo
  • Real examples: E-commerce, finance, marketing data quality rules
  • Data Quality Scorecard: Track metrics over time
  • Incident Response: What to do when quality fails
  • Cost of bad data: Case studies với quantified losses

Sau bài này, bạn sẽ có framework cụ thể để implement data quality trong organization của bạn.

6 Dimensions of Data Quality

Data quality không phải yes/no. Có 6 dimensions để evaluate:

1. Accuracy (Độ chính xác)

Definition: Data đúng với reality.

Examples:

  • ✅ Good: Customer email là email thật, có thể send được

  • ❌ Bad: Email = "test@test.com", "abc@xyz", null

  • ✅ Good: Order total = sum of line items

  • ❌ Bad: Order total = $100, nhưng line items sum = $95

How to measure:

-- E-commerce: Order total accuracy
SELECT
  order_id,
  order_total,
  SUM(line_item_price * quantity) as calculated_total,
  ABS(order_total - calculated_total) as difference
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY order_id, order_total
HAVING ABS(difference) > 0.01;  -- More than 1 cent off = inaccurate

dbt test:

# schema.yml
models:
  - name: orders
    columns:
      - name: order_total
        tests:
          - dbt_utils.expression_is_true:
              expression: "ABS(order_total - line_items_sum) < 0.01"

2. Completeness (Độ đầy đủ)

Definition: Không missing critical data.

Examples:

  • ✅ Good: Mọi customer có email (required field)

  • ❌ Bad: 20% customers không có email → Can't send marketing

  • ✅ Good: Mọi order có payment method

  • ❌ Bad: 5% orders missing payment_method → Can't analyze payment trends

How to measure:

-- Completeness rate
SELECT
  COUNT(*) as total_customers,
  COUNT(email) as customers_with_email,
  COUNT(email) * 100.0 / COUNT(*) as email_completeness_pct
FROM customers;

-- Should be > 95%

dbt test:

models:
  - name: customers
    columns:
      - name: email
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_not_be_null:
              row_condition: "customer_status = 'active'"  -- Active customers must have email

3. Consistency (Tính nhất quán)

Definition: Same data across systems match.

Examples:

  • ✅ Good: Customer name trong CRM = name trong Data Warehouse

  • ❌ Bad: CRM says "John Smith", DW says "SMITH, JOHN"

  • ✅ Good: Revenue trong sales table = revenue trong finance table

  • ❌ Bad: Sales table: $5M, Finance table: $4.8M

How to measure:

-- Compare revenue across systems
WITH sales_revenue AS (
  SELECT SUM(revenue) as total_revenue
  FROM sales_transactions
  WHERE transaction_date = '2024-01-15'
),
finance_revenue AS (
  SELECT SUM(amount) as total_revenue
  FROM finance_ledger
  WHERE ledger_date = '2024-01-15'
  AND account_type = 'revenue'
)
SELECT
  s.total_revenue as sales_revenue,
  f.total_revenue as finance_revenue,
  ABS(s.total_revenue - f.total_revenue) as difference,
  ABS(s.total_revenue - f.total_revenue) / s.total_revenue * 100 as difference_pct
FROM sales_revenue s, finance_revenue f;

-- Should be < 1% difference

dbt test:

models:
  - name: daily_revenue_reconciliation
    tests:
      - dbt_utils.expression_is_true:
          expression: "ABS(sales_revenue - finance_revenue) / sales_revenue < 0.01"
          name: revenue_reconciliation_within_1_percent

4. Timeliness (Tính kịp thời)

Definition: Data available khi cần.

Examples:

  • ✅ Good: Yesterday's orders available at 6AM today

  • ❌ Bad: Yesterday's orders chưa available at 9AM (people waiting)

  • ✅ Good: Customer profile updated trong 1 hour sau khi change

  • ❌ Bad: Customer change address, 24 hours sau vẫn show old address

How to measure:

-- Data freshness
SELECT
  MAX(order_date) as latest_order_date,
  CURRENT_TIMESTAMP as now,
  DATEDIFF(hour, MAX(order_date), CURRENT_TIMESTAMP) as hours_delay
FROM orders;

-- Should be < 6 hours for daily batch, < 1 hour for real-time

dbt test:

models:
  - name: orders
    tests:
      - dbt_utils.recency:
          datepart: hour
          field: loaded_at
          interval: 6  -- Should be loaded within 6 hours

5. Validity (Tính hợp lệ)

Definition: Data conform to business rules.

Examples:

  • ✅ Good: Order status trong ['pending', 'completed', 'cancelled', 'refunded']

  • ❌ Bad: Order status = 'DONE', 'ok', 'finished' (không valid values)

  • ✅ Good: Email format: xxx@xxx.xxx

  • ❌ Bad: Email = 'notanemail', '12345'

How to measure:

-- Invalid email formats
SELECT
  COUNT(*) as total_customers,
  COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) as invalid_emails,
  COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) * 100.0 / COUNT(*) as invalid_pct
FROM customers;

-- Should be 0%

dbt test:

models:
  - name: orders
    columns:
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'completed', 'cancelled', 'refunded']

  - name: customers
    columns:
      - name: email
        tests:
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

6. Uniqueness (Tính duy nhất)

Definition: No duplicates where there shouldn't be.

Examples:

  • ✅ Good: Mỗi order_id unique

  • ❌ Bad: Duplicate order_id (order appears twice)

  • ✅ Good: Mỗi customer_email unique

  • ❌ Bad: 2 customers có cùng email

How to measure:

-- Find duplicates
SELECT
  order_id,
  COUNT(*) as count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Should return 0 rows

dbt test:

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

  - name: customers
    columns:
      - name: email
        tests:
          - unique:
              where: "email IS NOT NULL"  -- Only check non-null emails

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn