"Dashboard says revenue $5M. Finance says $4.8M. Sales says $5.2M. Ai đúng?"
Nếu bạn là Data Engineer và nghe câu này, bạn có data quality problem.
"Garbage in, garbage out" là nguyên lý cơ bản của data. Nếu data quality kém:
- Dashboards sai → Bad decisions → Mất tiền
- Reports inconsistent → Mất trust → Không ai dùng data
- ML models unreliable → Predictions sai → Waste effort
Theo Gartner, poor data quality costs organizations $12.9 million per year on average. Harvard Business Review nói: "Only 3% of companies' data meets basic quality standards."
Data quality không phải "nice to have" - đó là foundation của mọi Data Platform.
Trong bài này, bạn sẽ học:
- 6 Dimensions of Data Quality: Framework để measure data quality
- Data Quality Framework: Define, Measure, Monitor, Remediate
- Tools & Implementation: dbt tests, Great Expectations, Monte Carlo
- Real examples: E-commerce, finance, marketing data quality rules
- Data Quality Scorecard: Track metrics over time
- Incident Response: What to do when quality fails
- Cost of bad data: Case studies với quantified losses
Sau bài này, bạn sẽ có framework cụ thể để implement data quality trong organization của bạn.
6 Dimensions of Data Quality
Data quality không phải yes/no. Có 6 dimensions để evaluate:
1. Accuracy (Độ chính xác)
Definition: Data đúng với reality.
Examples:
-
✅ Good: Customer email là email thật, có thể send được
-
❌ Bad: Email = "test@test.com", "abc@xyz", null
-
✅ Good: Order total = sum of line items
-
❌ Bad: Order total = $100, nhưng line items sum = $95
How to measure:
-- E-commerce: Order total accuracy
SELECT
order_id,
order_total,
SUM(line_item_price * quantity) as calculated_total,
ABS(order_total - calculated_total) as difference
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY order_id, order_total
HAVING ABS(difference) > 0.01; -- More than 1 cent off = inaccurate
dbt test:
# schema.yml
models:
- name: orders
columns:
- name: order_total
tests:
- dbt_utils.expression_is_true:
expression: "ABS(order_total - line_items_sum) < 0.01"
2. Completeness (Độ đầy đủ)
Definition: Không missing critical data.
Examples:
-
✅ Good: Mọi customer có email (required field)
-
❌ Bad: 20% customers không có email → Can't send marketing
-
✅ Good: Mọi order có payment method
-
❌ Bad: 5% orders missing payment_method → Can't analyze payment trends
How to measure:
-- Completeness rate
SELECT
COUNT(*) as total_customers,
COUNT(email) as customers_with_email,
COUNT(email) * 100.0 / COUNT(*) as email_completeness_pct
FROM customers;
-- Should be > 95%
dbt test:
models:
- name: customers
columns:
- name: email
tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_null:
row_condition: "customer_status = 'active'" -- Active customers must have email
3. Consistency (Tính nhất quán)
Definition: Same data across systems match.
Examples:
-
✅ Good: Customer name trong CRM = name trong Data Warehouse
-
❌ Bad: CRM says "John Smith", DW says "SMITH, JOHN"
-
✅ Good: Revenue trong sales table = revenue trong finance table
-
❌ Bad: Sales table: $5M, Finance table: $4.8M
How to measure:
-- Compare revenue across systems
WITH sales_revenue AS (
SELECT SUM(revenue) as total_revenue
FROM sales_transactions
WHERE transaction_date = '2024-01-15'
),
finance_revenue AS (
SELECT SUM(amount) as total_revenue
FROM finance_ledger
WHERE ledger_date = '2024-01-15'
AND account_type = 'revenue'
)
SELECT
s.total_revenue as sales_revenue,
f.total_revenue as finance_revenue,
ABS(s.total_revenue - f.total_revenue) as difference,
ABS(s.total_revenue - f.total_revenue) / s.total_revenue * 100 as difference_pct
FROM sales_revenue s, finance_revenue f;
-- Should be < 1% difference
dbt test:
models:
- name: daily_revenue_reconciliation
tests:
- dbt_utils.expression_is_true:
expression: "ABS(sales_revenue - finance_revenue) / sales_revenue < 0.01"
name: revenue_reconciliation_within_1_percent
4. Timeliness (Tính kịp thời)
Definition: Data available khi cần.
Examples:
-
✅ Good: Yesterday's orders available at 6AM today
-
❌ Bad: Yesterday's orders chưa available at 9AM (people waiting)
-
✅ Good: Customer profile updated trong 1 hour sau khi change
-
❌ Bad: Customer change address, 24 hours sau vẫn show old address
How to measure:
-- Data freshness
SELECT
MAX(order_date) as latest_order_date,
CURRENT_TIMESTAMP as now,
DATEDIFF(hour, MAX(order_date), CURRENT_TIMESTAMP) as hours_delay
FROM orders;
-- Should be < 6 hours for daily batch, < 1 hour for real-time
dbt test:
models:
- name: orders
tests:
- dbt_utils.recency:
datepart: hour
field: loaded_at
interval: 6 -- Should be loaded within 6 hours
5. Validity (Tính hợp lệ)
Definition: Data conform to business rules.
Examples:
-
✅ Good: Order status trong ['pending', 'completed', 'cancelled', 'refunded']
-
❌ Bad: Order status = 'DONE', 'ok', 'finished' (không valid values)
-
✅ Good: Email format:
xxx@xxx.xxx -
❌ Bad: Email = 'notanemail', '12345'
How to measure:
-- Invalid email formats
SELECT
COUNT(*) as total_customers,
COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) as invalid_emails,
COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) * 100.0 / COUNT(*) as invalid_pct
FROM customers;
-- Should be 0%
dbt test:
models:
- name: orders
columns:
- name: order_status
tests:
- accepted_values:
values: ['pending', 'processing', 'completed', 'cancelled', 'refunded']
- name: customers
columns:
- name: email
tests:
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
6. Uniqueness (Tính duy nhất)
Definition: No duplicates where there shouldn't be.
Examples:
-
✅ Good: Mỗi order_id unique
-
❌ Bad: Duplicate order_id (order appears twice)
-
✅ Good: Mỗi customer_email unique
-
❌ Bad: 2 customers có cùng email
How to measure:
-- Find duplicates
SELECT
order_id,
COUNT(*) as count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Should return 0 rows
dbt test:
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customers
columns:
- name: email
tests:
- unique:
where: "email IS NOT NULL" -- Only check non-null emails




