TL;DR
Data Lineage = Map of data flow showing:
- Source: Where data originates (APIs, databases, files)
- Transformations: How data is processed (SQL, Python, dbt)
- Destination: Where data ends up (dashboards, ML models, reports)
Why Critical:
- Impact Analysis: "If I change this table, what breaks?" → Know before deploying
- Root Cause Analysis: "Dashboard shows wrong numbers" → Trace back to find issue
- Compliance: GDPR/PDPA require knowing where PII flows
- Trust: "How is this metric calculated?" → See full calculation chain
Types:
- Table-level: Dataset → dataset dependencies
- Column-level: Field → field traceability (more granular, harder to maintain)
How Captured:
- Automatic: Parse SQL/dbt → extract dependencies (best)
- Metadata APIs: Query warehouse metadata
- Manual: Developers document (error-prone, don't do this)
Tools:
- dbt: Automatic lineage for dbt models (free, built-in)
- Data Catalogs: Atlan, Alation, DataHub integrate lineage
- Cloud-native: BigQuery, Snowflake have basic lineage
- Standalone: Manta, Octopai (enterprise)
Use Case Example: Dashboard shows revenue down 20%
- Trace lineage: Dashboard →
fact_revenuetable →stg_orderstable - Find:
stg_orderspipeline failed yesterday, data stale - Fix: Rerun pipeline → dashboard correct
- Time saved: 2 hours debugging → 5 minutes với lineage
Implementation: Start with dbt (if using), expand to catalog for full coverage
Bài này sẽ guide bạn qua complete lineage implementation từ basics đến advanced column-level tracking.
1. What is Data Lineage?
1.1. Definition
Data Lineage = Visual map showing:
- Upstream: Where data comes from
- Transformations: How data changes
- Downstream: Where data goes
Example:
E-commerce Order Flow:
Shopify API
│
▼
stg_shopify_orders (Raw staging)
│
▼
int_orders_enriched (Add customer info)
│
├─────────────┬─────────────┐
▼ ▼ ▼
fact_orders dim_customers revenue_metrics
│ │ │
├─────────────┴─────────────┤
▼ ▼
Executive Dashboard ML Model (Churn)
Key info per node:
- Table/dataset name
- Schema (columns)
- Update frequency
- Owner
- Quality score
- Row count
Key info per edge (arrows):
- Transformation logic (SQL, Python)
- When it runs
- Dependencies (runs after X completes)
1.2. Table-Level vs Column-Level
Table-Level Lineage:
orders_table → fact_orders_table
- Shows: Dataset depends on dataset
- Granularity: Coarse
- Effort: Low (easy to extract)
Column-Level Lineage:
orders_table.order_id → fact_orders.order_key
orders_table.total_amount → fact_orders.revenue_usd (converted từ VND)
customers_table.email → fact_orders.customer_email
- Shows: Field depends on field(s)
- Granularity: Fine
- Effort: High (complex to extract, especially với UNIONs, JOINs)
When to use:
- Table-level: Sufficient for 80% use cases (impact analysis, root cause)
- Column-level: Advanced compliance (GDPR - where does
emailflow?), sensitive data tracking
Vietnamese market: Most companies start table-level, add column-level for PII compliance (PDPA).
1.3. Lineage vs Data Flow vs Pipeline
Confused terms:
| Term | Definition | Example |
|---|---|---|
| Data Lineage | Static map of dependencies | "Table A depends on B and C" |
| Data Flow | Runtime execution path | "Data flows: API → S3 → BigQuery" |
| Data Pipeline | Orchestrated jobs | "Airflow DAG runs every hour" |
Lineage = blueprint, Flow = water moving, Pipeline = plumbing system
2. Why Data Lineage Matters
2.1. Impact Analysis
Scenario: Need to change customers table schema (add column, rename field)
Question: "What will break?"
Without lineage:
Engineer: "Hmm, let me search codebase for 'customers'..."
*3 hours later*
"Found 47 SQL files referencing it. Not sure if I got them all.
Let's deploy and see what breaks 🤞"
*Next day*
10 dashboards broken
5 ML models failing
CEO dashboard showing nulls
With lineage:
Click 'customers' table → View downstream dependencies
Downstream (Direct):
- int_customer_360 (used by Marketing dashboard)
- fact_orders (JOIN on customer_id)
- churn_prediction_model (feature: customer_age)
Downstream (Indirect):
- Executive Dashboard (via fact_orders)
- Email Campaign Tool (via int_customer_360)
- Revenue Forecast Model (via fact_orders)
Total impact: 3 direct, 8 indirect dependencies
Action: Notify owners, update transformations, test end-to-end
Time saved: 3 hours debugging → 5 minutes impact analysis
2.2. Root Cause Analysis
Scenario: Dashboard shows revenue down 20%
Without lineage:
Analyst: "Dashboard wrong? Let me check..."
→ Check dashboard query: Looks correct
→ Check fact_revenue table: Numbers look off
→ Ask Data Engineer: "Did something change?"
→ Engineer: "Not sure, let me check logs..."
→ *2 hours later* Found: ETL pipeline failed yesterday
Total time: 2+ hours
With lineage:
Click dashboard → View lineage → Trace upstream
Dashboard "Revenue Trends"
↑ uses
fact_revenue
↑ built from
stg_shopify_orders ⚠️ (STALE - last updated 36 hours ago)
↑ ingested from
Shopify API
Alert: stg_shopify_orders failed to update (API rate limit)
Root cause found in 2 minutes ✅
Fix: Increase rate limit, rerun pipeline
Time saved: 2 hours → 2 minutes
2.3. Compliance (GDPR/PDPA)
Requirement: "Show me where customer email is used and shared"
Column-level lineage:
customers.email (Source)
│
├─────────────────┬─────────────────┐
▼ ▼ ▼
fact_orders.email marketing_list external_crm
│ │ │
▼ ▼ ▼
Order Dashboard Mailchimp API Salesforce
(3rd party) (3rd party)
Compliance questions answered:
- ✅ Where is PII stored? (3 tables)
- ✅ Where is it used? (1 dashboard)
- ✅ Who has access? (Check permissions per table)
- ✅ Is it shared externally? (Yes, Mailchimp + Salesforce)
- ✅ How long retained? (Check retention policies)
Use case: User requests "Delete my data" (PDPA Right to be Forgotten)
- Lineage shows all locations with user's email
- Automated deletion script targets all nodes
- Audit log proves complete deletion
2.4. Trust & Transparency
Scenario: Marketing và Finance report different "Monthly Active Users"
Without lineage:
Marketing: "We have 50K MAU"
Finance: "We have 48K MAU"
CEO: "Which is correct??"
*Investigation*
Turns out:
- Marketing uses dim_users table (includes test accounts)
- Finance uses analytics.active_users (excludes test)
Resolution: 3 days of meetings
With lineage:
Click "MAU" metric
Marketing Dashboard MAU:
← dim_users (includes is_test = true/false)
Definition: COUNT(DISTINCT user_id) WHERE last_active_date >= 30 days
Finance Dashboard MAU:
← analytics.active_users (WHERE is_test = false)
Definition: COUNT(DISTINCT user_id) WHERE last_active_date >= 30 days AND is_test = false
Difference: Test accounts (2K)
Resolution: 5 minutes ✅
Decision: Use Finance definition (exclude test)
Trust: Users can see exactly how metrics are calculated → no black box
3. How Lineage is Captured
3.1. Automatic SQL Parsing (Best Method)
Parse SQL queries → extract table dependencies
Example:
-- dbt model: fact_orders.sql
SELECT
o.order_id,
o.order_date,
c.customer_name,
o.total_amount * 23000 AS total_vnd -- USD to VND
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
Lineage extracted:
fact_orders:
depends_on:
- stg_orders (columns: order_id, order_date, customer_id, total_amount)
- dim_customers (columns: customer_id, customer_name)
produces:
- columns: [order_id, order_date, customer_name, total_vnd]
Tools that do this:
- dbt: Automatic (best for dbt users)
- SQLFluff: Open-source SQL parser
- sqllineage: Python library for lineage extraction
- Atlan, Alation: Commercial catalogs with SQL parsing
Example (sqllineage library):
from sqllineage.runner import LineageRunner
sql = """
INSERT INTO analytics.fact_orders
SELECT
o.order_id,
c.customer_name,
o.total_amount
FROM staging.orders o
JOIN staging.customers c ON o.customer_id = c.customer_id
"""
result = LineageRunner(sql)
print("Source tables:", result.source_tables)
# [Table: staging.orders, Table: staging.customers]
print("Target tables:", result.target_tables)
# [Table: analytics.fact_orders]
# Build lineage graph
lineage = {
'sources': [str(t) for t in result.source_tables],
'targets': [str(t) for t in result.target_tables]
}
Pros:
- ✅ Accurate
- ✅ Automated (no manual work)
- ✅ Always up-to-date (parses live queries)
Cons:
- ❌ Complex SQL hard to parse (dynamic SQL, CTEs, UNIONs)
- ❌ Requires access to SQL code
3.2. Metadata APIs
Query warehouse metadata to find dependencies
BigQuery Example:
-- Find all views/tables that reference 'customers' table
SELECT
t.table_catalog,
t.table_schema,
t.table_name,
t.table_type,
t.ddl -- CREATE TABLE/VIEW statement
FROM `project.region-us.INFORMATION_SCHEMA.TABLES` t
WHERE t.ddl LIKE '%customers%'
Cloud SQL / PostgreSQL:
-- Find dependencies using pg_depend
SELECT DISTINCT
dependent_ns.nspname AS dependent_schema,
dependent_view.relname AS dependent_view,
source_ns.nspname AS source_schema,
source_table.relname AS source_table
FROM pg_depend
JOIN pg_rewrite ON pg_depend.objid = pg_rewrite.oid
JOIN pg_class AS dependent_view ON pg_rewrite.ev_class = dependent_view.oid
JOIN pg_class AS source_table ON pg_depend.refobjid = source_table.oid
JOIN pg_namespace AS dependent_ns ON dependent_view.relnamespace = dependent_ns.oid
JOIN pg_namespace AS source_ns ON source_table.relnamespace = source_ns.oid
WHERE source_table.relname = 'customers'
AND source_table.relkind = 'r'; -- 'r' = regular table
Pros:
- ✅ Works for native views (warehouse tracks dependencies)
Cons:
- ❌ Only captures database-level dependencies (not external transformations)
- ❌ Limited to views (not table-to-table in ETL)
3.3. dbt Integration (Recommended for dbt Users)
dbt automatically generates lineage from ref() and source() functions
dbt project:
# models/schema.yml
sources:
- name: shopify
tables:
- name: orders
- name: customers
models:
- name: stg_orders
description: "Staging orders from Shopify"
- name: fact_orders
description: "Order facts for analytics"
-- models/staging/stg_orders.sql
SELECT * FROM {{ source('shopify', 'orders') }}
-- models/marts/fact_orders.sql
SELECT
o.order_id,
c.customer_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id
dbt generates:
// target/manifest.json
{
"nodes": {
"model.my_project.fact_orders": {
"depends_on": {
"nodes": [
"model.my_project.stg_orders",
"model.my_project.dim_customers"
]
}
},
"model.my_project.stg_orders": {
"depends_on": {
"nodes": ["source.my_project.shopify.orders"]
}
}
}
}
Visualize:
# Generate docs with lineage
dbt docs generate
# Serve docs
dbt docs serve
# Open http://localhost:8080
# Click any model → See lineage graph
Lineage graph (interactive):
shopify.orders (Source)
│
▼
stg_orders (Staging)
│
▼
fact_orders (Mart)
│
▼
[Downstream models...]
Pros:
- ✅ 100% accurate (dbt knows exact dependencies)
- ✅ Zero manual work
- ✅ Free
- ✅ Always up-to-date
Cons:
- ❌ Only captures dbt models (not Python, Airflow, dashboards)
3.4. Manual Documentation (❌ Don't Do This)
Example:
# lineage.yml (manually maintained)
fact_orders:
upstream:
- stg_orders
- dim_customers
downstream:
- executive_dashboard
- revenue_model
Problems:
- ❌ Quickly outdated (developers forget to update)
- ❌ Error-prone (typos, missing dependencies)
- ❌ Not scalable
Rule: Automate lineage, never manual
4. Lineage Visualization: DAGs
4.1. DAG = Directed Acyclic Graph
Properties:
- Directed: Arrows show flow direction (A → B)
- Acyclic: No loops (can't have A → B → A)
Example DAG:
orders_api
│
▼
stg_orders ──────┐
│ │
▼ │
int_orders_clean │
│ │
├───────────┤
▼ ▼
fact_orders revenue_metrics
│ │
└─────┬─────┘
▼
executive_dashboard
Node (table) details:
┌─────────────────────────────────┐
│ fact_orders │
├─────────────────────────────────┤
│ Type: Table │
│ Rows: 1.2M │
│ Updated: 2 hours ago │
│ Quality: 98% ✅ │
│ Owner: @data-team │
│ │
│ Upstream (2): │
│ - stg_orders │
│ - int_orders_clean │
│ │
│ Downstream (1): │
│ - executive_dashboard │
└─────────────────────────────────┘
4.2. Interactive Features
Good lineage tools allow:
1. Click to expand:
fact_orders
│
▼
[Click "Show downstream"] → Expands to show all 15 dependencies
2. Filter by distance:
Show:
[ ] 1 level upstream/downstream
[x] 2 levels
[ ] All levels (can be messy)
3. Highlight paths:
Click "customers" table → Highlight path to "Executive Dashboard"
Shows: customers → dim_customers → fact_orders → dashboard
4. Impact analysis:
Right-click table → "What breaks if I change this?"
Highlights all downstream in red/orange
5. Column-level drill-down:
Click table → View columns
Click column → See which downstream columns depend on it
4.3. Lineage in dbt Docs
Example (dbt docs UI):
Top nav: [Project] [Sources] [Exposures]
Left sidebar:
📁 staging
📄 stg_orders
📄 stg_customers
📁 intermediate
📄 int_orders_enriched
📁 marts
📄 fact_orders
📄 dim_customers
Center: Lineage Graph
[Zoom In] [Zoom Out] [Center] [Download]
Visual DAG showing all models connected
Right sidebar (when model clicked):
Model: fact_orders
Description: "Order facts for analytics"
Columns (5):
- order_id
- order_date
- customer_name
- total_vnd
- profit_margin
Depends On (2):
- stg_orders
- dim_customers
Used By (3):
- revenue_dashboard
- churn_model
- cohort_analysis
5. Use Cases: Real-World Applications
5.1. Schema Change Impact
Scenario: Rename column customer_id → customer_key in customers table
Steps với lineage:
- Identify impact:
customers.customer_id is used by:
- dim_customers (JOIN key)
- fact_orders (FK reference)
- marketing_campaigns (target list)
- ml_features.customer_cohorts (feature)
- Notify owners:
# Auto-generate notification
affected_assets = lineage.get_downstream('customers.customer_id')
for asset in affected_assets:
notify(
owner=asset.owner,
message=f"⚠️ Breaking change: customers.customer_id renaming to customer_key",
impact=asset.name,
action_required="Update JOIN/WHERE clauses before 2025-07-15"
)
- Update in parallel:
- Data Engineer: Update all SQL queries
- Analytics Engineer: Update dbt models
- Data Scientist: Update feature pipelines
- Deploy safely:
- Test all downstream assets
- Deploy during low-traffic window
- Monitor for errors
Without lineage: 50% chance of missing something → production breakage
5.2. Data Quality Investigation
Scenario: Dashboard shows spike in revenue (too good to be true)
Investigation với lineage:
Click dashboard → Lineage → Trace upstream
revenue_dashboard
↑ uses
fact_revenue (updated 1 hour ago)
↑ built from
stg_shopify_orders (updated 1 hour ago)
↑ ingested from
Shopify API
Check quality scores:
- stg_shopify_orders: 65% ⚠️ (usually 98%)
Failed checks:
❌ order_total > 0 (5% violations - negative values!)
❌ unique order_id (2% duplicates!)
Root cause: Shopify API bug returned duplicate orders with negative refunds
Lineage helped:
- Quickly trace to source of bad data
- Identify quality issue
- Fix at source (contact Shopify)
Time: 10 minutes (vs hours of manual investigation)
5.3. GDPR/PDPA Data Deletion
Scenario: User requests deletion (Right to be Forgotten)
Column-level lineage query:
-- Find all tables containing user email
SELECT DISTINCT table_name, column_name
FROM lineage_metadata
WHERE source_column = 'users.email'
AND user_id = 12345;
Results:
┌─────────────────────┬──────────────┐
│ Table │ Column │
├─────────────────────┼──────────────┤
│ users │ email │
│ orders │ customer_email│
│ marketing_list │ email │
│ support_tickets │ requester_email│
│ audit_log │ user_email │
└─────────────────────┴──────────────┘
Automated deletion:
def delete_user_data(user_id):
# Get all tables with user data (from lineage)
affected_tables = lineage.get_tables_with_column('users.email')
for table in affected_tables:
# Delete from each table
db.execute(f"""
DELETE FROM {table.name}
WHERE user_id = %s
""", [user_id])
# Log deletion (compliance audit)
audit_log.record(
action='gdpr_deletion',
user_id=user_id,
table=table.name,
timestamp=NOW()
)
return {
'deleted_from': [t.name for t in affected_tables],
'timestamp': NOW()
}
Compliance proof: Lineage shows all locations checked ✅
5.4. Debugging ML Model
Scenario: Churn prediction model accuracy dropped từ 85% → 70%
Investigation:
churn_model
↑ trained on
ml_features.customer_features
↑ built from
├── dim_customers
├── fact_orders
└── web_events ⚠️ (STALE - 7 days old)
Root cause: web_events pipeline broken → missing features → model degraded
Fix: Repair web_events pipeline, retrain model
Lineage prevented: Weeks of debugging why model suddenly worse
6. Implementation Guide
6.1. Step 1: Start với dbt (If Applicable)
If you use dbt → Free lineage out of the box
# Already have dbt project?
dbt docs generate
dbt docs serve
# Open http://localhost:8080 → Click any model → See lineage
Export lineage to catalog:
# Extract lineage from dbt manifest.json
import json
with open('target/manifest.json') as f:
manifest = json.load(f)
# Build lineage graph
lineage_graph = {}
for node_id, node in manifest['nodes'].items():
if node['resource_type'] == 'model':
lineage_graph[node['name']] = {
'upstream': [
manifest['nodes'][dep]['name']
for dep in node['depends_on']['nodes']
if dep in manifest['nodes']
],
'downstream': [] # Compute by reversing graph
}
# Upload to data catalog
catalog.ingest_lineage(lineage_graph)
6.2. Step 2: Add Query Log Parsing
For non-dbt transformations (Python, Airflow, raw SQL)
BigQuery: Query audit logs
-- Find all queries that reference 'customers' table
SELECT
user_email,
query,
referenced_tables,
destination_table,
creation_time
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE state = 'DONE'
AND DATE(creation_time) >= CURRENT_DATE() - 7
AND (
'project.dataset.customers' IN UNNEST(referenced_tables)
OR query LIKE '%customers%'
)
ORDER BY creation_time DESC
LIMIT 100;
Parse queries to extract lineage:
from sqllineage.runner import LineageRunner
def extract_lineage_from_logs():
# Get recent queries from audit log
queries = bigquery_client.query("""
SELECT DISTINCT query, destination_table
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE DATE(creation_time) >= CURRENT_DATE() - 1
AND destination_table IS NOT NULL
""").to_dataframe()
lineage_edges = []
for _, row in queries.iterrows():
result = LineageRunner(row['query'])
for source in result.source_tables:
lineage_edges.append({
'source': str(source),
'target': row['destination_table'],
'transformation': row['query'][:500] # First 500 chars
})
return lineage_edges
# Run daily to update lineage
lineage = extract_lineage_from_logs()
catalog.update_lineage(lineage)
6.3. Step 3: Integrate Dashboards
Track dashboard → table dependencies
Looker Example:
# Use Looker API to get dashboard dependencies
from looker_sdk import init40
sdk = init40()
def get_dashboard_lineage(dashboard_id):
dashboard = sdk.dashboard(dashboard_id)
# Get all queries in dashboard
dependencies = []
for element in dashboard.dashboard_elements:
if element.query:
query = sdk.query(element.query.id)
# Extract tables used
for table in query.model.explores:
dependencies.append({
'dashboard': dashboard.title,
'dashboard_id': dashboard_id,
'table': f"{table.model_name}.{table.name}"
})
return dependencies
# Index all dashboards
for dashboard in sdk.all_dashboards():
lineage = get_dashboard_lineage(dashboard.id)
catalog.add_dashboard_lineage(lineage)
6.4. Step 4: Column-Level Lineage (Advanced)
More complex: Parse SQL to track column transformations
Example:
SELECT
o.order_id,
o.order_date,
c.email AS customer_email, -- customers.email → fact_orders.customer_email
o.total * 23000 AS total_vnd -- orders.total → fact_orders.total_vnd (with transformation)
FROM orders o
JOIN customers c ON o.customer_id = c.id
Column lineage:
fact_orders.order_id:
source: orders.order_id
transformation: direct
fact_orders.order_date:
source: orders.order_date
transformation: direct
fact_orders.customer_email:
source: customers.email
transformation: rename
fact_orders.total_vnd:
source: orders.total
transformation: "multiply by 23000"
Tools:
- Alation, Atlan: Built-in column lineage
- Manta: Specialized lineage tool
- Custom: Parse SQL AST (Abstract Syntax Tree)
7. Best Practices
7.1. Automate Everything
❌ Don't:
- Manually document lineage in spreadsheets
- Ask developers to update lineage diagrams
✅ Do:
- Parse code automatically (dbt, SQL logs)
- Use lineage tools (catalog, dbt docs)
- Regenerate lineage daily/hourly
7.2. Keep Lineage Fresh
Stale lineage worse than no lineage (gives false confidence)
Implementation:
# Daily cron job
def refresh_lineage():
# Extract from dbt
dbt_lineage = extract_dbt_lineage()
# Extract from query logs
query_lineage = extract_query_lineage()
# Merge
full_lineage = merge_lineage(dbt_lineage, query_lineage)
# Upload to catalog
catalog.replace_lineage(full_lineage, timestamp=NOW())
print(f"✅ Lineage refreshed: {len(full_lineage)} edges")
# Run daily at 3 AM
schedule.every().day.at("03:00").do(refresh_lineage)
7.3. Start Table-Level, Add Column-Level Later
Phased approach:
Phase 1 (Month 1-2): Table-level lineage
- Covers 80% of use cases
- Easier to implement
- Faster time-to-value
Phase 2 (Month 3-6): Column-level for PII
- Track sensitive columns (email, phone, etc.)
- Compliance focus (GDPR/PDPA)
Phase 3 (Month 6+): Full column-level
- All columns tracked
- Advanced impact analysis
7.4. Integrate với Data Catalog
Lineage + Catalog = Powerful combo
User searches: "customer revenue"
Result:
┌─────────────────────────────────────┐
│ fact_customer_revenue │
│ Quality: 98% | Updated: 2h ago │
│ │
│ [Schema] [Lineage] [Quality] [...]│
│ │
│ Click [Lineage] → │
│ Upstream (3): │
│ - stg_orders │
│ - dim_customers │
│ - exchange_rates │
│ │
│ Downstream (5): │
│ - revenue_dashboard ← You! │
│ - cohort_analysis │
│ - forecast_model │
│ ... │
└─────────────────────────────────────┘
Benefits:
- One interface for discovery + lineage
- Searchable lineage (find all downstream of X)
- Quality + lineage combined (see if upstream has quality issues)
8. Tools Comparison
8.1. Built-in Tools
dbt docs (Free)
- ✅ Automatic lineage for dbt models
- ✅ Beautiful interactive graph
- ✅ Zero cost
- ❌ dbt-only (no Python, dashboards)
- ❌ No column-level
BigQuery (Built-in)
- ✅ Basic table lineage for views
- ✅ Free
- ❌ Limited to views
- ❌ No cross-project lineage
Snowflake (Built-in)
- ✅ Object dependencies
- ✅ Access history (queries)
- ❌ Basic visualization
- ❌ No column-level
8.2. Data Catalogs
Atlan ($20K-$100K/year)
- ✅ Lineage integrated với catalog
- ✅ Auto-parsing dbt, SQL, Python
- ✅ Column-level lineage
- ✅ Impact analysis UI
Alation ($50K-$200K/year)
- ✅ Advanced lineage engine
- ✅ Column-level
- ✅ Cross-platform (Snowflake, BigQuery, etc.)
DataHub (Open-source, Free)
- ✅ Lineage via metadata events
- ✅ Extensible
- ❌ Requires setup effort
8.3. Specialized Lineage Tools
Manta (Enterprise, $$$)
- ✅ Enterprise-grade column lineage
- ✅ Supports 80+ data sources
- ✅ Complex SQL parsing (even procedural code)
- ❌ Very expensive
- ❌ Overkill for most companies
Octopai (Enterprise, $$$)
- ✅ BI-focused lineage (Tableau, Power BI)
- ✅ Cross-platform
- ❌ Expensive
8.4. Recommendation
| Company Size | Recommendation |
|---|---|
| < 50 employees | dbt docs (if dbt) or basic catalog |
| 50-200 | Atlan or DataHub (open-source) |
| 200+ | Alation or Manta (if heavy compliance) |
Kết Luận
Data Lineage is not optional for mature data teams - it's foundation for reliability, compliance, và trust.
Key Takeaways:
- Automate lineage extraction: Parse code (dbt, SQL), never manual
- Start table-level: 80% value với 20% effort
- Integrate với catalog: Lineage + discovery = powerful
- Use for impact analysis: Know what breaks before deploying
- Track PII flows: Critical for PDPA/GDPR compliance
- Keep fresh: Stale lineage = false confidence
- dbt is best starting point: Free, automatic, accurate
Implementation timeline:
- Week 1-2: Setup dbt docs (if applicable) or choose catalog
- Week 3-4: Extract lineage from SQL logs
- Month 2: Add dashboard lineage
- Month 3+: Column-level for PII (if needed)
ROI:
- Time savings: 2 hours debugging → 5 minutes với lineage
- Risk reduction: Prevent breaking production with impact analysis
- Compliance: GDPR/PDPA audit trails
- Trust: Transparency into data transformations
Next Steps:
- ✅ If using dbt: Run
dbt docs generatengay bây giờ - ✅ Đọc Data Catalog to integrate lineage
- ✅ Đọc Data Governance for broader framework
- ✅ Assess lineage maturity của team bạn
- ✅ Plan lineage implementation (start small!)
Need help? Carptech implements lineage solutions (dbt, Atlan, DataHub) and provides training. Book consultation to discuss your data traceability needs.
Related Posts:
- Data Catalog: Democratizing Data Discovery - Lineage in catalog context
- Data Governance 101: Framework cho Doanh Nghiệp - Governance foundation
- dbt Best Practices: Transform Data Like a Pro - dbt lineage deep-dive
- PDPA Compliance: Bảo Vệ Dữ Liệu Cá Nhân - Compliance use case




