Data Lineage: Traceability từ Source đến Dashboard

TL;DR

Data Lineage = Map of data flow showing:

Source: Where data originates (APIs, databases, files)
Transformations: How data is processed (SQL, Python, dbt)
Destination: Where data ends up (dashboards, ML models, reports)

Why Critical:

Impact Analysis: "If I change this table, what breaks?" → Know before deploying
Root Cause Analysis: "Dashboard shows wrong numbers" → Trace back to find issue
Compliance: GDPR/PDPA require knowing where PII flows
Trust: "How is this metric calculated?" → See full calculation chain

Types:

Table-level: Dataset → dataset dependencies
Column-level: Field → field traceability (more granular, harder to maintain)

How Captured:

Automatic: Parse SQL/dbt → extract dependencies (best)
Metadata APIs: Query warehouse metadata
Manual: Developers document (error-prone, don't do this)

Tools:

dbt: Automatic lineage for dbt models (free, built-in)
Data Catalogs: Atlan, Alation, DataHub integrate lineage
Cloud-native: BigQuery, Snowflake have basic lineage
Standalone: Manta, Octopai (enterprise)

Use Case Example: Dashboard shows revenue down 20%

Trace lineage: Dashboard → fact_revenue table → stg_orders table
Find: stg_orders pipeline failed yesterday, data stale
Fix: Rerun pipeline → dashboard correct
Time saved: 2 hours debugging → 5 minutes với lineage

Implementation: Start with dbt (if using), expand to catalog for full coverage

Bài này sẽ guide bạn qua complete lineage implementation từ basics đến advanced column-level tracking.

1. What is Data Lineage?

1.1. Definition

Data Lineage = Visual map showing:

Upstream: Where data comes from
Transformations: How data changes
Downstream: Where data goes

Example:

E-commerce Order Flow:

Shopify API
    │
    ▼
stg_shopify_orders (Raw staging)
    │
    ▼
int_orders_enriched (Add customer info)
    │
    ├─────────────┬─────────────┐
    ▼             ▼             ▼
fact_orders  dim_customers  revenue_metrics
    │             │             │
    ├─────────────┴─────────────┤
    ▼                           ▼
Executive Dashboard        ML Model (Churn)

Key info per node:

Table/dataset name
Schema (columns)
Update frequency
Owner
Quality score
Row count

Key info per edge (arrows):

Transformation logic (SQL, Python)
When it runs
Dependencies (runs after X completes)

1.2. Table-Level vs Column-Level

Table-Level Lineage:

orders_table → fact_orders_table

Shows: Dataset depends on dataset
Granularity: Coarse
Effort: Low (easy to extract)

Column-Level Lineage:

orders_table.order_id → fact_orders.order_key
orders_table.total_amount → fact_orders.revenue_usd (converted từ VND)
customers_table.email → fact_orders.customer_email

Shows: Field depends on field(s)
Granularity: Fine
Effort: High (complex to extract, especially với UNIONs, JOINs)

When to use:

Table-level: Sufficient for 80% use cases (impact analysis, root cause)
Column-level: Advanced compliance (GDPR - where does email flow?), sensitive data tracking

Vietnamese market: Most companies start table-level, add column-level for PII compliance (PDPA).

1.3. Lineage vs Data Flow vs Pipeline

Confused terms:

Term	Definition	Example
Data Lineage	Static map of dependencies	"Table A depends on B and C"
Data Flow	Runtime execution path	"Data flows: API → S3 → BigQuery"
Data Pipeline	Orchestrated jobs	"Airflow DAG runs every hour"

Lineage = blueprint, Flow = water moving, Pipeline = plumbing system

2. Why Data Lineage Matters

2.1. Impact Analysis

Scenario: Need to change customers table schema (add column, rename field)

Question: "What will break?"

Without lineage:

Engineer: "Hmm, let me search codebase for 'customers'..."
*3 hours later*
"Found 47 SQL files referencing it. Not sure if I got them all.
Let's deploy and see what breaks 🤞"

*Next day*
10 dashboards broken
5 ML models failing
CEO dashboard showing nulls

With lineage:

Click 'customers' table → View downstream dependencies

Downstream (Direct):
  - int_customer_360 (used by Marketing dashboard)
  - fact_orders (JOIN on customer_id)
  - churn_prediction_model (feature: customer_age)

Downstream (Indirect):
  - Executive Dashboard (via fact_orders)
  - Email Campaign Tool (via int_customer_360)
  - Revenue Forecast Model (via fact_orders)

Total impact: 3 direct, 8 indirect dependencies

Action: Notify owners, update transformations, test end-to-end

Time saved: 3 hours debugging → 5 minutes impact analysis

2.2. Root Cause Analysis

Scenario: Dashboard shows revenue down 20%

Without lineage:

Analyst: "Dashboard wrong? Let me check..."
  → Check dashboard query: Looks correct
  → Check fact_revenue table: Numbers look off
  → Ask Data Engineer: "Did something change?"
  → Engineer: "Not sure, let me check logs..."
  → *2 hours later* Found: ETL pipeline failed yesterday

Total time: 2+ hours

With lineage:

Click dashboard → View lineage → Trace upstream

Dashboard "Revenue Trends"
    ↑ uses
fact_revenue
    ↑ built from
stg_shopify_orders ⚠️ (STALE - last updated 36 hours ago)
    ↑ ingested from
Shopify API

Alert: stg_shopify_orders failed to update (API rate limit)

Root cause found in 2 minutes ✅
Fix: Increase rate limit, rerun pipeline

Time saved: 2 hours → 2 minutes

2.3. Compliance (GDPR/PDPA)

Requirement: "Show me where customer email is used and shared"

Column-level lineage:

customers.email (Source)
    │
    ├─────────────────┬─────────────────┐
    ▼                 ▼                 ▼
fact_orders.email  marketing_list  external_crm
    │                 │                 │
    ▼                 ▼                 ▼
Order Dashboard   Mailchimp API   Salesforce
                  (3rd party)     (3rd party)

Compliance questions answered:

✅ Where is PII stored? (3 tables)
✅ Where is it used? (1 dashboard)
✅ Who has access? (Check permissions per table)
✅ Is it shared externally? (Yes, Mailchimp + Salesforce)
✅ How long retained? (Check retention policies)

Use case: User requests "Delete my data" (PDPA Right to be Forgotten)

Lineage shows all locations with user's email
Automated deletion script targets all nodes
Audit log proves complete deletion

2.4. Trust & Transparency

Scenario: Marketing và Finance report different "Monthly Active Users"

Without lineage:

Marketing: "We have 50K MAU"
Finance: "We have 48K MAU"
CEO: "Which is correct??"

*Investigation*
Turns out:
- Marketing uses dim_users table (includes test accounts)
- Finance uses analytics.active_users (excludes test)

Resolution: 3 days of meetings

With lineage:

Click "MAU" metric

Marketing Dashboard MAU:
  ← dim_users (includes is_test = true/false)
  Definition: COUNT(DISTINCT user_id) WHERE last_active_date >= 30 days

Finance Dashboard MAU:
  ← analytics.active_users (WHERE is_test = false)
  Definition: COUNT(DISTINCT user_id) WHERE last_active_date >= 30 days AND is_test = false

Difference: Test accounts (2K)

Resolution: 5 minutes ✅
Decision: Use Finance definition (exclude test)

Trust: Users can see exactly how metrics are calculated → no black box

3. How Lineage is Captured

3.1. Automatic SQL Parsing (Best Method)

Parse SQL queries → extract table dependencies

Example:

-- dbt model: fact_orders.sql
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    o.total_amount * 23000 AS total_vnd  -- USD to VND
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
    ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'

Lineage extracted:

fact_orders:
  depends_on:
    - stg_orders (columns: order_id, order_date, customer_id, total_amount)
    - dim_customers (columns: customer_id, customer_name)

  produces:
    - columns: [order_id, order_date, customer_name, total_vnd]

Tools that do this:

dbt: Automatic (best for dbt users)
SQLFluff: Open-source SQL parser
sqllineage: Python library for lineage extraction
Atlan, Alation: Commercial catalogs with SQL parsing

Example (sqllineage library):

from sqllineage.runner import LineageRunner

sql = """
INSERT INTO analytics.fact_orders
SELECT
    o.order_id,
    c.customer_name,
    o.total_amount
FROM staging.orders o
JOIN staging.customers c ON o.customer_id = c.customer_id
"""

result = LineageRunner(sql)

print("Source tables:", result.source_tables)
# [Table: staging.orders, Table: staging.customers]

print("Target tables:", result.target_tables)
# [Table: analytics.fact_orders]

# Build lineage graph
lineage = {
    'sources': [str(t) for t in result.source_tables],
    'targets': [str(t) for t in result.target_tables]
}

Pros:

✅ Accurate
✅ Automated (no manual work)
✅ Always up-to-date (parses live queries)

Cons:

❌ Complex SQL hard to parse (dynamic SQL, CTEs, UNIONs)
❌ Requires access to SQL code

3.2. Metadata APIs

Query warehouse metadata to find dependencies

BigQuery Example:

-- Find all views/tables that reference 'customers' table
SELECT
    t.table_catalog,
    t.table_schema,
    t.table_name,
    t.table_type,
    t.ddl  -- CREATE TABLE/VIEW statement
FROM `project.region-us.INFORMATION_SCHEMA.TABLES` t
WHERE t.ddl LIKE '%customers%'

Cloud SQL / PostgreSQL:

-- Find dependencies using pg_depend
SELECT DISTINCT
    dependent_ns.nspname AS dependent_schema,
    dependent_view.relname AS dependent_view,
    source_ns.nspname AS source_schema,
    source_table.relname AS source_table
FROM pg_depend
JOIN pg_rewrite ON pg_depend.objid = pg_rewrite.oid
JOIN pg_class AS dependent_view ON pg_rewrite.ev_class = dependent_view.oid
JOIN pg_class AS source_table ON pg_depend.refobjid = source_table.oid
JOIN pg_namespace AS dependent_ns ON dependent_view.relnamespace = dependent_ns.oid
JOIN pg_namespace AS source_ns ON source_table.relnamespace = source_ns.oid
WHERE source_table.relname = 'customers'
  AND source_table.relkind = 'r';  -- 'r' = regular table

Pros:

✅ Works for native views (warehouse tracks dependencies)

Cons:

❌ Only captures database-level dependencies (not external transformations)
❌ Limited to views (not table-to-table in ETL)

3.3. dbt Integration (Recommended for dbt Users)

dbt automatically generates lineage from ref() and source() functions

dbt project:

# models/schema.yml
sources:
  - name: shopify
    tables:
      - name: orders
      - name: customers

models:
  - name: stg_orders
    description: "Staging orders from Shopify"

  - name: fact_orders
    description: "Order facts for analytics"

-- models/staging/stg_orders.sql
SELECT * FROM {{ source('shopify', 'orders') }}

-- models/marts/fact_orders.sql
SELECT
    o.order_id,
    c.customer_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id

dbt generates:

// target/manifest.json
{
  "nodes": {
    "model.my_project.fact_orders": {
      "depends_on": {
        "nodes": [
          "model.my_project.stg_orders",
          "model.my_project.dim_customers"
        ]
      }
    },
    "model.my_project.stg_orders": {
      "depends_on": {
        "nodes": ["source.my_project.shopify.orders"]
      }
    }
  }
}

Visualize:

# Generate docs with lineage
dbt docs generate

# Serve docs
dbt docs serve

# Open http://localhost:8080
# Click any model → See lineage graph

Lineage graph (interactive):

shopify.orders (Source)
    │
    ▼
stg_orders (Staging)
    │
    ▼
fact_orders (Mart)
    │
    ▼
[Downstream models...]

Pros:

✅ 100% accurate (dbt knows exact dependencies)
✅ Zero manual work
✅ Free
✅ Always up-to-date

Cons:

❌ Only captures dbt models (not Python, Airflow, dashboards)

3.4. Manual Documentation (❌ Don't Do This)

Example:

# lineage.yml (manually maintained)
fact_orders:
  upstream:
    - stg_orders
    - dim_customers
  downstream:
    - executive_dashboard
    - revenue_model

Problems:

❌ Quickly outdated (developers forget to update)
❌ Error-prone (typos, missing dependencies)
❌ Not scalable

Rule: Automate lineage, never manual

4. Lineage Visualization: DAGs

4.1. DAG = Directed Acyclic Graph

Properties:

Directed: Arrows show flow direction (A → B)
Acyclic: No loops (can't have A → B → A)

Example DAG:

        orders_api
            │
            ▼
       stg_orders ──────┐
            │           │
            ▼           │
    int_orders_clean    │
            │           │
            ├───────────┤
            ▼           ▼
       fact_orders  revenue_metrics
            │           │
            └─────┬─────┘
                  ▼
          executive_dashboard

Node (table) details:

┌─────────────────────────────────┐
│  fact_orders                    │
├─────────────────────────────────┤
│ Type: Table                     │
│ Rows: 1.2M                      │
│ Updated: 2 hours ago            │
│ Quality: 98% ✅                 │
│ Owner: @data-team               │
│                                 │
│ Upstream (2):                   │
│   - stg_orders                  │
│   - int_orders_clean            │
│                                 │
│ Downstream (1):                 │
│   - executive_dashboard         │
└─────────────────────────────────┘

4.2. Interactive Features

Good lineage tools allow:

1. Click to expand:

fact_orders
    │
    ▼
[Click "Show downstream"] → Expands to show all 15 dependencies

2. Filter by distance:

Show:
  [ ] 1 level upstream/downstream
  [x] 2 levels
  [ ] All levels (can be messy)

3. Highlight paths:

Click "customers" table → Highlight path to "Executive Dashboard"
Shows: customers → dim_customers → fact_orders → dashboard

4. Impact analysis:

Right-click table → "What breaks if I change this?"
Highlights all downstream in red/orange

5. Column-level drill-down:

Click table → View columns
Click column → See which downstream columns depend on it

4.3. Lineage in dbt Docs

Example (dbt docs UI):

Top nav: [Project] [Sources] [Exposures]

Left sidebar:
  📁 staging
    📄 stg_orders
    📄 stg_customers
  📁 intermediate
    📄 int_orders_enriched
  📁 marts
    📄 fact_orders
    📄 dim_customers

Center: Lineage Graph
  [Zoom In] [Zoom Out] [Center] [Download]

  Visual DAG showing all models connected

Right sidebar (when model clicked):
  Model: fact_orders
  Description: "Order facts for analytics"

  Columns (5):
    - order_id
    - order_date
    - customer_name
    - total_vnd
    - profit_margin

  Depends On (2):
    - stg_orders
    - dim_customers

  Used By (3):
    - revenue_dashboard
    - churn_model
    - cohort_analysis

5. Use Cases: Real-World Applications

5.1. Schema Change Impact

Scenario: Rename column customer_id → customer_key in customers table

Steps với lineage:

Identify impact:

customers.customer_id is used by:
  - dim_customers (JOIN key)
  - fact_orders (FK reference)
  - marketing_campaigns (target list)
  - ml_features.customer_cohorts (feature)

Notify owners:

# Auto-generate notification
affected_assets = lineage.get_downstream('customers.customer_id')

for asset in affected_assets:
    notify(
        owner=asset.owner,
        message=f"⚠️ Breaking change: customers.customer_id renaming to customer_key",
        impact=asset.name,
        action_required="Update JOIN/WHERE clauses before 2025-07-15"
    )

Update in parallel:

Data Engineer: Update all SQL queries
Analytics Engineer: Update dbt models
Data Scientist: Update feature pipelines

Deploy safely:

Test all downstream assets
Deploy during low-traffic window
Monitor for errors

Without lineage: 50% chance of missing something → production breakage

5.2. Data Quality Investigation

Scenario: Dashboard shows spike in revenue (too good to be true)

Investigation với lineage:

Click dashboard → Lineage → Trace upstream

revenue_dashboard
    ↑ uses
fact_revenue (updated 1 hour ago)
    ↑ built from
stg_shopify_orders (updated 1 hour ago)
    ↑ ingested from
Shopify API

Check quality scores:
  - stg_shopify_orders: 65% ⚠️ (usually 98%)
    Failed checks:
      ❌ order_total > 0 (5% violations - negative values!)
      ❌ unique order_id (2% duplicates!)

Root cause: Shopify API bug returned duplicate orders with negative refunds

Lineage helped:

Quickly trace to source of bad data
Identify quality issue
Fix at source (contact Shopify)

Time: 10 minutes (vs hours of manual investigation)

5.3. GDPR/PDPA Data Deletion

Scenario: User requests deletion (Right to be Forgotten)

Column-level lineage query:

-- Find all tables containing user email
SELECT DISTINCT table_name, column_name
FROM lineage_metadata
WHERE source_column = 'users.email'
  AND user_id = 12345;

Results:
┌─────────────────────┬──────────────┐
│ Table               │ Column       │
├─────────────────────┼──────────────┤
│ users               │ email        │
│ orders              │ customer_email│
│ marketing_list      │ email        │
│ support_tickets     │ requester_email│
│ audit_log           │ user_email   │
└─────────────────────┴──────────────┘

Automated deletion:

def delete_user_data(user_id):
    # Get all tables with user data (from lineage)
    affected_tables = lineage.get_tables_with_column('users.email')

    for table in affected_tables:
        # Delete from each table
        db.execute(f"""
            DELETE FROM {table.name}
            WHERE user_id = %s
        """, [user_id])

        # Log deletion (compliance audit)
        audit_log.record(
            action='gdpr_deletion',
            user_id=user_id,
            table=table.name,
            timestamp=NOW()
        )

    return {
        'deleted_from': [t.name for t in affected_tables],
        'timestamp': NOW()
    }

Compliance proof: Lineage shows all locations checked ✅

5.4. Debugging ML Model

Scenario: Churn prediction model accuracy dropped từ 85% → 70%

Investigation:

churn_model
    ↑ trained on
ml_features.customer_features
    ↑ built from
    ├── dim_customers
    ├── fact_orders
    └── web_events ⚠️ (STALE - 7 days old)

Root cause: web_events pipeline broken → missing features → model degraded

Fix: Repair web_events pipeline, retrain model

Lineage prevented: Weeks of debugging why model suddenly worse

6. Implementation Guide

6.1. Step 1: Start với dbt (If Applicable)

If you use dbt → Free lineage out of the box

# Already have dbt project?
dbt docs generate
dbt docs serve

# Open http://localhost:8080 → Click any model → See lineage

Export lineage to catalog:

# Extract lineage from dbt manifest.json
import json

with open('target/manifest.json') as f:
    manifest = json.load(f)

# Build lineage graph
lineage_graph = {}

for node_id, node in manifest['nodes'].items():
    if node['resource_type'] == 'model':
        lineage_graph[node['name']] = {
            'upstream': [
                manifest['nodes'][dep]['name']
                for dep in node['depends_on']['nodes']
                if dep in manifest['nodes']
            ],
            'downstream': []  # Compute by reversing graph
        }

# Upload to data catalog
catalog.ingest_lineage(lineage_graph)

6.2. Step 2: Add Query Log Parsing

For non-dbt transformations (Python, Airflow, raw SQL)

BigQuery: Query audit logs

-- Find all queries that reference 'customers' table
SELECT
    user_email,
    query,
    referenced_tables,
    destination_table,
    creation_time
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE state = 'DONE'
  AND DATE(creation_time) >= CURRENT_DATE() - 7
  AND (
    'project.dataset.customers' IN UNNEST(referenced_tables)
    OR query LIKE '%customers%'
  )
ORDER BY creation_time DESC
LIMIT 100;

Parse queries to extract lineage:

from sqllineage.runner import LineageRunner

def extract_lineage_from_logs():
    # Get recent queries from audit log
    queries = bigquery_client.query("""
        SELECT DISTINCT query, destination_table
        FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
        WHERE DATE(creation_time) >= CURRENT_DATE() - 1
          AND destination_table IS NOT NULL
    """).to_dataframe()

    lineage_edges = []

    for _, row in queries.iterrows():
        result = LineageRunner(row['query'])

        for source in result.source_tables:
            lineage_edges.append({
                'source': str(source),
                'target': row['destination_table'],
                'transformation': row['query'][:500]  # First 500 chars
            })

    return lineage_edges

# Run daily to update lineage
lineage = extract_lineage_from_logs()
catalog.update_lineage(lineage)

6.3. Step 3: Integrate Dashboards

Track dashboard → table dependencies

Looker Example:

# Use Looker API to get dashboard dependencies
from looker_sdk import init40

sdk = init40()

def get_dashboard_lineage(dashboard_id):
    dashboard = sdk.dashboard(dashboard_id)

    # Get all queries in dashboard
    dependencies = []
    for element in dashboard.dashboard_elements:
        if element.query:
            query = sdk.query(element.query.id)

            # Extract tables used
            for table in query.model.explores:
                dependencies.append({
                    'dashboard': dashboard.title,
                    'dashboard_id': dashboard_id,
                    'table': f"{table.model_name}.{table.name}"
                })

    return dependencies

# Index all dashboards
for dashboard in sdk.all_dashboards():
    lineage = get_dashboard_lineage(dashboard.id)
    catalog.add_dashboard_lineage(lineage)

6.4. Step 4: Column-Level Lineage (Advanced)

More complex: Parse SQL to track column transformations

Example:

SELECT
    o.order_id,
    o.order_date,
    c.email AS customer_email,  -- customers.email → fact_orders.customer_email
    o.total * 23000 AS total_vnd  -- orders.total → fact_orders.total_vnd (with transformation)
FROM orders o
JOIN customers c ON o.customer_id = c.id

Column lineage:

fact_orders.order_id:
  source: orders.order_id
  transformation: direct

fact_orders.order_date:
  source: orders.order_date
  transformation: direct

fact_orders.customer_email:
  source: customers.email
  transformation: rename

fact_orders.total_vnd:
  source: orders.total
  transformation: "multiply by 23000"

Tools:

Alation, Atlan: Built-in column lineage
Manta: Specialized lineage tool
Custom: Parse SQL AST (Abstract Syntax Tree)

7. Best Practices

7.1. Automate Everything

❌ Don't:

Manually document lineage in spreadsheets
Ask developers to update lineage diagrams

✅ Do:

Parse code automatically (dbt, SQL logs)
Use lineage tools (catalog, dbt docs)
Regenerate lineage daily/hourly

7.2. Keep Lineage Fresh

Stale lineage worse than no lineage (gives false confidence)

Implementation:

# Daily cron job
def refresh_lineage():
    # Extract from dbt
    dbt_lineage = extract_dbt_lineage()

    # Extract from query logs
    query_lineage = extract_query_lineage()

    # Merge
    full_lineage = merge_lineage(dbt_lineage, query_lineage)

    # Upload to catalog
    catalog.replace_lineage(full_lineage, timestamp=NOW())

    print(f"✅ Lineage refreshed: {len(full_lineage)} edges")

# Run daily at 3 AM
schedule.every().day.at("03:00").do(refresh_lineage)

7.3. Start Table-Level, Add Column-Level Later

Phased approach:

Phase 1 (Month 1-2): Table-level lineage

Covers 80% of use cases
Easier to implement
Faster time-to-value

Phase 2 (Month 3-6): Column-level for PII

Track sensitive columns (email, phone, etc.)
Compliance focus (GDPR/PDPA)

Phase 3 (Month 6+): Full column-level

All columns tracked
Advanced impact analysis

7.4. Integrate với Data Catalog

Lineage + Catalog = Powerful combo

User searches: "customer revenue"

Result:
┌─────────────────────────────────────┐
│ fact_customer_revenue               │
│ Quality: 98% | Updated: 2h ago     │
│                                     │
│ [Schema] [Lineage] [Quality] [...]│
│                                     │
│ Click [Lineage] →                  │
│   Upstream (3):                    │
│     - stg_orders                   │
│     - dim_customers                │
│     - exchange_rates               │
│                                     │
│   Downstream (5):                  │
│     - revenue_dashboard ← You!    │
│     - cohort_analysis              │
│     - forecast_model               │
│     ...                            │
└─────────────────────────────────────┘

Benefits:

One interface for discovery + lineage
Searchable lineage (find all downstream of X)
Quality + lineage combined (see if upstream has quality issues)

8. Tools Comparison

8.1. Built-in Tools

dbt docs (Free)

✅ Automatic lineage for dbt models
✅ Beautiful interactive graph
✅ Zero cost
❌ dbt-only (no Python, dashboards)
❌ No column-level

BigQuery (Built-in)

✅ Basic table lineage for views
✅ Free
❌ Limited to views
❌ No cross-project lineage

Snowflake (Built-in)

✅ Object dependencies
✅ Access history (queries)
❌ Basic visualization
❌ No column-level

8.2. Data Catalogs

Atlan ($20K-$100K/year)

✅ Lineage integrated với catalog
✅ Auto-parsing dbt, SQL, Python
✅ Column-level lineage
✅ Impact analysis UI

Alation ($50K-$200K/year)

✅ Advanced lineage engine
✅ Column-level
✅ Cross-platform (Snowflake, BigQuery, etc.)

DataHub (Open-source, Free)

✅ Lineage via metadata events
✅ Extensible
❌ Requires setup effort

8.3. Specialized Lineage Tools

Manta (Enterprise, $$$)

✅ Enterprise-grade column lineage
✅ Supports 80+ data sources
✅ Complex SQL parsing (even procedural code)
❌ Very expensive
❌ Overkill for most companies

Octopai (Enterprise, $$$)

✅ BI-focused lineage (Tableau, Power BI)
✅ Cross-platform
❌ Expensive

8.4. Recommendation

Company Size	Recommendation
< 50 employees	dbt docs (if dbt) or basic catalog
50-200	Atlan or DataHub (open-source)
200+	Alation or Manta (if heavy compliance)

Kết Luận

Data Lineage is not optional for mature data teams - it's foundation for reliability, compliance, và trust.

Key Takeaways:

Automate lineage extraction: Parse code (dbt, SQL), never manual
Start table-level: 80% value với 20% effort
Integrate với catalog: Lineage + discovery = powerful
Use for impact analysis: Know what breaks before deploying
Track PII flows: Critical for PDPA/GDPR compliance
Keep fresh: Stale lineage = false confidence
dbt is best starting point: Free, automatic, accurate

Implementation timeline:

Week 1-2: Setup dbt docs (if applicable) or choose catalog
Week 3-4: Extract lineage from SQL logs
Month 2: Add dashboard lineage
Month 3+: Column-level for PII (if needed)

ROI:

Time savings: 2 hours debugging → 5 minutes với lineage
Risk reduction: Prevent breaking production with impact analysis
Compliance: GDPR/PDPA audit trails
Trust: Transparency into data transformations

Next Steps:

✅ If using dbt: Run dbt docs generate ngay bây giờ
✅ Đọc Data Catalog to integrate lineage
✅ Đọc Data Governance for broader framework
✅ Assess lineage maturity của team bạn
✅ Plan lineage implementation (start small!)

Need help? Carptech implements lineage solutions (dbt, Atlan, DataHub) and provides training. Book consultation to discuss your data traceability needs.

Bước tiếp theo

Làm Data Maturity Assessment → — Đánh giá mức độ trưởng thành dữ liệu trên 6 dimensions
Tính ROI Data Platform → — Ước tính chi phí và lợi ích đầu tư data platform
Đặt lịch tư vấn miễn phí → — 60 phút cùng chuyên gia Carptech

Related Posts:

Data Catalog: Democratizing Data Discovery - Lineage in catalog context
Data Governance 101: Framework cho Doanh Nghiệp - Governance foundation
dbt Best Practices: Transform Data Like a Pro - dbt lineage deep-dive
PDPA Compliance: Bảo Vệ Dữ Liệu Cá Nhân - Compliance use case

Data Lineage: Traceability từ Source đến Dashboard

TL;DR

1. What is Data Lineage?

1.1. Definition

1.2. Table-Level vs Column-Level

1.3. Lineage vs Data Flow vs Pipeline

2. Why Data Lineage Matters

2.1. Impact Analysis

2.2. Root Cause Analysis

2.3. Compliance (GDPR/PDPA)

2.4. Trust & Transparency

3. How Lineage is Captured

3.1. Automatic SQL Parsing (Best Method)

3.2. Metadata APIs

3.3. dbt Integration (Recommended for dbt Users)

3.4. Manual Documentation (❌ Don't Do This)

4. Lineage Visualization: DAGs

4.1. DAG = Directed Acyclic Graph

4.2. Interactive Features

4.3. Lineage in dbt Docs

5. Use Cases: Real-World Applications

5.1. Schema Change Impact

5.2. Data Quality Investigation

5.3. GDPR/PDPA Data Deletion

5.4. Debugging ML Model

6. Implementation Guide

6.1. Step 1: Start với dbt (If Applicable)

6.2. Step 2: Add Query Log Parsing

6.3. Step 3: Integrate Dashboards

6.4. Step 4: Column-Level Lineage (Advanced)

7. Best Practices

7.1. Automate Everything

7.2. Keep Lineage Fresh

7.3. Start Table-Level, Add Column-Level Later

7.4. Integrate với Data Catalog

8. Tools Comparison

8.1. Built-in Tools

8.2. Data Catalogs

8.3. Specialized Lineage Tools

8.4. Recommendation

Kết Luận

Bước tiếp theo

Checklist Tuân Thủ 3 Luật Dữ Liệu VN 2026

Có câu hỏi về Data Platform?

Bài viết liên quan

Data Governance 101: Framework cho Doanh Nghiệp Việt Nam

Data Catalog: Democratizing Data Discovery với Metadata Management

Checklist tuân thủ 3 luật dữ liệu Việt Nam 2026: hướng dẫn theo quy mô và ngành nghề

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý