Quay lại Blog
Data GovernanceCập nhật: 30 tháng 6, 202519 phút đọc

Data Lineage: Traceability từ Source đến Dashboard

Hướng dẫn toàn diện về Data Lineage - map data flow từ source qua transformations đến destination. Khám phá table-level vs column-level lineage, automatic parsing tools (dbt, SQL), visualization DAGs, use cases (impact analysis, root cause, compliance), và implementation best practices.

Ngô Thanh Thảo

Ngô Thanh Thảo

Data Governance & Security Lead

Data lineage DAG visualization showing data flowing from source systems through transformation layers (staging, intermediate, marts) to final dashboards with dependency arrows and impact analysis highlighting
#Data Lineage#Data Traceability#Impact Analysis#Root Cause Analysis#dbt#Data Governance#Metadata Management#Data Quality#Compliance

TL;DR

Data Lineage = Map of data flow showing:

  • Source: Where data originates (APIs, databases, files)
  • Transformations: How data is processed (SQL, Python, dbt)
  • Destination: Where data ends up (dashboards, ML models, reports)

Why Critical:

  1. Impact Analysis: "If I change this table, what breaks?" → Know before deploying
  2. Root Cause Analysis: "Dashboard shows wrong numbers" → Trace back to find issue
  3. Compliance: GDPR/PDPA require knowing where PII flows
  4. Trust: "How is this metric calculated?" → See full calculation chain

Types:

  • Table-level: Dataset → dataset dependencies
  • Column-level: Field → field traceability (more granular, harder to maintain)

How Captured:

  • Automatic: Parse SQL/dbt → extract dependencies (best)
  • Metadata APIs: Query warehouse metadata
  • Manual: Developers document (error-prone, don't do this)

Tools:

  • dbt: Automatic lineage for dbt models (free, built-in)
  • Data Catalogs: Atlan, Alation, DataHub integrate lineage
  • Cloud-native: BigQuery, Snowflake have basic lineage
  • Standalone: Manta, Octopai (enterprise)

Use Case Example: Dashboard shows revenue down 20%

  • Trace lineage: Dashboard → fact_revenue table → stg_orders table
  • Find: stg_orders pipeline failed yesterday, data stale
  • Fix: Rerun pipeline → dashboard correct
  • Time saved: 2 hours debugging → 5 minutes với lineage

Implementation: Start with dbt (if using), expand to catalog for full coverage

Bài này sẽ guide bạn qua complete lineage implementation từ basics đến advanced column-level tracking.


1. What is Data Lineage?

1.1. Definition

Data Lineage = Visual map showing:

  • Upstream: Where data comes from
  • Transformations: How data changes
  • Downstream: Where data goes

Example:

E-commerce Order Flow:

Shopify API
    │
    ▼
stg_shopify_orders (Raw staging)
    │
    ▼
int_orders_enriched (Add customer info)
    │
    ├─────────────┬─────────────┐
    ▼             ▼             ▼
fact_orders  dim_customers  revenue_metrics
    │             │             │
    ├─────────────┴─────────────┤
    ▼                           ▼
Executive Dashboard        ML Model (Churn)

Key info per node:

  • Table/dataset name
  • Schema (columns)
  • Update frequency
  • Owner
  • Quality score
  • Row count

Key info per edge (arrows):

  • Transformation logic (SQL, Python)
  • When it runs
  • Dependencies (runs after X completes)

1.2. Table-Level vs Column-Level

Table-Level Lineage:

orders_table → fact_orders_table
  • Shows: Dataset depends on dataset
  • Granularity: Coarse
  • Effort: Low (easy to extract)

Column-Level Lineage:

orders_table.order_id → fact_orders.order_key
orders_table.total_amount → fact_orders.revenue_usd (converted từ VND)
customers_table.email → fact_orders.customer_email
  • Shows: Field depends on field(s)
  • Granularity: Fine
  • Effort: High (complex to extract, especially với UNIONs, JOINs)

When to use:

  • Table-level: Sufficient for 80% use cases (impact analysis, root cause)
  • Column-level: Advanced compliance (GDPR - where does email flow?), sensitive data tracking

Vietnamese market: Most companies start table-level, add column-level for PII compliance (PDPA).

1.3. Lineage vs Data Flow vs Pipeline

Confused terms:

TermDefinitionExample
Data LineageStatic map of dependencies"Table A depends on B and C"
Data FlowRuntime execution path"Data flows: API → S3 → BigQuery"
Data PipelineOrchestrated jobs"Airflow DAG runs every hour"

Lineage = blueprint, Flow = water moving, Pipeline = plumbing system


2. Why Data Lineage Matters

2.1. Impact Analysis

Scenario: Need to change customers table schema (add column, rename field)

Question: "What will break?"

Without lineage:

Engineer: "Hmm, let me search codebase for 'customers'..."
*3 hours later*
"Found 47 SQL files referencing it. Not sure if I got them all.
Let's deploy and see what breaks 🤞"

*Next day*
10 dashboards broken
5 ML models failing
CEO dashboard showing nulls

With lineage:

Click 'customers' table → View downstream dependencies

Downstream (Direct):
  - int_customer_360 (used by Marketing dashboard)
  - fact_orders (JOIN on customer_id)
  - churn_prediction_model (feature: customer_age)

Downstream (Indirect):
  - Executive Dashboard (via fact_orders)
  - Email Campaign Tool (via int_customer_360)
  - Revenue Forecast Model (via fact_orders)

Total impact: 3 direct, 8 indirect dependencies

Action: Notify owners, update transformations, test end-to-end

Time saved: 3 hours debugging → 5 minutes impact analysis

2.2. Root Cause Analysis

Scenario: Dashboard shows revenue down 20%

Without lineage:

Analyst: "Dashboard wrong? Let me check..."
  → Check dashboard query: Looks correct
  → Check fact_revenue table: Numbers look off
  → Ask Data Engineer: "Did something change?"
  → Engineer: "Not sure, let me check logs..."
  → *2 hours later* Found: ETL pipeline failed yesterday

Total time: 2+ hours

With lineage:

Click dashboard → View lineage → Trace upstream

Dashboard "Revenue Trends"
    ↑ uses
fact_revenue
    ↑ built from
stg_shopify_orders ⚠️ (STALE - last updated 36 hours ago)
    ↑ ingested from
Shopify API

Alert: stg_shopify_orders failed to update (API rate limit)

Root cause found in 2 minutes ✅
Fix: Increase rate limit, rerun pipeline

Time saved: 2 hours → 2 minutes

2.3. Compliance (GDPR/PDPA)

Requirement: "Show me where customer email is used and shared"

Column-level lineage:

customers.email (Source)
    │
    ├─────────────────┬─────────────────┐
    ▼                 ▼                 ▼
fact_orders.email  marketing_list  external_crm
    │                 │                 │
    ▼                 ▼                 ▼
Order Dashboard   Mailchimp API   Salesforce
                  (3rd party)     (3rd party)

Compliance questions answered:

  • ✅ Where is PII stored? (3 tables)
  • ✅ Where is it used? (1 dashboard)
  • ✅ Who has access? (Check permissions per table)
  • ✅ Is it shared externally? (Yes, Mailchimp + Salesforce)
  • ✅ How long retained? (Check retention policies)

Use case: User requests "Delete my data" (PDPA Right to be Forgotten)

  • Lineage shows all locations with user's email
  • Automated deletion script targets all nodes
  • Audit log proves complete deletion

2.4. Trust & Transparency

Scenario: Marketing và Finance report different "Monthly Active Users"

Without lineage:

Marketing: "We have 50K MAU"
Finance: "We have 48K MAU"
CEO: "Which is correct??"

*Investigation*
Turns out:
- Marketing uses dim_users table (includes test accounts)
- Finance uses analytics.active_users (excludes test)

Resolution: 3 days of meetings

With lineage:

Click "MAU" metric

Marketing Dashboard MAU:
  ← dim_users (includes is_test = true/false)
  Definition: COUNT(DISTINCT user_id) WHERE last_active_date >= 30 days

Finance Dashboard MAU:
  ← analytics.active_users (WHERE is_test = false)
  Definition: COUNT(DISTINCT user_id) WHERE last_active_date >= 30 days AND is_test = false

Difference: Test accounts (2K)

Resolution: 5 minutes ✅
Decision: Use Finance definition (exclude test)

Trust: Users can see exactly how metrics are calculated → no black box


3. How Lineage is Captured

3.1. Automatic SQL Parsing (Best Method)

Parse SQL queries → extract table dependencies

Example:

-- dbt model: fact_orders.sql
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    o.total_amount * 23000 AS total_vnd  -- USD to VND
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
    ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'

Lineage extracted:

fact_orders:
  depends_on:
    - stg_orders (columns: order_id, order_date, customer_id, total_amount)
    - dim_customers (columns: customer_id, customer_name)

  produces:
    - columns: [order_id, order_date, customer_name, total_vnd]

Tools that do this:

  • dbt: Automatic (best for dbt users)
  • SQLFluff: Open-source SQL parser
  • sqllineage: Python library for lineage extraction
  • Atlan, Alation: Commercial catalogs with SQL parsing

Example (sqllineage library):

from sqllineage.runner import LineageRunner

sql = """
INSERT INTO analytics.fact_orders
SELECT
    o.order_id,
    c.customer_name,
    o.total_amount
FROM staging.orders o
JOIN staging.customers c ON o.customer_id = c.customer_id
"""

result = LineageRunner(sql)

print("Source tables:", result.source_tables)
# [Table: staging.orders, Table: staging.customers]

print("Target tables:", result.target_tables)
# [Table: analytics.fact_orders]

# Build lineage graph
lineage = {
    'sources': [str(t) for t in result.source_tables],
    'targets': [str(t) for t in result.target_tables]
}

Pros:

  • ✅ Accurate
  • ✅ Automated (no manual work)
  • ✅ Always up-to-date (parses live queries)

Cons:

  • ❌ Complex SQL hard to parse (dynamic SQL, CTEs, UNIONs)
  • ❌ Requires access to SQL code

3.2. Metadata APIs

Query warehouse metadata to find dependencies

BigQuery Example:

-- Find all views/tables that reference 'customers' table
SELECT
    t.table_catalog,
    t.table_schema,
    t.table_name,
    t.table_type,
    t.ddl  -- CREATE TABLE/VIEW statement
FROM `project.region-us.INFORMATION_SCHEMA.TABLES` t
WHERE t.ddl LIKE '%customers%'

Cloud SQL / PostgreSQL:

-- Find dependencies using pg_depend
SELECT DISTINCT
    dependent_ns.nspname AS dependent_schema,
    dependent_view.relname AS dependent_view,
    source_ns.nspname AS source_schema,
    source_table.relname AS source_table
FROM pg_depend
JOIN pg_rewrite ON pg_depend.objid = pg_rewrite.oid
JOIN pg_class AS dependent_view ON pg_rewrite.ev_class = dependent_view.oid
JOIN pg_class AS source_table ON pg_depend.refobjid = source_table.oid
JOIN pg_namespace AS dependent_ns ON dependent_view.relnamespace = dependent_ns.oid
JOIN pg_namespace AS source_ns ON source_table.relnamespace = source_ns.oid
WHERE source_table.relname = 'customers'
  AND source_table.relkind = 'r';  -- 'r' = regular table

Pros:

  • ✅ Works for native views (warehouse tracks dependencies)

Cons:

  • ❌ Only captures database-level dependencies (not external transformations)
  • ❌ Limited to views (not table-to-table in ETL)

3.3. dbt Integration (Recommended for dbt Users)

dbt automatically generates lineage from ref() and source() functions

dbt project:

# models/schema.yml
sources:
  - name: shopify
    tables:
      - name: orders
      - name: customers

models:
  - name: stg_orders
    description: "Staging orders from Shopify"

  - name: fact_orders
    description: "Order facts for analytics"
-- models/staging/stg_orders.sql
SELECT * FROM {{ source('shopify', 'orders') }}

-- models/marts/fact_orders.sql
SELECT
    o.order_id,
    c.customer_name
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c ON o.customer_id = c.customer_id

dbt generates:

// target/manifest.json
{
  "nodes": {
    "model.my_project.fact_orders": {
      "depends_on": {
        "nodes": [
          "model.my_project.stg_orders",
          "model.my_project.dim_customers"
        ]
      }
    },
    "model.my_project.stg_orders": {
      "depends_on": {
        "nodes": ["source.my_project.shopify.orders"]
      }
    }
  }
}

Visualize:

# Generate docs with lineage
dbt docs generate

# Serve docs
dbt docs serve

# Open http://localhost:8080
# Click any model → See lineage graph

Lineage graph (interactive):

shopify.orders (Source)
    │
    ▼
stg_orders (Staging)
    │
    ▼
fact_orders (Mart)
    │
    ▼
[Downstream models...]

Pros:

  • ✅ 100% accurate (dbt knows exact dependencies)
  • ✅ Zero manual work
  • ✅ Free
  • ✅ Always up-to-date

Cons:

  • ❌ Only captures dbt models (not Python, Airflow, dashboards)

3.4. Manual Documentation (❌ Don't Do This)

Example:

# lineage.yml (manually maintained)
fact_orders:
  upstream:
    - stg_orders
    - dim_customers
  downstream:
    - executive_dashboard
    - revenue_model

Problems:

  • ❌ Quickly outdated (developers forget to update)
  • ❌ Error-prone (typos, missing dependencies)
  • ❌ Not scalable

Rule: Automate lineage, never manual


4. Lineage Visualization: DAGs

4.1. DAG = Directed Acyclic Graph

Properties:

  • Directed: Arrows show flow direction (A → B)
  • Acyclic: No loops (can't have A → B → A)

Example DAG:

        orders_api
            │
            ▼
       stg_orders ──────┐
            │           │
            ▼           │
    int_orders_clean    │
            │           │
            ├───────────┤
            ▼           ▼
       fact_orders  revenue_metrics
            │           │
            └─────┬─────┘
                  ▼
          executive_dashboard

Node (table) details:

┌─────────────────────────────────┐
│  fact_orders                    │
├─────────────────────────────────┤
│ Type: Table                     │
│ Rows: 1.2M                      │
│ Updated: 2 hours ago            │
│ Quality: 98% ✅                 │
│ Owner: @data-team               │
│                                 │
│ Upstream (2):                   │
│   - stg_orders                  │
│   - int_orders_clean            │
│                                 │
│ Downstream (1):                 │
│   - executive_dashboard         │
└─────────────────────────────────┘

4.2. Interactive Features

Good lineage tools allow:

1. Click to expand:

fact_orders
    │
    ▼
[Click "Show downstream"] → Expands to show all 15 dependencies

2. Filter by distance:

Show:
  [ ] 1 level upstream/downstream
  [x] 2 levels
  [ ] All levels (can be messy)

3. Highlight paths:

Click "customers" table → Highlight path to "Executive Dashboard"
Shows: customers → dim_customers → fact_orders → dashboard

4. Impact analysis:

Right-click table → "What breaks if I change this?"
Highlights all downstream in red/orange

5. Column-level drill-down:

Click table → View columns
Click column → See which downstream columns depend on it

4.3. Lineage in dbt Docs

Example (dbt docs UI):

Top nav: [Project] [Sources] [Exposures]

Left sidebar:
  📁 staging
    📄 stg_orders
    📄 stg_customers
  📁 intermediate
    📄 int_orders_enriched
  📁 marts
    📄 fact_orders
    📄 dim_customers

Center: Lineage Graph
  [Zoom In] [Zoom Out] [Center] [Download]

  Visual DAG showing all models connected

Right sidebar (when model clicked):
  Model: fact_orders
  Description: "Order facts for analytics"

  Columns (5):
    - order_id
    - order_date
    - customer_name
    - total_vnd
    - profit_margin

  Depends On (2):
    - stg_orders
    - dim_customers

  Used By (3):
    - revenue_dashboard
    - churn_model
    - cohort_analysis

5. Use Cases: Real-World Applications

5.1. Schema Change Impact

Scenario: Rename column customer_idcustomer_key in customers table

Steps với lineage:

  1. Identify impact:
customers.customer_id is used by:
  - dim_customers (JOIN key)
  - fact_orders (FK reference)
  - marketing_campaigns (target list)
  - ml_features.customer_cohorts (feature)
  1. Notify owners:
# Auto-generate notification
affected_assets = lineage.get_downstream('customers.customer_id')

for asset in affected_assets:
    notify(
        owner=asset.owner,
        message=f"⚠️ Breaking change: customers.customer_id renaming to customer_key",
        impact=asset.name,
        action_required="Update JOIN/WHERE clauses before 2025-07-15"
    )
  1. Update in parallel:
  • Data Engineer: Update all SQL queries
  • Analytics Engineer: Update dbt models
  • Data Scientist: Update feature pipelines
  1. Deploy safely:
  • Test all downstream assets
  • Deploy during low-traffic window
  • Monitor for errors

Without lineage: 50% chance of missing something → production breakage

5.2. Data Quality Investigation

Scenario: Dashboard shows spike in revenue (too good to be true)

Investigation với lineage:

Click dashboard → Lineage → Trace upstream

revenue_dashboard
    ↑ uses
fact_revenue (updated 1 hour ago)
    ↑ built from
stg_shopify_orders (updated 1 hour ago)
    ↑ ingested from
Shopify API

Check quality scores:
  - stg_shopify_orders: 65% ⚠️ (usually 98%)
    Failed checks:
      ❌ order_total > 0 (5% violations - negative values!)
      ❌ unique order_id (2% duplicates!)

Root cause: Shopify API bug returned duplicate orders with negative refunds

Lineage helped:

  • Quickly trace to source of bad data
  • Identify quality issue
  • Fix at source (contact Shopify)

Time: 10 minutes (vs hours of manual investigation)

5.3. GDPR/PDPA Data Deletion

Scenario: User requests deletion (Right to be Forgotten)

Column-level lineage query:

-- Find all tables containing user email
SELECT DISTINCT table_name, column_name
FROM lineage_metadata
WHERE source_column = 'users.email'
  AND user_id = 12345;

Results:
┌─────────────────────┬──────────────┐
│ Table               │ Column       │
├─────────────────────┼──────────────┤
│ users               │ email        │
│ orders              │ customer_email│
│ marketing_list      │ email        │
│ support_tickets     │ requester_email│
│ audit_log           │ user_email   │
└─────────────────────┴──────────────┘

Automated deletion:

def delete_user_data(user_id):
    # Get all tables with user data (from lineage)
    affected_tables = lineage.get_tables_with_column('users.email')

    for table in affected_tables:
        # Delete from each table
        db.execute(f"""
            DELETE FROM {table.name}
            WHERE user_id = %s
        """, [user_id])

        # Log deletion (compliance audit)
        audit_log.record(
            action='gdpr_deletion',
            user_id=user_id,
            table=table.name,
            timestamp=NOW()
        )

    return {
        'deleted_from': [t.name for t in affected_tables],
        'timestamp': NOW()
    }

Compliance proof: Lineage shows all locations checked ✅

5.4. Debugging ML Model

Scenario: Churn prediction model accuracy dropped từ 85% → 70%

Investigation:

churn_model
    ↑ trained on
ml_features.customer_features
    ↑ built from
    ├── dim_customers
    ├── fact_orders
    └── web_events ⚠️ (STALE - 7 days old)

Root cause: web_events pipeline broken → missing features → model degraded

Fix: Repair web_events pipeline, retrain model

Lineage prevented: Weeks of debugging why model suddenly worse


6. Implementation Guide

6.1. Step 1: Start với dbt (If Applicable)

If you use dbt → Free lineage out of the box

# Already have dbt project?
dbt docs generate
dbt docs serve

# Open http://localhost:8080 → Click any model → See lineage

Export lineage to catalog:

# Extract lineage from dbt manifest.json
import json

with open('target/manifest.json') as f:
    manifest = json.load(f)

# Build lineage graph
lineage_graph = {}

for node_id, node in manifest['nodes'].items():
    if node['resource_type'] == 'model':
        lineage_graph[node['name']] = {
            'upstream': [
                manifest['nodes'][dep]['name']
                for dep in node['depends_on']['nodes']
                if dep in manifest['nodes']
            ],
            'downstream': []  # Compute by reversing graph
        }

# Upload to data catalog
catalog.ingest_lineage(lineage_graph)

6.2. Step 2: Add Query Log Parsing

For non-dbt transformations (Python, Airflow, raw SQL)

BigQuery: Query audit logs

-- Find all queries that reference 'customers' table
SELECT
    user_email,
    query,
    referenced_tables,
    destination_table,
    creation_time
FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE state = 'DONE'
  AND DATE(creation_time) >= CURRENT_DATE() - 7
  AND (
    'project.dataset.customers' IN UNNEST(referenced_tables)
    OR query LIKE '%customers%'
  )
ORDER BY creation_time DESC
LIMIT 100;

Parse queries to extract lineage:

from sqllineage.runner import LineageRunner

def extract_lineage_from_logs():
    # Get recent queries from audit log
    queries = bigquery_client.query("""
        SELECT DISTINCT query, destination_table
        FROM `project.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
        WHERE DATE(creation_time) >= CURRENT_DATE() - 1
          AND destination_table IS NOT NULL
    """).to_dataframe()

    lineage_edges = []

    for _, row in queries.iterrows():
        result = LineageRunner(row['query'])

        for source in result.source_tables:
            lineage_edges.append({
                'source': str(source),
                'target': row['destination_table'],
                'transformation': row['query'][:500]  # First 500 chars
            })

    return lineage_edges

# Run daily to update lineage
lineage = extract_lineage_from_logs()
catalog.update_lineage(lineage)

6.3. Step 3: Integrate Dashboards

Track dashboard → table dependencies

Looker Example:

# Use Looker API to get dashboard dependencies
from looker_sdk import init40

sdk = init40()

def get_dashboard_lineage(dashboard_id):
    dashboard = sdk.dashboard(dashboard_id)

    # Get all queries in dashboard
    dependencies = []
    for element in dashboard.dashboard_elements:
        if element.query:
            query = sdk.query(element.query.id)

            # Extract tables used
            for table in query.model.explores:
                dependencies.append({
                    'dashboard': dashboard.title,
                    'dashboard_id': dashboard_id,
                    'table': f"{table.model_name}.{table.name}"
                })

    return dependencies

# Index all dashboards
for dashboard in sdk.all_dashboards():
    lineage = get_dashboard_lineage(dashboard.id)
    catalog.add_dashboard_lineage(lineage)

6.4. Step 4: Column-Level Lineage (Advanced)

More complex: Parse SQL to track column transformations

Example:

SELECT
    o.order_id,
    o.order_date,
    c.email AS customer_email,  -- customers.email → fact_orders.customer_email
    o.total * 23000 AS total_vnd  -- orders.total → fact_orders.total_vnd (with transformation)
FROM orders o
JOIN customers c ON o.customer_id = c.id

Column lineage:

fact_orders.order_id:
  source: orders.order_id
  transformation: direct

fact_orders.order_date:
  source: orders.order_date
  transformation: direct

fact_orders.customer_email:
  source: customers.email
  transformation: rename

fact_orders.total_vnd:
  source: orders.total
  transformation: "multiply by 23000"

Tools:

  • Alation, Atlan: Built-in column lineage
  • Manta: Specialized lineage tool
  • Custom: Parse SQL AST (Abstract Syntax Tree)

7. Best Practices

7.1. Automate Everything

❌ Don't:

  • Manually document lineage in spreadsheets
  • Ask developers to update lineage diagrams

✅ Do:

  • Parse code automatically (dbt, SQL logs)
  • Use lineage tools (catalog, dbt docs)
  • Regenerate lineage daily/hourly

7.2. Keep Lineage Fresh

Stale lineage worse than no lineage (gives false confidence)

Implementation:

# Daily cron job
def refresh_lineage():
    # Extract from dbt
    dbt_lineage = extract_dbt_lineage()

    # Extract from query logs
    query_lineage = extract_query_lineage()

    # Merge
    full_lineage = merge_lineage(dbt_lineage, query_lineage)

    # Upload to catalog
    catalog.replace_lineage(full_lineage, timestamp=NOW())

    print(f"✅ Lineage refreshed: {len(full_lineage)} edges")

# Run daily at 3 AM
schedule.every().day.at("03:00").do(refresh_lineage)

7.3. Start Table-Level, Add Column-Level Later

Phased approach:

Phase 1 (Month 1-2): Table-level lineage

  • Covers 80% of use cases
  • Easier to implement
  • Faster time-to-value

Phase 2 (Month 3-6): Column-level for PII

  • Track sensitive columns (email, phone, etc.)
  • Compliance focus (GDPR/PDPA)

Phase 3 (Month 6+): Full column-level

  • All columns tracked
  • Advanced impact analysis

7.4. Integrate với Data Catalog

Lineage + Catalog = Powerful combo

User searches: "customer revenue"

Result:
┌─────────────────────────────────────┐
│ fact_customer_revenue               │
│ Quality: 98% | Updated: 2h ago     │
│                                     │
│ [Schema] [Lineage] [Quality] [...]│
│                                     │
│ Click [Lineage] →                  │
│   Upstream (3):                    │
│     - stg_orders                   │
│     - dim_customers                │
│     - exchange_rates               │
│                                     │
│   Downstream (5):                  │
│     - revenue_dashboard ← You!    │
│     - cohort_analysis              │
│     - forecast_model               │
│     ...                            │
└─────────────────────────────────────┘

Benefits:

  • One interface for discovery + lineage
  • Searchable lineage (find all downstream of X)
  • Quality + lineage combined (see if upstream has quality issues)

8. Tools Comparison

8.1. Built-in Tools

dbt docs (Free)

  • ✅ Automatic lineage for dbt models
  • ✅ Beautiful interactive graph
  • ✅ Zero cost
  • ❌ dbt-only (no Python, dashboards)
  • ❌ No column-level

BigQuery (Built-in)

  • ✅ Basic table lineage for views
  • ✅ Free
  • ❌ Limited to views
  • ❌ No cross-project lineage

Snowflake (Built-in)

  • ✅ Object dependencies
  • ✅ Access history (queries)
  • ❌ Basic visualization
  • ❌ No column-level

8.2. Data Catalogs

Atlan ($20K-$100K/year)

  • ✅ Lineage integrated với catalog
  • ✅ Auto-parsing dbt, SQL, Python
  • ✅ Column-level lineage
  • ✅ Impact analysis UI

Alation ($50K-$200K/year)

  • ✅ Advanced lineage engine
  • ✅ Column-level
  • ✅ Cross-platform (Snowflake, BigQuery, etc.)

DataHub (Open-source, Free)

  • ✅ Lineage via metadata events
  • ✅ Extensible
  • ❌ Requires setup effort

8.3. Specialized Lineage Tools

Manta (Enterprise, $$$)

  • ✅ Enterprise-grade column lineage
  • ✅ Supports 80+ data sources
  • ✅ Complex SQL parsing (even procedural code)
  • ❌ Very expensive
  • ❌ Overkill for most companies

Octopai (Enterprise, $$$)

  • ✅ BI-focused lineage (Tableau, Power BI)
  • ✅ Cross-platform
  • ❌ Expensive

8.4. Recommendation

Company SizeRecommendation
< 50 employeesdbt docs (if dbt) or basic catalog
50-200Atlan or DataHub (open-source)
200+Alation or Manta (if heavy compliance)

Kết Luận

Data Lineage is not optional for mature data teams - it's foundation for reliability, compliance, và trust.

Key Takeaways:

  1. Automate lineage extraction: Parse code (dbt, SQL), never manual
  2. Start table-level: 80% value với 20% effort
  3. Integrate với catalog: Lineage + discovery = powerful
  4. Use for impact analysis: Know what breaks before deploying
  5. Track PII flows: Critical for PDPA/GDPR compliance
  6. Keep fresh: Stale lineage = false confidence
  7. dbt is best starting point: Free, automatic, accurate

Implementation timeline:

  • Week 1-2: Setup dbt docs (if applicable) or choose catalog
  • Week 3-4: Extract lineage from SQL logs
  • Month 2: Add dashboard lineage
  • Month 3+: Column-level for PII (if needed)

ROI:

  • Time savings: 2 hours debugging → 5 minutes với lineage
  • Risk reduction: Prevent breaking production with impact analysis
  • Compliance: GDPR/PDPA audit trails
  • Trust: Transparency into data transformations

Next Steps:

  • ✅ If using dbt: Run dbt docs generate ngay bây giờ
  • ✅ Đọc Data Catalog to integrate lineage
  • ✅ Đọc Data Governance for broader framework
  • ✅ Assess lineage maturity của team bạn
  • ✅ Plan lineage implementation (start small!)

Need help? Carptech implements lineage solutions (dbt, Atlan, DataHub) and provides training. Book consultation to discuss your data traceability needs.


Related Posts:

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn