Quay lại Blog
Data GovernanceCập nhật: 24 tháng 6, 202522 phút đọc

Data Catalog: Democratizing Data Discovery với Metadata Management

Hướng dẫn toàn diện về Data Catalog - Google for your data. Khám phá automated metadata harvesting, business glossary, data lineage, quality scores, collaboration features. So sánh tools (Atlan, Alation, Collibra, DataHub), implementation roadmap, và adoption strategies.

Ngô Thanh Thảo

Ngô Thanh Thảo

Data Governance & Security Lead

Data catalog visualization showing search interface, metadata management, business glossary, data lineage graphs, quality scores, and collaboration features enabling self-service data discovery across an enterprise
#Data Catalog#Metadata Management#Data Discovery#Business Glossary#Data Lineage#Atlan#Alation#DataHub#Data Governance#Self-Service Analytics

TL;DR

Data Catalog = "Google cho data" - giúp users tìm, hiểu, trust, và access data trong organization.

The Problem:

  • Analyst: "Where is customer churn data?"
  • 2 days searching Slack, emailing teams, digging through databases
  • Finally finds it, but unsure if it's correct/up-to-date
  • Forrester: Employees waste 30% of time searching for data

The Solution: Data Catalog

Key Features:

  1. Search: Tìm datasets by keywords (like Google)
  2. Understand: See schema, descriptions, business context
  3. Trust: Quality scores, certifications, usage stats
  4. Access: Self-service access requests (no IT tickets)
  5. Collaborate: Comments, ratings, Q&A
  6. Lineage: Trace data from source to dashboard

Business Impact:

  • Time savings: 2 days → 15 minutes to find data (case study)
  • Productivity: Data teams spend more time analyzing, less searching
  • Trust: Users confident they're using correct data
  • Governance: Centralized access control + audit trails
  • ROI: $500K/year productivity savings (enterprise với 5000+ tables)

Tools Comparison:

ToolTypePriceBest For
AtlanCommercial$20K-$100K/yearModern UI, collaboration
AlationCommercial$50K-$200K/yearEnterprise, powerful search
CollibraCommercial$100K-$500K/yearFull governance suite
DataHubOpen-sourceFreeTech-savvy teams, DIY
AmundsenOpen-sourceFreeLyft-style, metadata-first
dbt docsFreeFreedbt users only

Case study Vietnamese e-commerce: Implemented Atlan for 200+ datasets

  • Before: 2 days average to find data
  • After: 15 minutes average
  • Result: 3 analysts save 60 hours/month → $500K/year value

Bài này sẽ guide bạn qua complete implementation roadmap từ tool selection đến adoption.


1. The Data Discovery Problem

1.1. Real-World Scenario

Monday morning, 9 AM:

Marketing Analyst (Slack):
"Hi team, where can I find customer churn data?
Need to analyze churn by segment for exec presentation tomorrow."

Data Engineer (11 AM):
"Try the analytics database. Not sure which table though."

Analyst (2 PM):
"Found 3 tables: customer_churn, churn_predictions, churn_analysis
Which one is correct? What's the difference?"

Data Scientist (5 PM):
"Use churn_predictions - it's the ML model output.
But check with @data-team if it's still running."

Analyst (Next day, 10 AM):
"Model hasn't run in 3 months 😢 Need fresh data.
Can someone help?"

Total time wasted: 2 days (analyst) + 1 hour (engineer) + 30 min (scientist) = 17+ hours

Cost: ~$500 (salary) + missed deadline + poor decision based on stale data

1.2. Symptoms: You Need a Data Catalog If...

"I don't know what data we have"

  • 1000+ tables, nobody knows what's in them
  • Tribal knowledge (only 1 person knows where X is)

"I can't find the data I need"

  • Hours/days searching Slack, emailing teams
  • Multiple sources, don't know which is correct

"I don't trust this data"

  • No quality indicators
  • Don't know when it was last updated
  • Different teams get different results

"I can't understand this data"

  • No documentation
  • Cryptic column names (col_a, field_123)
  • Business context missing

"I can't access the data"

  • Don't know who to ask for permission
  • IT tickets take 1 week+

"We have duplicate/conflicting data"

  • 5 customer tables, all slightly different
  • Which is "source of truth"?

If you nodded to ≥3 → Urgently need catalog

1.3. The Cost of Poor Data Discovery

Forrester Research:

  • Employees spend 30% of time searching for và verifying data
  • For data team of 10: 3 FTE-equivalents wasted on search

Example calculation (Vietnamese startup, 10-person data team):

Avg salary: $30K/year
30% wasted time = 3 FTE × $30K = $90K/year wasted

Plus:

  • Missed opportunities (slow insights)
  • Wrong decisions (using stale/incorrect data)
  • Duplicate work (rebuilding data that exists)
  • Shadow IT (teams build own data silos)

ROI of catalog: Break-even trong 6-12 months


2. What is a Data Catalog?

2.1. Core Concept

Data Catalog = Centralized inventory of data assets với:

  • Metadata: Schema, descriptions, owners, tags
  • Search: Find datasets like Google
  • Lineage: Where data comes from, where it goes
  • Quality: Scores, freshness, certifications
  • Access: Request permissions, track usage
  • Collaboration: Comments, ratings, Q&A

Analogy: Library catalog for books

  • Without catalog: Wander aisles randomly, ask librarian
  • With catalog: Search by title/author/topic, see availability, reserve

2.2. Key Features

1. Search & Discovery

User searches: "customer revenue"

Results:
┌─────────────────────────────────────────────────────┐
│ 📊 fact_customer_revenue (Certified ✅)             │
│ Analytics | Updated 2 hours ago | Quality: 98%     │
│ "Daily customer revenue aggregated by segment"      │
│ Owner: @data-team | 45 users this week             │
│ [View] [Request Access] [Save to Favorites]        │
├─────────────────────────────────────────────────────┤
│ 📊 customer_lifetime_value                          │
│ Analytics | Updated 1 day ago | Quality: 85%       │
│ "LTV calculated using 24-month lookback"           │
│ Owner: @growth-team | 12 users this week           │
├─────────────────────────────────────────────────────┤
│ 📊 stg_customer_orders (Staging)                    │
│ Staging | Updated 6 hours ago | Quality: 92%       │
│ "Raw customer orders from Shopify"                 │
│ Owner: @data-eng | 8 users this week               │
└─────────────────────────────────────────────────────┘

Ranking factors:

  • Relevance: Keyword match in name, description, columns
  • Popularity: Usage frequency
  • Certification: Curated "trusted" datasets
  • Recency: Recently updated
  • Quality: High quality score

2. Business Glossary

Define business terms once, link to datasets:

glossary:
  - term: "Monthly Recurring Revenue (MRR)"
    definition: "Sum of all subscription revenue recognized in a given month"
    owner: CFO
    formula: "SUM(subscription_revenue) WHERE billing_period = 'monthly'"
    related_data:
      - analytics.fact_revenue.mrr
      - dashboards.executive_dashboard.mrr_chart
    synonyms: ["Monthly Revenue", "Recurring Revenue"]

  - term: "Active Customer"
    definition: "Customer with at least 1 transaction in last 90 days"
    owner: VP Marketing
    formula: "MAX(order_date) >= CURRENT_DATE - 90"
    related_data:
      - analytics.dim_customers.is_active
      - metrics.active_customers_count

Why critical: Prevents "dashboard của Marketing khác Finance" problems

3. Schema & Column Documentation

Table: dim_customers
Description: Customer dimension table with demographic and behavioral attributes

Columns:
┌──────────────────┬──────────┬─────────────────────────────────────┐
│ Column           │ Type     │ Description                         │
├──────────────────┼──────────┼─────────────────────────────────────┤
│ customer_id      │ INT64    │ Unique customer identifier (PK)     │
│ email            │ STRING   │ Customer email (PII - masked)       │
│ first_order_date │ DATE     │ Date of first purchase              │
│ lifetime_value   │ FLOAT64  │ Total revenue from customer (USD)   │
│ churn_score      │ FLOAT64  │ ML churn probability (0-1)          │
│                  │          │ Updated daily by churn model        │
│ segment          │ STRING   │ Customer segment (VIP/Regular/New)  │
│                  │          │ Business logic: See glossary        │
└──────────────────┴──────────┴─────────────────────────────────────┘

Tags: #customer #analytics #production #certified
Owner: @data-team
Steward: @marketing-vp

4. Data Quality Indicators

Quality Score: 95% ✅

Checks (last 24 hours):
  ✅ No null values in customer_id (100% pass)
  ✅ Email format valid (99.8% pass)
  ⚠️ 2% of phone numbers invalid format
  ✅ Lifetime_value >= 0 (100% pass)
  ✅ Updated within 24 hours (last update: 2 hours ago)

Freshness:
  Expected: Daily by 6 AM
  Actual: Daily at 4:30 AM ✅
  SLA: Met 99.5% of time (last 30 days)

Data Quality Trends:
  [Chart showing quality score over time]

5. Data Lineage

Visual graph showing data flow:

Shopify Orders API
    │
    ▼
stg_shopify_orders (Staging)
    │
    ▼
int_orders_with_customers (Intermediate)
    │
    ├─────────────┬─────────────┐
    ▼             ▼             ▼
fact_orders  dim_customers  metrics.daily_revenue
    │             │             │
    ▼             ▼             ▼
Executive Dashboard (Looker)

Click any node → see details, query logs, dependencies

6. Collaboration

Comments (3):

@analyst_alice (2 days ago):
"Is this table updated real-time or batch?"

  └─ @data_engineer_bob (2 days ago):
     "Batch, updates every hour via dbt pipeline.
      See lineage for details."

@growth_manager (1 week ago):
"Can we add 'lead_source' column? Would help with attribution analysis."

  └─ @data_team (1 week ago):
     "Good idea! Added to Q3 roadmap. Tracking in Jira: DATA-234"

@new_analyst (3 hours ago):
"What's the difference between this and customer_summary table?"

  └─ [Pending response]

Plus: Ratings ⭐⭐⭐⭐⭐ (4.5/5), usage stats, saved queries

7. Access Requests

Self-Service Access Request:

Dataset: analytics.fact_orders
Your current access: None

Request access:
  [ ] Read access
  [ ] Write access (requires approval)

Purpose: [Dropdown]
  - Ad-hoc analysis
  - Building dashboard
  - ML model training
  - Other: ________________

Justification: "Need to analyze order trends for Q2 marketing campaign"

[Submit Request]

─────────────────────────────────────────────────────
Expected approval time: < 4 hours
Approver: @data-steward-marketing

Workflow:

  1. User submits request
  2. Data Steward receives notification
  3. Approve/reject (1 click)
  4. Access automatically provisioned (via IAM integration)
  5. User notified

3. Architecture: How Data Catalogs Work

3.1. Components

┌─────────────────────────────────────────────────────┐
│              Data Catalog System                    │
├─────────────────────────────────────────────────────┤
│                                                     │
│  ┌──────────────┐  ┌──────────────┐               │
│  │ UI Layer     │  │ API Layer    │               │
│  │ - Search     │  │ - REST APIs  │               │
│  │ - Browse     │  │ - GraphQL    │               │
│  │ - Lineage    │  └──────────────┘               │
│  └──────────────┘                                  │
│         │                    │                      │
│         ▼                    ▼                      │
│  ┌─────────────────────────────────────┐           │
│  │     Catalog Metadata Store          │           │
│  │  - Tables, columns, descriptions    │           │
│  │  - Lineage graph                    │           │
│  │  - Quality scores                   │           │
│  │  - User annotations                 │           │
│  └─────────────────────────────────────┘           │
│         ▲                    ▲                      │
│         │                    │                      │
│  ┌──────────────┐  ┌──────────────┐               │
│  │ Metadata     │  │ Lineage      │               │
│  │ Harvesters   │  │ Parsers      │               │
│  │ (Crawlers)   │  │ (SQL, dbt)   │               │
│  └──────────────┘  └──────────────┘               │
│         │                    │                      │
└─────────┼────────────────────┼─────────────────────┘
          │                    │
          ▼                    ▼
┌──────────────────────────────────────────┐
│         Data Sources                     │
│  - BigQuery, Snowflake, PostgreSQL      │
│  - dbt, Airflow (lineage)               │
│  - Looker, Tableau (dashboards)         │
│  - Git (code documentation)             │
└──────────────────────────────────────────┘

3.2. Metadata Harvesting (Automated)

Crawlers scan data sources and extract metadata:

# Example: BigQuery metadata harvester
from google.cloud import bigquery

def harvest_bigquery_metadata(project_id):
    client = bigquery.Client(project=project_id)

    catalog_entries = []

    # Iterate all datasets
    for dataset in client.list_datasets():
        dataset_id = dataset.dataset_id

        # Iterate all tables
        for table_ref in client.list_tables(dataset_id):
            table = client.get_table(table_ref)

            # Extract metadata
            metadata = {
                'name': f"{project_id}.{dataset_id}.{table.table_id}",
                'type': 'table',
                'schema': [
                    {
                        'name': field.name,
                        'type': field.field_type,
                        'mode': field.mode,
                        'description': field.description or ''
                    }
                    for field in table.schema
                ],
                'row_count': table.num_rows,
                'size_bytes': table.num_bytes,
                'created': table.created.isoformat(),
                'modified': table.modified.isoformat(),
                'description': table.description or '',
                'labels': table.labels or {}
            }

            catalog_entries.append(metadata)

            # Get column stats
            if table.num_rows > 0:
                metadata['column_stats'] = get_column_stats(table_ref)

    return catalog_entries

def get_column_stats(table_ref):
    # Query column statistics
    query = f"""
        SELECT
            COUNT(*) as total_rows,
            COUNT(DISTINCT customer_id) as unique_customers,
            MIN(order_date) as earliest_order,
            MAX(order_date) as latest_order
        FROM `{table_ref}`
    """
    # Execute and return stats
    ...

# Send to catalog
for entry in harvest_bigquery_metadata('my-project'):
    catalog.ingest_metadata(entry)

Harvesting schedule: Hourly/daily (configurable)

3.3. Lineage Extraction

Method 1: SQL Parsing

Parse SQL queries to extract dependencies:

import sqlparse

def extract_lineage_from_sql(sql):
    """
    Extract table dependencies from SQL query
    """
    parsed = sqlparse.parse(sql)[0]

    tables_read = []  # Source tables (FROM, JOIN)
    tables_written = []  # Target tables (INSERT, CREATE)

    # Simplified extraction logic
    for token in parsed.tokens:
        if token.ttype is sqlparse.tokens.Keyword:
            if token.value.upper() == 'FROM':
                # Next token is source table
                tables_read.append(get_next_name(parsed, token))
            elif token.value.upper() == 'JOIN':
                tables_read.append(get_next_name(parsed, token))
            elif token.value.upper() in ('INSERT INTO', 'CREATE TABLE'):
                tables_written.append(get_next_name(parsed, token))

    return {
        'sources': tables_read,
        'targets': tables_written
    }

# Example
sql = """
INSERT INTO analytics.fact_orders
SELECT
    o.order_id,
    c.customer_id,
    o.order_total
FROM staging.orders o
JOIN staging.customers c ON o.customer_id = c.customer_id
"""

lineage = extract_lineage_from_sql(sql)
# {
#   'sources': ['staging.orders', 'staging.customers'],
#   'targets': ['analytics.fact_orders']
# }

Method 2: dbt Integration

dbt automatically generates lineage:

# dbt project
models:
  - name: fact_orders
    description: "Order facts table"
    config:
      meta:
        catalog:
          certified: true
          owner: "@data-team"

    columns:
      - name: order_id
        description: "Unique order identifier"
      - name: customer_id
        description: "Foreign key to dim_customers"

# dbt generates lineage.json
{
  "nodes": {
    "model.my_project.fact_orders": {
      "depends_on": {
        "nodes": [
          "source.my_project.staging.orders",
          "source.my_project.staging.customers"
        ]
      }
    }
  }
}

Catalog ingests lineage.json → builds graph


4. Tools Comparison: Build vs Buy

4.1. Commercial Tools

Atlan ($20K-$100K/year)

Pros:

  • ✅ Modern UI (Slack-like collaboration)
  • ✅ Easy setup (cloud-native)
  • ✅ Active development (monthly releases)
  • ✅ Good for mid-size companies (50-500 employees)
  • ✅ Embedded lineage, quality, collaboration

Cons:

  • ❌ Expensive for startups
  • ❌ Less customizable than open-source

Best for: Fast-growing startups/scale-ups, modern data stack (dbt, Fivetran, cloud warehouse)


Alation ($50K-$200K/year)

Pros:

  • ✅ Powerful search (best-in-class NLP)
  • ✅ Enterprise features (SSO, audit, governance)
  • ✅ Strong lineage engine
  • ✅ Large customer base (proven at scale)

Cons:

  • ❌ Expensive
  • ❌ Complex setup
  • ❌ UI feels dated (compared to Atlan)

Best for: Large enterprises (500+ employees), complex data environments, strict compliance needs


Collibra ($100K-$500K/year)

Pros:

  • ✅ Comprehensive governance suite (catalog + workflow + privacy)
  • ✅ Strong compliance features (GDPR, PDPA)
  • ✅ Workflow automation (data requests, approvals)

Cons:

  • ❌ Very expensive
  • ❌ Heavy (long implementation: 6-12 months)
  • ❌ Overkill for most companies

Best for: Highly regulated industries (banking, healthcare), enterprises with dedicated governance teams

4.2. Open-Source Tools

DataHub (LinkedIn, Free)

Pros:

  • ✅ Free (open-source)
  • ✅ Active community
  • ✅ Modern architecture (Kafka-based event streaming)
  • ✅ Good lineage support
  • ✅ Cloud-agnostic

Cons:

  • ❌ Requires engineering resources to maintain
  • ❌ Setup complexity (Kubernetes, Kafka, PostgreSQL, Elasticsearch)
  • ❌ UI less polished than commercial tools

Best for: Tech-savvy teams, companies with engineering bandwidth, budget-constrained

Setup example (Docker Compose):

# docker-compose.yml
version: '3'
services:
  datahub-gms:
    image: linkedin/datahub-gms:latest
    ports:
      - "8080:8080"
    environment:
      - EBEAN_DATASOURCE_URL=jdbc:postgresql://postgres:5432/datahub

  datahub-frontend:
    image: linkedin/datahub-frontend-react:latest
    ports:
      - "9002:9002"
    environment:
      - DATAHUB_GMS_HOST=datahub-gms
      - DATAHUB_GMS_PORT=8080

  postgres:
    image: postgres:12
    environment:
      POSTGRES_DB: datahub
      POSTGRES_PASSWORD: datahub

  elasticsearch:
    image: elasticsearch:7.10.1
    environment:
      - discovery.type=single-node

Amundsen (Lyft, Free)

Pros:

  • ✅ Free
  • ✅ Metadata-first approach
  • ✅ Good search
  • ✅ Simpler than DataHub

Cons:

  • ❌ Less active development (slower updates)
  • ❌ Limited lineage
  • ❌ Setup still requires ops

Best for: Companies wanting simpler open-source option


dbt docs (Free, built-in với dbt)

Pros:

  • ✅ 100% free
  • ✅ Zero setup (auto-generated by dbt)
  • ✅ Lineage graph built-in
  • ✅ Column-level docs

Cons:

  • ❌ Only catalogs dbt models (not sources, dashboards, etc.)
  • ❌ No search, collaboration, quality features
  • ❌ Static site (not dynamic database)

Best for: dbt-only shops, MVP catalog before investing in full solution

Example:

# Generate docs
dbt docs generate

# Serve locally
dbt docs serve --port 8080

# Access at http://localhost:8080

4.3. Decision Matrix

Company SizeBudgetRecommendation
< 50 employees< $10Kdbt docs → DataHub (open-source)
50-200$10K-$50KAtlan
200-500$50K-$150KAtlan or Alation
500+$100K+Alation or Collibra

Vietnamese market: Hầu hết startups → dbt docs hoặc DataHub. Scale-ups → Atlan.


5. Implementation Roadmap (3 Months)

Month 1: Setup & Core Metadata

Week 1: Tool Selection

  • Evaluate 2-3 tools
  • Run POCs với sample datasets
  • Select tool

Week 2: Install & Configure

  • Deploy catalog (cloud or self-hosted)
  • Integrate với data sources (BigQuery, Snowflake, etc.)
  • Setup authentication (SSO)

Week 3: Metadata Harvesting

  • Configure crawlers for all databases
  • Run initial metadata extraction
  • Review results

Week 4: Document Top 20 Datasets

  • Identify most-used tables
  • Add descriptions (table + column level)
  • Assign owners

Deliverables:

  • ✅ Catalog deployed
  • ✅ 100+ datasets ingested
  • ✅ Top 20 documented

Month 2: Enrich & Enable

Week 5: Business Glossary

  • Define 20 critical business terms
  • Link to datasets
  • Publish glossary

Week 6: Data Quality Integration

  • Integrate với dbt tests
  • Display quality scores trong catalog
  • Setup freshness monitoring

Week 7: Lineage

  • Extract lineage from dbt
  • Parse SQL logs for lineage
  • Build lineage graphs

Week 8: Access Control

  • Integrate với IAM (BigQuery, Snowflake)
  • Setup access request workflow
  • Test approval flow

Deliverables:

  • ✅ Glossary published (20 terms)
  • ✅ Quality scores visible
  • ✅ Lineage graphs live
  • ✅ Access requests enabled

Month 3: Adoption & Scale

Week 9: Training

  • Train all data users (1-hour session)
  • Create user guides
  • Office hours for questions

Week 10: Expand Coverage

  • Document more datasets (target: 80% coverage)
  • Add dashboards to catalog (Looker, Tableau)
  • Integrate Airflow (pipeline metadata)

Week 11: Adoption Campaigns

  • Mandate: All new datasets must be documented
  • Incentivize: Leaderboard for most documented datasets
  • Showcase: Share success stories

Week 12: Measure & Iterate

  • Track metrics (searches, active users, time-to-find)
  • Survey users (satisfaction, pain points)
  • Plan improvements

Deliverables:

  • ✅ 80% datasets documented
  • ✅ 60% active users weekly
  • ✅ < 30 min average time to find data

6. Adoption Strategies: Getting Users to Use It

6.1. The Challenge

Common failure: Build catalog → nobody uses it

Why:

  • Old habits (ask colleagues on Slack)
  • Not aware catalog exists
  • Doesn't have data they need
  • Too complicated

6.2. Adoption Tactics

1. Executive Mandate

CEO/CDO announcement:

From: CEO
Subject: New Data Catalog - Mandatory for All Data Work

Team,

Starting next week, all data discovery must go through our new Data Catalog
(catalog.company.com).

This will:
- Save 30% of time previously spent searching for data
- Ensure we're using correct, trusted data
- Improve compliance

Expectation: All new datasets documented within 48 hours of creation.

Training sessions this week (sign up: link).

- CEO

2. Make It the Path of Least Resistance

  • Slack integration: /catalog search customer revenue → results trong Slack
  • Browser extension: Highlight table names in queries → link to catalog
  • IDE plugin: Auto-complete với catalog suggestions

3. Showcase Quick Wins

Weekly newsletter:

📊 Data Catalog Success Story

This week, @analyst_alice found customer segmentation data in 5 minutes
(used to take 2 days!).

She used it to build dashboard for exec meeting.

Result: CEO loved it, approved $500K marketing campaign.

👉 Start using catalog: catalog.company.com

4. Gamification

Leaderboard:

🏆 Top Data Documenters (This Month)

1. @bob_engineer - 45 tables documented 🥇
2. @alice_analyst - 32 tables documented 🥈
3. @charlie_scientist - 28 tables documented 🥉

Prize: Winner gets $100 Amazon gift card + recognition in all-hands

5. Block Old Paths

  • Disable direct database access (force via catalog + access requests)
  • Auto-reject Slack questions "Where is X data?" → point to catalog

6. Continuous Training

  • Onboarding: All new hires trained on catalog (Day 1)
  • Monthly office hours (Q&A)
  • Video tutorials (< 3 min each)

6.3. Metrics to Track

Usage Metrics:

  • Daily/weekly active users: Target > 60% of data team
  • Searches per day: Trending up = good
  • Click-through rate: Search → view dataset → request access

Coverage Metrics:

  • % datasets documented: Target > 80%
  • % datasets with quality scores: Target > 70%
  • % columns with descriptions: Target > 60%

Value Metrics:

  • Time to find data: Survey users monthly (target < 30 min)
  • Support tickets: Decrease in "where is X?" tickets
  • Data errors: Decrease in incidents from using wrong data

Engagement Metrics:

  • Comments/ratings: Active collaboration = good
  • Saved searches/favorites: Users finding value

Dashboard example:

Catalog Adoption Dashboard (June 2025)

Users:
  Active users (7 days): 45/60 (75%) ✅
  New users this week: 8
  Power users (10+ searches/week): 12

Coverage:
  Datasets cataloged: 850/1000 (85%) ✅
  Quality scores: 650/850 (76%) ✅
  Documented columns: 4,200/8,000 (53%) ⚠️

Value:
  Avg time to find data: 18 minutes ✅ (down from 2 days)
  Support tickets (data discovery): 3 ↓ (was 25/week)
  User satisfaction: 4.2/5 ⭐

Top Searches (This Week):
  1. customer revenue (125 searches)
  2. churn prediction (87 searches)
  3. marketing attribution (64 searches)

7. Case Study: Vietnamese E-commerce - Data Catalog ROI

7.1. Company Profile

Company: Top 20 e-commerce platform

  • 2M customers, 50K orders/month
  • Data team: 15 people (5 engineers, 10 analysts)
  • Data assets: 200+ tables, 15 dashboards

7.2. Pre-Catalog Pain

Symptoms:

  • Analysts spend 2 days average finding data
  • 30% of time wasted on data discovery
  • Frequent errors (using wrong/stale data)
  • Duplicate work (rebuilding datasets that exist)
  • IT tickets backlog (50+ access requests)

Incident (March 2025):

  • Marketing built campaign based on "active_customers" table
  • Turns out table hasn't updated in 3 months (nobody knew)
  • Campaign targeted churned customers
  • Result: $50K wasted spend + brand damage

Trigger: CEO mandated "fix data chaos within 3 months"

7.3. Implementation (3 Months)

Month 1: Deploy Atlan

  • Chose Atlan (modern UI, good for scale-ups)
  • Pricing: $30K/year (15 users)
  • Setup: 2 weeks (cloud deployment)
  • Integrated BigQuery, PostgreSQL, Looker

Month 2: Document & Enrich

  • Documented top 50 datasets (most-used)
  • Added quality scores from dbt tests
  • Built business glossary (25 terms)
  • Configured lineage extraction

Month 3: Train & Adopt

  • Trained all 15 data team members
  • Mandated: All new datasets must be documented
  • Slack integration: /catalog search X
  • Measured adoption weekly

7.4. Results (After 6 Months)

Time Savings:

  • 2 days → 15 minutes average to find data
  • 30% time wasted → 5% on discovery
  • 3 analysts × 60 hours/month saved = 180 hours/month

Value:

  • Analyst salary: ~$3K/month
  • 180 hours saved = $9K/month = $108K/year

Quality:

  • 0 incidents from using wrong data (was 2-3/month)
  • Prevented: ~$50K/year in errors

Productivity:

  • Analysts build 2x more dashboards (freed time)
  • Faster insights → faster decisions

Access Control:

  • IT tickets: 50 → 5/month (90% reduction)
  • Self-service access requests: 95% approved within 4 hours

ROI:

Cost:
  - Atlan license: $30K/year
  - Implementation time: 1 engineer × 1 month = $5K
  - Training: $2K
  Total: $37K

Benefit:
  - Time savings: $108K/year
  - Prevented errors: $50K/year
  Total: $158K/year

ROI: ($158K - $37K) / $37K = 327%
Payback period: 3 months

CTO Quote:

"Catalog transformed how we work. Analysts now spend time analyzing, not searching. Best $30K we've spent. ROI paid back trong Q1."

7.5. Adoption Stats

Month 3 (Post-Launch):

  • Active users: 8/15 (53%)
  • Searches/day: 25
  • Documented datasets: 80/200 (40%)

Month 6 (Mature):

  • Active users: 14/15 (93%) ✅
  • Searches/day: 120
  • Documented datasets: 180/200 (90%)
  • User satisfaction: 4.5/5 ⭐

Success factors:

  1. Executive sponsorship: CEO championed it
  2. Quick wins: Top 50 datasets documented fast
  3. Training: Hands-on workshops
  4. Enforcement: Mandatory for new datasets
  5. Integration: Slack, IDE plugins

8. Best Practices

8.1. Documentation

Do:

  • ✅ Write for business users (not just technical)
  • ✅ Explain "why" not just "what" (business context)
  • ✅ Link to business glossary terms
  • ✅ Include examples (sample queries)

Don't:

  • ❌ Copy-paste SQL comments (often outdated/wrong)
  • ❌ Use jargon without explaining
  • ❌ Leave columns undocumented

Example:

# Bad
table: fact_orders
description: "Orders table"
columns:
  - name: amt
    type: FLOAT
    description: "amount"

# Good
table: fact_orders
description: |
  Daily order facts for revenue analysis.
  Grain: One row per order.
  Updated: Hourly via ETL pipeline (stg_orders → fact_orders).

  Use for:
  - Revenue reporting
  - Customer analytics
  - Marketing attribution

  DO NOT use for:
  - Real-time dashboards (1 hour lag)
  - Fraud detection (use real-time stream)

columns:
  - name: order_total_usd
    type: FLOAT
    description: |
      Total order value in USD (including tax, shipping, discounts).
      Converted from VND using daily exchange rate.
      See glossary: "Order Total"

      Example: order_total_usd = (subtotal + tax + shipping) - discounts

8.2. Ownership

Rule: Every dataset must have:

  • Owner: Technical person (Data Engineer) - maintains pipeline
  • Steward: Business person (Marketing VP) - defines business rules
table: dim_customers
owner: "@data-eng-team"  # Technical owner
steward: "@marketing-vp"  # Business owner

responsibilities:
  owner:
    - Maintain pipeline
    - Fix data quality issues
    - Respond to technical questions

  steward:
    - Define business logic
    - Approve access requests
    - Certify data accuracy

8.3. Certification

Problem: 5 customer tables, which is correct?

Solution: Certify "golden" datasets

dim_customers ✅ CERTIFIED
  - Reviewed by Data Governance Council
  - Quality score > 95%
  - Meets all business requirements
  - Official source of truth for customer data

customer_backup ⚠️ DEPRECATED
  - Old table, no longer maintained
  - Use dim_customers instead

customer_sandbox 🧪 EXPERIMENTAL
  - Testing new enrichment logic
  - DO NOT use for production

Certification process:

  1. Owner nominates dataset
  2. Governance Council reviews (quality, documentation, business value)
  3. If approved → Certified badge
  4. Re-review annually

8.4. Quality Monitoring

Integrate catalog với quality tools:

# dbt test results → catalog
def sync_quality_to_catalog(dbt_test_results):
    for test in dbt_test_results:
        catalog.update_quality_score(
            dataset=test['model'],
            score=test['pass_rate'],
            checks=[
                {
                    'name': test['test_name'],
                    'status': 'passed' if test['passed'] else 'failed',
                    'details': test['message']
                }
            ]
        )

# Run after dbt tests
dbt test --store-failures
sync_quality_to_catalog(parse_dbt_results())

Display trong catalog:

Quality Score: 98% ✅

Recent Checks:
  ✅ unique_customer_id (passed)
  ✅ not_null_email (passed)
  ⚠️ email_format (97% passed, 3% failed)

Last Updated: 2 hours ago

Kết Luận

Data Catalog is not optional - it's foundation for data-driven culture.

Key Takeaways:

  1. Time savings are real: 2 days → 15 minutes to find data
  2. Start small: Top 20 datasets, expand gradually
  3. Adoption is critical: Best catalog useless nếu nobody uses
  4. Documentation quality > quantity: 50 well-documented tables > 500 poorly documented
  5. Integrate with workflows: Slack, IDE, approval flows
  6. Measure ROI: Time savings, prevented errors, productivity gains
  7. Open-source viable: DataHub good for budget-constrained teams

Next Steps:

  • ✅ Assess current data discovery pain (survey team)
  • ✅ Evaluate 2-3 tools (Atlan, DataHub, dbt docs)
  • ✅ Start MVP: Document top 20 datasets (even in spreadsheet!)
  • ✅ Đọc Data Governance for foundation
  • ✅ Đọc Data Lineage for deep-dive (upcoming)

Need help? Carptech implements data catalogs (Atlan, DataHub) and provides training. Book consultation to discuss your data discovery challenges.


Related Posts:

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn