TL;DR
Data Catalog = "Google cho data" - giúp users tìm, hiểu, trust, và access data trong organization.
The Problem:
- Analyst: "Where is customer churn data?"
- 2 days searching Slack, emailing teams, digging through databases
- Finally finds it, but unsure if it's correct/up-to-date
- Forrester: Employees waste 30% of time searching for data
The Solution: Data Catalog
Key Features:
- Search: Tìm datasets by keywords (like Google)
- Understand: See schema, descriptions, business context
- Trust: Quality scores, certifications, usage stats
- Access: Self-service access requests (no IT tickets)
- Collaborate: Comments, ratings, Q&A
- Lineage: Trace data from source to dashboard
Business Impact:
- Time savings: 2 days → 15 minutes to find data (case study)
- Productivity: Data teams spend more time analyzing, less searching
- Trust: Users confident they're using correct data
- Governance: Centralized access control + audit trails
- ROI: $500K/year productivity savings (enterprise với 5000+ tables)
Tools Comparison:
| Tool | Type | Price | Best For |
|---|---|---|---|
| Atlan | Commercial | $20K-$100K/year | Modern UI, collaboration |
| Alation | Commercial | $50K-$200K/year | Enterprise, powerful search |
| Collibra | Commercial | $100K-$500K/year | Full governance suite |
| DataHub | Open-source | Free | Tech-savvy teams, DIY |
| Amundsen | Open-source | Free | Lyft-style, metadata-first |
| dbt docs | Free | Free | dbt users only |
Case study Vietnamese e-commerce: Implemented Atlan for 200+ datasets
- Before: 2 days average to find data
- After: 15 minutes average
- Result: 3 analysts save 60 hours/month → $500K/year value
Bài này sẽ guide bạn qua complete implementation roadmap từ tool selection đến adoption.
1. The Data Discovery Problem
1.1. Real-World Scenario
Monday morning, 9 AM:
Marketing Analyst (Slack):
"Hi team, where can I find customer churn data?
Need to analyze churn by segment for exec presentation tomorrow."
Data Engineer (11 AM):
"Try the analytics database. Not sure which table though."
Analyst (2 PM):
"Found 3 tables: customer_churn, churn_predictions, churn_analysis
Which one is correct? What's the difference?"
Data Scientist (5 PM):
"Use churn_predictions - it's the ML model output.
But check with @data-team if it's still running."
Analyst (Next day, 10 AM):
"Model hasn't run in 3 months 😢 Need fresh data.
Can someone help?"
Total time wasted: 2 days (analyst) + 1 hour (engineer) + 30 min (scientist) = 17+ hours
Cost: ~$500 (salary) + missed deadline + poor decision based on stale data
1.2. Symptoms: You Need a Data Catalog If...
✅ "I don't know what data we have"
- 1000+ tables, nobody knows what's in them
- Tribal knowledge (only 1 person knows where X is)
✅ "I can't find the data I need"
- Hours/days searching Slack, emailing teams
- Multiple sources, don't know which is correct
✅ "I don't trust this data"
- No quality indicators
- Don't know when it was last updated
- Different teams get different results
✅ "I can't understand this data"
- No documentation
- Cryptic column names (
col_a,field_123) - Business context missing
✅ "I can't access the data"
- Don't know who to ask for permission
- IT tickets take 1 week+
✅ "We have duplicate/conflicting data"
- 5 customer tables, all slightly different
- Which is "source of truth"?
If you nodded to ≥3 → Urgently need catalog
1.3. The Cost of Poor Data Discovery
Forrester Research:
- Employees spend 30% of time searching for và verifying data
- For data team of 10: 3 FTE-equivalents wasted on search
Example calculation (Vietnamese startup, 10-person data team):
Avg salary: $30K/year
30% wasted time = 3 FTE × $30K = $90K/year wasted
Plus:
- Missed opportunities (slow insights)
- Wrong decisions (using stale/incorrect data)
- Duplicate work (rebuilding data that exists)
- Shadow IT (teams build own data silos)
ROI of catalog: Break-even trong 6-12 months
2. What is a Data Catalog?
2.1. Core Concept
Data Catalog = Centralized inventory of data assets với:
- Metadata: Schema, descriptions, owners, tags
- Search: Find datasets like Google
- Lineage: Where data comes from, where it goes
- Quality: Scores, freshness, certifications
- Access: Request permissions, track usage
- Collaboration: Comments, ratings, Q&A
Analogy: Library catalog for books
- Without catalog: Wander aisles randomly, ask librarian
- With catalog: Search by title/author/topic, see availability, reserve
2.2. Key Features
1. Search & Discovery
User searches: "customer revenue"
Results:
┌─────────────────────────────────────────────────────┐
│ 📊 fact_customer_revenue (Certified ✅) │
│ Analytics | Updated 2 hours ago | Quality: 98% │
│ "Daily customer revenue aggregated by segment" │
│ Owner: @data-team | 45 users this week │
│ [View] [Request Access] [Save to Favorites] │
├─────────────────────────────────────────────────────┤
│ 📊 customer_lifetime_value │
│ Analytics | Updated 1 day ago | Quality: 85% │
│ "LTV calculated using 24-month lookback" │
│ Owner: @growth-team | 12 users this week │
├─────────────────────────────────────────────────────┤
│ 📊 stg_customer_orders (Staging) │
│ Staging | Updated 6 hours ago | Quality: 92% │
│ "Raw customer orders from Shopify" │
│ Owner: @data-eng | 8 users this week │
└─────────────────────────────────────────────────────┘
Ranking factors:
- Relevance: Keyword match in name, description, columns
- Popularity: Usage frequency
- Certification: Curated "trusted" datasets
- Recency: Recently updated
- Quality: High quality score
2. Business Glossary
Define business terms once, link to datasets:
glossary:
- term: "Monthly Recurring Revenue (MRR)"
definition: "Sum of all subscription revenue recognized in a given month"
owner: CFO
formula: "SUM(subscription_revenue) WHERE billing_period = 'monthly'"
related_data:
- analytics.fact_revenue.mrr
- dashboards.executive_dashboard.mrr_chart
synonyms: ["Monthly Revenue", "Recurring Revenue"]
- term: "Active Customer"
definition: "Customer with at least 1 transaction in last 90 days"
owner: VP Marketing
formula: "MAX(order_date) >= CURRENT_DATE - 90"
related_data:
- analytics.dim_customers.is_active
- metrics.active_customers_count
Why critical: Prevents "dashboard của Marketing khác Finance" problems
3. Schema & Column Documentation
Table: dim_customers
Description: Customer dimension table with demographic and behavioral attributes
Columns:
┌──────────────────┬──────────┬─────────────────────────────────────┐
│ Column │ Type │ Description │
├──────────────────┼──────────┼─────────────────────────────────────┤
│ customer_id │ INT64 │ Unique customer identifier (PK) │
│ email │ STRING │ Customer email (PII - masked) │
│ first_order_date │ DATE │ Date of first purchase │
│ lifetime_value │ FLOAT64 │ Total revenue from customer (USD) │
│ churn_score │ FLOAT64 │ ML churn probability (0-1) │
│ │ │ Updated daily by churn model │
│ segment │ STRING │ Customer segment (VIP/Regular/New) │
│ │ │ Business logic: See glossary │
└──────────────────┴──────────┴─────────────────────────────────────┘
Tags: #customer #analytics #production #certified
Owner: @data-team
Steward: @marketing-vp
4. Data Quality Indicators
Quality Score: 95% ✅
Checks (last 24 hours):
✅ No null values in customer_id (100% pass)
✅ Email format valid (99.8% pass)
⚠️ 2% of phone numbers invalid format
✅ Lifetime_value >= 0 (100% pass)
✅ Updated within 24 hours (last update: 2 hours ago)
Freshness:
Expected: Daily by 6 AM
Actual: Daily at 4:30 AM ✅
SLA: Met 99.5% of time (last 30 days)
Data Quality Trends:
[Chart showing quality score over time]
5. Data Lineage
Visual graph showing data flow:
Shopify Orders API
│
▼
stg_shopify_orders (Staging)
│
▼
int_orders_with_customers (Intermediate)
│
├─────────────┬─────────────┐
▼ ▼ ▼
fact_orders dim_customers metrics.daily_revenue
│ │ │
▼ ▼ ▼
Executive Dashboard (Looker)
Click any node → see details, query logs, dependencies
6. Collaboration
Comments (3):
@analyst_alice (2 days ago):
"Is this table updated real-time or batch?"
└─ @data_engineer_bob (2 days ago):
"Batch, updates every hour via dbt pipeline.
See lineage for details."
@growth_manager (1 week ago):
"Can we add 'lead_source' column? Would help with attribution analysis."
└─ @data_team (1 week ago):
"Good idea! Added to Q3 roadmap. Tracking in Jira: DATA-234"
@new_analyst (3 hours ago):
"What's the difference between this and customer_summary table?"
└─ [Pending response]
Plus: Ratings ⭐⭐⭐⭐⭐ (4.5/5), usage stats, saved queries
7. Access Requests
Self-Service Access Request:
Dataset: analytics.fact_orders
Your current access: None
Request access:
[ ] Read access
[ ] Write access (requires approval)
Purpose: [Dropdown]
- Ad-hoc analysis
- Building dashboard
- ML model training
- Other: ________________
Justification: "Need to analyze order trends for Q2 marketing campaign"
[Submit Request]
─────────────────────────────────────────────────────
Expected approval time: < 4 hours
Approver: @data-steward-marketing
Workflow:
- User submits request
- Data Steward receives notification
- Approve/reject (1 click)
- Access automatically provisioned (via IAM integration)
- User notified
3. Architecture: How Data Catalogs Work
3.1. Components
┌─────────────────────────────────────────────────────┐
│ Data Catalog System │
├─────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ UI Layer │ │ API Layer │ │
│ │ - Search │ │ - REST APIs │ │
│ │ - Browse │ │ - GraphQL │ │
│ │ - Lineage │ └──────────────┘ │
│ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Catalog Metadata Store │ │
│ │ - Tables, columns, descriptions │ │
│ │ - Lineage graph │ │
│ │ - Quality scores │ │
│ │ - User annotations │ │
│ └─────────────────────────────────────┘ │
│ ▲ ▲ │
│ │ │ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Metadata │ │ Lineage │ │
│ │ Harvesters │ │ Parsers │ │
│ │ (Crawlers) │ │ (SQL, dbt) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
└─────────┼────────────────────┼─────────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────┐
│ Data Sources │
│ - BigQuery, Snowflake, PostgreSQL │
│ - dbt, Airflow (lineage) │
│ - Looker, Tableau (dashboards) │
│ - Git (code documentation) │
└──────────────────────────────────────────┘
3.2. Metadata Harvesting (Automated)
Crawlers scan data sources and extract metadata:
# Example: BigQuery metadata harvester
from google.cloud import bigquery
def harvest_bigquery_metadata(project_id):
client = bigquery.Client(project=project_id)
catalog_entries = []
# Iterate all datasets
for dataset in client.list_datasets():
dataset_id = dataset.dataset_id
# Iterate all tables
for table_ref in client.list_tables(dataset_id):
table = client.get_table(table_ref)
# Extract metadata
metadata = {
'name': f"{project_id}.{dataset_id}.{table.table_id}",
'type': 'table',
'schema': [
{
'name': field.name,
'type': field.field_type,
'mode': field.mode,
'description': field.description or ''
}
for field in table.schema
],
'row_count': table.num_rows,
'size_bytes': table.num_bytes,
'created': table.created.isoformat(),
'modified': table.modified.isoformat(),
'description': table.description or '',
'labels': table.labels or {}
}
catalog_entries.append(metadata)
# Get column stats
if table.num_rows > 0:
metadata['column_stats'] = get_column_stats(table_ref)
return catalog_entries
def get_column_stats(table_ref):
# Query column statistics
query = f"""
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(order_date) as earliest_order,
MAX(order_date) as latest_order
FROM `{table_ref}`
"""
# Execute and return stats
...
# Send to catalog
for entry in harvest_bigquery_metadata('my-project'):
catalog.ingest_metadata(entry)
Harvesting schedule: Hourly/daily (configurable)
3.3. Lineage Extraction
Method 1: SQL Parsing
Parse SQL queries to extract dependencies:
import sqlparse
def extract_lineage_from_sql(sql):
"""
Extract table dependencies from SQL query
"""
parsed = sqlparse.parse(sql)[0]
tables_read = [] # Source tables (FROM, JOIN)
tables_written = [] # Target tables (INSERT, CREATE)
# Simplified extraction logic
for token in parsed.tokens:
if token.ttype is sqlparse.tokens.Keyword:
if token.value.upper() == 'FROM':
# Next token is source table
tables_read.append(get_next_name(parsed, token))
elif token.value.upper() == 'JOIN':
tables_read.append(get_next_name(parsed, token))
elif token.value.upper() in ('INSERT INTO', 'CREATE TABLE'):
tables_written.append(get_next_name(parsed, token))
return {
'sources': tables_read,
'targets': tables_written
}
# Example
sql = """
INSERT INTO analytics.fact_orders
SELECT
o.order_id,
c.customer_id,
o.order_total
FROM staging.orders o
JOIN staging.customers c ON o.customer_id = c.customer_id
"""
lineage = extract_lineage_from_sql(sql)
# {
# 'sources': ['staging.orders', 'staging.customers'],
# 'targets': ['analytics.fact_orders']
# }
Method 2: dbt Integration
dbt automatically generates lineage:
# dbt project
models:
- name: fact_orders
description: "Order facts table"
config:
meta:
catalog:
certified: true
owner: "@data-team"
columns:
- name: order_id
description: "Unique order identifier"
- name: customer_id
description: "Foreign key to dim_customers"
# dbt generates lineage.json
{
"nodes": {
"model.my_project.fact_orders": {
"depends_on": {
"nodes": [
"source.my_project.staging.orders",
"source.my_project.staging.customers"
]
}
}
}
}
Catalog ingests lineage.json → builds graph
4. Tools Comparison: Build vs Buy
4.1. Commercial Tools
Atlan ($20K-$100K/year)
Pros:
- ✅ Modern UI (Slack-like collaboration)
- ✅ Easy setup (cloud-native)
- ✅ Active development (monthly releases)
- ✅ Good for mid-size companies (50-500 employees)
- ✅ Embedded lineage, quality, collaboration
Cons:
- ❌ Expensive for startups
- ❌ Less customizable than open-source
Best for: Fast-growing startups/scale-ups, modern data stack (dbt, Fivetran, cloud warehouse)
Alation ($50K-$200K/year)
Pros:
- ✅ Powerful search (best-in-class NLP)
- ✅ Enterprise features (SSO, audit, governance)
- ✅ Strong lineage engine
- ✅ Large customer base (proven at scale)
Cons:
- ❌ Expensive
- ❌ Complex setup
- ❌ UI feels dated (compared to Atlan)
Best for: Large enterprises (500+ employees), complex data environments, strict compliance needs
Collibra ($100K-$500K/year)
Pros:
- ✅ Comprehensive governance suite (catalog + workflow + privacy)
- ✅ Strong compliance features (GDPR, PDPA)
- ✅ Workflow automation (data requests, approvals)
Cons:
- ❌ Very expensive
- ❌ Heavy (long implementation: 6-12 months)
- ❌ Overkill for most companies
Best for: Highly regulated industries (banking, healthcare), enterprises with dedicated governance teams
4.2. Open-Source Tools
DataHub (LinkedIn, Free)
Pros:
- ✅ Free (open-source)
- ✅ Active community
- ✅ Modern architecture (Kafka-based event streaming)
- ✅ Good lineage support
- ✅ Cloud-agnostic
Cons:
- ❌ Requires engineering resources to maintain
- ❌ Setup complexity (Kubernetes, Kafka, PostgreSQL, Elasticsearch)
- ❌ UI less polished than commercial tools
Best for: Tech-savvy teams, companies with engineering bandwidth, budget-constrained
Setup example (Docker Compose):
# docker-compose.yml
version: '3'
services:
datahub-gms:
image: linkedin/datahub-gms:latest
ports:
- "8080:8080"
environment:
- EBEAN_DATASOURCE_URL=jdbc:postgresql://postgres:5432/datahub
datahub-frontend:
image: linkedin/datahub-frontend-react:latest
ports:
- "9002:9002"
environment:
- DATAHUB_GMS_HOST=datahub-gms
- DATAHUB_GMS_PORT=8080
postgres:
image: postgres:12
environment:
POSTGRES_DB: datahub
POSTGRES_PASSWORD: datahub
elasticsearch:
image: elasticsearch:7.10.1
environment:
- discovery.type=single-node
Amundsen (Lyft, Free)
Pros:
- ✅ Free
- ✅ Metadata-first approach
- ✅ Good search
- ✅ Simpler than DataHub
Cons:
- ❌ Less active development (slower updates)
- ❌ Limited lineage
- ❌ Setup still requires ops
Best for: Companies wanting simpler open-source option
dbt docs (Free, built-in với dbt)
Pros:
- ✅ 100% free
- ✅ Zero setup (auto-generated by dbt)
- ✅ Lineage graph built-in
- ✅ Column-level docs
Cons:
- ❌ Only catalogs dbt models (not sources, dashboards, etc.)
- ❌ No search, collaboration, quality features
- ❌ Static site (not dynamic database)
Best for: dbt-only shops, MVP catalog before investing in full solution
Example:
# Generate docs
dbt docs generate
# Serve locally
dbt docs serve --port 8080
# Access at http://localhost:8080
4.3. Decision Matrix
| Company Size | Budget | Recommendation |
|---|---|---|
| < 50 employees | < $10K | dbt docs → DataHub (open-source) |
| 50-200 | $10K-$50K | Atlan |
| 200-500 | $50K-$150K | Atlan or Alation |
| 500+ | $100K+ | Alation or Collibra |
Vietnamese market: Hầu hết startups → dbt docs hoặc DataHub. Scale-ups → Atlan.
5. Implementation Roadmap (3 Months)
Month 1: Setup & Core Metadata
Week 1: Tool Selection
- Evaluate 2-3 tools
- Run POCs với sample datasets
- Select tool
Week 2: Install & Configure
- Deploy catalog (cloud or self-hosted)
- Integrate với data sources (BigQuery, Snowflake, etc.)
- Setup authentication (SSO)
Week 3: Metadata Harvesting
- Configure crawlers for all databases
- Run initial metadata extraction
- Review results
Week 4: Document Top 20 Datasets
- Identify most-used tables
- Add descriptions (table + column level)
- Assign owners
Deliverables:
- ✅ Catalog deployed
- ✅ 100+ datasets ingested
- ✅ Top 20 documented
Month 2: Enrich & Enable
Week 5: Business Glossary
- Define 20 critical business terms
- Link to datasets
- Publish glossary
Week 6: Data Quality Integration
- Integrate với dbt tests
- Display quality scores trong catalog
- Setup freshness monitoring
Week 7: Lineage
- Extract lineage from dbt
- Parse SQL logs for lineage
- Build lineage graphs
Week 8: Access Control
- Integrate với IAM (BigQuery, Snowflake)
- Setup access request workflow
- Test approval flow
Deliverables:
- ✅ Glossary published (20 terms)
- ✅ Quality scores visible
- ✅ Lineage graphs live
- ✅ Access requests enabled
Month 3: Adoption & Scale
Week 9: Training
- Train all data users (1-hour session)
- Create user guides
- Office hours for questions
Week 10: Expand Coverage
- Document more datasets (target: 80% coverage)
- Add dashboards to catalog (Looker, Tableau)
- Integrate Airflow (pipeline metadata)
Week 11: Adoption Campaigns
- Mandate: All new datasets must be documented
- Incentivize: Leaderboard for most documented datasets
- Showcase: Share success stories
Week 12: Measure & Iterate
- Track metrics (searches, active users, time-to-find)
- Survey users (satisfaction, pain points)
- Plan improvements
Deliverables:
- ✅ 80% datasets documented
- ✅ 60% active users weekly
- ✅ < 30 min average time to find data
6. Adoption Strategies: Getting Users to Use It
6.1. The Challenge
Common failure: Build catalog → nobody uses it
Why:
- Old habits (ask colleagues on Slack)
- Not aware catalog exists
- Doesn't have data they need
- Too complicated
6.2. Adoption Tactics
1. Executive Mandate
CEO/CDO announcement:
From: CEO
Subject: New Data Catalog - Mandatory for All Data Work
Team,
Starting next week, all data discovery must go through our new Data Catalog
(catalog.company.com).
This will:
- Save 30% of time previously spent searching for data
- Ensure we're using correct, trusted data
- Improve compliance
Expectation: All new datasets documented within 48 hours of creation.
Training sessions this week (sign up: link).
- CEO
2. Make It the Path of Least Resistance
- Slack integration:
/catalog search customer revenue→ results trong Slack - Browser extension: Highlight table names in queries → link to catalog
- IDE plugin: Auto-complete với catalog suggestions
3. Showcase Quick Wins
Weekly newsletter:
📊 Data Catalog Success Story
This week, @analyst_alice found customer segmentation data in 5 minutes
(used to take 2 days!).
She used it to build dashboard for exec meeting.
Result: CEO loved it, approved $500K marketing campaign.
👉 Start using catalog: catalog.company.com
4. Gamification
Leaderboard:
🏆 Top Data Documenters (This Month)
1. @bob_engineer - 45 tables documented 🥇
2. @alice_analyst - 32 tables documented 🥈
3. @charlie_scientist - 28 tables documented 🥉
Prize: Winner gets $100 Amazon gift card + recognition in all-hands
5. Block Old Paths
- Disable direct database access (force via catalog + access requests)
- Auto-reject Slack questions "Where is X data?" → point to catalog
6. Continuous Training
- Onboarding: All new hires trained on catalog (Day 1)
- Monthly office hours (Q&A)
- Video tutorials (< 3 min each)
6.3. Metrics to Track
Usage Metrics:
- Daily/weekly active users: Target > 60% of data team
- Searches per day: Trending up = good
- Click-through rate: Search → view dataset → request access
Coverage Metrics:
- % datasets documented: Target > 80%
- % datasets with quality scores: Target > 70%
- % columns with descriptions: Target > 60%
Value Metrics:
- Time to find data: Survey users monthly (target < 30 min)
- Support tickets: Decrease in "where is X?" tickets
- Data errors: Decrease in incidents from using wrong data
Engagement Metrics:
- Comments/ratings: Active collaboration = good
- Saved searches/favorites: Users finding value
Dashboard example:
Catalog Adoption Dashboard (June 2025)
Users:
Active users (7 days): 45/60 (75%) ✅
New users this week: 8
Power users (10+ searches/week): 12
Coverage:
Datasets cataloged: 850/1000 (85%) ✅
Quality scores: 650/850 (76%) ✅
Documented columns: 4,200/8,000 (53%) ⚠️
Value:
Avg time to find data: 18 minutes ✅ (down from 2 days)
Support tickets (data discovery): 3 ↓ (was 25/week)
User satisfaction: 4.2/5 ⭐
Top Searches (This Week):
1. customer revenue (125 searches)
2. churn prediction (87 searches)
3. marketing attribution (64 searches)
7. Case Study: Vietnamese E-commerce - Data Catalog ROI
7.1. Company Profile
Company: Top 20 e-commerce platform
- 2M customers, 50K orders/month
- Data team: 15 people (5 engineers, 10 analysts)
- Data assets: 200+ tables, 15 dashboards
7.2. Pre-Catalog Pain
Symptoms:
- Analysts spend 2 days average finding data
- 30% of time wasted on data discovery
- Frequent errors (using wrong/stale data)
- Duplicate work (rebuilding datasets that exist)
- IT tickets backlog (50+ access requests)
Incident (March 2025):
- Marketing built campaign based on "active_customers" table
- Turns out table hasn't updated in 3 months (nobody knew)
- Campaign targeted churned customers
- Result: $50K wasted spend + brand damage
Trigger: CEO mandated "fix data chaos within 3 months"
7.3. Implementation (3 Months)
Month 1: Deploy Atlan
- Chose Atlan (modern UI, good for scale-ups)
- Pricing: $30K/year (15 users)
- Setup: 2 weeks (cloud deployment)
- Integrated BigQuery, PostgreSQL, Looker
Month 2: Document & Enrich
- Documented top 50 datasets (most-used)
- Added quality scores from dbt tests
- Built business glossary (25 terms)
- Configured lineage extraction
Month 3: Train & Adopt
- Trained all 15 data team members
- Mandated: All new datasets must be documented
- Slack integration:
/catalog search X - Measured adoption weekly
7.4. Results (After 6 Months)
Time Savings:
- 2 days → 15 minutes average to find data
- 30% time wasted → 5% on discovery
- 3 analysts × 60 hours/month saved = 180 hours/month
Value:
- Analyst salary: ~$3K/month
- 180 hours saved = $9K/month = $108K/year
Quality:
- 0 incidents from using wrong data (was 2-3/month)
- Prevented: ~$50K/year in errors
Productivity:
- Analysts build 2x more dashboards (freed time)
- Faster insights → faster decisions
Access Control:
- IT tickets: 50 → 5/month (90% reduction)
- Self-service access requests: 95% approved within 4 hours
ROI:
Cost:
- Atlan license: $30K/year
- Implementation time: 1 engineer × 1 month = $5K
- Training: $2K
Total: $37K
Benefit:
- Time savings: $108K/year
- Prevented errors: $50K/year
Total: $158K/year
ROI: ($158K - $37K) / $37K = 327%
Payback period: 3 months
CTO Quote:
"Catalog transformed how we work. Analysts now spend time analyzing, not searching. Best $30K we've spent. ROI paid back trong Q1."
7.5. Adoption Stats
Month 3 (Post-Launch):
- Active users: 8/15 (53%)
- Searches/day: 25
- Documented datasets: 80/200 (40%)
Month 6 (Mature):
- Active users: 14/15 (93%) ✅
- Searches/day: 120
- Documented datasets: 180/200 (90%)
- User satisfaction: 4.5/5 ⭐
Success factors:
- Executive sponsorship: CEO championed it
- Quick wins: Top 50 datasets documented fast
- Training: Hands-on workshops
- Enforcement: Mandatory for new datasets
- Integration: Slack, IDE plugins
8. Best Practices
8.1. Documentation
Do:
- ✅ Write for business users (not just technical)
- ✅ Explain "why" not just "what" (business context)
- ✅ Link to business glossary terms
- ✅ Include examples (sample queries)
Don't:
- ❌ Copy-paste SQL comments (often outdated/wrong)
- ❌ Use jargon without explaining
- ❌ Leave columns undocumented
Example:
# Bad
table: fact_orders
description: "Orders table"
columns:
- name: amt
type: FLOAT
description: "amount"
# Good
table: fact_orders
description: |
Daily order facts for revenue analysis.
Grain: One row per order.
Updated: Hourly via ETL pipeline (stg_orders → fact_orders).
Use for:
- Revenue reporting
- Customer analytics
- Marketing attribution
DO NOT use for:
- Real-time dashboards (1 hour lag)
- Fraud detection (use real-time stream)
columns:
- name: order_total_usd
type: FLOAT
description: |
Total order value in USD (including tax, shipping, discounts).
Converted from VND using daily exchange rate.
See glossary: "Order Total"
Example: order_total_usd = (subtotal + tax + shipping) - discounts
8.2. Ownership
Rule: Every dataset must have:
- Owner: Technical person (Data Engineer) - maintains pipeline
- Steward: Business person (Marketing VP) - defines business rules
table: dim_customers
owner: "@data-eng-team" # Technical owner
steward: "@marketing-vp" # Business owner
responsibilities:
owner:
- Maintain pipeline
- Fix data quality issues
- Respond to technical questions
steward:
- Define business logic
- Approve access requests
- Certify data accuracy
8.3. Certification
Problem: 5 customer tables, which is correct?
Solution: Certify "golden" datasets
dim_customers ✅ CERTIFIED
- Reviewed by Data Governance Council
- Quality score > 95%
- Meets all business requirements
- Official source of truth for customer data
customer_backup ⚠️ DEPRECATED
- Old table, no longer maintained
- Use dim_customers instead
customer_sandbox 🧪 EXPERIMENTAL
- Testing new enrichment logic
- DO NOT use for production
Certification process:
- Owner nominates dataset
- Governance Council reviews (quality, documentation, business value)
- If approved → Certified badge
- Re-review annually
8.4. Quality Monitoring
Integrate catalog với quality tools:
# dbt test results → catalog
def sync_quality_to_catalog(dbt_test_results):
for test in dbt_test_results:
catalog.update_quality_score(
dataset=test['model'],
score=test['pass_rate'],
checks=[
{
'name': test['test_name'],
'status': 'passed' if test['passed'] else 'failed',
'details': test['message']
}
]
)
# Run after dbt tests
dbt test --store-failures
sync_quality_to_catalog(parse_dbt_results())
Display trong catalog:
Quality Score: 98% ✅
Recent Checks:
✅ unique_customer_id (passed)
✅ not_null_email (passed)
⚠️ email_format (97% passed, 3% failed)
Last Updated: 2 hours ago
Kết Luận
Data Catalog is not optional - it's foundation for data-driven culture.
Key Takeaways:
- Time savings are real: 2 days → 15 minutes to find data
- Start small: Top 20 datasets, expand gradually
- Adoption is critical: Best catalog useless nếu nobody uses
- Documentation quality > quantity: 50 well-documented tables > 500 poorly documented
- Integrate with workflows: Slack, IDE, approval flows
- Measure ROI: Time savings, prevented errors, productivity gains
- Open-source viable: DataHub good for budget-constrained teams
Next Steps:
- ✅ Assess current data discovery pain (survey team)
- ✅ Evaluate 2-3 tools (Atlan, DataHub, dbt docs)
- ✅ Start MVP: Document top 20 datasets (even in spreadsheet!)
- ✅ Đọc Data Governance for foundation
- ✅ Đọc Data Lineage for deep-dive (upcoming)
Need help? Carptech implements data catalogs (Atlan, DataHub) and provides training. Book consultation to discuss your data discovery challenges.
Related Posts:
- Data Governance 101: Framework cho Doanh Nghiệp
- Data Lineage: Traceability từ Source đến Dashboard - Deep-dive into lineage (upcoming)
- dbt Best Practices: Transform Data Like a Pro - dbt documentation practices
- From BI to AI: Analytics Maturity Evolution - Self-service analytics




