TL;DR
- Serverless Data Platform = không server để manage, tự scale, pay-per-use
- Stack: Airbyte Cloud (ingestion) + S3/GCS (storage) + BigQuery/Snowflake (warehouse) + dbt Cloud (transformation) + Metabase/Looker (BI)
- Advantages: Zero ops, auto-scale, cost-efficient (chỉ pay khi dùng), fast setup
- Disadvantages: Less control, vendor lock-in, cost unpredictable nếu không careful
- Best for: Startups, unpredictable workloads, proof-of-concepts, small teams
- Not for: Custom infrastructure needs, ultra-low latency, very high-volume (sometimes cheaper to self-manage)
- Example: E-commerce serverless platform: $800/month, handles 50M events/month, zero DevOps time
- Savings: 40-60% cheaper than self-managed cho unpredictable workloads
Giới Thiệu: Vấn Đề với Traditional Infrastructure
Traditional Data Platform setup:
Infrastructure to manage:
- ✅ EC2/VMs for Airflow orchestration
- ✅ EMR/Dataproc clusters for Spark jobs
- ✅ Redshift/self-hosted Postgres cluster
- ✅ Load balancers, monitoring, networking
- ✅ Security patches, upgrades, scaling
Team needed:
- 2 Data Engineers (pipelines)
- 1 DevOps/Platform Engineer (infrastructure)
- 1 week setup time
- Monthly cost: $5K-$10K (infrastructure + team time)
Vấn đề:
- ❌ Ops overhead: Spend 30% time managing infrastructure
- ❌ Scaling complexity: Phải provision cho peak load, lãng phí 70% capacity off-peak
- ❌ Slow setup: 1-2 weeks để setup infrastructure
- ❌ Fixed costs: Pay 24/7 even khi không dùng
Serverless approach giải quyết:
Infrastructure to manage: NONE
- ❌ No servers
- ❌ No scaling config
- ❌ No patching
Team needed:
- 1 Data Engineer (chỉ lo pipelines)
- Setup time: 1 day
- Monthly cost: $800-$2K (pay per use)
Serverless Principles
1. No Servers to Manage
Traditional:
# Provision EC2 for Airflow
aws ec2 run-instances --instance-type m5.large ...
# Install Airflow, configure, monitor, patch
Serverless:
# dbt Cloud: No infrastructure
# Just write SQL, schedule runs
dbt run --select model
2. Auto-Scale
Traditional:
# Phải configure auto-scaling group
auto_scaling_config = {
'min_size': 2,
'max_size': 10,
'target_cpu': 70
}
Serverless:
# BigQuery tự scale từ 0 → 1000s workers
SELECT * FROM events WHERE date = '2025-01-15'
# BigQuery tự phân bổ workers, không cần config
3. Pay-per-Use
Traditional:
- Fixed cost: $5K/month cho Redshift cluster (chạy 24/7)
- Utilization: 20% (lãng phí 80%)
Serverless:
- BigQuery: $5/TB scanned
- Chỉ trả tiền khi query
- Ví dụ: 10TB queried/month = $50 (100x cheaper)
4. Built-in High Availability
Traditional:
- Phải configure redundancy, backups, disaster recovery
Serverless:
- Cloud provider handles HA automatically
- SLA: 99.9-99.99%
Serverless Data Stack
Architecture Overview
Serverless Data Platform:
[Data Sources]
↓
[Airbyte Cloud / Fivetran] ← Ingestion (serverless)
↓
[S3 / GCS] ← Storage (serverless)
↓
[BigQuery / Snowflake] ← Warehouse (serverless compute)
↓
[dbt Cloud] ← Transformation (serverless)
↓
[Metabase / Looker] ← BI (hosted/serverless)
1. Ingestion: Airbyte Cloud / Fivetran
Airbyte Cloud:
# airbyte-config.yaml
source: postgres
host: db.company.com
database: production
username: readonly_user
destination: bigquery
project_id: my-project
dataset_id: raw_data
schedule: "0 */6 * * *" # Every 6 hours
Pricing:
- Airbyte Cloud: $2.50 per credit (1 credit = 1 million rows)
- Fivetran: $1-$2 per million monthly active rows (MAR)
Benefit: No infrastructure, connectors maintained by vendor.
2. Storage: S3 / GCS
# S3 pricing
- Standard: $0.023/GB/month
- Infrequent Access: $0.0125/GB/month
- Glacier (archive): $0.004/GB/month
# GCS pricing similar
Example: 10TB data = $230/month (Standard) or $40/month (Archive).
3. Warehouse: BigQuery / Snowflake
BigQuery:
-- Pay per query (on-demand)
SELECT user_id, COUNT(*) as events
FROM events
WHERE event_date >= '2025-01-01'
GROUP BY user_id
-- Cost: $5/TB scanned
BigQuery pricing:
- On-demand: $5/TB scanned (first 1TB free/month)
- Flat-rate: $2,000/month for 100 slots (reserved capacity)
- Storage: $0.02/GB/month (active), $0.01/GB/month (long-term)
Snowflake:
-- Serverless compute (virtual warehouses)
-- Auto-suspend when idle, auto-resume on query
Snowflake pricing:
- Compute: $2-$4/credit (1 credit = 1 hour of X-Small warehouse)
- Storage: $40/TB/month (compressed)
- Serverless features: Auto-scaling, auto-suspend
Comparison:
| Feature | BigQuery | Snowflake |
|---|---|---|
| Pricing model | Pay-per-query | Pay-per-compute-hour |
| Auto-suspend | N/A (fully serverless) | Yes (configurable) |
| Best for | Unpredictable queries | Predictable workloads |
4. Transformation: dbt Cloud
# dbt Cloud: Serverless dbt runs
# No Airflow, no EC2
# dbt project
models/
staging/
stg_orders.sql
marts/
fct_sales.sql
dbt Cloud pricing:
- Developer: $50/user/month
- Team: $100/user/month
- Includes: Scheduler, IDE, CI/CD, logs
Benefit: No need to run Airflow, dbt Cloud handles scheduling.
5. Orchestration: Cloud Functions / Lambda
For simple workflows:
# Cloud Function (GCP)
import functions_framework
from google.cloud import bigquery
@functions_framework.http
def trigger_dbt_run(request):
# Trigger dbt Cloud API
dbt_cloud_api.run_job(job_id=12345)
return 'Success', 200
# Trigger: Cloud Scheduler (cron)
# No server, no Airflow
AWS Lambda similar.
Pricing: Free tier covers most use cases (1M requests/month free).
6. BI: Metabase Cloud / Looker
Metabase Cloud:
- Hosted Metabase
- Pricing: $85/month for 5 users
Looker (Google Cloud):
- Fully managed
- Pricing: $3,000-$5,000/month (enterprise)
Alternatives: Preset (hosted Superset), Mode Analytics.
Advantages of Serverless
✅ 1. Zero Ops
Traditional:
# Weekly tasks
- Patch OS on 10 EC2 instances
- Upgrade Airflow version
- Monitor disk space on Redshift
- Scale up EMR cluster for peak load
Serverless:
- Cloud provider handles all of this
- Data team focuses 100% on data pipelines
Time savings: 10-20 hours/month (DevOps time) = $2K-$4K/month saved.
✅ 2. Auto-Scale
Example: Black Friday traffic spike
Traditional:
- Provision cluster for peak: 10 nodes
- Off-peak utilization: 20% (lãng phí 80%)
Serverless:
- BigQuery scales from 0 → 1000s slots automatically
- Pay only for what you use
✅ 3. Cost-Efficient (for Unpredictable Workloads)
Startup use case: Analytics dashboard, 10 users, 100 queries/day
Traditional:
- Redshift dc2.large: $180/month (24/7)
- Airflow on EC2 t3.medium: $30/month
Serverless:
- BigQuery: 1TB scanned/month = $5/month
- dbt Cloud: $50/month
- Total: $55/month (vs $210)
Savings: 74% cheaper.
✅ 4. Fast Setup
Traditional: 1-2 weeks (provision servers, configure networking, setup monitoring)
Serverless: 1 day
# Day 1: Setup serverless stack
1. Create BigQuery project (5 min)
2. Setup Airbyte Cloud connections (30 min)
3. Create dbt project, deploy to dbt Cloud (2 hours)
4. Connect Metabase to BigQuery (30 min)
5. Done! Data flowing, dashboards live.
✅ 5. Built-in Features
Serverless tools come with:
- Monitoring (BigQuery query logs, dbt Cloud logs)
- Security (IAM, encryption at rest/in transit)
- Disaster recovery (automated backups)
- Compliance (SOC 2, GDPR-ready)
Disadvantages of Serverless
❌ 1. Less Control
Traditional: Full control over infrastructure
# Customize Spark config
spark.executor.memory = 8g
spark.executor.cores = 4
Serverless: Limited customization
-- BigQuery: Cannot control worker memory, cores
-- Must work within BigQuery limits (e.g., 6-hour query timeout)
❌ 2. Vendor Lock-In
BigQuery SQL khác Snowflake SQL (dialect differences):
-- BigQuery
SELECT DATE_TRUNC(order_date, MONTH) as month
-- Snowflake
SELECT DATE_TRUNC('MONTH', order_date) as month
Migration cost: Rewrite queries khi switch vendors.
❌ 3. Cold Starts (Minor)
Snowflake: Virtual warehouse tắt khi idle → First query sau đó mất 10-20s để start.
Workaround: Keep warehouse running during business hours (trade-off cost vs latency).
❌ 4. Cost Unpredictability
Nguy hiểm: Poorly optimized query có thể scan 100TB = $500 cost.
Example:
-- BAD: Scan toàn bộ table
SELECT * FROM events WHERE user_id = 'user123'
-- Cost: $50 (scan 10TB)
-- GOOD: Partition pruning
SELECT * FROM events
WHERE event_date = '2025-01-15' -- Partition column
AND user_id = 'user123'
-- Cost: $0.05 (scan 10GB)
Mitigation: Set budget alerts, use quotas.
When to Use Serverless
✅ Best For
1. Startups với Small Teams
Team: 1 Data Engineer
Budget: $1K-$5K/month
Workload: Unpredictable (growing fast)
→ Serverless perfect fit
2. Unpredictable Workloads
Example: Marketing analytics
- Off-season: 10 queries/day
- Campaign season: 1,000 queries/day
Serverless scales automatically, pay per use.
3. Proof-of-Concepts
# Need analytics platform in 2 weeks for demo
# No time to setup infrastructure
→ Serverless setup in 1 day
4. Low-Volume Workloads
Example: Internal analytics, 50 queries/day, 500GB scanned/month
BigQuery cost: $2.50/month
vs Redshift: $180/month
→ Serverless 70x cheaper
❌ Not For
1. Custom Infrastructure Needs
Example: Custom ML framework cần specific GPU config
→ Self-managed EC2/GCP Compute Engine better.
2. Ultra-Low Latency
Example: Real-time bidding (need < 50ms response)
→ Dedicated infrastructure với warm caches better.
3. Very High-Volume (Sometimes)
Example: 1PB scanned/month on BigQuery = $5,000
Redshift reserved instance: $2,000/month (cheaper at this scale).
Rule: Nếu workload predictable + high-volume → Reserved capacity cheaper.
Example Architecture: E-commerce Serverless Platform
Requirements
- Data sources: Shopify (orders), Google Analytics (web events), Facebook Ads
- Volume: 50M events/month, 100GB new data/day
- Users: 10 analysts, 50 dashboard viewers
- Budget: < $1,000/month
Architecture
[Shopify API] [GA4] [Facebook Ads API]
↓ ↓ ↓
[Airbyte Cloud: Ingestion] ← $150/month
↓
[GCS Bucket: Raw Data] ← $20/month (1TB)
↓
[BigQuery: Warehouse] ← $300/month (60TB scanned)
↓
[dbt Cloud: Transformations] ← $50/month (1 developer)
↓
[Metabase Cloud: BI] ← $85/month (10 users)
Total: $605/month
Zero DevOps time
Implementation
Step 1: Setup BigQuery
# Create project
gcloud projects create ecommerce-analytics
# Create dataset
bq mk --dataset ecommerce-analytics:raw_data
bq mk --dataset ecommerce-analytics:analytics
Step 2: Setup Airbyte Cloud
# Airbyte connections
connections:
- name: shopify_to_bigquery
source: shopify
destination: bigquery.raw_data
schedule: "0 */1 * * *" # Hourly
- name: ga4_to_bigquery
source: google_analytics_4
destination: bigquery.raw_data
schedule: "0 2 * * *" # Daily at 2 AM
Step 3: dbt Models
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw_data', 'shopify_orders') }}
)
SELECT
id AS order_id,
customer_id,
CAST(created_at AS TIMESTAMP) AS order_date,
total_price,
currency
FROM source
-- models/marts/fct_daily_sales.sql
SELECT
DATE(order_date) AS sale_date,
COUNT(DISTINCT order_id) AS orders,
SUM(total_price) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY 1
Step 4: Deploy to dbt Cloud
# dbt Cloud automatically runs models on schedule
# No Airflow needed
Step 5: Connect Metabase
# Metabase Cloud
# Add BigQuery connection
# Create dashboards
# Share with team
Results
- ✅ Setup time: 1 day
- ✅ Cost: $605/month (vs $3K for self-managed)
- ✅ Ops time: 0 hours/month
- ✅ Handles 50M events/month without scaling issues
- ✅ Query latency: < 3s for 95th percentile
Cost Comparison: Serverless vs Self-Managed
Scenario: Mid-Size Company
Workload:
- 5TB new data/month
- 20TB queried/month
- 5 data engineers
- 100 dashboard users
Option 1: Self-Managed
Infrastructure:
- Redshift dc2.large (2 nodes): $360/month
- EC2 for Airflow (t3.medium): $30/month
- EC2 for dbt (t3.small): $15/month
- RDS for Airflow metadata (t3.small): $25/month
- Load balancer: $20/month
Team time:
- Platform maintenance: 20 hours/month
- @ $100/hour = $2,000/month
Total: $2,450/month
Option 2: Serverless
Services:
- Airbyte Cloud: $300/month (5TB)
- BigQuery storage (10TB): $200/month
- BigQuery compute (20TB scanned): $100/month
- dbt Cloud (5 developers): $500/month
- Metabase Cloud (100 users): $200/month
Team time:
- Platform maintenance: 0 hours/month
Total: $1,300/month
Savings: $1,150/month (47% cheaper) + zero ops.
Getting Started: Step-by-Step
Week 1: Setup
Day 1-2: Choose Stack
Decision matrix:
| Component | Tool | Why |
|---|---|---|
| Warehouse | BigQuery | Fully serverless, cheap for low-medium volume |
| Ingestion | Airbyte Cloud | 300+ connectors, serverless |
| Transformation | dbt Cloud | Industry standard, serverless scheduler |
| BI | Metabase Cloud | Affordable, easy to use |
Day 3-4: Setup Connections
# BigQuery
1. Create GCP project
2. Enable BigQuery API
3. Create datasets (raw, staging, analytics)
# Airbyte Cloud
1. Signup airbyte.com/cloud
2. Add sources (databases, APIs, SaaS)
3. Configure destinations (BigQuery)
4. Test connections
# dbt Cloud
1. Signup cloud.getdbt.com
2. Connect to BigQuery
3. Create dbt project (git repo)
4. Deploy
Day 5: First Data Pipeline
-- dbt models
-- models/staging/stg_users.sql
SELECT * FROM {{ source('raw', 'users') }}
-- models/marts/fct_user_signups.sql
SELECT
DATE(created_at) AS signup_date,
COUNT(*) AS signups
FROM {{ ref('stg_users') }}
GROUP BY 1
Week 2: Optimize & Scale
Cost optimization:
-- Partition tables
CREATE TABLE analytics.events
PARTITION BY DATE(event_date)
CLUSTER BY user_id
AS SELECT * FROM raw.events
-- Query optimization
SELECT * FROM analytics.events
WHERE event_date = '2025-01-15' -- Partition filter (cheap)
AND user_id = 'user123' -- Cluster filter (fast)
Set budget alerts:
# BigQuery budget
gcloud billing budgets create \
--billing-account=XXXXX \
--display-name="BigQuery Budget" \
--budget-amount=500USD \
--threshold-rule=percent=80
Monitor costs:
-- BigQuery cost by user
SELECT
user_email,
SUM(total_bytes_processed) / 1e12 * 5 AS cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 2 DESC
Best Practices
1. Optimize for Cost
Partition tables:
-- Partition by date (most common filter)
CREATE TABLE events
PARTITION BY DATE(event_date)
CLUSTER BY user_id, event_type
AS SELECT * FROM raw_events
Benefit: Query only scan relevant partitions (10x cheaper).
Use views sparingly:
-- BAD: View scans entire table every time
CREATE VIEW active_users AS
SELECT * FROM users WHERE last_login > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
-- GOOD: Materialized table (scheduled refresh)
CREATE TABLE active_users AS
SELECT * FROM users WHERE last_login > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
2. Set Quotas
BigQuery custom quotas:
# Limit per-user query cost
gcloud services quotas update \
--service=bigquery.googleapis.com \
--consumer=projects/my-project \
--metric=bigquery.googleapis.com/quota/query/usage \
--unit=1/d/{user} \
--value=1000 # 1000 queries/day per user
3. Monitor Performance
Query performance:
-- Find slow queries
SELECT
job_id,
user_email,
total_slot_ms / 1000 / 60 AS slot_minutes,
total_bytes_processed / 1e9 AS gb_processed,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY slot_minutes DESC
LIMIT 20
Optimize slow queries: Add partition filters, use clustering.
4. Design for Serverless
Batch processing (not streaming):
# Good for serverless: Hourly batch
@dbt_cloud_schedule(cron="0 * * * *")
def hourly_aggregations():
dbt.run(models="fct_hourly_sales")
# Bad for serverless: Streaming (use Dataflow/Flink for streaming)
Serverless best for: Batch ELT, ad-hoc analytics, dashboards.
Case Studies
Case Study 1: SaaS Startup (100 customers)
Before (Self-Managed):
- Postgres RDS: $200/month
- Metabase on EC2: $30/month
- Manual exports to S3, ad-hoc queries
- Pain: Slow queries, DevOps overhead
After (Serverless):
- Airbyte Cloud (Postgres → BigQuery): $50/month
- BigQuery: $20/month (4TB scanned)
- Metabase Cloud: $85/month
- dbt Cloud: $50/month
- Total: $205/month
- Benefits:
- ✅ Queries 10x faster (BigQuery vs Postgres)
- ✅ Zero DevOps
- ✅ Scales automatically as customers grow
Case Study 2: E-commerce (10K orders/day)
Before (Self-Managed):
- Redshift: $360/month
- Airflow on EC2: $50/month
- DevOps time: $1,000/month
- Total: $1,410/month
After (Serverless):
- Airbyte Cloud: $200/month
- BigQuery: $150/month (30TB scanned)
- dbt Cloud: $100/month (2 developers)
- Metabase Cloud: $85/month
- Total: $535/month
- Savings: $875/month (62% reduction)
- Benefits:
- ✅ Setup time: 2 weeks → 2 days
- ✅ Zero maintenance
- ✅ Auto-scales for Black Friday traffic
Case Study 3: Marketing Agency (20 clients)
Before:
- Manual exports from Google Ads, Facebook Ads
- Google Sheets for reporting
- 10 hours/week manual work
After (Serverless):
- Airbyte Cloud (Google Ads + Facebook Ads → BigQuery): $150/month
- BigQuery: $30/month
- Looker Studio (free): $0
- Total: $180/month
- Benefits:
- ✅ Automated reporting (vs 10 hours/week manual)
- ✅ Time savings = $2,000/month (@$50/hour)
- ✅ ROI: 11x
Kết Luận
Key Takeaways
✅ Serverless Data Platform = zero ops, auto-scale, pay-per-use ✅ Stack: Airbyte/Fivetran + S3/GCS + BigQuery/Snowflake + dbt Cloud + Metabase/Looker ✅ Best for: Startups, unpredictable workloads, small teams, proof-of-concepts ✅ Savings: 40-60% cheaper than self-managed cho low-medium volume ✅ Trade-offs: Less control, vendor lock-in, cost unpredictability nếu không optimize
Recommendations
Cho startups:
- Start serverless (BigQuery + dbt Cloud + Metabase)
- Optimize as you grow
- Switch to reserved capacity khi workload predictable
Cho enterprises:
- Serverless cho:
- ✅ Proof-of-concepts
- ✅ Departmental analytics (unpredictable usage)
- Self-managed cho:
- ✅ Mission-critical workloads (need control)
- ✅ Very high-volume (cheaper at scale)
Migration path:
- Start với 1 pilot use case (e.g., marketing analytics)
- Measure cost savings & time savings
- Expand to other use cases
- Keep some workloads self-managed nếu cần
Next Steps
Muốn setup serverless Data Platform cho doanh nghiệp của bạn?
Carptech giúp bạn:
- ✅ Assessment & stack recommendations
- ✅ Setup serverless platform (1-2 weeks)
- ✅ Cost optimization (reduce 40-60% cloud spend)
- ✅ Training cho team
📞 Liên hệ Carptech: carptech.vn
Related Posts:




