Serverless Data Architecture: Scale to Zero, Pay per Query

TL;DR

Serverless Data Platform = không server để manage, tự scale, pay-per-use
Stack: Airbyte Cloud (ingestion) + S3/GCS (storage) + BigQuery/Snowflake (warehouse) + dbt Cloud (transformation) + Metabase/Looker (BI)
Advantages: Zero ops, auto-scale, cost-efficient (chỉ pay khi dùng), fast setup
Disadvantages: Less control, vendor lock-in, cost unpredictable nếu không careful
Best for: Startups, unpredictable workloads, proof-of-concepts, small teams
Not for: Custom infrastructure needs, ultra-low latency, very high-volume (sometimes cheaper to self-manage)
Example: E-commerce serverless platform: $800/month, handles 50M events/month, zero DevOps time
Savings: 40-60% cheaper than self-managed cho unpredictable workloads

Giới Thiệu: Vấn Đề với Traditional Infrastructure

Traditional Data Platform setup:

Infrastructure to manage:
- ✅ EC2/VMs for Airflow orchestration
- ✅ EMR/Dataproc clusters for Spark jobs
- ✅ Redshift/self-hosted Postgres cluster
- ✅ Load balancers, monitoring, networking
- ✅ Security patches, upgrades, scaling

Team needed:
- 2 Data Engineers (pipelines)
- 1 DevOps/Platform Engineer (infrastructure)
- 1 week setup time
- Monthly cost: $5K-$10K (infrastructure + team time)

Vấn đề:

❌ Ops overhead: Spend 30% time managing infrastructure
❌ Scaling complexity: Phải provision cho peak load, lãng phí 70% capacity off-peak
❌ Slow setup: 1-2 weeks để setup infrastructure
❌ Fixed costs: Pay 24/7 even khi không dùng

Serverless approach giải quyết:

Infrastructure to manage: NONE
- ❌ No servers
- ❌ No scaling config
- ❌ No patching

Team needed:
- 1 Data Engineer (chỉ lo pipelines)
- Setup time: 1 day
- Monthly cost: $800-$2K (pay per use)

Serverless Principles

1. No Servers to Manage

Traditional:

# Provision EC2 for Airflow
aws ec2 run-instances --instance-type m5.large ...
# Install Airflow, configure, monitor, patch

Serverless:

# dbt Cloud: No infrastructure
# Just write SQL, schedule runs
dbt run --select model

2. Auto-Scale

Traditional:

# Phải configure auto-scaling group
auto_scaling_config = {
    'min_size': 2,
    'max_size': 10,
    'target_cpu': 70
}

Serverless:

# BigQuery tự scale từ 0 → 1000s workers
SELECT * FROM events WHERE date = '2025-01-15'
# BigQuery tự phân bổ workers, không cần config

3. Pay-per-Use

Traditional:

Fixed cost: $5K/month cho Redshift cluster (chạy 24/7)
Utilization: 20% (lãng phí 80%)

Serverless:

BigQuery: $5/TB scanned
Chỉ trả tiền khi query
Ví dụ: 10TB queried/month = $50 (100x cheaper)

4. Built-in High Availability

Traditional:

Phải configure redundancy, backups, disaster recovery

Serverless:

Cloud provider handles HA automatically
SLA: 99.9-99.99%

Serverless Data Stack

Architecture Overview

Serverless Data Platform:

[Data Sources]
     ↓
[Airbyte Cloud / Fivetran]  ← Ingestion (serverless)
     ↓
[S3 / GCS]                   ← Storage (serverless)
     ↓
[BigQuery / Snowflake]       ← Warehouse (serverless compute)
     ↓
[dbt Cloud]                  ← Transformation (serverless)
     ↓
[Metabase / Looker]          ← BI (hosted/serverless)

1. Ingestion: Airbyte Cloud / Fivetran

Airbyte Cloud:

# airbyte-config.yaml
source: postgres
  host: db.company.com
  database: production
  username: readonly_user

destination: bigquery
  project_id: my-project
  dataset_id: raw_data

schedule: "0 */6 * * *"  # Every 6 hours

Pricing:

Airbyte Cloud: $2.50 per credit (1 credit = 1 million rows)
Fivetran: $1-$2 per million monthly active rows (MAR)

Benefit: No infrastructure, connectors maintained by vendor.

2. Storage: S3 / GCS

# S3 pricing
- Standard: $0.023/GB/month
- Infrequent Access: $0.0125/GB/month
- Glacier (archive): $0.004/GB/month

# GCS pricing similar

Example: 10TB data = $230/month (Standard) or $40/month (Archive).

3. Warehouse: BigQuery / Snowflake

BigQuery:

-- Pay per query (on-demand)
SELECT user_id, COUNT(*) as events
FROM events
WHERE event_date >= '2025-01-01'
GROUP BY user_id

-- Cost: $5/TB scanned

BigQuery pricing:

On-demand: $5/TB scanned (first 1TB free/month)
Flat-rate: $2,000/month for 100 slots (reserved capacity)
Storage: $0.02/GB/month (active), $0.01/GB/month (long-term)

Snowflake:

-- Serverless compute (virtual warehouses)
-- Auto-suspend when idle, auto-resume on query

Snowflake pricing:

Compute: $2-$4/credit (1 credit = 1 hour of X-Small warehouse)
Storage: $40/TB/month (compressed)
Serverless features: Auto-scaling, auto-suspend

Comparison:

Feature	BigQuery	Snowflake
Pricing model	Pay-per-query	Pay-per-compute-hour
Auto-suspend	N/A (fully serverless)	Yes (configurable)
Best for	Unpredictable queries	Predictable workloads

4. Transformation: dbt Cloud

# dbt Cloud: Serverless dbt runs
# No Airflow, no EC2

# dbt project
models/
  staging/
    stg_orders.sql
  marts/
    fct_sales.sql

dbt Cloud pricing:

Developer: $50/user/month
Team: $100/user/month
Includes: Scheduler, IDE, CI/CD, logs

Benefit: No need to run Airflow, dbt Cloud handles scheduling.

5. Orchestration: Cloud Functions / Lambda

For simple workflows:

# Cloud Function (GCP)
import functions_framework
from google.cloud import bigquery

@functions_framework.http
def trigger_dbt_run(request):
    # Trigger dbt Cloud API
    dbt_cloud_api.run_job(job_id=12345)
    return 'Success', 200

# Trigger: Cloud Scheduler (cron)
# No server, no Airflow

AWS Lambda similar.

Pricing: Free tier covers most use cases (1M requests/month free).

6. BI: Metabase Cloud / Looker

Metabase Cloud:

Hosted Metabase
Pricing: $85/month for 5 users

Looker (Google Cloud):

Fully managed
Pricing: $3,000-$5,000/month (enterprise)

Alternatives: Preset (hosted Superset), Mode Analytics.

Advantages of Serverless

✅ 1. Zero Ops

Traditional:

# Weekly tasks
- Patch OS on 10 EC2 instances
- Upgrade Airflow version
- Monitor disk space on Redshift
- Scale up EMR cluster for peak load

Serverless:

Cloud provider handles all of this
Data team focuses 100% on data pipelines

Time savings: 10-20 hours/month (DevOps time) = $2K-$4K/month saved.

✅ 2. Auto-Scale

Example: Black Friday traffic spike

Traditional:

Provision cluster for peak: 10 nodes
Off-peak utilization: 20% (lãng phí 80%)

Serverless:

BigQuery scales from 0 → 1000s slots automatically
Pay only for what you use

✅ 3. Cost-Efficient (for Unpredictable Workloads)

Startup use case: Analytics dashboard, 10 users, 100 queries/day

Traditional:

Redshift dc2.large: $180/month (24/7)
Airflow on EC2 t3.medium: $30/month

Serverless:

BigQuery: 1TB scanned/month = $5/month
dbt Cloud: $50/month
Total: $55/month (vs $210)

Savings: 74% cheaper.

✅ 4. Fast Setup

Traditional: 1-2 weeks (provision servers, configure networking, setup monitoring)

Serverless: 1 day

# Day 1: Setup serverless stack
1. Create BigQuery project (5 min)
2. Setup Airbyte Cloud connections (30 min)
3. Create dbt project, deploy to dbt Cloud (2 hours)
4. Connect Metabase to BigQuery (30 min)
5. Done! Data flowing, dashboards live.

✅ 5. Built-in Features

Serverless tools come with:

Monitoring (BigQuery query logs, dbt Cloud logs)
Security (IAM, encryption at rest/in transit)
Disaster recovery (automated backups)
Compliance (SOC 2, GDPR-ready)

Disadvantages of Serverless

❌ 1. Less Control

Traditional: Full control over infrastructure

# Customize Spark config
spark.executor.memory = 8g
spark.executor.cores = 4

Serverless: Limited customization

-- BigQuery: Cannot control worker memory, cores
-- Must work within BigQuery limits (e.g., 6-hour query timeout)

❌ 2. Vendor Lock-In

BigQuery SQL khác Snowflake SQL (dialect differences):

-- BigQuery
SELECT DATE_TRUNC(order_date, MONTH) as month

-- Snowflake
SELECT DATE_TRUNC('MONTH', order_date) as month

Migration cost: Rewrite queries khi switch vendors.

❌ 3. Cold Starts (Minor)

Snowflake: Virtual warehouse tắt khi idle → First query sau đó mất 10-20s để start.

Workaround: Keep warehouse running during business hours (trade-off cost vs latency).

❌ 4. Cost Unpredictability

Nguy hiểm: Poorly optimized query có thể scan 100TB = $500 cost.

Example:

-- BAD: Scan toàn bộ table
SELECT * FROM events WHERE user_id = 'user123'
-- Cost: $50 (scan 10TB)

-- GOOD: Partition pruning
SELECT * FROM events
WHERE event_date = '2025-01-15'  -- Partition column
  AND user_id = 'user123'
-- Cost: $0.05 (scan 10GB)

Mitigation: Set budget alerts, use quotas.

When to Use Serverless

✅ Best For

1. Startups với Small Teams

Team: 1 Data Engineer
Budget: $1K-$5K/month
Workload: Unpredictable (growing fast)

→ Serverless perfect fit

2. Unpredictable Workloads

Example: Marketing analytics

Off-season: 10 queries/day
Campaign season: 1,000 queries/day

Serverless scales automatically, pay per use.

3. Proof-of-Concepts

# Need analytics platform in 2 weeks for demo
# No time to setup infrastructure

→ Serverless setup in 1 day

4. Low-Volume Workloads

Example: Internal analytics, 50 queries/day, 500GB scanned/month

BigQuery cost: $2.50/month
vs Redshift: $180/month

→ Serverless 70x cheaper

❌ Not For

1. Custom Infrastructure Needs

Example: Custom ML framework cần specific GPU config

→ Self-managed EC2/GCP Compute Engine better.

2. Ultra-Low Latency

Example: Real-time bidding (need < 50ms response)

→ Dedicated infrastructure với warm caches better.

3. Very High-Volume (Sometimes)

Example: 1PB scanned/month on BigQuery = $5,000

Redshift reserved instance: $2,000/month (cheaper at this scale).

Rule: Nếu workload predictable + high-volume → Reserved capacity cheaper.

Example Architecture: E-commerce Serverless Platform

Requirements

Data sources: Shopify (orders), Google Analytics (web events), Facebook Ads
Volume: 50M events/month, 100GB new data/day
Users: 10 analysts, 50 dashboard viewers
Budget: < $1,000/month

Architecture

[Shopify API]       [GA4]         [Facebook Ads API]
       ↓              ↓                   ↓
    [Airbyte Cloud: Ingestion] ← $150/month
                  ↓
           [GCS Bucket: Raw Data] ← $20/month (1TB)
                  ↓
         [BigQuery: Warehouse] ← $300/month (60TB scanned)
                  ↓
           [dbt Cloud: Transformations] ← $50/month (1 developer)
                  ↓
         [Metabase Cloud: BI] ← $85/month (10 users)

Total: $605/month
Zero DevOps time

Implementation

Step 1: Setup BigQuery

# Create project
gcloud projects create ecommerce-analytics

# Create dataset
bq mk --dataset ecommerce-analytics:raw_data
bq mk --dataset ecommerce-analytics:analytics

Step 2: Setup Airbyte Cloud

# Airbyte connections
connections:
  - name: shopify_to_bigquery
    source: shopify
    destination: bigquery.raw_data
    schedule: "0 */1 * * *"  # Hourly

  - name: ga4_to_bigquery
    source: google_analytics_4
    destination: bigquery.raw_data
    schedule: "0 2 * * *"  # Daily at 2 AM

Step 3: dbt Models

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw_data', 'shopify_orders') }}
)

SELECT
    id AS order_id,
    customer_id,
    CAST(created_at AS TIMESTAMP) AS order_date,
    total_price,
    currency
FROM source

-- models/marts/fct_daily_sales.sql
SELECT
    DATE(order_date) AS sale_date,
    COUNT(DISTINCT order_id) AS orders,
    SUM(total_price) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY 1

Step 4: Deploy to dbt Cloud

# dbt Cloud automatically runs models on schedule
# No Airflow needed

Step 5: Connect Metabase

# Metabase Cloud
# Add BigQuery connection
# Create dashboards
# Share with team

Results

✅ Setup time: 1 day
✅ Cost: $605/month (vs $3K for self-managed)
✅ Ops time: 0 hours/month
✅ Handles 50M events/month without scaling issues
✅ Query latency: < 3s for 95th percentile

Cost Comparison: Serverless vs Self-Managed

Scenario: Mid-Size Company

Workload:

5TB new data/month
20TB queried/month
5 data engineers
100 dashboard users

Option 1: Self-Managed

Infrastructure:
- Redshift dc2.large (2 nodes): $360/month
- EC2 for Airflow (t3.medium): $30/month
- EC2 for dbt (t3.small): $15/month
- RDS for Airflow metadata (t3.small): $25/month
- Load balancer: $20/month

Team time:
- Platform maintenance: 20 hours/month
- @ $100/hour = $2,000/month

Total: $2,450/month

Option 2: Serverless

Services:
- Airbyte Cloud: $300/month (5TB)
- BigQuery storage (10TB): $200/month
- BigQuery compute (20TB scanned): $100/month
- dbt Cloud (5 developers): $500/month
- Metabase Cloud (100 users): $200/month

Team time:
- Platform maintenance: 0 hours/month

Total: $1,300/month

Savings: $1,150/month (47% cheaper) + zero ops.

Getting Started: Step-by-Step

Week 1: Setup

Day 1-2: Choose Stack

Decision matrix:

Component	Tool	Why
Warehouse	BigQuery	Fully serverless, cheap for low-medium volume
Ingestion	Airbyte Cloud	300+ connectors, serverless
Transformation	dbt Cloud	Industry standard, serverless scheduler
BI	Metabase Cloud	Affordable, easy to use

Day 3-4: Setup Connections

# BigQuery
1. Create GCP project
2. Enable BigQuery API
3. Create datasets (raw, staging, analytics)

# Airbyte Cloud
1. Signup airbyte.com/cloud
2. Add sources (databases, APIs, SaaS)
3. Configure destinations (BigQuery)
4. Test connections

# dbt Cloud
1. Signup cloud.getdbt.com
2. Connect to BigQuery
3. Create dbt project (git repo)
4. Deploy

Day 5: First Data Pipeline

-- dbt models
-- models/staging/stg_users.sql
SELECT * FROM {{ source('raw', 'users') }}

-- models/marts/fct_user_signups.sql
SELECT
    DATE(created_at) AS signup_date,
    COUNT(*) AS signups
FROM {{ ref('stg_users') }}
GROUP BY 1

Week 2: Optimize & Scale

Cost optimization:

-- Partition tables
CREATE TABLE analytics.events
PARTITION BY DATE(event_date)
CLUSTER BY user_id
AS SELECT * FROM raw.events

-- Query optimization
SELECT * FROM analytics.events
WHERE event_date = '2025-01-15'  -- Partition filter (cheap)
  AND user_id = 'user123'         -- Cluster filter (fast)

Set budget alerts:

# BigQuery budget
gcloud billing budgets create \
  --billing-account=XXXXX \
  --display-name="BigQuery Budget" \
  --budget-amount=500USD \
  --threshold-rule=percent=80

Monitor costs:

-- BigQuery cost by user
SELECT
    user_email,
    SUM(total_bytes_processed) / 1e12 * 5 AS cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 2 DESC

Best Practices

1. Optimize for Cost

Partition tables:

-- Partition by date (most common filter)
CREATE TABLE events
PARTITION BY DATE(event_date)
CLUSTER BY user_id, event_type
AS SELECT * FROM raw_events

Benefit: Query only scan relevant partitions (10x cheaper).

Use views sparingly:

-- BAD: View scans entire table every time
CREATE VIEW active_users AS
SELECT * FROM users WHERE last_login > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

-- GOOD: Materialized table (scheduled refresh)
CREATE TABLE active_users AS
SELECT * FROM users WHERE last_login > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

2. Set Quotas

BigQuery custom quotas:

# Limit per-user query cost
gcloud services quotas update \
  --service=bigquery.googleapis.com \
  --consumer=projects/my-project \
  --metric=bigquery.googleapis.com/quota/query/usage \
  --unit=1/d/{user} \
  --value=1000  # 1000 queries/day per user

3. Monitor Performance

Query performance:

-- Find slow queries
SELECT
    job_id,
    user_email,
    total_slot_ms / 1000 / 60 AS slot_minutes,
    total_bytes_processed / 1e9 AS gb_processed,
    TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY slot_minutes DESC
LIMIT 20

Optimize slow queries: Add partition filters, use clustering.

4. Design for Serverless

Batch processing (not streaming):

# Good for serverless: Hourly batch
@dbt_cloud_schedule(cron="0 * * * *")
def hourly_aggregations():
    dbt.run(models="fct_hourly_sales")

# Bad for serverless: Streaming (use Dataflow/Flink for streaming)

Serverless best for: Batch ELT, ad-hoc analytics, dashboards.

Case Studies

Case Study 1: SaaS Startup (100 customers)

Before (Self-Managed):

Postgres RDS: $200/month
Metabase on EC2: $30/month
Manual exports to S3, ad-hoc queries
Pain: Slow queries, DevOps overhead

After (Serverless):

Airbyte Cloud (Postgres → BigQuery): $50/month
BigQuery: $20/month (4TB scanned)
Metabase Cloud: $85/month
dbt Cloud: $50/month
Total: $205/month
Benefits:
- ✅ Queries 10x faster (BigQuery vs Postgres)
- ✅ Zero DevOps
- ✅ Scales automatically as customers grow

Case Study 2: E-commerce (10K orders/day)

Before (Self-Managed):

Redshift: $360/month
Airflow on EC2: $50/month
DevOps time: $1,000/month
Total: $1,410/month

After (Serverless):

Airbyte Cloud: $200/month
BigQuery: $150/month (30TB scanned)
dbt Cloud: $100/month (2 developers)
Metabase Cloud: $85/month
Total: $535/month
Savings: $875/month (62% reduction)
Benefits:
- ✅ Setup time: 2 weeks → 2 days
- ✅ Zero maintenance
- ✅ Auto-scales for Black Friday traffic

Case Study 3: Marketing Agency (20 clients)

Before:

Manual exports from Google Ads, Facebook Ads
Google Sheets for reporting
10 hours/week manual work

After (Serverless):

Airbyte Cloud (Google Ads + Facebook Ads → BigQuery): $150/month
BigQuery: $30/month
Looker Studio (free): $0
Total: $180/month
Benefits:
- ✅ Automated reporting (vs 10 hours/week manual)
- ✅ Time savings = $2,000/month (@$50/hour)
- ✅ ROI: 11x

Kết Luận

Key Takeaways

✅ Serverless Data Platform = zero ops, auto-scale, pay-per-use ✅ Stack: Airbyte/Fivetran + S3/GCS + BigQuery/Snowflake + dbt Cloud + Metabase/Looker ✅ Best for: Startups, unpredictable workloads, small teams, proof-of-concepts ✅ Savings: 40-60% cheaper than self-managed cho low-medium volume ✅ Trade-offs: Less control, vendor lock-in, cost unpredictability nếu không optimize

Recommendations

Cho startups:

Start serverless (BigQuery + dbt Cloud + Metabase)
Optimize as you grow
Switch to reserved capacity khi workload predictable

Cho enterprises:

Serverless cho:
- ✅ Proof-of-concepts
- ✅ Departmental analytics (unpredictable usage)
Self-managed cho:
- ✅ Mission-critical workloads (need control)
- ✅ Very high-volume (cheaper at scale)

Migration path:

Start với 1 pilot use case (e.g., marketing analytics)
Measure cost savings & time savings
Expand to other use cases
Keep some workloads self-managed nếu cần

Next Steps

Muốn setup serverless Data Platform cho doanh nghiệp của bạn?

Carptech giúp bạn:

✅ Assessment & stack recommendations
✅ Setup serverless platform (1-2 weeks)
✅ Cost optimization (reduce 40-60% cloud spend)
✅ Training cho team

📞 Liên hệ Carptech: carptech.vn

Related Posts:

Bước tiếp theo

Làm Data Maturity Assessment → — Đánh giá mức độ trưởng thành dữ liệu trên 6 dimensions
Tính ROI Data Platform → — Ước tính chi phí và lợi ích đầu tư data platform
Đặt lịch tư vấn miễn phí → — 60 phút cùng chuyên gia Carptech

Serverless Data Architecture: Scale to Zero, Pay per Query

TL;DR

Giới Thiệu: Vấn Đề với Traditional Infrastructure

Serverless Principles

1. No Servers to Manage

2. Auto-Scale

3. Pay-per-Use

4. Built-in High Availability

Serverless Data Stack

Architecture Overview

1. Ingestion: Airbyte Cloud / Fivetran

2. Storage: S3 / GCS

3. Warehouse: BigQuery / Snowflake

4. Transformation: dbt Cloud

5. Orchestration: Cloud Functions / Lambda

6. BI: Metabase Cloud / Looker

Advantages of Serverless

✅ 1. Zero Ops

✅ 2. Auto-Scale

✅ 3. Cost-Efficient (for Unpredictable Workloads)

✅ 4. Fast Setup

✅ 5. Built-in Features

Disadvantages of Serverless

❌ 1. Less Control

❌ 2. Vendor Lock-In

❌ 3. Cold Starts (Minor)

❌ 4. Cost Unpredictability

When to Use Serverless

✅ Best For

❌ Not For

Example Architecture: E-commerce Serverless Platform

Requirements

Architecture

Implementation

Results

Cost Comparison: Serverless vs Self-Managed

Scenario: Mid-Size Company

Option 1: Self-Managed

Option 2: Serverless

Getting Started: Step-by-Step

Week 1: Setup

Week 2: Optimize & Scale

Best Practices

1. Optimize for Cost

2. Set Quotas

3. Monitor Performance

4. Design for Serverless

Case Studies

Case Study 1: SaaS Startup (100 customers)

Case Study 2: E-commerce (10K orders/day)

Case Study 3: Marketing Agency (20 clients)

Kết Luận

Key Takeaways

Recommendations

Next Steps

Bước tiếp theo

Đăng ký nhận bài viết mới

Có câu hỏi về Data Platform?

Bài viết liên quan

Data Platform Migration: From On-Premise to Cloud

Cost Optimization cho Data Platform: Giảm 40-60% Cloud Bills

Lakehouse Architecture: So sánh Delta Lake, Iceberg, Hudi

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý