Quay lại Blog
Cloud & InfrastructureCập nhật: 29 tháng 7, 202515 phút đọc

Serverless Data Architecture: Scale to Zero, Pay per Query

Hướng dẫn xây dựng Data Platform hoàn toàn serverless - không server, tự scale, pay-per-use. Tối ưu cho startups và doanh nghiệp muốn giảm ops overhead.

Nguyễn Minh Tuấn

Nguyễn Minh Tuấn

Principal Data Architect

Serverless Data Architecture - Scale to zero, pay per query
#Serverless#Data Architecture#Cloud#BigQuery#Snowflake#Cost Optimization#Startups

TL;DR

  • Serverless Data Platform = không server để manage, tự scale, pay-per-use
  • Stack: Airbyte Cloud (ingestion) + S3/GCS (storage) + BigQuery/Snowflake (warehouse) + dbt Cloud (transformation) + Metabase/Looker (BI)
  • Advantages: Zero ops, auto-scale, cost-efficient (chỉ pay khi dùng), fast setup
  • Disadvantages: Less control, vendor lock-in, cost unpredictable nếu không careful
  • Best for: Startups, unpredictable workloads, proof-of-concepts, small teams
  • Not for: Custom infrastructure needs, ultra-low latency, very high-volume (sometimes cheaper to self-manage)
  • Example: E-commerce serverless platform: $800/month, handles 50M events/month, zero DevOps time
  • Savings: 40-60% cheaper than self-managed cho unpredictable workloads

Giới Thiệu: Vấn Đề với Traditional Infrastructure

Traditional Data Platform setup:

Infrastructure to manage:
- ✅ EC2/VMs for Airflow orchestration
- ✅ EMR/Dataproc clusters for Spark jobs
- ✅ Redshift/self-hosted Postgres cluster
- ✅ Load balancers, monitoring, networking
- ✅ Security patches, upgrades, scaling

Team needed:
- 2 Data Engineers (pipelines)
- 1 DevOps/Platform Engineer (infrastructure)
- 1 week setup time
- Monthly cost: $5K-$10K (infrastructure + team time)

Vấn đề:

  • Ops overhead: Spend 30% time managing infrastructure
  • Scaling complexity: Phải provision cho peak load, lãng phí 70% capacity off-peak
  • Slow setup: 1-2 weeks để setup infrastructure
  • Fixed costs: Pay 24/7 even khi không dùng

Serverless approach giải quyết:

Infrastructure to manage: NONE
- ❌ No servers
- ❌ No scaling config
- ❌ No patching

Team needed:
- 1 Data Engineer (chỉ lo pipelines)
- Setup time: 1 day
- Monthly cost: $800-$2K (pay per use)

Serverless Principles

1. No Servers to Manage

Traditional:

# Provision EC2 for Airflow
aws ec2 run-instances --instance-type m5.large ...
# Install Airflow, configure, monitor, patch

Serverless:

# dbt Cloud: No infrastructure
# Just write SQL, schedule runs
dbt run --select model

2. Auto-Scale

Traditional:

# Phải configure auto-scaling group
auto_scaling_config = {
    'min_size': 2,
    'max_size': 10,
    'target_cpu': 70
}

Serverless:

# BigQuery tự scale từ 0 → 1000s workers
SELECT * FROM events WHERE date = '2025-01-15'
# BigQuery tự phân bổ workers, không cần config

3. Pay-per-Use

Traditional:

  • Fixed cost: $5K/month cho Redshift cluster (chạy 24/7)
  • Utilization: 20% (lãng phí 80%)

Serverless:

  • BigQuery: $5/TB scanned
  • Chỉ trả tiền khi query
  • Ví dụ: 10TB queried/month = $50 (100x cheaper)

4. Built-in High Availability

Traditional:

  • Phải configure redundancy, backups, disaster recovery

Serverless:

  • Cloud provider handles HA automatically
  • SLA: 99.9-99.99%

Serverless Data Stack

Architecture Overview

Serverless Data Platform:

[Data Sources]
     ↓
[Airbyte Cloud / Fivetran]  ← Ingestion (serverless)
     ↓
[S3 / GCS]                   ← Storage (serverless)
     ↓
[BigQuery / Snowflake]       ← Warehouse (serverless compute)
     ↓
[dbt Cloud]                  ← Transformation (serverless)
     ↓
[Metabase / Looker]          ← BI (hosted/serverless)

1. Ingestion: Airbyte Cloud / Fivetran

Airbyte Cloud:

# airbyte-config.yaml
source: postgres
  host: db.company.com
  database: production
  username: readonly_user

destination: bigquery
  project_id: my-project
  dataset_id: raw_data

schedule: "0 */6 * * *"  # Every 6 hours

Pricing:

  • Airbyte Cloud: $2.50 per credit (1 credit = 1 million rows)
  • Fivetran: $1-$2 per million monthly active rows (MAR)

Benefit: No infrastructure, connectors maintained by vendor.

2. Storage: S3 / GCS

# S3 pricing
- Standard: $0.023/GB/month
- Infrequent Access: $0.0125/GB/month
- Glacier (archive): $0.004/GB/month

# GCS pricing similar

Example: 10TB data = $230/month (Standard) or $40/month (Archive).

3. Warehouse: BigQuery / Snowflake

BigQuery:

-- Pay per query (on-demand)
SELECT user_id, COUNT(*) as events
FROM events
WHERE event_date >= '2025-01-01'
GROUP BY user_id

-- Cost: $5/TB scanned

BigQuery pricing:

  • On-demand: $5/TB scanned (first 1TB free/month)
  • Flat-rate: $2,000/month for 100 slots (reserved capacity)
  • Storage: $0.02/GB/month (active), $0.01/GB/month (long-term)

Snowflake:

-- Serverless compute (virtual warehouses)
-- Auto-suspend when idle, auto-resume on query

Snowflake pricing:

  • Compute: $2-$4/credit (1 credit = 1 hour of X-Small warehouse)
  • Storage: $40/TB/month (compressed)
  • Serverless features: Auto-scaling, auto-suspend

Comparison:

FeatureBigQuerySnowflake
Pricing modelPay-per-queryPay-per-compute-hour
Auto-suspendN/A (fully serverless)Yes (configurable)
Best forUnpredictable queriesPredictable workloads

4. Transformation: dbt Cloud

# dbt Cloud: Serverless dbt runs
# No Airflow, no EC2

# dbt project
models/
  staging/
    stg_orders.sql
  marts/
    fct_sales.sql

dbt Cloud pricing:

  • Developer: $50/user/month
  • Team: $100/user/month
  • Includes: Scheduler, IDE, CI/CD, logs

Benefit: No need to run Airflow, dbt Cloud handles scheduling.

5. Orchestration: Cloud Functions / Lambda

For simple workflows:

# Cloud Function (GCP)
import functions_framework
from google.cloud import bigquery

@functions_framework.http
def trigger_dbt_run(request):
    # Trigger dbt Cloud API
    dbt_cloud_api.run_job(job_id=12345)
    return 'Success', 200

# Trigger: Cloud Scheduler (cron)
# No server, no Airflow

AWS Lambda similar.

Pricing: Free tier covers most use cases (1M requests/month free).

6. BI: Metabase Cloud / Looker

Metabase Cloud:

  • Hosted Metabase
  • Pricing: $85/month for 5 users

Looker (Google Cloud):

  • Fully managed
  • Pricing: $3,000-$5,000/month (enterprise)

Alternatives: Preset (hosted Superset), Mode Analytics.


Advantages of Serverless

✅ 1. Zero Ops

Traditional:

# Weekly tasks
- Patch OS on 10 EC2 instances
- Upgrade Airflow version
- Monitor disk space on Redshift
- Scale up EMR cluster for peak load

Serverless:

  • Cloud provider handles all of this
  • Data team focuses 100% on data pipelines

Time savings: 10-20 hours/month (DevOps time) = $2K-$4K/month saved.

✅ 2. Auto-Scale

Example: Black Friday traffic spike

Traditional:

  • Provision cluster for peak: 10 nodes
  • Off-peak utilization: 20% (lãng phí 80%)

Serverless:

  • BigQuery scales from 0 → 1000s slots automatically
  • Pay only for what you use

✅ 3. Cost-Efficient (for Unpredictable Workloads)

Startup use case: Analytics dashboard, 10 users, 100 queries/day

Traditional:

  • Redshift dc2.large: $180/month (24/7)
  • Airflow on EC2 t3.medium: $30/month

Serverless:

  • BigQuery: 1TB scanned/month = $5/month
  • dbt Cloud: $50/month
  • Total: $55/month (vs $210)

Savings: 74% cheaper.

✅ 4. Fast Setup

Traditional: 1-2 weeks (provision servers, configure networking, setup monitoring)

Serverless: 1 day

# Day 1: Setup serverless stack
1. Create BigQuery project (5 min)
2. Setup Airbyte Cloud connections (30 min)
3. Create dbt project, deploy to dbt Cloud (2 hours)
4. Connect Metabase to BigQuery (30 min)
5. Done! Data flowing, dashboards live.

✅ 5. Built-in Features

Serverless tools come with:

  • Monitoring (BigQuery query logs, dbt Cloud logs)
  • Security (IAM, encryption at rest/in transit)
  • Disaster recovery (automated backups)
  • Compliance (SOC 2, GDPR-ready)

Disadvantages of Serverless

❌ 1. Less Control

Traditional: Full control over infrastructure

# Customize Spark config
spark.executor.memory = 8g
spark.executor.cores = 4

Serverless: Limited customization

-- BigQuery: Cannot control worker memory, cores
-- Must work within BigQuery limits (e.g., 6-hour query timeout)

❌ 2. Vendor Lock-In

BigQuery SQL khác Snowflake SQL (dialect differences):

-- BigQuery
SELECT DATE_TRUNC(order_date, MONTH) as month

-- Snowflake
SELECT DATE_TRUNC('MONTH', order_date) as month

Migration cost: Rewrite queries khi switch vendors.

❌ 3. Cold Starts (Minor)

Snowflake: Virtual warehouse tắt khi idle → First query sau đó mất 10-20s để start.

Workaround: Keep warehouse running during business hours (trade-off cost vs latency).

❌ 4. Cost Unpredictability

Nguy hiểm: Poorly optimized query có thể scan 100TB = $500 cost.

Example:

-- BAD: Scan toàn bộ table
SELECT * FROM events WHERE user_id = 'user123'
-- Cost: $50 (scan 10TB)

-- GOOD: Partition pruning
SELECT * FROM events
WHERE event_date = '2025-01-15'  -- Partition column
  AND user_id = 'user123'
-- Cost: $0.05 (scan 10GB)

Mitigation: Set budget alerts, use quotas.


When to Use Serverless

✅ Best For

1. Startups với Small Teams

Team: 1 Data Engineer
Budget: $1K-$5K/month
Workload: Unpredictable (growing fast)

→ Serverless perfect fit

2. Unpredictable Workloads

Example: Marketing analytics

  • Off-season: 10 queries/day
  • Campaign season: 1,000 queries/day

Serverless scales automatically, pay per use.

3. Proof-of-Concepts

# Need analytics platform in 2 weeks for demo
# No time to setup infrastructure

→ Serverless setup in 1 day

4. Low-Volume Workloads

Example: Internal analytics, 50 queries/day, 500GB scanned/month

BigQuery cost: $2.50/month
vs Redshift: $180/month

→ Serverless 70x cheaper

❌ Not For

1. Custom Infrastructure Needs

Example: Custom ML framework cần specific GPU config

→ Self-managed EC2/GCP Compute Engine better.

2. Ultra-Low Latency

Example: Real-time bidding (need < 50ms response)

→ Dedicated infrastructure với warm caches better.

3. Very High-Volume (Sometimes)

Example: 1PB scanned/month on BigQuery = $5,000

Redshift reserved instance: $2,000/month (cheaper at this scale).

Rule: Nếu workload predictable + high-volume → Reserved capacity cheaper.


Example Architecture: E-commerce Serverless Platform

Requirements

  • Data sources: Shopify (orders), Google Analytics (web events), Facebook Ads
  • Volume: 50M events/month, 100GB new data/day
  • Users: 10 analysts, 50 dashboard viewers
  • Budget: < $1,000/month

Architecture

[Shopify API]       [GA4]         [Facebook Ads API]
       ↓              ↓                   ↓
    [Airbyte Cloud: Ingestion] ← $150/month
                  ↓
           [GCS Bucket: Raw Data] ← $20/month (1TB)
                  ↓
         [BigQuery: Warehouse] ← $300/month (60TB scanned)
                  ↓
           [dbt Cloud: Transformations] ← $50/month (1 developer)
                  ↓
         [Metabase Cloud: BI] ← $85/month (10 users)

Total: $605/month
Zero DevOps time

Implementation

Step 1: Setup BigQuery

# Create project
gcloud projects create ecommerce-analytics

# Create dataset
bq mk --dataset ecommerce-analytics:raw_data
bq mk --dataset ecommerce-analytics:analytics

Step 2: Setup Airbyte Cloud

# Airbyte connections
connections:
  - name: shopify_to_bigquery
    source: shopify
    destination: bigquery.raw_data
    schedule: "0 */1 * * *"  # Hourly

  - name: ga4_to_bigquery
    source: google_analytics_4
    destination: bigquery.raw_data
    schedule: "0 2 * * *"  # Daily at 2 AM

Step 3: dbt Models

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw_data', 'shopify_orders') }}
)

SELECT
    id AS order_id,
    customer_id,
    CAST(created_at AS TIMESTAMP) AS order_date,
    total_price,
    currency
FROM source

-- models/marts/fct_daily_sales.sql
SELECT
    DATE(order_date) AS sale_date,
    COUNT(DISTINCT order_id) AS orders,
    SUM(total_price) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY 1

Step 4: Deploy to dbt Cloud

# dbt Cloud automatically runs models on schedule
# No Airflow needed

Step 5: Connect Metabase

# Metabase Cloud
# Add BigQuery connection
# Create dashboards
# Share with team

Results

  • Setup time: 1 day
  • Cost: $605/month (vs $3K for self-managed)
  • Ops time: 0 hours/month
  • Handles 50M events/month without scaling issues
  • Query latency: < 3s for 95th percentile

Cost Comparison: Serverless vs Self-Managed

Scenario: Mid-Size Company

Workload:

  • 5TB new data/month
  • 20TB queried/month
  • 5 data engineers
  • 100 dashboard users

Option 1: Self-Managed

Infrastructure:
- Redshift dc2.large (2 nodes): $360/month
- EC2 for Airflow (t3.medium): $30/month
- EC2 for dbt (t3.small): $15/month
- RDS for Airflow metadata (t3.small): $25/month
- Load balancer: $20/month

Team time:
- Platform maintenance: 20 hours/month
- @ $100/hour = $2,000/month

Total: $2,450/month

Option 2: Serverless

Services:
- Airbyte Cloud: $300/month (5TB)
- BigQuery storage (10TB): $200/month
- BigQuery compute (20TB scanned): $100/month
- dbt Cloud (5 developers): $500/month
- Metabase Cloud (100 users): $200/month

Team time:
- Platform maintenance: 0 hours/month

Total: $1,300/month

Savings: $1,150/month (47% cheaper) + zero ops.


Getting Started: Step-by-Step

Week 1: Setup

Day 1-2: Choose Stack

Decision matrix:

ComponentToolWhy
WarehouseBigQueryFully serverless, cheap for low-medium volume
IngestionAirbyte Cloud300+ connectors, serverless
Transformationdbt CloudIndustry standard, serverless scheduler
BIMetabase CloudAffordable, easy to use

Day 3-4: Setup Connections

# BigQuery
1. Create GCP project
2. Enable BigQuery API
3. Create datasets (raw, staging, analytics)

# Airbyte Cloud
1. Signup airbyte.com/cloud
2. Add sources (databases, APIs, SaaS)
3. Configure destinations (BigQuery)
4. Test connections

# dbt Cloud
1. Signup cloud.getdbt.com
2. Connect to BigQuery
3. Create dbt project (git repo)
4. Deploy

Day 5: First Data Pipeline

-- dbt models
-- models/staging/stg_users.sql
SELECT * FROM {{ source('raw', 'users') }}

-- models/marts/fct_user_signups.sql
SELECT
    DATE(created_at) AS signup_date,
    COUNT(*) AS signups
FROM {{ ref('stg_users') }}
GROUP BY 1

Week 2: Optimize & Scale

Cost optimization:

-- Partition tables
CREATE TABLE analytics.events
PARTITION BY DATE(event_date)
CLUSTER BY user_id
AS SELECT * FROM raw.events

-- Query optimization
SELECT * FROM analytics.events
WHERE event_date = '2025-01-15'  -- Partition filter (cheap)
  AND user_id = 'user123'         -- Cluster filter (fast)

Set budget alerts:

# BigQuery budget
gcloud billing budgets create \
  --billing-account=XXXXX \
  --display-name="BigQuery Budget" \
  --budget-amount=500USD \
  --threshold-rule=percent=80

Monitor costs:

-- BigQuery cost by user
SELECT
    user_email,
    SUM(total_bytes_processed) / 1e12 * 5 AS cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 2 DESC

Best Practices

1. Optimize for Cost

Partition tables:

-- Partition by date (most common filter)
CREATE TABLE events
PARTITION BY DATE(event_date)
CLUSTER BY user_id, event_type
AS SELECT * FROM raw_events

Benefit: Query only scan relevant partitions (10x cheaper).

Use views sparingly:

-- BAD: View scans entire table every time
CREATE VIEW active_users AS
SELECT * FROM users WHERE last_login > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

-- GOOD: Materialized table (scheduled refresh)
CREATE TABLE active_users AS
SELECT * FROM users WHERE last_login > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

2. Set Quotas

BigQuery custom quotas:

# Limit per-user query cost
gcloud services quotas update \
  --service=bigquery.googleapis.com \
  --consumer=projects/my-project \
  --metric=bigquery.googleapis.com/quota/query/usage \
  --unit=1/d/{user} \
  --value=1000  # 1000 queries/day per user

3. Monitor Performance

Query performance:

-- Find slow queries
SELECT
    job_id,
    user_email,
    total_slot_ms / 1000 / 60 AS slot_minutes,
    total_bytes_processed / 1e9 AS gb_processed,
    TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY slot_minutes DESC
LIMIT 20

Optimize slow queries: Add partition filters, use clustering.

4. Design for Serverless

Batch processing (not streaming):

# Good for serverless: Hourly batch
@dbt_cloud_schedule(cron="0 * * * *")
def hourly_aggregations():
    dbt.run(models="fct_hourly_sales")

# Bad for serverless: Streaming (use Dataflow/Flink for streaming)

Serverless best for: Batch ELT, ad-hoc analytics, dashboards.


Case Studies

Case Study 1: SaaS Startup (100 customers)

Before (Self-Managed):

  • Postgres RDS: $200/month
  • Metabase on EC2: $30/month
  • Manual exports to S3, ad-hoc queries
  • Pain: Slow queries, DevOps overhead

After (Serverless):

  • Airbyte Cloud (Postgres → BigQuery): $50/month
  • BigQuery: $20/month (4TB scanned)
  • Metabase Cloud: $85/month
  • dbt Cloud: $50/month
  • Total: $205/month
  • Benefits:
    • ✅ Queries 10x faster (BigQuery vs Postgres)
    • ✅ Zero DevOps
    • ✅ Scales automatically as customers grow

Case Study 2: E-commerce (10K orders/day)

Before (Self-Managed):

  • Redshift: $360/month
  • Airflow on EC2: $50/month
  • DevOps time: $1,000/month
  • Total: $1,410/month

After (Serverless):

  • Airbyte Cloud: $200/month
  • BigQuery: $150/month (30TB scanned)
  • dbt Cloud: $100/month (2 developers)
  • Metabase Cloud: $85/month
  • Total: $535/month
  • Savings: $875/month (62% reduction)
  • Benefits:
    • ✅ Setup time: 2 weeks → 2 days
    • ✅ Zero maintenance
    • ✅ Auto-scales for Black Friday traffic

Case Study 3: Marketing Agency (20 clients)

Before:

  • Manual exports from Google Ads, Facebook Ads
  • Google Sheets for reporting
  • 10 hours/week manual work

After (Serverless):

  • Airbyte Cloud (Google Ads + Facebook Ads → BigQuery): $150/month
  • BigQuery: $30/month
  • Looker Studio (free): $0
  • Total: $180/month
  • Benefits:
    • ✅ Automated reporting (vs 10 hours/week manual)
    • ✅ Time savings = $2,000/month (@$50/hour)
    • ✅ ROI: 11x

Kết Luận

Key Takeaways

Serverless Data Platform = zero ops, auto-scale, pay-per-use ✅ Stack: Airbyte/Fivetran + S3/GCS + BigQuery/Snowflake + dbt Cloud + Metabase/Looker ✅ Best for: Startups, unpredictable workloads, small teams, proof-of-concepts ✅ Savings: 40-60% cheaper than self-managed cho low-medium volume ✅ Trade-offs: Less control, vendor lock-in, cost unpredictability nếu không optimize

Recommendations

Cho startups:

  • Start serverless (BigQuery + dbt Cloud + Metabase)
  • Optimize as you grow
  • Switch to reserved capacity khi workload predictable

Cho enterprises:

  • Serverless cho:
    • ✅ Proof-of-concepts
    • ✅ Departmental analytics (unpredictable usage)
  • Self-managed cho:
    • ✅ Mission-critical workloads (need control)
    • ✅ Very high-volume (cheaper at scale)

Migration path:

  1. Start với 1 pilot use case (e.g., marketing analytics)
  2. Measure cost savings & time savings
  3. Expand to other use cases
  4. Keep some workloads self-managed nếu cần

Next Steps

Muốn setup serverless Data Platform cho doanh nghiệp của bạn?

Carptech giúp bạn:

  • ✅ Assessment & stack recommendations
  • ✅ Setup serverless platform (1-2 weeks)
  • ✅ Cost optimization (reduce 40-60% cloud spend)
  • ✅ Training cho team

📞 Liên hệ Carptech: carptech.vn


Related Posts:

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn