Quay lại Blog
Cloud & InfrastructureCập nhật: 15 tháng 7, 202519 phút đọc

Cost Optimization cho Data Platform: Giảm 40-60% Cloud Bills

Hướng dẫn toàn diện về Cloud Cost Optimization cho Data Platform. Khám phá 5 optimization strategies (storage tiers, compute right-sizing, reserved capacity, data lifecycle, monitoring), query optimization techniques, FinOps culture, và case studies giảm 40-60% cloud costs.

Hoàng Quốc Việt

Hoàng Quốc Việt

Senior Data Platform Engineer

Cloud cost optimization dashboard showing cost breakdown, optimization opportunities, budget alerts, and savings achieved through storage tiers, compute optimization, and reserved capacity
#Cost Optimization#FinOps#Cloud Costs#BigQuery#AWS#GCP#Azure#Data Platform#Budget Management#Query Optimization

TL;DR

Cloud costs spiral out of control = Common problem. Flexera report: 30-40% enterprises overspend on cloud due to waste.

Data Platform cost breakdown:

  • Compute: 60-70% (biggest cost) - Queries, transformations, Spark jobs
  • Storage: 10-20% (cheapest) - Data lake, warehouse storage
  • Networking: 10-15% - Data transfer, egress
  • Licenses: 5-10% - Third-party tools (BI, ETL)

5 Optimization Strategies:

  1. Storage Optimization (Save 40-60%)

    • Use tiered storage (hot/cool/archive)
    • Lifecycle policies (auto-move old data)
    • Compression (Parquet vs CSV = 10x smaller)
    • Deduplication
  2. Compute Optimization (Save 50-80%)

    • Right-sizing (don't over-provision)
    • Spot/Preemptible instances (60-80% cheaper for batch)
    • Auto-scaling (scale down when idle)
    • Query optimization (avoid SELECT *, use partitions)
    • Scheduled shutdowns (dev/staging off nights/weekends)
  3. Reserved Capacity (Save 30-50%)

    • Commit 1-3 years for discounts
    • Good for predictable workloads
  4. Data Lifecycle (Save 20-40%)

    • Delete unnecessary data (raw after transformed, old logs)
    • Retention policies
  5. Monitoring & Alerts (Prevent waste)

    • Set budgets, anomaly detection
    • Showback/chargeback (teams accountable)

Case Studies:

  • Startup: BigQuery $15K → $6K/month (60% reduction via query optimization)
  • Enterprise: AWS Data Platform $120K → $50K/month (58% reduction via reserved instances + right-sizing)

ROI: Cost optimization effort pays back trong 1-2 months

Bài này sẽ guide bạn qua 30+ tactics để reduce cloud data platform costs by 40-60%.


1. The Cloud Cost Problem

1.1. Why Cloud Costs Spiral

Scenario quen thuộc:

Month 1: "Cloud is cheap! $500/month for unlimited scale"
Month 3: "$2,000/month, but we're growing fast"
Month 6: "$8,000/month... wait, what happened?"
Month 12: "$25,000/month 😱 CFO demanding explanations"

Root causes:

1. No visibility

  • Developers spin up resources freely
  • No cost attribution (which team/project?)
  • Discover waste months later (too late)

2. Over-provisioning

  • "Better safe than sorry" → provision 10x needed
  • Example: 16-core instance when 4-core sufficient

3. Zombie resources

  • Dev/test environments running 24/7
  • Unused snapshots, old backups
  • Forgotten resources after project ends

4. Inefficient queries

  • SELECT * FROM large_table (scan entire table)
  • Missing partitions
  • Repeated expensive queries (should be cached/materialized)

5. Poor data hygiene

  • Keep all data forever (no lifecycle)
  • Duplicate data across systems
  • Storing unnecessary raw data

1.2. Real Cost Example

Vietnamese startup (before optimization):

BigQuery:
  Query costs: $12,000/month (1.2PB scanned)
  Storage: $2,000/month (100TB active)

Cloud Storage:
  Storage: $2,300/month (100TB standard tier)

Dataflow:
  ETL jobs: $3,500/month

Cloud Composer:
  Airflow: $500/month

Total: $20,300/month = 466M VND/month 💸

After 3 months optimization: $8,200/month (60% reduction!)


2. Cost Breakdown: Know Where Money Goes

2.1. Typical Data Platform Costs

Percentage breakdown:

┌────────────────────────────────────┐
│ Compute (Queries, ETL):      60%   │ ← Biggest cost
│ Storage (Lake, Warehouse):   20%   │
│ Networking (Egress):         10%   │
│ Licenses (BI tools, etc):    10%   │
└────────────────────────────────────┘

Dollar breakdown (example $10K/month platform):

Compute: $6,000
  - BigQuery queries: $4,000
  - Dataflow jobs: $1,500
  - Cloud Functions: $500

Storage: $2,000
  - BigQuery storage: $800
  - Cloud Storage (data lake): $1,000
  - Backups: $200

Networking: $1,000
  - Data egress to internet: $600
  - Cross-region transfer: $400

Licenses: $1,000
  - Looker (5 users): $500
  - Fivetran: $500

Optimization priority: Focus on Compute first (60% of costs, biggest savings potential).


3. Strategy 1: Storage Optimization

3.1. Storage Tiering

Problem: All data stored in expensive "hot" tier

Solution: Use appropriate tier per access frequency

GCP Cloud Storage tiers:

TierUse CasePrice/GB/monthExample
StandardFrequent access (< 30 days)$0.020Active analytics data
NearlineMonthly access (30-90 days)$0.010Recent backups
ColdlineQuarterly access (90-365 days)$0.004Compliance archives
ArchiveYearly access (365+ days)$0.0012Long-term retention

Savings example (100TB data):

Before (all Standard):
  100TB × $0.020 = $2,000/month

After (tiered):
  20TB Standard (active): 20TB × $0.020 = $400
  30TB Nearline (30d): 30TB × $0.010 = $300
  50TB Coldline (90d+): 50TB × $0.004 = $200
  Total: $900/month

Savings: $1,100/month (55%) 🎉

Implementation (lifecycle policy):

# GCS lifecycle policy
cat > lifecycle.json <<EOF
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
        "condition": {"age": 30}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
        "condition": {"age": 90}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "ARCHIVE"},
        "condition": {"age": 365}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 2555}  # 7 years
      }
    ]
  }
}
EOF

gsutil lifecycle set lifecycle.json gs://my-data-bucket/

AWS S3 equivalent:

TierPrice/GB/month
S3 Standard$0.023
S3 Infrequent Access$0.0125
S3 Glacier$0.004
S3 Glacier Deep Archive$0.00099

3.2. Compression

Problem: Storing uncompressed data (CSV, JSON)

Solution: Use columnar compressed formats (Parquet, ORC)

Comparison:

Same 1TB dataset:

CSV (uncompressed):
  Size: 1,000GB
  Storage cost: $20/month
  Query scan: Full 1TB

Parquet (compressed):
  Size: 100GB (10x smaller!)
  Storage cost: $2/month
  Query scan: 100GB (columnar → read only needed columns)

Savings: $18/month storage + 90% query cost reduction

Implementation:

# Convert CSV to Parquet
import pandas as pd

# Read CSV
df = pd.read_csv('large_data.csv')

# Write Parquet (auto-compressed with Snappy)
df.to_parquet(
    'large_data.parquet',
    engine='pyarrow',
    compression='snappy',
    index=False
)

# Upload to cloud
# gsutil cp large_data.parquet gs://bucket/

BigQuery example:

-- Export table to Parquet (compressed)
EXPORT DATA OPTIONS(
  uri='gs://bucket/data-*.parquet',
  format='PARQUET',
  compression='SNAPPY'
) AS
SELECT * FROM `project.dataset.large_table`;

3.3. Deduplication

Problem: Duplicate data across systems

Example:

  • Raw data in S3: 50TB
  • Staged data in warehouse: 50TB (duplicate!)
  • Aggregated marts: 10TB
  • Total: 110TB (should be 60TB)

Solution: Delete raw data after transformation (if not needed)

-- Identify duplicates
SELECT
  data_source,
  data_date,
  COUNT(*) as copies,
  SUM(size_gb) as total_gb
FROM data_inventory
GROUP BY 1, 2
HAVING COUNT(*) > 1
ORDER BY total_gb DESC;

-- Delete old raw data (after confirmed in warehouse)
DELETE FROM raw_data
WHERE load_date < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
  AND status = 'processed';

4. Strategy 2: Compute Optimization

4.1. Right-Sizing

Problem: Over-provisioned instances

Example:

Dataflow job:
  Current: 20 workers × n1-standard-8 (8 vCPUs each)
  Cost: $3.20/hour × 20 = $64/hour

  Actual usage: 30% CPU utilization

Right-sized: 8 workers × n1-standard-4
  Cost: $1.60/hour × 8 = $12.80/hour

Savings: $51/hour = 80% reduction 🎉

How to right-size:

  1. Monitor actual usage (CPU, memory, disk)
  2. Start small, scale up if needed (easier than down)
  3. Use auto-scaling (cloud adjusts automatically)

GCP Compute Engine right-sizing recommendations:

# Get recommendations
gcloud compute instances list --format="table(
  name,
  machineType,
  status
)"

# GCP console shows right-sizing recommendations
# Example: "Switch from n1-standard-8 to n1-standard-4 (save $1,200/month)"

4.2. Spot/Preemptible Instances

Concept: Cloud sells spare capacity at 60-80% discount

Trade-off: Can be terminated anytime (with 30-second notice)

Perfect for: Batch jobs, ETL, ML training (can be retried)

Not for: Real-time APIs, databases (need reliability)

Savings:

Regular Dataflow job:
  10 workers × $0.40/hour = $4/hour
  Daily (4 hours): $16/day = $480/month

Preemptible Dataflow:
  10 workers × $0.10/hour = $1/hour
  Daily: $4/day = $120/month

Savings: $360/month (75%) 🎉

GCP Dataflow example:

# Launch Dataflow job with preemptible workers
gcloud dataflow jobs run my-etl-job \
    --gcs-location gs://templates/etl-template \
    --region us-central1 \
    --num-workers 10 \
    --worker-machine-type n1-standard-4 \
    --use-public-ips \
    --additional-experiments=use_runner_v2 \
    --additional-user-labels=cost-center=analytics \
    --max-workers 20 \
    --enable-streaming-engine \
    --worker-zone us-central1-a \
    --additional-experiments=use_preemptible_workers  # ← Enable preemptible!

AWS Spot Instances: Similar concept, 50-90% discount

4.3. Auto-Scaling

Problem: Resources running 24/7 even when idle

Solution: Auto-scale down during low usage

Example - BigQuery:

BigQuery is already fully serverless (auto-scales automatically) ✅

Example - Redshift:

# Pause Redshift cluster during off-hours
import boto3
from datetime import datetime

redshift = boto3.client('redshift')

def pause_cluster_if_off_hours():
    hour = datetime.now().hour

    # Pause 10 PM - 6 AM
    if hour >= 22 or hour < 6:
        redshift.pause_cluster(ClusterIdentifier='my-cluster')
        print("⏸️  Cluster paused (off-hours)")
    else:
        # Resume if paused
        response = redshift.describe_clusters(ClusterIdentifier='my-cluster')
        if response['Clusters'][0]['ClusterStatus'] == 'paused':
            redshift.resume_cluster(ClusterIdentifier='my-cluster')
            print("▶️  Cluster resumed")

# Run via CloudWatch Events (cron)
pause_cluster_if_off_hours()

Savings:

Redshift cluster: $5,000/month (24/7)

With pause (8 hours/day off):
  16 hours/day usage = 67% uptime
  Cost: $5,000 × 0.67 = $3,350/month

Savings: $1,650/month (33%)

4.4. Query Optimization ⭐ Biggest Impact

Bad query:

-- 💸 Expensive: Scans entire 10TB table
SELECT *
FROM `project.dataset.large_table`
WHERE DATE(timestamp) = '2025-07-15'
LIMIT 100;

-- BigQuery scans: 10TB × $5/TB = $50 per query!

Optimized query:

-- ✅ Cheap: Uses partition, selects only needed columns
SELECT
  user_id,
  event_name,
  timestamp
FROM `project.dataset.large_table`
WHERE DATE(timestamp) = '2025-07-15'  -- Partition pruning
LIMIT 100;

-- BigQuery scans: 30GB (1 day partition) × $5/TB = $0.15
-- Savings: $49.85 per query (99.7% reduction!)

Optimization tactics:

**1. Avoid SELECT ***

-- ❌ Bad
SELECT * FROM orders;  -- Scans all 50 columns

-- ✅ Good
SELECT order_id, customer_id, total FROM orders;  -- Scans 3 columns only

2. Use Partitioning

-- Create partitioned table (BigQuery)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(timestamp)
AS SELECT * FROM ...;

-- Queries auto-prune partitions
SELECT * FROM events
WHERE DATE(timestamp) = '2025-07-15'  -- Scans 1 day only, not entire table

3. Use Clustering

-- Create clustered table (BigQuery)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_name  -- Sort by these columns
AS SELECT * FROM ...;

-- Queries skip irrelevant data blocks
SELECT * FROM events
WHERE user_id = 12345  -- Scans only blocks with user_id=12345

4. Materialize Expensive Queries

-- ❌ Expensive query (runs daily, scans 5TB each time)
SELECT
  DATE(timestamp) as date,
  COUNT(DISTINCT user_id) as dau
FROM events
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1;

-- Cost: 5TB × $5 × 30 days = $750/month

-- ✅ Materialize as table (run once, query table after)
CREATE TABLE `analytics.daily_active_users` AS
SELECT
  DATE(timestamp) as date,
  COUNT(DISTINCT user_id) as dau
FROM events
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1;

-- Future queries: scan tiny table (1KB), not 5TB
SELECT * FROM `analytics.daily_active_users`;

-- Cost: $0.025 (one-time materialization) + $0.00001 per query
-- Savings: $749/month (99.9%!)

5. Use Query Results Caching

BigQuery automatically caches query results for 24 hours (free!):

-- First run: Scans 1TB, costs $5
SELECT COUNT(*) FROM large_table WHERE date = '2025-07-15';

-- Second run (within 24h): Uses cache, costs $0 ✅
SELECT COUNT(*) FROM large_table WHERE date = '2025-07-15';

4.5. Scheduled Shutdowns

Problem: Dev/test/staging environments running 24/7

Solution: Auto-shutdown nights + weekends

# Cloud Function: Shutdown dev resources after hours
import googleapiclient.discovery

def shutdown_dev_resources(event, context):
    from datetime import datetime

    now = datetime.now()
    hour = now.hour
    is_weekend = now.weekday() >= 5  # Saturday=5, Sunday=6

    # Shutdown conditions
    if hour >= 19 or hour < 8 or is_weekend:
        # Stop dev instances
        compute = googleapiclient.discovery.build('compute', 'v1')

        project = 'my-project'
        zone = 'us-central1-a'
        instances = ['dev-server-1', 'dev-server-2', 'staging-db']

        for instance in instances:
            compute.instances().stop(
                project=project,
                zone=zone,
                instance=instance
            ).execute()

        print(f"⏹️  Stopped {len(instances)} dev instances (off-hours)")

# Schedule: Cloud Scheduler runs hourly

Savings:

Dev environment: $1,500/month (24/7)

With shutdown:
  Weekdays: 9 AM - 7 PM = 10 hours (42% uptime)
  Weekends: Off
  Monthly uptime: ~30%

Cost: $1,500 × 0.30 = $450/month

Savings: $1,050/month (70%) 🎉

5. Strategy 3: Reserved Capacity

5.1. Concept

Pay upfront for 1-3 years → Get 30-50% discount

Good for: Predictable, steady workloads

Not for: Unpredictable/spiky workloads

5.2. BigQuery Flat-Rate Pricing

On-Demand (default):

  • $5 per TB scanned
  • Unpredictable monthly cost

Flat-Rate (reserved):

  • $2,000/month = 100 slots (fixed cost)
  • Unlimited queries

Break-even: 400TB scanned/month

Example:

Current usage: 600TB scanned/month
On-demand cost: 600TB × $5 = $3,000/month

Flat-rate: $2,000/month (100 slots)

Savings: $1,000/month (33%)

When to use:

  • ✅ Query > 400TB/month consistently
  • ✅ Predictable workloads
  • ❌ Variable workloads (some months 100TB, some 500TB)

5.3. AWS Reserved Instances

Redshift Reserved Nodes:

On-Demand ra3.4xlarge:
  $3.26/hour = $2,347/month

Reserved (1-year, no upfront):
  $2.27/hour = $1,634/month
  Savings: $713/month (30%)

Reserved (3-year, all upfront):
  $1.53/hour = $1,102/month
  Savings: $1,245/month (53%)

Risk: Locked in for 1-3 years (can't cancel)


6. Strategy 4: Data Lifecycle Management

6.1. Delete Unnecessary Data

Common waste:

❌ Raw data kept forever (after transformed to warehouse)
❌ Old logs (> 1 year, rarely accessed)
❌ Test data in production
❌ Duplicate backups
❌ Abandoned datasets (project ended)

Lifecycle policy example:

# Data retention policy
datasets:
  - name: raw_events
    retention: 90_days  # Delete after transformed
    reason: "Processed into analytics.events"

  - name: application_logs
    retention: 1_year
    archive_after: 90_days  # Move to archive tier
    reason: "Compliance requires 1 year"

  - name: analytics_*
    retention: 7_years  # Tax compliance
    archive_after: 2_years

  - name: ml_experiments_*
    retention: 90_days
    reason: "Temporary experiments, keep best models only"

Implementation (automated cleanup):

-- BigQuery: Delete old raw data
DELETE FROM `project.raw.events`
WHERE DATE(ingested_at) < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);

-- Delete old logs
DELETE FROM `project.logs.application_logs`
WHERE DATE(timestamp) < DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY);

Savings example:

Before: 200TB total storage
  - 50TB raw (should be deleted after 90 days)
  - 100TB analytics (needed)
  - 50TB old logs (> 1 year)

After cleanup: 100TB
  Storage cost: 200TB × $20/TB = $4,000/month
                → 100TB × $20/TB = $2,000/month

Savings: $2,000/month (50%)

6.2. Backup Optimization

Problem: Daily full backups forever

Solution: Incremental backups + retention policy

Backup strategy:
  - Daily incremental: 7 days (delete after)
  - Weekly full: 4 weeks
  - Monthly full: 12 months
  - Yearly full: 7 years (compliance)

Before (daily full forever):
  365 days × 5TB = 1,825TB backups
  Cost: $36,500/month

After (optimized):
  Daily (7): 7 × 0.5TB (incremental) = 3.5TB
  Weekly (4): 4 × 5TB = 20TB
  Monthly (12): 12 × 5TB = 60TB
  Yearly (7): 7 × 5TB = 35TB
  Total: 118.5TB

Cost: $2,370/month

Savings: $34,130/month (93%!) 🎉

7. Strategy 5: Monitoring & Alerting

7.1. Set Budgets

Prevent surprise bills:

Monthly budget: $10,000

Alerts:
  - 50% spent ($5,000): Email to team
  - 75% spent ($7,500): Email to manager + Slack alert
  - 90% spent ($9,000): Email to CTO + block non-critical spending
  - 100% spent: Hard stop (prevent overages)

GCP Budget Alert (setup):

gcloud billing budgets create \
    --billing-account=BILLING_ACCOUNT_ID \
    --display-name="Data Platform Budget" \
    --budget-amount=10000 \
    --threshold-rule=percent=50 \
    --threshold-rule=percent=75 \
    --threshold-rule=percent=90 \
    --threshold-rule=percent=100,basis=forecasted-spend

7.2. Anomaly Detection

Detect cost spikes before they become problems:

# Daily cost monitoring
import pandas as pd
from google.cloud import bigquery

def detect_cost_anomalies():
    client = bigquery.Client()

    # Get daily costs (last 30 days)
    query = """
        SELECT
            DATE(usage_start_time) as date,
            SUM(cost) as daily_cost
        FROM `project.billing.gcp_billing_export`
        WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
        GROUP BY 1
        ORDER BY 1
    """

    df = client.query(query).to_dataframe()

    # Calculate moving average
    df['avg_cost'] = df['daily_cost'].rolling(window=7).mean()
    df['std_cost'] = df['daily_cost'].rolling(window=7).std()

    # Detect anomalies (> 2 standard deviations)
    latest = df.iloc[-1]
    threshold = latest['avg_cost'] + 2 * latest['std_cost']

    if latest['daily_cost'] > threshold:
        alert(f"""
        🚨 Cost Anomaly Detected!

        Today's cost: ${latest['daily_cost']:.2f}
        7-day average: ${latest['avg_cost']:.2f}
        Threshold: ${threshold:.2f}

        Spike: {((latest['daily_cost'] / latest['avg_cost']) - 1) * 100:.0f}% above normal

        Action: Investigate immediately!
        """)

# Run daily via Cloud Scheduler
detect_cost_anomalies()

7.3. Showback/Chargeback

Make teams accountable for their costs:

Showback: Show each team their usage (informational)

Chargeback: Charge each team's budget (enforced)

Implementation:

-- Tag all resources with labels
-- Example: BigQuery table labels
ALTER TABLE `project.dataset.analytics_table`
SET OPTIONS (
  labels=[("team", "marketing"), ("project", "campaign-analysis")]
);

-- Monthly cost report per team
SELECT
  labels.value AS team,
  SUM(cost) as monthly_cost,
  SUM(cost) / (SELECT SUM(cost) FROM billing) * 100 as percent_of_total
FROM `project.billing.gcp_billing_export`,
UNNEST(labels) AS labels
WHERE labels.key = 'team'
  AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
GROUP BY 1
ORDER BY 2 DESC;

-- Output:
-- team          monthly_cost   percent
-- marketing     $4,500         45%
-- data-science  $3,000         30%
-- analytics     $2,500         25%

Result: Teams become cost-conscious when they see their own bills.


8. Case Studies

8.1. Vietnamese Startup: BigQuery Cost Cut 60%

Company: SaaS platform (10 employees, Series A)

Before:

BigQuery costs: $15,000/month
  - 3PB scanned/month (analysts running exploratory queries)
  - No partitions, no clustering
  - SELECT * everywhere
  - Dashboard queries running every 5 minutes (uncached)

Problems identified:

  1. No partitioning: Every query scans full table
  2. **SELECT ***: Scanning 50 columns when need 5
  3. Dashboard over-querying: Same query 12 times/hour
  4. No materialization: Complex aggregations recomputed each time

Actions (3-month optimization project):

Month 1: Partition + Cluster

-- Recreate tables with partitions
CREATE TABLE `analytics.events_optimized`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_name
AS SELECT * FROM `analytics.events`;

-- Replace old table
DROP TABLE `analytics.events`;
ALTER TABLE `analytics.events_optimized` RENAME TO events;

Result: Query costs → $10,000/month (33% reduction)

Month 2: Query Optimization

  • Replaced SELECT * with specific columns
  • Added WHERE clauses to prune partitions
  • Used LIMIT for exploratory queries

Result: Query costs → $7,000/month (additional 30%)

Month 3: Materialization + Caching

  • Materialized daily/weekly aggregates
  • Increased dashboard cache TTL to 1 hour (from 5 min)
  • Scheduled expensive reports (vs on-demand)

Result: Query costs → $6,000/month

Final savings: $15,000 → $6,000 = 60% reduction ($9,000/month = $108K/year)

Effort: 1 data engineer, 25% time over 3 months

ROI: $108K savings / $15K effort = 7x return

8.2. Enterprise: AWS Data Platform 58% Reduction

Company: Manufacturing (500 employees, $100M revenue)

Before:

AWS Data Platform: $120,000/month breakdown:
  - Redshift (24/7, 10-node ra3.4xlarge): $23,470/month
  - S3 (500TB, all standard tier): $11,500/month
  - Glue ETL (on-demand): $8,500/month
  - EMR (3 clusters, 24/7): $42,000/month
  - RDS (over-provisioned): $18,000/month
  - Data transfer: $10,000/month
  - Other: $6,530/month

Optimization (6-month project):

1. Reserved Instances (Month 1-2)

  • Redshift: 3-year reserved nodes → $15,500/month (save $7,970)
  • RDS: 1-year reserved → $10,800/month (save $7,200)

2. Redshift Pause/Resume (Month 2)

  • Pause 10 PM - 6 AM (8 hours) = 67% uptime
  • Cost: $15,500 × 0.67 = $10,385/month (save $5,115)

3. S3 Lifecycle (Month 3)

  • 200TB → Glacier after 90 days
  • 150TB → Deep Archive after 1 year
  • Storage: $11,500 → $4,200/month (save $7,300)

4. EMR Right-Sizing + Spot (Month 4-5)

  • Reduced to 1 cluster (consolidated workloads)
  • 80% spot instances
  • Auto-shutdown off-hours
  • Cost: $42,000 → $12,000/month (save $30,000)

5. Data Transfer Optimization (Month 6)

  • Use VPC endpoints (avoid internet egress)
  • Consolidate cross-region data
  • Cost: $10,000 → $4,000/month (save $6,000)

Final costs: $50,585/month

Total savings: $120,000 → $50,585 = 58% reduction ($69,415/month = $833K/year!)

Effort: 2 engineers, 6 months

ROI: $833K savings / $180K effort = 4.6x return


9. Cost Optimization Checklist (30 Items)

Storage (8 items)

  • Use tiered storage (hot/cool/archive) based on access patterns
  • Implement lifecycle policies (auto-move old data)
  • Compress data (Parquet/ORC instead of CSV/JSON)
  • Partition large tables by date
  • Delete unnecessary raw data after transformation
  • Deduplicate data across systems
  • Optimize backup retention (not forever)
  • Use incremental backups (not daily full)

Compute (12 items)

  • Right-size instances (monitor actual usage)
  • Use spot/preemptible instances for batch jobs
  • Auto-scale down during off-hours
  • Pause dev/staging environments nights + weekends
  • Avoid SELECT * (query only needed columns)
  • Use partitions and clustering
  • Materialize expensive aggregations
  • Leverage query result caching
  • Set query timeouts (kill runaway queries)
  • Limit exploratory queries (use LIMIT)
  • Scheduled reports (not on-demand)
  • Review and kill zombie jobs

Reserved Capacity (3 items)

  • Analyze usage patterns (predictable workloads?)
  • Purchase reserved instances for steady workloads
  • Consider flat-rate pricing for high query volumes

Data Lifecycle (3 items)

  • Define retention policies per dataset
  • Automate data deletion (don't keep forever)
  • Archive cold data to cheaper tiers

Monitoring (4 items)

  • Set budgets with alerts (50%, 75%, 90%, 100%)
  • Implement anomaly detection (catch spikes early)
  • Tag all resources (team, project, environment)
  • Showback/chargeback to teams (accountability)

10. Tools for Cost Management

10.1. Cloud-Native Tools

GCP:

  • Billing Reports: Visualize costs by service, project, label
  • Budgets & Alerts: Email/Slack when threshold reached
  • Recommender: AI suggests optimizations (right-sizing, etc.)
  • Committed Use Discounts: Reserved capacity

AWS:

  • Cost Explorer: Analyze historical costs, forecast
  • Budgets: Set budget alerts
  • Trusted Advisor: Optimization recommendations
  • Reserved Instance Recommendations: Purchase suggestions

Azure:

  • Cost Management + Billing: Dashboards, alerts
  • Advisor: Cost optimization suggestions
  • Reservations: Reserved capacity discounts

10.2. Third-Party Tools

CloudHealth (VMware) - $$$

  • Multi-cloud cost management
  • Detailed analytics, showback/chargeback
  • Policy enforcement
  • Enterprise-focused

Cloudability (Apptio) - $$$

  • FinOps platform
  • Cost allocation, forecasting
  • Anomaly detection
  • Good for large enterprises

Kubecost (Open-source, free for single cluster)

  • Kubernetes cost monitoring
  • Per-namespace, per-pod costs
  • Good for containerized data platforms

Kết Luận

Cloud cost optimization is not one-time - it's continuous process.

Key Takeaways:

  1. Compute is biggest cost (60-70%) → Focus query optimization first
  2. Storage tiering saves 40-60% (lifecycle policies)
  3. Right-sizing saves 50-80% (monitor actual usage)
  4. Spot instances save 60-80% for batch jobs
  5. Reserved capacity saves 30-50% for predictable workloads
  6. Query optimization = biggest impact (SELECT *, partitions, materialization)
  7. Data lifecycle = delete unnecessary data (50% storage savings)
  8. Monitoring = prevent waste (budgets, anomaly detection, showback)

Implementation Priority:

Week 1-2: Quick wins

  • Set budgets + alerts
  • Identify top cost drivers
  • Delete obvious waste (old backups, zombie resources)

Month 1: Storage optimization

  • Implement lifecycle policies
  • Compress data (Parquet)
  • Delete unnecessary data

Month 2-3: Compute optimization

  • Query optimization (partitions, SELECT specific columns)
  • Materialize expensive queries
  • Right-size instances

Month 4+: Advanced

  • Reserved capacity (if predictable workloads)
  • Spot instances for batch
  • Showback/chargeback culture

Expected savings: 40-60% reduction trong 3-6 months

ROI: Optimization effort pays back trong 1-2 months

Next Steps:

  • ✅ Run cost analysis (where is money going?)
  • ✅ Set budget alerts TODAY (prevent surprise bills)
  • ✅ Implement 5 quick wins from checklist
  • ✅ Đọc Cloud Architecture for platform selection
  • ✅ Đọc Query Optimization for BigQuery-specific tactics

Need help? Carptech provides Cloud FinOps consulting - we've helped 10+ Vietnamese companies reduce cloud costs 40-60%. Book consultation to discuss your cost optimization strategy.


Related Posts:

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn