Cost Optimization cho Data Platform: Giảm 40-60% Cloud Bills

TL;DR

Cloud costs spiral out of control = Common problem. Flexera report: 30-40% enterprises overspend on cloud due to waste.

Data Platform cost breakdown:

Compute: 60-70% (biggest cost) - Queries, transformations, Spark jobs
Storage: 10-20% (cheapest) - Data lake, warehouse storage
Networking: 10-15% - Data transfer, egress
Licenses: 5-10% - Third-party tools (BI, ETL)

5 Optimization Strategies:

Storage Optimization (Save 40-60%)
- Use tiered storage (hot/cool/archive)
- Lifecycle policies (auto-move old data)
- Compression (Parquet vs CSV = 10x smaller)
- Deduplication
Compute Optimization (Save 50-80%)
- Right-sizing (don't over-provision)
- Spot/Preemptible instances (60-80% cheaper for batch)
- Auto-scaling (scale down when idle)
- Query optimization (avoid SELECT *, use partitions)
- Scheduled shutdowns (dev/staging off nights/weekends)
Reserved Capacity (Save 30-50%)
- Commit 1-3 years for discounts
- Good for predictable workloads
Data Lifecycle (Save 20-40%)
- Delete unnecessary data (raw after transformed, old logs)
- Retention policies
Monitoring & Alerts (Prevent waste)
- Set budgets, anomaly detection
- Showback/chargeback (teams accountable)

Case Studies:

Startup: BigQuery $15K → $6K/month (60% reduction via query optimization)
Enterprise: AWS Data Platform $120K → $50K/month (58% reduction via reserved instances + right-sizing)

ROI: Cost optimization effort pays back trong 1-2 months

Bài này sẽ guide bạn qua 30+ tactics để reduce cloud data platform costs by 40-60%.

1. The Cloud Cost Problem

1.1. Why Cloud Costs Spiral

Scenario quen thuộc:

Month 1: "Cloud is cheap! $500/month for unlimited scale"
Month 3: "$2,000/month, but we're growing fast"
Month 6: "$8,000/month... wait, what happened?"
Month 12: "$25,000/month 😱 CFO demanding explanations"

Root causes:

1. No visibility

Developers spin up resources freely
No cost attribution (which team/project?)
Discover waste months later (too late)

2. Over-provisioning

"Better safe than sorry" → provision 10x needed
Example: 16-core instance when 4-core sufficient

3. Zombie resources

Dev/test environments running 24/7
Unused snapshots, old backups
Forgotten resources after project ends

4. Inefficient queries

SELECT * FROM large_table (scan entire table)
Missing partitions
Repeated expensive queries (should be cached/materialized)

5. Poor data hygiene

Keep all data forever (no lifecycle)
Duplicate data across systems
Storing unnecessary raw data

1.2. Real Cost Example

Vietnamese startup (before optimization):

BigQuery:
  Query costs: $12,000/month (1.2PB scanned)
  Storage: $2,000/month (100TB active)

Cloud Storage:
  Storage: $2,300/month (100TB standard tier)

Dataflow:
  ETL jobs: $3,500/month

Cloud Composer:
  Airflow: $500/month

Total: $20,300/month = 466M VND/month 💸

After 3 months optimization: $8,200/month (60% reduction!)

2. Cost Breakdown: Know Where Money Goes

2.1. Typical Data Platform Costs

Percentage breakdown:

┌────────────────────────────────────┐
│ Compute (Queries, ETL):      60%   │ ← Biggest cost
│ Storage (Lake, Warehouse):   20%   │
│ Networking (Egress):         10%   │
│ Licenses (BI tools, etc):    10%   │
└────────────────────────────────────┘

Dollar breakdown (example $10K/month platform):

Compute: $6,000
  - BigQuery queries: $4,000
  - Dataflow jobs: $1,500
  - Cloud Functions: $500

Storage: $2,000
  - BigQuery storage: $800
  - Cloud Storage (data lake): $1,000
  - Backups: $200

Networking: $1,000
  - Data egress to internet: $600
  - Cross-region transfer: $400

Licenses: $1,000
  - Looker (5 users): $500
  - Fivetran: $500

Optimization priority: Focus on Compute first (60% of costs, biggest savings potential). Để hiểu rõ cấu trúc chi phí từng nền tảng, xem thêm so sánh chi phí Data Warehouse: Snowflake vs. BigQuery vs. Redshift.

3. Strategy 1: Storage Optimization

3.1. Storage Tiering

Problem: All data stored in expensive "hot" tier

Solution: Use appropriate tier per access frequency

GCP Cloud Storage tiers:

Tier	Use Case	Price/GB/month	Example
Standard	Frequent access (< 30 days)	$0.020	Active analytics data
Nearline	Monthly access (30-90 days)	$0.010	Recent backups
Coldline	Quarterly access (90-365 days)	$0.004	Compliance archives
Archive	Yearly access (365+ days)	$0.0012	Long-term retention

Savings example (100TB data):

Before (all Standard):
  100TB × $0.020 = $2,000/month

After (tiered):
  20TB Standard (active): 20TB × $0.020 = $400
  30TB Nearline (30d): 30TB × $0.010 = $300
  50TB Coldline (90d+): 50TB × $0.004 = $200
  Total: $900/month

Savings: $1,100/month (55%) 🎉

Implementation (lifecycle policy):

# GCS lifecycle policy
cat > lifecycle.json <<EOF
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
        "condition": {"age": 30}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
        "condition": {"age": 90}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "ARCHIVE"},
        "condition": {"age": 365}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 2555}  # 7 years
      }
    ]
  }
}
EOF

gsutil lifecycle set lifecycle.json gs://my-data-bucket/

AWS S3 equivalent:

Tier	Price/GB/month
S3 Standard	$0.023
S3 Infrequent Access	$0.0125
S3 Glacier	$0.004
S3 Glacier Deep Archive	$0.00099

3.2. Compression

Problem: Storing uncompressed data (CSV, JSON)

Solution: Use columnar compressed formats (Parquet, ORC)

Comparison:

Same 1TB dataset:

CSV (uncompressed):
  Size: 1,000GB
  Storage cost: $20/month
  Query scan: Full 1TB

Parquet (compressed):
  Size: 100GB (10x smaller!)
  Storage cost: $2/month
  Query scan: 100GB (columnar → read only needed columns)

Savings: $18/month storage + 90% query cost reduction

Implementation:

# Convert CSV to Parquet
import pandas as pd

# Read CSV
df = pd.read_csv('large_data.csv')

# Write Parquet (auto-compressed with Snappy)
df.to_parquet(
    'large_data.parquet',
    engine='pyarrow',
    compression='snappy',
    index=False
)

# Upload to cloud
# gsutil cp large_data.parquet gs://bucket/

BigQuery example:

-- Export table to Parquet (compressed)
EXPORT DATA OPTIONS(
  uri='gs://bucket/data-*.parquet',
  format='PARQUET',
  compression='SNAPPY'
) AS
SELECT * FROM `project.dataset.large_table`;

3.3. Deduplication

Problem: Duplicate data across systems

Example:

Raw data in S3: 50TB
Staged data in warehouse: 50TB (duplicate!)
Aggregated marts: 10TB
Total: 110TB (should be 60TB)

Solution: Delete raw data after transformation (if not needed)

-- Identify duplicates
SELECT
  data_source,
  data_date,
  COUNT(*) as copies,
  SUM(size_gb) as total_gb
FROM data_inventory
GROUP BY 1, 2
HAVING COUNT(*) > 1
ORDER BY total_gb DESC;

-- Delete old raw data (after confirmed in warehouse)
DELETE FROM raw_data
WHERE load_date < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
  AND status = 'processed';

4. Strategy 2: Compute Optimization

4.1. Right-Sizing

Problem: Over-provisioned instances

Example:

Dataflow job:
  Current: 20 workers × n1-standard-8 (8 vCPUs each)
  Cost: $3.20/hour × 20 = $64/hour

  Actual usage: 30% CPU utilization

Right-sized: 8 workers × n1-standard-4
  Cost: $1.60/hour × 8 = $12.80/hour

Savings: $51/hour = 80% reduction 🎉

How to right-size:

Monitor actual usage (CPU, memory, disk)
Start small, scale up if needed (easier than down)
Use auto-scaling (cloud adjusts automatically)

GCP Compute Engine right-sizing recommendations:

# Get recommendations
gcloud compute instances list --format="table(
  name,
  machineType,
  status
)"

# GCP console shows right-sizing recommendations
# Example: "Switch from n1-standard-8 to n1-standard-4 (save $1,200/month)"

4.2. Spot/Preemptible Instances

Concept: Cloud sells spare capacity at 60-80% discount

Trade-off: Can be terminated anytime (with 30-second notice)

Perfect for: Batch jobs, ETL, ML training (can be retried)

Not for: Real-time APIs, databases (need reliability)

Savings:

Regular Dataflow job:
  10 workers × $0.40/hour = $4/hour
  Daily (4 hours): $16/day = $480/month

Preemptible Dataflow:
  10 workers × $0.10/hour = $1/hour
  Daily: $4/day = $120/month

Savings: $360/month (75%) 🎉

GCP Dataflow example:

# Launch Dataflow job with preemptible workers
gcloud dataflow jobs run my-etl-job \
    --gcs-location gs://templates/etl-template \
    --region us-central1 \
    --num-workers 10 \
    --worker-machine-type n1-standard-4 \
    --use-public-ips \
    --additional-experiments=use_runner_v2 \
    --additional-user-labels=cost-center=analytics \
    --max-workers 20 \
    --enable-streaming-engine \
    --worker-zone us-central1-a \
    --additional-experiments=use_preemptible_workers  # ← Enable preemptible!

AWS Spot Instances: Similar concept, 50-90% discount

4.3. Auto-Scaling

Problem: Resources running 24/7 even when idle

Solution: Auto-scale down during low usage. Tìm hiểu thêm về serverless data architecture để tận dụng mô hình scale-to-zero

Example - BigQuery:

BigQuery is already fully serverless (auto-scales automatically) ✅

Example - Redshift:

# Pause Redshift cluster during off-hours
import boto3
from datetime import datetime

redshift = boto3.client('redshift')

def pause_cluster_if_off_hours():
    hour = datetime.now().hour

    # Pause 10 PM - 6 AM
    if hour >= 22 or hour < 6:
        redshift.pause_cluster(ClusterIdentifier='my-cluster')
        print("⏸️  Cluster paused (off-hours)")
    else:
        # Resume if paused
        response = redshift.describe_clusters(ClusterIdentifier='my-cluster')
        if response['Clusters'][0]['ClusterStatus'] == 'paused':
            redshift.resume_cluster(ClusterIdentifier='my-cluster')
            print("▶️  Cluster resumed")

# Run via CloudWatch Events (cron)
pause_cluster_if_off_hours()

Savings:

Redshift cluster: $5,000/month (24/7)

With pause (8 hours/day off):
  16 hours/day usage = 67% uptime
  Cost: $5,000 × 0.67 = $3,350/month

Savings: $1,650/month (33%)

4.4. Query Optimization ⭐ Biggest Impact

Bad query:

-- 💸 Expensive: Scans entire 10TB table
SELECT *
FROM `project.dataset.large_table`
WHERE DATE(timestamp) = '2025-07-15'
LIMIT 100;

-- BigQuery scans: 10TB × $5/TB = $50 per query!

Optimized query:

-- ✅ Cheap: Uses partition, selects only needed columns
SELECT
  user_id,
  event_name,
  timestamp
FROM `project.dataset.large_table`
WHERE DATE(timestamp) = '2025-07-15'  -- Partition pruning
LIMIT 100;

-- BigQuery scans: 30GB (1 day partition) × $5/TB = $0.15
-- Savings: $49.85 per query (99.7% reduction!)

Optimization tactics:

**1. Avoid SELECT ***

-- ❌ Bad
SELECT * FROM orders;  -- Scans all 50 columns

-- ✅ Good
SELECT order_id, customer_id, total FROM orders;  -- Scans 3 columns only

2. Use Partitioning

-- Create partitioned table (BigQuery)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(timestamp)
AS SELECT * FROM ...;

-- Queries auto-prune partitions
SELECT * FROM events
WHERE DATE(timestamp) = '2025-07-15'  -- Scans 1 day only, not entire table

3. Use Clustering

-- Create clustered table (BigQuery)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_name  -- Sort by these columns
AS SELECT * FROM ...;

-- Queries skip irrelevant data blocks
SELECT * FROM events
WHERE user_id = 12345  -- Scans only blocks with user_id=12345

4. Materialize Expensive Queries

-- ❌ Expensive query (runs daily, scans 5TB each time)
SELECT
  DATE(timestamp) as date,
  COUNT(DISTINCT user_id) as dau
FROM events
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1;

-- Cost: 5TB × $5 × 30 days = $750/month

-- ✅ Materialize as table (run once, query table after)
CREATE TABLE `analytics.daily_active_users` AS
SELECT
  DATE(timestamp) as date,
  COUNT(DISTINCT user_id) as dau
FROM events
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1;

-- Future queries: scan tiny table (1KB), not 5TB
SELECT * FROM `analytics.daily_active_users`;

-- Cost: $0.025 (one-time materialization) + $0.00001 per query
-- Savings: $749/month (99.9%!)

5. Use Query Results Caching

BigQuery automatically caches query results for 24 hours (free!):

-- First run: Scans 1TB, costs $5
SELECT COUNT(*) FROM large_table WHERE date = '2025-07-15';

-- Second run (within 24h): Uses cache, costs $0 ✅
SELECT COUNT(*) FROM large_table WHERE date = '2025-07-15';

4.5. Scheduled Shutdowns

Problem: Dev/test/staging environments running 24/7

Solution: Auto-shutdown nights + weekends

# Cloud Function: Shutdown dev resources after hours
import googleapiclient.discovery

def shutdown_dev_resources(event, context):
    from datetime import datetime

    now = datetime.now()
    hour = now.hour
    is_weekend = now.weekday() >= 5  # Saturday=5, Sunday=6

    # Shutdown conditions
    if hour >= 19 or hour < 8 or is_weekend:
        # Stop dev instances
        compute = googleapiclient.discovery.build('compute', 'v1')

        project = 'my-project'
        zone = 'us-central1-a'
        instances = ['dev-server-1', 'dev-server-2', 'staging-db']

        for instance in instances:
            compute.instances().stop(
                project=project,
                zone=zone,
                instance=instance
            ).execute()

        print(f"⏹️  Stopped {len(instances)} dev instances (off-hours)")

# Schedule: Cloud Scheduler runs hourly

Savings:

Dev environment: $1,500/month (24/7)

With shutdown:
  Weekdays: 9 AM - 7 PM = 10 hours (42% uptime)
  Weekends: Off
  Monthly uptime: ~30%

Cost: $1,500 × 0.30 = $450/month

Savings: $1,050/month (70%) 🎉

5. Strategy 3: Reserved Capacity

5.1. Concept

Pay upfront for 1-3 years → Get 30-50% discount

Good for: Predictable, steady workloads

Not for: Unpredictable/spiky workloads

5.2. BigQuery Flat-Rate Pricing

On-Demand (default):

$5 per TB scanned
Unpredictable monthly cost

Flat-Rate (reserved):

$2,000/month = 100 slots (fixed cost)
Unlimited queries

Break-even: 400TB scanned/month

Example:

Current usage: 600TB scanned/month
On-demand cost: 600TB × $5 = $3,000/month

Flat-rate: $2,000/month (100 slots)

Savings: $1,000/month (33%)

When to use:

✅ Query > 400TB/month consistently
✅ Predictable workloads
❌ Variable workloads (some months 100TB, some 500TB)

5.3. AWS Reserved Instances

Redshift Reserved Nodes:

On-Demand ra3.4xlarge:
  $3.26/hour = $2,347/month

Reserved (1-year, no upfront):
  $2.27/hour = $1,634/month
  Savings: $713/month (30%)

Reserved (3-year, all upfront):
  $1.53/hour = $1,102/month
  Savings: $1,245/month (53%)

Risk: Locked in for 1-3 years (can't cancel)

6. Strategy 4: Data Lifecycle Management

6.1. Delete Unnecessary Data

Common waste:

❌ Raw data kept forever (after transformed to warehouse)
❌ Old logs (> 1 year, rarely accessed)
❌ Test data in production
❌ Duplicate backups
❌ Abandoned datasets (project ended)

Lifecycle policy example:

# Data retention policy
datasets:
  - name: raw_events
    retention: 90_days  # Delete after transformed
    reason: "Processed into analytics.events"

  - name: application_logs
    retention: 1_year
    archive_after: 90_days  # Move to archive tier
    reason: "Compliance requires 1 year"

  - name: analytics_*
    retention: 7_years  # Tax compliance
    archive_after: 2_years

  - name: ml_experiments_*
    retention: 90_days
    reason: "Temporary experiments, keep best models only"

Implementation (automated cleanup):

-- BigQuery: Delete old raw data
DELETE FROM `project.raw.events`
WHERE DATE(ingested_at) < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);

-- Delete old logs
DELETE FROM `project.logs.application_logs`
WHERE DATE(timestamp) < DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY);

Savings example:

Before: 200TB total storage
  - 50TB raw (should be deleted after 90 days)
  - 100TB analytics (needed)
  - 50TB old logs (> 1 year)

After cleanup: 100TB
  Storage cost: 200TB × $20/TB = $4,000/month
                → 100TB × $20/TB = $2,000/month

Savings: $2,000/month (50%)

6.2. Backup Optimization

Problem: Daily full backups forever

Solution: Incremental backups + retention policy

Backup strategy:
  - Daily incremental: 7 days (delete after)
  - Weekly full: 4 weeks
  - Monthly full: 12 months
  - Yearly full: 7 years (compliance)

Before (daily full forever):
  365 days × 5TB = 1,825TB backups
  Cost: $36,500/month

After (optimized):
  Daily (7): 7 × 0.5TB (incremental) = 3.5TB
  Weekly (4): 4 × 5TB = 20TB
  Monthly (12): 12 × 5TB = 60TB
  Yearly (7): 7 × 5TB = 35TB
  Total: 118.5TB

Cost: $2,370/month

Savings: $34,130/month (93%!) 🎉

7. Strategy 5: Monitoring & Alerting

7.1. Set Budgets

Prevent surprise bills:

Monthly budget: $10,000

Alerts:
  - 50% spent ($5,000): Email to team
  - 75% spent ($7,500): Email to manager + Slack alert
  - 90% spent ($9,000): Email to CTO + block non-critical spending
  - 100% spent: Hard stop (prevent overages)

GCP Budget Alert (setup):

gcloud billing budgets create \
    --billing-account=BILLING_ACCOUNT_ID \
    --display-name="Data Platform Budget" \
    --budget-amount=10000 \
    --threshold-rule=percent=50 \
    --threshold-rule=percent=75 \
    --threshold-rule=percent=90 \
    --threshold-rule=percent=100,basis=forecasted-spend

7.2. Anomaly Detection

Detect cost spikes before they become problems:

# Daily cost monitoring
import pandas as pd
from google.cloud import bigquery

def detect_cost_anomalies():
    client = bigquery.Client()

    # Get daily costs (last 30 days)
    query = """
        SELECT
            DATE(usage_start_time) as date,
            SUM(cost) as daily_cost
        FROM `project.billing.gcp_billing_export`
        WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
        GROUP BY 1
        ORDER BY 1
    """

    df = client.query(query).to_dataframe()

    # Calculate moving average
    df['avg_cost'] = df['daily_cost'].rolling(window=7).mean()
    df['std_cost'] = df['daily_cost'].rolling(window=7).std()

    # Detect anomalies (> 2 standard deviations)
    latest = df.iloc[-1]
    threshold = latest['avg_cost'] + 2 * latest['std_cost']

    if latest['daily_cost'] > threshold:
        alert(f"""
        🚨 Cost Anomaly Detected!

        Today's cost: ${latest['daily_cost']:.2f}
        7-day average: ${latest['avg_cost']:.2f}
        Threshold: ${threshold:.2f}

        Spike: {((latest['daily_cost'] / latest['avg_cost']) - 1) * 100:.0f}% above normal

        Action: Investigate immediately!
        """)

# Run daily via Cloud Scheduler
detect_cost_anomalies()

7.3. Showback/Chargeback

Make teams accountable for their costs:

Showback: Show each team their usage (informational)

Chargeback: Charge each team's budget (enforced)

Implementation:

-- Tag all resources with labels
-- Example: BigQuery table labels
ALTER TABLE `project.dataset.analytics_table`
SET OPTIONS (
  labels=[("team", "marketing"), ("project", "campaign-analysis")]
);

-- Monthly cost report per team
SELECT
  labels.value AS team,
  SUM(cost) as monthly_cost,
  SUM(cost) / (SELECT SUM(cost) FROM billing) * 100 as percent_of_total
FROM `project.billing.gcp_billing_export`,
UNNEST(labels) AS labels
WHERE labels.key = 'team'
  AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
GROUP BY 1
ORDER BY 2 DESC;

-- Output:
-- team          monthly_cost   percent
-- marketing     $4,500         45%
-- data-science  $3,000         30%
-- analytics     $2,500         25%

Result: Teams become cost-conscious when they see their own bills.

8. Case Studies

8.1. Vietnamese Startup: BigQuery Cost Cut 60%

Company: SaaS platform (10 employees, Series A)

Before:

BigQuery costs: $15,000/month
  - 3PB scanned/month (analysts running exploratory queries)
  - No partitions, no clustering
  - SELECT * everywhere
  - Dashboard queries running every 5 minutes (uncached)

Problems identified:

No partitioning: Every query scans full table
**SELECT ***: Scanning 50 columns when need 5
Dashboard over-querying: Same query 12 times/hour
No materialization: Complex aggregations recomputed each time

Actions (3-month optimization project):

Month 1: Partition + Cluster

-- Recreate tables with partitions
CREATE TABLE `analytics.events_optimized`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_name
AS SELECT * FROM `analytics.events`;

-- Replace old table
DROP TABLE `analytics.events`;
ALTER TABLE `analytics.events_optimized` RENAME TO events;

Result: Query costs → $10,000/month (33% reduction)

Month 2: Query Optimization

Replaced SELECT * with specific columns
Added WHERE clauses to prune partitions
Used LIMIT for exploratory queries

Result: Query costs → $7,000/month (additional 30%)

Month 3: Materialization + Caching

Materialized daily/weekly aggregates
Increased dashboard cache TTL to 1 hour (from 5 min)
Scheduled expensive reports (vs on-demand)

Result: Query costs → $6,000/month

Final savings: $15,000 → $6,000 = 60% reduction ($9,000/month = $108K/year)

Effort: 1 data engineer, 25% time over 3 months

ROI: $108K savings / $15K effort = 7x return

8.2. Enterprise: AWS Data Platform 58% Reduction

Company: Manufacturing (500 employees, $100M revenue)

Before:

AWS Data Platform: $120,000/month breakdown:
  - Redshift (24/7, 10-node ra3.4xlarge): $23,470/month
  - S3 (500TB, all standard tier): $11,500/month
  - Glue ETL (on-demand): $8,500/month
  - EMR (3 clusters, 24/7): $42,000/month
  - RDS (over-provisioned): $18,000/month
  - Data transfer: $10,000/month
  - Other: $6,530/month

Optimization (6-month project):

1. Reserved Instances (Month 1-2)

Redshift: 3-year reserved nodes → $15,500/month (save $7,970)
RDS: 1-year reserved → $10,800/month (save $7,200)

2. Redshift Pause/Resume (Month 2)

Pause 10 PM - 6 AM (8 hours) = 67% uptime
Cost: $15,500 × 0.67 = $10,385/month (save $5,115)

3. S3 Lifecycle (Month 3)

200TB → Glacier after 90 days
150TB → Deep Archive after 1 year
Storage: $11,500 → $4,200/month (save $7,300)

4. EMR Right-Sizing + Spot (Month 4-5)

Reduced to 1 cluster (consolidated workloads)
80% spot instances
Auto-shutdown off-hours
Cost: $42,000 → $12,000/month (save $30,000)

5. Data Transfer Optimization (Month 6)

Use VPC endpoints (avoid internet egress)
Consolidate cross-region data
Cost: $10,000 → $4,000/month (save $6,000)

Final costs: $50,585/month

Total savings: $120,000 → $50,585 = 58% reduction ($69,415/month = $833K/year!)

Effort: 2 engineers, 6 months

ROI: $833K savings / $180K effort = 4.6x return

9. Cost Optimization Checklist (30 Items)

Storage (8 items)

Use tiered storage (hot/cool/archive) based on access patterns
Implement lifecycle policies (auto-move old data)
Compress data (Parquet/ORC instead of CSV/JSON)
Partition large tables by date
Delete unnecessary raw data after transformation
Deduplicate data across systems
Optimize backup retention (not forever)
Use incremental backups (not daily full)

Compute (12 items)

Reserved Capacity (3 items)

Analyze usage patterns (predictable workloads?)
Purchase reserved instances for steady workloads
Consider flat-rate pricing for high query volumes

Data Lifecycle (3 items)

Define retention policies per dataset
Automate data deletion (don't keep forever)
Archive cold data to cheaper tiers

Monitoring (4 items)

Set budgets with alerts (50%, 75%, 90%, 100%)
Implement anomaly detection (catch spikes early)
Tag all resources (team, project, environment)
Showback/chargeback to teams (accountability)

10. Tools for Cost Management

10.1. Cloud-Native Tools

GCP:

Billing Reports: Visualize costs by service, project, label
Budgets & Alerts: Email/Slack when threshold reached
Recommender: AI suggests optimizations (right-sizing, etc.)
Committed Use Discounts: Reserved capacity

AWS:

Cost Explorer: Analyze historical costs, forecast
Budgets: Set budget alerts
Trusted Advisor: Optimization recommendations
Reserved Instance Recommendations: Purchase suggestions

Azure:

Cost Management + Billing: Dashboards, alerts
Advisor: Cost optimization suggestions
Reservations: Reserved capacity discounts

10.2. Third-Party Tools

CloudHealth (VMware) - $$$

Multi-cloud cost management
Detailed analytics, showback/chargeback
Policy enforcement
Enterprise-focused

Cloudability (Apptio) - $$$

FinOps platform
Cost allocation, forecasting
Anomaly detection
Good for large enterprises

Kubecost (Open-source, free for single cluster)

Kubernetes cost monitoring
Per-namespace, per-pod costs
Good for containerized data platforms

Kết Luận

Cloud cost optimization is not one-time - it's continuous process.

Key Takeaways:

Compute is biggest cost (60-70%) → Focus query optimization first
Storage tiering saves 40-60% (lifecycle policies)
Right-sizing saves 50-80% (monitor actual usage)
Spot instances save 60-80% for batch jobs
Reserved capacity saves 30-50% for predictable workloads
Query optimization = biggest impact (SELECT *, partitions, materialization)
Data lifecycle = delete unnecessary data (50% storage savings)
Monitoring = prevent waste (budgets, anomaly detection, showback)

Implementation Priority:

Week 1-2: Quick wins

Set budgets + alerts
Identify top cost drivers
Delete obvious waste (old backups, zombie resources)

Month 1: Storage optimization

Implement lifecycle policies
Compress data (Parquet)
Delete unnecessary data

Month 2-3: Compute optimization

Query optimization (partitions, SELECT specific columns)
Materialize expensive queries
Right-size instances

Month 4+: Advanced

Reserved capacity (if predictable workloads)
Spot instances for batch
Showback/chargeback culture

Expected savings: 40-60% reduction trong 3-6 months

ROI: Optimization effort pays back trong 1-2 months

Next Steps:

✅ Run cost analysis (where is money going?)
✅ Set budget alerts TODAY (prevent surprise bills)
✅ Implement 5 quick wins from checklist
✅ Đọc Cloud Architecture for platform selection
✅ Đọc Query Optimization for BigQuery-specific tactics

Muốn biết ROI tổng thể? Tính ROI Data Platform miễn phí — ước tính chi phí và lợi ích đầu tư data platform, bao gồm cloud costs, 3 phút, số liệu VN.

Need help? Carptech provides Cloud FinOps consulting - we've helped 10+ Vietnamese companies reduce cloud costs 40-60%. Đặt lịch tư vấn miễn phí to discuss your cost optimization strategy.

Related Posts:

Cloud Data Platform Architecture: AWS vs GCP vs Azure
Coming: Lakehouse Architecture (Bài 33)
Coming: Serverless Data Architecture (Bài 34)
Coming: Cloud Migration Guide (Bài 35)

Cost Optimization cho Data Platform: Giảm 40-60% Cloud Bills

TL;DR

1. The Cloud Cost Problem

1.1. Why Cloud Costs Spiral

1.2. Real Cost Example

2. Cost Breakdown: Know Where Money Goes

2.1. Typical Data Platform Costs

3. Strategy 1: Storage Optimization

3.1. Storage Tiering

3.2. Compression

3.3. Deduplication

4. Strategy 2: Compute Optimization

4.1. Right-Sizing

4.2. Spot/Preemptible Instances

4.3. Auto-Scaling

4.4. Query Optimization ⭐ Biggest Impact

4.5. Scheduled Shutdowns

5. Strategy 3: Reserved Capacity

5.1. Concept

5.2. BigQuery Flat-Rate Pricing

5.3. AWS Reserved Instances

6. Strategy 4: Data Lifecycle Management

6.1. Delete Unnecessary Data

6.2. Backup Optimization

7. Strategy 5: Monitoring & Alerting

7.1. Set Budgets

7.2. Anomaly Detection

7.3. Showback/Chargeback

8. Case Studies

8.1. Vietnamese Startup: BigQuery Cost Cut 60%

8.2. Enterprise: AWS Data Platform 58% Reduction

9. Cost Optimization Checklist (30 Items)

Storage (8 items)

Compute (12 items)

Reserved Capacity (3 items)

Data Lifecycle (3 items)

Monitoring (4 items)

10. Tools for Cost Management

10.1. Cloud-Native Tools

10.2. Third-Party Tools

Kết Luận

Đăng ký nhận bài viết mới

Có câu hỏi về Data Platform?

Bài viết liên quan

Cloud Data Platform Architecture: AWS vs GCP vs Azure

Data Platform Migration: From On-Premise to Cloud

Serverless Data Architecture: Scale to Zero, Pay per Query

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý