TL;DR
Cloud costs spiral out of control = Common problem. Flexera report: 30-40% enterprises overspend on cloud due to waste.
Data Platform cost breakdown:
- Compute: 60-70% (biggest cost) - Queries, transformations, Spark jobs
- Storage: 10-20% (cheapest) - Data lake, warehouse storage
- Networking: 10-15% - Data transfer, egress
- Licenses: 5-10% - Third-party tools (BI, ETL)
5 Optimization Strategies:
-
Storage Optimization (Save 40-60%)
- Use tiered storage (hot/cool/archive)
- Lifecycle policies (auto-move old data)
- Compression (Parquet vs CSV = 10x smaller)
- Deduplication
-
Compute Optimization (Save 50-80%)
- Right-sizing (don't over-provision)
- Spot/Preemptible instances (60-80% cheaper for batch)
- Auto-scaling (scale down when idle)
- Query optimization (avoid SELECT *, use partitions)
- Scheduled shutdowns (dev/staging off nights/weekends)
-
Reserved Capacity (Save 30-50%)
- Commit 1-3 years for discounts
- Good for predictable workloads
-
Data Lifecycle (Save 20-40%)
- Delete unnecessary data (raw after transformed, old logs)
- Retention policies
-
Monitoring & Alerts (Prevent waste)
- Set budgets, anomaly detection
- Showback/chargeback (teams accountable)
Case Studies:
- Startup: BigQuery $15K → $6K/month (60% reduction via query optimization)
- Enterprise: AWS Data Platform $120K → $50K/month (58% reduction via reserved instances + right-sizing)
ROI: Cost optimization effort pays back trong 1-2 months
Bài này sẽ guide bạn qua 30+ tactics để reduce cloud data platform costs by 40-60%.
1. The Cloud Cost Problem
1.1. Why Cloud Costs Spiral
Scenario quen thuộc:
Month 1: "Cloud is cheap! $500/month for unlimited scale"
Month 3: "$2,000/month, but we're growing fast"
Month 6: "$8,000/month... wait, what happened?"
Month 12: "$25,000/month 😱 CFO demanding explanations"
Root causes:
1. No visibility
- Developers spin up resources freely
- No cost attribution (which team/project?)
- Discover waste months later (too late)
2. Over-provisioning
- "Better safe than sorry" → provision 10x needed
- Example: 16-core instance when 4-core sufficient
3. Zombie resources
- Dev/test environments running 24/7
- Unused snapshots, old backups
- Forgotten resources after project ends
4. Inefficient queries
- SELECT * FROM large_table (scan entire table)
- Missing partitions
- Repeated expensive queries (should be cached/materialized)
5. Poor data hygiene
- Keep all data forever (no lifecycle)
- Duplicate data across systems
- Storing unnecessary raw data
1.2. Real Cost Example
Vietnamese startup (before optimization):
BigQuery:
Query costs: $12,000/month (1.2PB scanned)
Storage: $2,000/month (100TB active)
Cloud Storage:
Storage: $2,300/month (100TB standard tier)
Dataflow:
ETL jobs: $3,500/month
Cloud Composer:
Airflow: $500/month
Total: $20,300/month = 466M VND/month 💸
After 3 months optimization: $8,200/month (60% reduction!)
2. Cost Breakdown: Know Where Money Goes
2.1. Typical Data Platform Costs
Percentage breakdown:
┌────────────────────────────────────┐
│ Compute (Queries, ETL): 60% │ ← Biggest cost
│ Storage (Lake, Warehouse): 20% │
│ Networking (Egress): 10% │
│ Licenses (BI tools, etc): 10% │
└────────────────────────────────────┘
Dollar breakdown (example $10K/month platform):
Compute: $6,000
- BigQuery queries: $4,000
- Dataflow jobs: $1,500
- Cloud Functions: $500
Storage: $2,000
- BigQuery storage: $800
- Cloud Storage (data lake): $1,000
- Backups: $200
Networking: $1,000
- Data egress to internet: $600
- Cross-region transfer: $400
Licenses: $1,000
- Looker (5 users): $500
- Fivetran: $500
Optimization priority: Focus on Compute first (60% of costs, biggest savings potential).
3. Strategy 1: Storage Optimization
3.1. Storage Tiering
Problem: All data stored in expensive "hot" tier
Solution: Use appropriate tier per access frequency
GCP Cloud Storage tiers:
| Tier | Use Case | Price/GB/month | Example |
|---|---|---|---|
| Standard | Frequent access (< 30 days) | $0.020 | Active analytics data |
| Nearline | Monthly access (30-90 days) | $0.010 | Recent backups |
| Coldline | Quarterly access (90-365 days) | $0.004 | Compliance archives |
| Archive | Yearly access (365+ days) | $0.0012 | Long-term retention |
Savings example (100TB data):
Before (all Standard):
100TB × $0.020 = $2,000/month
After (tiered):
20TB Standard (active): 20TB × $0.020 = $400
30TB Nearline (30d): 30TB × $0.010 = $300
50TB Coldline (90d+): 50TB × $0.004 = $200
Total: $900/month
Savings: $1,100/month (55%) 🎉
Implementation (lifecycle policy):
# GCS lifecycle policy
cat > lifecycle.json <<EOF
{
"lifecycle": {
"rule": [
{
"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
"condition": {"age": 30}
},
{
"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
"condition": {"age": 90}
},
{
"action": {"type": "SetStorageClass", "storageClass": "ARCHIVE"},
"condition": {"age": 365}
},
{
"action": {"type": "Delete"},
"condition": {"age": 2555} # 7 years
}
]
}
}
EOF
gsutil lifecycle set lifecycle.json gs://my-data-bucket/
AWS S3 equivalent:
| Tier | Price/GB/month |
|---|---|
| S3 Standard | $0.023 |
| S3 Infrequent Access | $0.0125 |
| S3 Glacier | $0.004 |
| S3 Glacier Deep Archive | $0.00099 |
3.2. Compression
Problem: Storing uncompressed data (CSV, JSON)
Solution: Use columnar compressed formats (Parquet, ORC)
Comparison:
Same 1TB dataset:
CSV (uncompressed):
Size: 1,000GB
Storage cost: $20/month
Query scan: Full 1TB
Parquet (compressed):
Size: 100GB (10x smaller!)
Storage cost: $2/month
Query scan: 100GB (columnar → read only needed columns)
Savings: $18/month storage + 90% query cost reduction
Implementation:
# Convert CSV to Parquet
import pandas as pd
# Read CSV
df = pd.read_csv('large_data.csv')
# Write Parquet (auto-compressed with Snappy)
df.to_parquet(
'large_data.parquet',
engine='pyarrow',
compression='snappy',
index=False
)
# Upload to cloud
# gsutil cp large_data.parquet gs://bucket/
BigQuery example:
-- Export table to Parquet (compressed)
EXPORT DATA OPTIONS(
uri='gs://bucket/data-*.parquet',
format='PARQUET',
compression='SNAPPY'
) AS
SELECT * FROM `project.dataset.large_table`;
3.3. Deduplication
Problem: Duplicate data across systems
Example:
- Raw data in S3: 50TB
- Staged data in warehouse: 50TB (duplicate!)
- Aggregated marts: 10TB
- Total: 110TB (should be 60TB)
Solution: Delete raw data after transformation (if not needed)
-- Identify duplicates
SELECT
data_source,
data_date,
COUNT(*) as copies,
SUM(size_gb) as total_gb
FROM data_inventory
GROUP BY 1, 2
HAVING COUNT(*) > 1
ORDER BY total_gb DESC;
-- Delete old raw data (after confirmed in warehouse)
DELETE FROM raw_data
WHERE load_date < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
AND status = 'processed';
4. Strategy 2: Compute Optimization
4.1. Right-Sizing
Problem: Over-provisioned instances
Example:
Dataflow job:
Current: 20 workers × n1-standard-8 (8 vCPUs each)
Cost: $3.20/hour × 20 = $64/hour
Actual usage: 30% CPU utilization
Right-sized: 8 workers × n1-standard-4
Cost: $1.60/hour × 8 = $12.80/hour
Savings: $51/hour = 80% reduction 🎉
How to right-size:
- Monitor actual usage (CPU, memory, disk)
- Start small, scale up if needed (easier than down)
- Use auto-scaling (cloud adjusts automatically)
GCP Compute Engine right-sizing recommendations:
# Get recommendations
gcloud compute instances list --format="table(
name,
machineType,
status
)"
# GCP console shows right-sizing recommendations
# Example: "Switch from n1-standard-8 to n1-standard-4 (save $1,200/month)"
4.2. Spot/Preemptible Instances
Concept: Cloud sells spare capacity at 60-80% discount
Trade-off: Can be terminated anytime (with 30-second notice)
Perfect for: Batch jobs, ETL, ML training (can be retried)
Not for: Real-time APIs, databases (need reliability)
Savings:
Regular Dataflow job:
10 workers × $0.40/hour = $4/hour
Daily (4 hours): $16/day = $480/month
Preemptible Dataflow:
10 workers × $0.10/hour = $1/hour
Daily: $4/day = $120/month
Savings: $360/month (75%) 🎉
GCP Dataflow example:
# Launch Dataflow job with preemptible workers
gcloud dataflow jobs run my-etl-job \
--gcs-location gs://templates/etl-template \
--region us-central1 \
--num-workers 10 \
--worker-machine-type n1-standard-4 \
--use-public-ips \
--additional-experiments=use_runner_v2 \
--additional-user-labels=cost-center=analytics \
--max-workers 20 \
--enable-streaming-engine \
--worker-zone us-central1-a \
--additional-experiments=use_preemptible_workers # ← Enable preemptible!
AWS Spot Instances: Similar concept, 50-90% discount
4.3. Auto-Scaling
Problem: Resources running 24/7 even when idle
Solution: Auto-scale down during low usage
Example - BigQuery:
BigQuery is already fully serverless (auto-scales automatically) ✅
Example - Redshift:
# Pause Redshift cluster during off-hours
import boto3
from datetime import datetime
redshift = boto3.client('redshift')
def pause_cluster_if_off_hours():
hour = datetime.now().hour
# Pause 10 PM - 6 AM
if hour >= 22 or hour < 6:
redshift.pause_cluster(ClusterIdentifier='my-cluster')
print("⏸️ Cluster paused (off-hours)")
else:
# Resume if paused
response = redshift.describe_clusters(ClusterIdentifier='my-cluster')
if response['Clusters'][0]['ClusterStatus'] == 'paused':
redshift.resume_cluster(ClusterIdentifier='my-cluster')
print("▶️ Cluster resumed")
# Run via CloudWatch Events (cron)
pause_cluster_if_off_hours()
Savings:
Redshift cluster: $5,000/month (24/7)
With pause (8 hours/day off):
16 hours/day usage = 67% uptime
Cost: $5,000 × 0.67 = $3,350/month
Savings: $1,650/month (33%)
4.4. Query Optimization ⭐ Biggest Impact
Bad query:
-- 💸 Expensive: Scans entire 10TB table
SELECT *
FROM `project.dataset.large_table`
WHERE DATE(timestamp) = '2025-07-15'
LIMIT 100;
-- BigQuery scans: 10TB × $5/TB = $50 per query!
Optimized query:
-- ✅ Cheap: Uses partition, selects only needed columns
SELECT
user_id,
event_name,
timestamp
FROM `project.dataset.large_table`
WHERE DATE(timestamp) = '2025-07-15' -- Partition pruning
LIMIT 100;
-- BigQuery scans: 30GB (1 day partition) × $5/TB = $0.15
-- Savings: $49.85 per query (99.7% reduction!)
Optimization tactics:
**1. Avoid SELECT ***
-- ❌ Bad
SELECT * FROM orders; -- Scans all 50 columns
-- ✅ Good
SELECT order_id, customer_id, total FROM orders; -- Scans 3 columns only
2. Use Partitioning
-- Create partitioned table (BigQuery)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(timestamp)
AS SELECT * FROM ...;
-- Queries auto-prune partitions
SELECT * FROM events
WHERE DATE(timestamp) = '2025-07-15' -- Scans 1 day only, not entire table
3. Use Clustering
-- Create clustered table (BigQuery)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_name -- Sort by these columns
AS SELECT * FROM ...;
-- Queries skip irrelevant data blocks
SELECT * FROM events
WHERE user_id = 12345 -- Scans only blocks with user_id=12345
4. Materialize Expensive Queries
-- ❌ Expensive query (runs daily, scans 5TB each time)
SELECT
DATE(timestamp) as date,
COUNT(DISTINCT user_id) as dau
FROM events
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1;
-- Cost: 5TB × $5 × 30 days = $750/month
-- ✅ Materialize as table (run once, query table after)
CREATE TABLE `analytics.daily_active_users` AS
SELECT
DATE(timestamp) as date,
COUNT(DISTINCT user_id) as dau
FROM events
WHERE DATE(timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1;
-- Future queries: scan tiny table (1KB), not 5TB
SELECT * FROM `analytics.daily_active_users`;
-- Cost: $0.025 (one-time materialization) + $0.00001 per query
-- Savings: $749/month (99.9%!)
5. Use Query Results Caching
BigQuery automatically caches query results for 24 hours (free!):
-- First run: Scans 1TB, costs $5
SELECT COUNT(*) FROM large_table WHERE date = '2025-07-15';
-- Second run (within 24h): Uses cache, costs $0 ✅
SELECT COUNT(*) FROM large_table WHERE date = '2025-07-15';
4.5. Scheduled Shutdowns
Problem: Dev/test/staging environments running 24/7
Solution: Auto-shutdown nights + weekends
# Cloud Function: Shutdown dev resources after hours
import googleapiclient.discovery
def shutdown_dev_resources(event, context):
from datetime import datetime
now = datetime.now()
hour = now.hour
is_weekend = now.weekday() >= 5 # Saturday=5, Sunday=6
# Shutdown conditions
if hour >= 19 or hour < 8 or is_weekend:
# Stop dev instances
compute = googleapiclient.discovery.build('compute', 'v1')
project = 'my-project'
zone = 'us-central1-a'
instances = ['dev-server-1', 'dev-server-2', 'staging-db']
for instance in instances:
compute.instances().stop(
project=project,
zone=zone,
instance=instance
).execute()
print(f"⏹️ Stopped {len(instances)} dev instances (off-hours)")
# Schedule: Cloud Scheduler runs hourly
Savings:
Dev environment: $1,500/month (24/7)
With shutdown:
Weekdays: 9 AM - 7 PM = 10 hours (42% uptime)
Weekends: Off
Monthly uptime: ~30%
Cost: $1,500 × 0.30 = $450/month
Savings: $1,050/month (70%) 🎉
5. Strategy 3: Reserved Capacity
5.1. Concept
Pay upfront for 1-3 years → Get 30-50% discount
Good for: Predictable, steady workloads
Not for: Unpredictable/spiky workloads
5.2. BigQuery Flat-Rate Pricing
On-Demand (default):
- $5 per TB scanned
- Unpredictable monthly cost
Flat-Rate (reserved):
- $2,000/month = 100 slots (fixed cost)
- Unlimited queries
Break-even: 400TB scanned/month
Example:
Current usage: 600TB scanned/month
On-demand cost: 600TB × $5 = $3,000/month
Flat-rate: $2,000/month (100 slots)
Savings: $1,000/month (33%)
When to use:
- ✅ Query > 400TB/month consistently
- ✅ Predictable workloads
- ❌ Variable workloads (some months 100TB, some 500TB)
5.3. AWS Reserved Instances
Redshift Reserved Nodes:
On-Demand ra3.4xlarge:
$3.26/hour = $2,347/month
Reserved (1-year, no upfront):
$2.27/hour = $1,634/month
Savings: $713/month (30%)
Reserved (3-year, all upfront):
$1.53/hour = $1,102/month
Savings: $1,245/month (53%)
Risk: Locked in for 1-3 years (can't cancel)
6. Strategy 4: Data Lifecycle Management
6.1. Delete Unnecessary Data
Common waste:
❌ Raw data kept forever (after transformed to warehouse)
❌ Old logs (> 1 year, rarely accessed)
❌ Test data in production
❌ Duplicate backups
❌ Abandoned datasets (project ended)
Lifecycle policy example:
# Data retention policy
datasets:
- name: raw_events
retention: 90_days # Delete after transformed
reason: "Processed into analytics.events"
- name: application_logs
retention: 1_year
archive_after: 90_days # Move to archive tier
reason: "Compliance requires 1 year"
- name: analytics_*
retention: 7_years # Tax compliance
archive_after: 2_years
- name: ml_experiments_*
retention: 90_days
reason: "Temporary experiments, keep best models only"
Implementation (automated cleanup):
-- BigQuery: Delete old raw data
DELETE FROM `project.raw.events`
WHERE DATE(ingested_at) < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);
-- Delete old logs
DELETE FROM `project.logs.application_logs`
WHERE DATE(timestamp) < DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY);
Savings example:
Before: 200TB total storage
- 50TB raw (should be deleted after 90 days)
- 100TB analytics (needed)
- 50TB old logs (> 1 year)
After cleanup: 100TB
Storage cost: 200TB × $20/TB = $4,000/month
→ 100TB × $20/TB = $2,000/month
Savings: $2,000/month (50%)
6.2. Backup Optimization
Problem: Daily full backups forever
Solution: Incremental backups + retention policy
Backup strategy:
- Daily incremental: 7 days (delete after)
- Weekly full: 4 weeks
- Monthly full: 12 months
- Yearly full: 7 years (compliance)
Before (daily full forever):
365 days × 5TB = 1,825TB backups
Cost: $36,500/month
After (optimized):
Daily (7): 7 × 0.5TB (incremental) = 3.5TB
Weekly (4): 4 × 5TB = 20TB
Monthly (12): 12 × 5TB = 60TB
Yearly (7): 7 × 5TB = 35TB
Total: 118.5TB
Cost: $2,370/month
Savings: $34,130/month (93%!) 🎉
7. Strategy 5: Monitoring & Alerting
7.1. Set Budgets
Prevent surprise bills:
Monthly budget: $10,000
Alerts:
- 50% spent ($5,000): Email to team
- 75% spent ($7,500): Email to manager + Slack alert
- 90% spent ($9,000): Email to CTO + block non-critical spending
- 100% spent: Hard stop (prevent overages)
GCP Budget Alert (setup):
gcloud billing budgets create \
--billing-account=BILLING_ACCOUNT_ID \
--display-name="Data Platform Budget" \
--budget-amount=10000 \
--threshold-rule=percent=50 \
--threshold-rule=percent=75 \
--threshold-rule=percent=90 \
--threshold-rule=percent=100,basis=forecasted-spend
7.2. Anomaly Detection
Detect cost spikes before they become problems:
# Daily cost monitoring
import pandas as pd
from google.cloud import bigquery
def detect_cost_anomalies():
client = bigquery.Client()
# Get daily costs (last 30 days)
query = """
SELECT
DATE(usage_start_time) as date,
SUM(cost) as daily_cost
FROM `project.billing.gcp_billing_export`
WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 1
"""
df = client.query(query).to_dataframe()
# Calculate moving average
df['avg_cost'] = df['daily_cost'].rolling(window=7).mean()
df['std_cost'] = df['daily_cost'].rolling(window=7).std()
# Detect anomalies (> 2 standard deviations)
latest = df.iloc[-1]
threshold = latest['avg_cost'] + 2 * latest['std_cost']
if latest['daily_cost'] > threshold:
alert(f"""
🚨 Cost Anomaly Detected!
Today's cost: ${latest['daily_cost']:.2f}
7-day average: ${latest['avg_cost']:.2f}
Threshold: ${threshold:.2f}
Spike: {((latest['daily_cost'] / latest['avg_cost']) - 1) * 100:.0f}% above normal
Action: Investigate immediately!
""")
# Run daily via Cloud Scheduler
detect_cost_anomalies()
7.3. Showback/Chargeback
Make teams accountable for their costs:
Showback: Show each team their usage (informational)
Chargeback: Charge each team's budget (enforced)
Implementation:
-- Tag all resources with labels
-- Example: BigQuery table labels
ALTER TABLE `project.dataset.analytics_table`
SET OPTIONS (
labels=[("team", "marketing"), ("project", "campaign-analysis")]
);
-- Monthly cost report per team
SELECT
labels.value AS team,
SUM(cost) as monthly_cost,
SUM(cost) / (SELECT SUM(cost) FROM billing) * 100 as percent_of_total
FROM `project.billing.gcp_billing_export`,
UNNEST(labels) AS labels
WHERE labels.key = 'team'
AND DATE(usage_start_time) >= DATE_TRUNC(CURRENT_DATE(), MONTH)
GROUP BY 1
ORDER BY 2 DESC;
-- Output:
-- team monthly_cost percent
-- marketing $4,500 45%
-- data-science $3,000 30%
-- analytics $2,500 25%
Result: Teams become cost-conscious when they see their own bills.
8. Case Studies
8.1. Vietnamese Startup: BigQuery Cost Cut 60%
Company: SaaS platform (10 employees, Series A)
Before:
BigQuery costs: $15,000/month
- 3PB scanned/month (analysts running exploratory queries)
- No partitions, no clustering
- SELECT * everywhere
- Dashboard queries running every 5 minutes (uncached)
Problems identified:
- No partitioning: Every query scans full table
- **SELECT ***: Scanning 50 columns when need 5
- Dashboard over-querying: Same query 12 times/hour
- No materialization: Complex aggregations recomputed each time
Actions (3-month optimization project):
Month 1: Partition + Cluster
-- Recreate tables with partitions
CREATE TABLE `analytics.events_optimized`
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_name
AS SELECT * FROM `analytics.events`;
-- Replace old table
DROP TABLE `analytics.events`;
ALTER TABLE `analytics.events_optimized` RENAME TO events;
Result: Query costs → $10,000/month (33% reduction)
Month 2: Query Optimization
- Replaced SELECT * with specific columns
- Added WHERE clauses to prune partitions
- Used LIMIT for exploratory queries
Result: Query costs → $7,000/month (additional 30%)
Month 3: Materialization + Caching
- Materialized daily/weekly aggregates
- Increased dashboard cache TTL to 1 hour (from 5 min)
- Scheduled expensive reports (vs on-demand)
Result: Query costs → $6,000/month
Final savings: $15,000 → $6,000 = 60% reduction ($9,000/month = $108K/year)
Effort: 1 data engineer, 25% time over 3 months
ROI: $108K savings / $15K effort = 7x return
8.2. Enterprise: AWS Data Platform 58% Reduction
Company: Manufacturing (500 employees, $100M revenue)
Before:
AWS Data Platform: $120,000/month breakdown:
- Redshift (24/7, 10-node ra3.4xlarge): $23,470/month
- S3 (500TB, all standard tier): $11,500/month
- Glue ETL (on-demand): $8,500/month
- EMR (3 clusters, 24/7): $42,000/month
- RDS (over-provisioned): $18,000/month
- Data transfer: $10,000/month
- Other: $6,530/month
Optimization (6-month project):
1. Reserved Instances (Month 1-2)
- Redshift: 3-year reserved nodes → $15,500/month (save $7,970)
- RDS: 1-year reserved → $10,800/month (save $7,200)
2. Redshift Pause/Resume (Month 2)
- Pause 10 PM - 6 AM (8 hours) = 67% uptime
- Cost: $15,500 × 0.67 = $10,385/month (save $5,115)
3. S3 Lifecycle (Month 3)
- 200TB → Glacier after 90 days
- 150TB → Deep Archive after 1 year
- Storage: $11,500 → $4,200/month (save $7,300)
4. EMR Right-Sizing + Spot (Month 4-5)
- Reduced to 1 cluster (consolidated workloads)
- 80% spot instances
- Auto-shutdown off-hours
- Cost: $42,000 → $12,000/month (save $30,000)
5. Data Transfer Optimization (Month 6)
- Use VPC endpoints (avoid internet egress)
- Consolidate cross-region data
- Cost: $10,000 → $4,000/month (save $6,000)
Final costs: $50,585/month
Total savings: $120,000 → $50,585 = 58% reduction ($69,415/month = $833K/year!)
Effort: 2 engineers, 6 months
ROI: $833K savings / $180K effort = 4.6x return
9. Cost Optimization Checklist (30 Items)
Storage (8 items)
- Use tiered storage (hot/cool/archive) based on access patterns
- Implement lifecycle policies (auto-move old data)
- Compress data (Parquet/ORC instead of CSV/JSON)
- Partition large tables by date
- Delete unnecessary raw data after transformation
- Deduplicate data across systems
- Optimize backup retention (not forever)
- Use incremental backups (not daily full)
Compute (12 items)
- Right-size instances (monitor actual usage)
- Use spot/preemptible instances for batch jobs
- Auto-scale down during off-hours
- Pause dev/staging environments nights + weekends
- Avoid SELECT * (query only needed columns)
- Use partitions and clustering
- Materialize expensive aggregations
- Leverage query result caching
- Set query timeouts (kill runaway queries)
- Limit exploratory queries (use LIMIT)
- Scheduled reports (not on-demand)
- Review and kill zombie jobs
Reserved Capacity (3 items)
- Analyze usage patterns (predictable workloads?)
- Purchase reserved instances for steady workloads
- Consider flat-rate pricing for high query volumes
Data Lifecycle (3 items)
- Define retention policies per dataset
- Automate data deletion (don't keep forever)
- Archive cold data to cheaper tiers
Monitoring (4 items)
- Set budgets with alerts (50%, 75%, 90%, 100%)
- Implement anomaly detection (catch spikes early)
- Tag all resources (team, project, environment)
- Showback/chargeback to teams (accountability)
10. Tools for Cost Management
10.1. Cloud-Native Tools
GCP:
- Billing Reports: Visualize costs by service, project, label
- Budgets & Alerts: Email/Slack when threshold reached
- Recommender: AI suggests optimizations (right-sizing, etc.)
- Committed Use Discounts: Reserved capacity
AWS:
- Cost Explorer: Analyze historical costs, forecast
- Budgets: Set budget alerts
- Trusted Advisor: Optimization recommendations
- Reserved Instance Recommendations: Purchase suggestions
Azure:
- Cost Management + Billing: Dashboards, alerts
- Advisor: Cost optimization suggestions
- Reservations: Reserved capacity discounts
10.2. Third-Party Tools
CloudHealth (VMware) - $$$
- Multi-cloud cost management
- Detailed analytics, showback/chargeback
- Policy enforcement
- Enterprise-focused
Cloudability (Apptio) - $$$
- FinOps platform
- Cost allocation, forecasting
- Anomaly detection
- Good for large enterprises
Kubecost (Open-source, free for single cluster)
- Kubernetes cost monitoring
- Per-namespace, per-pod costs
- Good for containerized data platforms
Kết Luận
Cloud cost optimization is not one-time - it's continuous process.
Key Takeaways:
- Compute is biggest cost (60-70%) → Focus query optimization first
- Storage tiering saves 40-60% (lifecycle policies)
- Right-sizing saves 50-80% (monitor actual usage)
- Spot instances save 60-80% for batch jobs
- Reserved capacity saves 30-50% for predictable workloads
- Query optimization = biggest impact (SELECT *, partitions, materialization)
- Data lifecycle = delete unnecessary data (50% storage savings)
- Monitoring = prevent waste (budgets, anomaly detection, showback)
Implementation Priority:
Week 1-2: Quick wins
- Set budgets + alerts
- Identify top cost drivers
- Delete obvious waste (old backups, zombie resources)
Month 1: Storage optimization
- Implement lifecycle policies
- Compress data (Parquet)
- Delete unnecessary data
Month 2-3: Compute optimization
- Query optimization (partitions, SELECT specific columns)
- Materialize expensive queries
- Right-size instances
Month 4+: Advanced
- Reserved capacity (if predictable workloads)
- Spot instances for batch
- Showback/chargeback culture
Expected savings: 40-60% reduction trong 3-6 months
ROI: Optimization effort pays back trong 1-2 months
Next Steps:
- ✅ Run cost analysis (where is money going?)
- ✅ Set budget alerts TODAY (prevent surprise bills)
- ✅ Implement 5 quick wins from checklist
- ✅ Đọc Cloud Architecture for platform selection
- ✅ Đọc Query Optimization for BigQuery-specific tactics
Need help? Carptech provides Cloud FinOps consulting - we've helped 10+ Vietnamese companies reduce cloud costs 40-60%. Book consultation to discuss your cost optimization strategy.
Related Posts:
- Cloud Data Platform Architecture: AWS vs GCP vs Azure
- Coming: Lakehouse Architecture (Bài 33)
- Coming: Serverless Data Architecture (Bài 34)
- Coming: Cloud Migration Guide (Bài 35)




