TL;DR
- Data Lakehouse là kiến trúc kết hợp ưu điểm của Data Warehouse (ACID, performance) và Data Lake (flexibility, cost)
- Table formats (Delta Lake, Iceberg, Hudi) giải quyết hạn chế của plain Parquet files: không có ACID, khó update/delete, không có schema enforcement
- Delta Lake (Databricks): Phổ biến nhất, tích hợp sâu với Databricks, tốt cho Spark workloads
- Apache Iceberg (Netflix): Engine-agnostic, partition evolution thông minh, tốt cho multi-engine environments
- Apache Hudi (Uber): Tối ưu cho streaming, CDC, near real-time ingestion
- Lựa chọn: Delta nếu dùng Databricks, Iceberg nếu muốn open và multi-engine, Hudi nếu use case streaming-heavy
- Giá trị: Tăng performance 3-5x, giảm storage 20-40%, enable ACID trên data lake
Giới Thiệu: Vấn Đề của Plain Data Lakes
Bạn đang dùng Data Lake trên S3/GCS lưu trữ Parquet files? Bạn từng gặp những vấn đề này:
❌ Không có ACID transactions: Hai pipelines cùng ghi data → corruption ❌ Không update/delete được: Phải rewrite toàn bộ partition để sửa 1 record ❌ Queries chậm: Query phải scan hàng nghìn Parquet files ❌ Schema evolution khó: Thêm column mới phải migrate hàng trăm TB data ❌ Không time travel: Không rollback được về version cũ
Ví dụ thực tế tại một E-commerce Việt Nam:
Data Lake structure:
s3://data-lake/
orders/
year=2025/
month=01/
day=01/
part-00001.parquet (500 files mỗi ngày)
part-00002.parquet
...
Vấn đề gặp phải:
- Customer yêu cầu xóa data (GDPR/PDPA) → Phải tìm và rewrite hàng trăm Parquet files
- Order status update (e.g., "delivered") → Append thêm record mới, logic dedup phức tạp
- Query "last 7 days orders" phải scan 3,500 files → Chậm, tốn tiền
Table formats (Delta Lake, Iceberg, Hudi) giải quyết các vấn đề này. Chúng là metadata layer trên Parquet/ORC files, cung cấp:
✅ ACID transactions ✅ Efficient updates/deletes ✅ Schema evolution ✅ Time travel ✅ Better query performance
Trong bài này, chúng ta sẽ so sánh chi tiết 3 table format hàng đầu để bạn chọn đúng cho use case của mình.
Lakehouse Architecture Recap
Trước khi đi sâu vào table formats, nhắc lại Lakehouse là gì:
Data Warehouse:
- ✅ ACID transactions, data quality
- ✅ Fast queries
- ❌ Đắt đỏ (Snowflake, BigQuery billing)
- ❌ Không linh hoạt (chỉ structured data)
Data Lake:
- ✅ Cheap (S3 chỉ $0.023/GB/month)
- ✅ Flexible (structured, semi-structured, unstructured)
- ❌ Không ACID
- ❌ Queries chậm
Lakehouse = Warehouse performance + Lake cost:
Lakehouse Architecture:
┌─────────────────────────────────────────┐
│ Query Engines: Spark, Trino, Flink │
├─────────────────────────────────────────┤
│ Table Format: Delta/Iceberg/Hudi │ ← Metadata layer
├─────────────────────────────────────────┤
│ Storage: Parquet/ORC files on S3/GCS │
└─────────────────────────────────────────┘
Table format làm gì:
- Track which files belong to table
- Store schema, partition info
- Provide ACID via optimistic concurrency control
- Enable time travel via versioning
- Optimize queries via file pruning
Bây giờ so sánh 3 table format phổ biến nhất.
Delta Lake (Databricks)
Tổng Quan
Nguồn gốc: Phát triển bởi Databricks, open-source từ 2019 (Linux Foundation) Licenses: Apache 2.0 Best for: Databricks users, Spark-centric workloads
Kiến Trúc
Delta Lake Table:
data/
_delta_log/
00000000000000000000.json ← Transaction log (ACID)
00000000000000000001.json
00000000000000000010.checkpoint.parquet
part-00000-*.snappy.parquet ← Data files
part-00001-*.snappy.parquet
Delta Log (transaction log):
- JSON files ghi lại mọi thay đổi (add file, remove file, metadata change)
- Mỗi transaction tạo new commit
- Checkpoint files (Parquet) mỗi 10 commits để tăng tốc read
Tính Năng Chính
1. ACID Transactions
# Concurrent writes không conflict
# Writer 1
df1.write.format("delta").mode("append").save("/data/orders")
# Writer 2 (cùng lúc)
df2.write.format("delta").mode("append").save("/data/orders")
# Delta Lake ensures both succeed or both fail, no partial writes
2. Time Travel
# Read version cũ
df = spark.read.format("delta").option("versionAsOf", 5).load("/data/orders")
# Hoặc theo timestamp
df = spark.read.format("delta").option("timestampAsOf", "2025-01-01").load("/data/orders")
# Restore về version cũ
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/orders")
deltaTable.restoreToVersion(5)
Use case: Rollback sau khi load bad data, audit historical changes.
3. Upserts (MERGE)
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/customers")
# Upsert: update existing, insert new
deltaTable.alias("target").merge(
updates.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdate(set = {
"email": "source.email",
"updated_at": "source.updated_at"
}).whenNotMatchedInsert(values = {
"customer_id": "source.customer_id",
"email": "source.email",
"created_at": "source.created_at"
}).execute()
Performance: Chỉ rewrite files chứa matched records, không phải toàn bộ table.
4. Schema Evolution
# Add column mới
df_with_new_column.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/data/orders")
# Delta Lake tự động merge schema
5. Optimize & Z-Ordering
# Compact small files thành larger files
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()
# Z-Ordering: cluster data theo columns thường filter
deltaTable.optimize().executeZOrderBy("user_id", "event_date")
Kết quả: Query "WHERE user_id = X" nhanh hơn 10x.
6. Vacuum (Cleanup Old Files)
# Xóa files cũ hơn 7 days (đã bị replaced bởi newer versions)
deltaTable.vacuum(7)
Ưu & Nhược Điểm
Ưu điểm ✅:
- Mature, production-proven (hàng nghìn companies)
- Tight integration với Databricks (optimized)
- Rich features (Z-Order, Bloom filters, CDF)
- Strong community support
- Easy to get started
Nhược điểm ❌:
- Spark-centric (Trino, Flink support còn limited)
- Databricks-optimized (một số features chỉ trên Databricks)
- Partitioning không thông minh như Iceberg
Khi Nào Dùng Delta Lake
✅ Nên dùng khi:
- Đang dùng Databricks
- Workload chủ yếu Spark
- Cần mature ecosystem
- Team quen Spark APIs
❌ Không nên khi:
- Muốn multi-engine (Trino, Flink, Presto)
- Muốn tránh vendor lock-in (Databricks)
- Cần advanced partitioning (hidden partitions)
Apache Iceberg (Netflix)
Tổng Quan
Nguồn gốc: Phát triển bởi Netflix, donated to Apache Foundation 2018 Licenses: Apache 2.0 Best for: Multi-engine environments, large-scale analytics
Kiến Trúc
Iceberg Table:
data/
metadata/
v1.metadata.json ← Current metadata
v2.metadata.json
snap-123.avro ← Snapshot files
manifest-list-456.avro ← Manifest lists
manifest-789.avro ← Manifests (file lists)
data/
part-00000.parquet ← Data files
part-00001.parquet
3-level metadata:
- Metadata file: Schema, partition spec, snapshots
- Manifest list: List of manifest files for a snapshot
- Manifest: List of data files with statistics
Tại sao 3 levels? Optimize cho large tables (millions of files).
Tính Năng Chính
1. Hidden Partitioning
Vấn đề với Hive partitioning:
-- Users phải biết partition structure
SELECT * FROM orders
WHERE year = 2025 AND month = 1 AND day = 15
AND order_date = '2025-01-15' -- Redundant!
Iceberg hidden partitioning:
# Define partition transform
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, TimestampType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform
schema = Schema(
NestedField(1, "order_id", StringType()),
NestedField(2, "order_date", TimestampType()),
)
# Partition by day(order_date) - hidden from users
partition_spec = PartitionSpec(
PartitionField(source_id=2, field_id=1000, transform=DayTransform(), name="order_day")
)
Query:
-- Users chỉ cần filter theo order_date, Iceberg tự map sang partition
SELECT * FROM orders WHERE order_date = '2025-01-15'
Benefit: Users không cần biết partition structure, tránh errors.
2. Partition Evolution
Ví dụ: Bạn partition table theo day(order_date), sau 2 năm data lớn, muốn đổi sang month(order_date).
Với Hive: Phải rewrite toàn bộ table (hàng trăm TB).
Với Iceberg:
from pyiceberg.transforms import MonthTransform
# Update partition spec - không cần rewrite data!
table.update_partition_spec() \
.add_field(source_column="order_date", transform=MonthTransform(), name="order_month") \
.commit()
Kết quả: Old data vẫn partitioned by day, new data partitioned by month. Iceberg query optimizer biết cách đọc cả hai.
3. Schema Evolution
from pyiceberg.table import Table
# Add column
table.update_schema() \
.add_column("customer_tier", StringType()) \
.commit()
# Rename column
table.update_schema() \
.rename_column("old_name", "new_name") \
.commit()
# Delete column
table.update_schema() \
.delete_column("unused_column") \
.commit()
Backward compatibility: Old Parquet files không có new column → Iceberg return null.
4. Time Travel
# Read snapshot tại timestamp
df = spark.read.format("iceberg") \
.option("snapshot-id", 1234567890) \
.load("warehouse.db.orders")
# Hoặc
df = spark.read.format("iceberg") \
.option("as-of-timestamp", "1609459200000") \
.load("warehouse.db.orders")
5. Engine-Agnostic
Iceberg hoạt động với nhiều engines:
# Spark
df = spark.read.format("iceberg").load("db.table")
# Trino
spark.sql("SELECT * FROM iceberg.db.table")
# Flink (streaming)
tableEnv.executeSql("SELECT * FROM iceberg_catalog.db.table")
# Hive
hive> SELECT * FROM db.table;
Benefit: Không bị lock-in vào 1 engine.
Ưu & Nhược Điểm
Ưu điểm ✅:
- Truly open: Không tied to vendor
- Multi-engine: Spark, Trino, Flink, Hive, Presto
- Advanced partitioning: Hidden partitions, partition evolution
- Scalable: Designed for petabyte-scale
- Active development: Apple, Netflix, LinkedIn contribute
Nhược điểm ❌:
- Newer than Delta (ecosystem smaller)
- Setup phức tạp hơn Delta (cần catalog như Hive Metastore hoặc Nessie)
- Ít features hơn Delta (no Z-Ordering, CDF built-in)
Khi Nào Dùng Iceberg
✅ Nên dùng khi:
- Multi-engine environment (Spark + Trino + Flink)
- Muốn avoid vendor lock-in
- Cần partition evolution
- Large-scale tables (billions of files)
❌ Không nên khi:
- Spark-only workload (Delta simpler)
- Cần advanced Spark optimizations (Z-Order)
- Team nhỏ, muốn easy setup
Apache Hudi (Uber)
Tổng Quan
Nguồn gốc: Phát triển bởi Uber, donated to Apache Foundation 2019 Licenses: Apache 2.0 Best for: Streaming ingestion, CDC, near real-time analytics
Kiến Trúc
Hudi có 2 storage types:
1. Copy-on-Write (CoW):
- Data stored in Parquet files
- Updates rewrite entire Parquet file
- Fast reads (chỉ Parquet)
- Slow writes (rewrite overhead)
2. Merge-on-Read (MoR):
- Base files (Parquet) + delta files (Avro logs)
- Updates append to delta logs
- Fast writes
- Reads phải merge base + delta (slower)
Hudi MoR Table:
data/
.hoodie/
hoodie.properties
00001.commit ← Timeline (commits log)
00001.inflight
partition1/
file1_base.parquet ← Base files
file1_delta.log ← Delta logs (updates)
partition2/
file2_base.parquet
Tính Năng Chính
1. Incremental Queries
Ví dụ: Ingest orders từ database mỗi 5 phút, chỉ process new/updated records.
# Read only changes since last checkpoint
incremental_df = spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", last_checkpoint) \
.load("/data/orders")
Benefit: Thay vì reprocess 10TB data, chỉ process 10GB new data → Nhanh hơn 1000x.
2. Upserts (Record-Level Updates)
from pyspark.sql import SparkSession
# Hudi upsert
hudi_options = {
'hoodie.table.name': 'customers',
'hoodie.datasource.write.recordkey.field': 'customer_id',
'hoodie.datasource.write.precombine.field': 'updated_at',
'hoodie.datasource.write.operation': 'upsert'
}
df.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save("/data/customers")
Performance (MoR): Updates append to log files, không rewrite Parquet → Very fast.
3. CDC (Change Data Capture)
Ví dụ: Stream changes từ MySQL sang Hudi.
# Debezium → Kafka → Spark Streaming → Hudi
stream_df = spark.readStream.format("kafka") \
.option("subscribe", "mysql.orders") \
.load()
# Parse CDC events
parsed_df = stream_df.selectExpr("CAST(value AS STRING)") \
.select(from_json(col("value"), schema).alias("data")) \
.select("data.*")
# Write to Hudi (MoR for fast ingestion)
parsed_df.writeStream.format("hudi") \
.options(**hudi_options) \
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
.trigger(processingTime='1 minute') \
.start("/data/orders_cdc")
Latency: Near real-time (1-5 minutes từ database change → queryable).
4. Timeline & Time Travel
# Hudi timeline: sequence of commits
# Read table tại specific commit
df = spark.read.format("hudi") \
.option("as.of.instant", "20250115120000") \
.load("/data/orders")
5. Compaction (MoR Tables)
MoR tables tích lũy delta logs → Reads chậm dần. Compaction merge logs vào base files.
# Inline compaction (automatic)
hudi_options['hoodie.compact.inline'] = 'true'
hudi_options['hoodie.compact.inline.max.delta.commits'] = '5'
# Async compaction (separate job)
spark.read.format("hudi") \
.load("/data/orders") \
.write.format("hudi") \
.option("hoodie.datasource.compaction.async.enable", "true") \
.save()
Ưu & Nhược Điểm
Ưu điểm ✅:
- Best for streaming: MoR tables rất nhanh cho ingestion
- Incremental processing: Chỉ process changed data
- CDC-friendly: Near real-time pipelines
- Record-level updates: Efficient upserts
Nhược điểm ❌:
- Complexity: 2 storage types (CoW vs MoR) confusing
- Reads slower (MoR): Phải merge base + delta
- Smaller ecosystem: Không phổ biến bằng Delta/Iceberg
- Spark-centric: Trino support limited
Khi Nào Dùng Hudi
✅ Nên dùng khi:
- Streaming use cases (Kafka → Data Lake)
- CDC pipelines (database replication)
- Cần incremental processing
- Near real-time analytics
❌ Không nên khi:
- Batch-only workloads (Delta/Iceberg simpler)
- Cần multi-engine support (Iceberg better)
- Team không familiar với streaming
So Sánh Chi Tiết: Feature Matrix
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| ACID Transactions | ✅ Yes | ✅ Yes | ✅ Yes |
| Time Travel | ✅ Yes | ✅ Yes | ✅ Yes (Timeline) |
| Schema Evolution | ✅ Yes | ✅ Yes | ✅ Yes |
| Upserts/Deletes | ✅ MERGE | ✅ MERGE | ✅ Upsert (native) |
| Partition Evolution | ❌ No | ✅ Yes | ⚠️ Limited |
| Hidden Partitioning | ❌ No | ✅ Yes | ❌ No |
| Incremental Queries | ⚠️ CDF (Databricks) | ⚠️ Via snapshots | ✅ Native |
| Streaming Ingestion | ⚠️ OK | ⚠️ OK | ✅ Excellent (MoR) |
| Query Engines | Spark (primary) | Spark, Trino, Flink, Hive | Spark (primary) |
| Read Performance | ⚠️ Good | ⚠️ Good | ❌ Slower (MoR) |
| Write Performance | ⚠️ Good | ⚠️ Good | ✅ Excellent (MoR) |
| Small Files Problem | ✅ OPTIMIZE | ✅ Compaction | ✅ Compaction |
| Z-Ordering | ✅ Yes | ❌ No | ❌ No |
| Bloom Filters | ✅ Yes (Databricks) | ⚠️ Limited | ⚠️ Limited |
| File Format | Parquet (default) | Parquet, ORC, Avro | Parquet, Avro (MoR) |
| Maturity | ⭐⭐⭐⭐⭐ Mature | ⭐⭐⭐⭐ Growing | ⭐⭐⭐ Niche |
| Ecosystem | ⭐⭐⭐⭐⭐ Rich | ⭐⭐⭐⭐ Growing | ⭐⭐⭐ Limited |
| Community | Very active | Very active | Active |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Performance Benchmarks
Read Performance
Test: Query 1TB table (100M rows), filter 1% data.
| Format | Query Time | Files Scanned |
|---|---|---|
| Plain Parquet | 45s | 1,000 files |
| Delta Lake | 12s | 150 files (optimized) |
| Iceberg | 14s | 180 files |
| Hudi (CoW) | 13s | 160 files |
| Hudi (MoR) | 22s | 160 base + 40 delta |
Insight: Delta/Iceberg/Hudi CoW similar performance. Hudi MoR slower vì merge overhead.
Write Performance (Upserts)
Test: Update 1% of 1TB table (1M records).
| Format | Update Time | Approach |
|---|---|---|
| Plain Parquet | 180s | Rewrite 10GB partition |
| Delta Lake | 35s | Rewrite only affected files |
| Iceberg | 38s | Rewrite only affected files |
| Hudi (CoW) | 40s | Rewrite affected files |
| Hudi (MoR) | 8s | Append to delta logs |
Insight: Hudi MoR nhanh nhất cho writes (5x faster). Delta/Iceberg tương đương.
Streaming Ingestion
Test: Ingest 100K records/second từ Kafka.
| Format | Ingestion Lag | Queryable Latency |
|---|---|---|
| Plain Parquet | 60s | 60s |
| Delta Lake | 20s | 20s |
| Iceberg | 25s | 25s |
| Hudi (MoR) | 10s | 15s (after compaction) |
Insight: Hudi MoR tốt nhất cho streaming (low latency).
Ecosystem Support
Query Engines
| Engine | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| Apache Spark | ✅ Native | ✅ Native | ✅ Native |
| Trino/Presto | ⚠️ Basic | ✅ Full support | ⚠️ Basic |
| Apache Flink | ❌ No | ✅ Full support | ⚠️ Limited |
| Hive | ❌ No | ✅ Via connector | ⚠️ Limited |
| Dremio | ✅ Yes | ✅ Yes | ❌ No |
| Athena | ✅ Yes | ✅ Yes | ❌ No |
Insight: Iceberg có multi-engine support tốt nhất.
Cloud Platforms
| Platform | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| Databricks | ✅ Native | ⚠️ Preview | ❌ No |
| AWS EMR | ✅ Yes | ✅ Yes | ✅ Yes |
| AWS Glue | ⚠️ Limited | ✅ Yes | ✅ Yes |
| GCP Dataproc | ✅ Yes | ✅ Yes | ✅ Yes |
| Azure Synapse | ✅ Preview | ⚠️ Limited | ❌ No |
| Snowflake | ⚠️ Can read | ✅ Iceberg Tables | ❌ No |
Insight: Delta best on Databricks, Iceberg best cross-platform.
How to Choose: Decision Framework
┌─────────────────────────────────────────────────────────┐
│ START: Chọn Table Format cho Lakehouse │
└────────────────┬────────────────────────────────────────┘
│
▼
┌───────────────────┐
│ Bạn dùng Databricks│
│ platform? │
└────┬─────────┬─────┘
│ │
YES│ │NO
│ │
▼ ▼
┌──────────┐ ┌────────────────────┐
│ DELTA │ │ Workload nào chính?│
│ LAKE │ └───┬──────────┬─────┘
└──────────┘ │ │
Streaming Batch/Ad-hoc
│ │
▼ ▼
┌──────────┐ ┌────────────────┐
│ HUDI │ │ Multi-engine? │
│ (MoR) │ └───┬────────┬───┘
└──────────┘ │ │
YES NO
│ │
▼ ▼
┌─────────┐ ┌──────┐
│ ICEBERG │ │ DELTA│
└─────────┘ └──────┘
Decision Criteria
Chọn Delta Lake nếu:
- ✅ Dùng Databricks (best integration)
- ✅ Spark-centric workloads
- ✅ Cần mature ecosystem
- ✅ Team quen Spark APIs
- ✅ Cần advanced features (Z-Order, CDF)
Chọn Apache Iceberg nếu:
- ✅ Multi-engine environment (Spark + Trino + Flink)
- ✅ Muốn open standard, avoid lock-in
- ✅ Cần partition evolution
- ✅ Large-scale tables (billions of rows)
- ✅ Query từ Snowflake (Iceberg Tables)
Chọn Apache Hudi nếu:
- ✅ Streaming-heavy use cases (Kafka, CDC)
- ✅ Near real-time analytics (< 5 min latency)
- ✅ Incremental processing (process only changes)
- ✅ Write-heavy workload (MoR)
- ✅ Uber-style data platform
Migration Guide: Parquet → Table Formats
Bước 1: Assessment
# Scan existing Parquet data
parquet_path = "s3://data-lake/orders/"
df = spark.read.parquet(parquet_path)
# Check size, partitions
print(f"Rows: {df.count()}")
print(f"Size: {df.rdd.map(lambda x: len(str(x))).sum() / 1e9} GB")
print(f"Partitions: {df.rdd.getNumPartitions()}")
Bước 2: Convert to Delta Lake
# Option 1: CONVERT (in-place, không duplicate data)
from delta.tables import DeltaTable
DeltaTable.convertToDelta(
spark,
f"parquet.`{parquet_path}`",
"order_date DATE, customer_id STRING" # Partition schema
)
# Option 2: CREATE TABLE (explicit)
df = spark.read.parquet(parquet_path)
df.write.format("delta") \
.partitionBy("order_date") \
.save("s3://lakehouse/orders_delta")
Bước 3: Convert to Iceberg
from pyspark.sql import SparkSession
# Create Iceberg table
spark.sql(f"""
CREATE TABLE iceberg_catalog.db.orders
USING iceberg
PARTITIONED BY (days(order_date))
AS SELECT * FROM parquet.`{parquet_path}`
""")
# Hoặc migrate existing
df = spark.read.parquet(parquet_path)
df.writeTo("iceberg_catalog.db.orders").create()
Bước 4: Convert to Hudi
hudi_options = {
'hoodie.table.name': 'orders',
'hoodie.datasource.write.recordkey.field': 'order_id',
'hoodie.datasource.write.precombine.field': 'order_date',
'hoodie.datasource.write.partitionpath.field': 'order_date',
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', # hoặc MERGE_ON_READ
'hoodie.datasource.write.operation': 'bulk_insert'
}
df = spark.read.parquet(parquet_path)
df.write.format("hudi") \
.options(**hudi_options) \
.mode("overwrite") \
.save("s3://lakehouse/orders_hudi")
Bước 5: Validate & Cutover
# Validate row counts match
parquet_count = spark.read.parquet(parquet_path).count()
delta_count = spark.read.format("delta").load("s3://lakehouse/orders_delta").count()
assert parquet_count == delta_count, "Row count mismatch!"
# Update pipelines to write to new format
# Cutover queries to read from new format
# Monitor for 1 week
# Decommission old Parquet files
Code Examples: Common Operations
Create Table
Delta Lake:
df.write.format("delta") \
.mode("overwrite") \
.partitionBy("date") \
.save("/data/events")
Iceberg:
df.writeTo("iceberg_catalog.db.events") \
.partitionedBy("date") \
.create()
Hudi:
hudi_options = {
'hoodie.table.name': 'events',
'hoodie.datasource.write.recordkey.field': 'event_id',
'hoodie.datasource.write.precombine.field': 'timestamp',
}
df.write.format("hudi") \
.options(**hudi_options) \
.save("/data/events")
Read Table
Delta Lake:
df = spark.read.format("delta").load("/data/events")
Iceberg:
df = spark.read.format("iceberg").load("iceberg_catalog.db.events")
Hudi:
df = spark.read.format("hudi").load("/data/events")
Update Records
Delta Lake:
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/customers")
deltaTable.update(
condition = "status = 'inactive'",
set = {"status": "'archived'"}
)
Iceberg:
spark.sql("""
UPDATE iceberg_catalog.db.customers
SET status = 'archived'
WHERE status = 'inactive'
""")
Hudi:
# Hudi updates via upsert
updates_df = spark.createDataFrame([
("cust123", "archived")
], ["customer_id", "status"])
updates_df.write.format("hudi") \
.option("hoodie.datasource.write.operation", "upsert") \
.save("/data/customers")
Delete Records
Delta Lake:
deltaTable.delete("order_date < '2020-01-01'")
Iceberg:
spark.sql("""
DELETE FROM iceberg_catalog.db.orders
WHERE order_date < '2020-01-01'
""")
Hudi:
deletes_df = spark.sql("SELECT order_id FROM orders WHERE order_date < '2020-01-01'")
deletes_df.write.format("hudi") \
.option("hoodie.datasource.write.operation", "delete") \
.save("/data/orders")
Case Studies: Real-World Usage
Case Study 1: E-commerce - Delta Lake
Company: Online retailer, 5M customers, 100M orders Challenge: Update order status real-time, analytics on full history Solution: Delta Lake on Databricks
Architecture:
MySQL (orders) → Debezium → Kafka → Spark Streaming → Delta Lake
↓
BI Tools (Tableau)
Implementation:
# Streaming write to Delta
stream_df = spark.readStream.format("kafka") \
.option("subscribe", "order_updates") \
.load()
stream_df.writeStream.format("delta") \
.option("checkpointLocation", "/checkpoints/orders") \
.outputMode("append") \
.start("/delta/orders")
Results:
- ✅ Latency: 2-5 minutes (CDC → queryable)
- ✅ MERGE operations handle 100K updates/hour
- ✅ Analysts query with sub-second response time
- ✅ Cost: $3K/month (compute + storage)
Case Study 2: Fintech - Apache Iceberg
Company: Payment processor, 10TB transaction data Challenge: Multi-engine access (Spark for ETL, Trino for ad-hoc, Flink for streaming) Solution: Apache Iceberg on AWS S3
Architecture:
Kafka → Flink (streaming) → Iceberg Tables on S3
↓
Spark (batch ETL)
Trino (ad-hoc SQL)
Implementation:
# Flink streaming write
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.executeSql("""
CREATE CATALOG iceberg WITH (
'type' = 'iceberg',
'catalog-type' = 'hadoop',
'warehouse' = 's3://data-lakehouse/'
)
""");
tableEnv.executeSql("""
INSERT INTO iceberg.db.transactions
SELECT * FROM kafka_source
""");
Results:
- ✅ Multi-engine: Flink, Spark, Trino work seamlessly
- ✅ Partition evolution: Changed monthly → daily partitions without rewrite
- ✅ Query performance: 3x faster than plain Parquet
- ✅ No vendor lock-in
Case Study 3: Ride-sharing - Apache Hudi
Company: Ride-sharing platform (tương tự Grab) Challenge: Near real-time driver/rider metrics, CDC from PostgreSQL Solution: Apache Hudi (MoR) on AWS EMR
Architecture:
PostgreSQL → Debezium → Kafka → Spark Streaming → Hudi (MoR)
↓
Presto (queries)
Implementation:
hudi_options = {
'hoodie.table.name': 'trips',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.compact.inline': 'false', # Async compaction
}
stream_df.writeStream.format("hudi") \
.options(**hudi_options) \
.trigger(processingTime='1 minute') \
.start("/hudi/trips")
Results:
- ✅ Latency: < 2 minutes (PostgreSQL → queryable)
- ✅ Incremental processing: Chỉ process changed records (10x faster)
- ✅ Storage savings: 30% vs plain Parquet (compression + deduplication)
- ✅ Handles 1M trip updates/day
Future Outlook: Convergence?
Current Trends
1. Format Convergence:
- Delta, Iceberg, Hudi ngày càng giống nhau về features
- Delta adding multi-engine support (UniForm)
- Iceberg adding Z-Ordering
- Standards emerging (open table formats)
2. Cloud Platform Support:
- AWS Athena/Glue supports all 3
- Snowflake adding Iceberg Tables (2024)
- Databricks supports Delta + Iceberg (preview)
3. Open Standards:
- Lakehouse open standard (Linux Foundation)
- Interoperability improving
Predictions (2025-2026)
- Delta Lake: Sẽ dominant trong Databricks ecosystem, expand multi-engine
- Iceberg: Sẽ trở thành "default open standard" cho cross-platform
- Hudi: Sẽ focus vào streaming niche, integrate with Flink ecosystem
Lời khuyên: Chọn format phù hợp với platform hiện tại, nhưng theo dõi interoperability để dễ migrate sau này.
Best Practices
1. Start Small
# Pilot với 1 table trước
pilot_table = "orders" # Most important, high-volume table
# Migrate, test, validate
# Nếu thành công → rollout to other tables
2. Optimize Regularly
Delta Lake:
# Schedule weekly optimization
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("customer_id", "order_date")
deltaTable.vacuum(7) # Cleanup old files
Iceberg:
# Expire old snapshots (time travel only need last 7 days)
spark.sql("""
CALL iceberg_catalog.system.expire_snapshots(
table => 'db.orders',
older_than => TIMESTAMP '2025-01-15 00:00:00'
)
""")
Hudi:
# Schedule compaction for MoR tables
spark.read.format("hudi").load("/data/orders") \
.write.format("hudi") \
.option("hoodie.datasource.compaction.async.enable", "true") \
.save()
3. Monitor Performance
# Metrics to track
- Query latency (p50, p95, p99)
- Storage size (growth rate)
- Small files count (target < 128MB avg)
- Compaction lag (Hudi MoR)
- Snapshot count (Iceberg/Delta)
4. Schema Evolution
# Add columns (non-breaking)
df_with_new_column.write.format("delta") \
.option("mergeSchema", "true") \
.mode("append") \
.save("/data/orders")
# Rename/delete columns (breaking) → communicate to users first
Kết Luận
Key Takeaways
✅ Table formats (Delta, Iceberg, Hudi) bring ACID + performance to Data Lakes ✅ Delta Lake: Best cho Databricks, Spark-centric workloads ✅ Apache Iceberg: Best cho multi-engine, open ecosystems ✅ Apache Hudi: Best cho streaming, CDC, near real-time ✅ Giá trị: 3-5x faster queries, 20-40% storage savings, enable upserts/deletes
Recommendations
Cho startups/SMEs:
- Start với Iceberg nếu muốn flexibility
- Hoặc Delta nếu dùng Databricks
- Avoid Hudi unless streaming-heavy
Cho enterprises:
- Multi-team → Iceberg (multi-engine support)
- Databricks-centric → Delta Lake
- Real-time analytics → Hudi cho streaming tables
Migration path:
- Pilot với 1 high-value table
- Measure performance improvements
- Rollout to other tables incrementally
- Deprecate plain Parquet
Next Steps
Muốn implement Lakehouse architecture cho doanh nghiệp của bạn?
Carptech giúp bạn:
- ✅ Assessment & recommendations (Delta vs Iceberg vs Hudi)
- ✅ Migration từ plain Parquet/Hive sang table formats
- ✅ Performance optimization (queries 3-5x faster)
- ✅ Training cho team
📞 Liên hệ Carptech: carptech.vn
Related Posts:




