Lakehouse Architecture: So sánh Delta Lake, Iceberg, Hudi

TL;DR

Data Lakehouse là kiến trúc kết hợp ưu điểm của Data Warehouse (ACID, performance) và Data Lake (flexibility, cost)
Table formats (Delta Lake, Iceberg, Hudi) giải quyết hạn chế của plain Parquet files: không có ACID, khó update/delete, không có schema enforcement
Delta Lake (Databricks): Phổ biến nhất, tích hợp sâu với Databricks, tốt cho Spark workloads
Apache Iceberg (Netflix): Engine-agnostic, partition evolution thông minh, tốt cho multi-engine environments
Apache Hudi (Uber): Tối ưu cho streaming, CDC, near real-time ingestion
Lựa chọn: Delta nếu dùng Databricks, Iceberg nếu muốn open và multi-engine, Hudi nếu use case streaming-heavy
Giá trị: Tăng performance 3-5x, giảm storage 20-40%, enable ACID trên data lake

Giới Thiệu: Vấn Đề của Plain Data Lakes

Bạn đang dùng Data Lake trên S3/GCS lưu trữ Parquet files? Bạn từng gặp những vấn đề này:

❌ Không có ACID transactions: Hai pipelines cùng ghi data → corruption ❌ Không update/delete được: Phải rewrite toàn bộ partition để sửa 1 record ❌ Queries chậm: Query phải scan hàng nghìn Parquet files ❌ Schema evolution khó: Thêm column mới phải migrate hàng trăm TB data ❌ Không time travel: Không rollback được về version cũ

Ví dụ thực tế tại một E-commerce Việt Nam:

Data Lake structure:
s3://data-lake/
  orders/
    year=2025/
      month=01/
        day=01/
          part-00001.parquet (500 files mỗi ngày)
          part-00002.parquet
          ...

Vấn đề gặp phải:

Customer yêu cầu xóa data (GDPR/PDPA) → Phải tìm và rewrite hàng trăm Parquet files
Order status update (e.g., "delivered") → Append thêm record mới, logic dedup phức tạp
Query "last 7 days orders" phải scan 3,500 files → Chậm, tốn tiền

Table formats (Delta Lake, Iceberg, Hudi) giải quyết các vấn đề này. Chúng là metadata layer trên Parquet/ORC files, cung cấp:

✅ ACID transactions ✅ Efficient updates/deletes ✅ Schema evolution ✅ Time travel ✅ Better query performance

Trong bài này, chúng ta sẽ so sánh chi tiết 3 table format hàng đầu để bạn chọn đúng cho use case của mình.

Lakehouse Architecture Recap

Trước khi đi sâu vào table formats, nhắc lại Lakehouse là gì:

Data Warehouse:

✅ ACID transactions, data quality
✅ Fast queries
❌ Đắt đỏ (Snowflake, BigQuery billing)
❌ Không linh hoạt (chỉ structured data)

Data Lake:

✅ Cheap (S3 chỉ $0.023/GB/month)
✅ Flexible (structured, semi-structured, unstructured)
❌ Không ACID
❌ Queries chậm

Lakehouse = Warehouse performance + Lake cost:

Lakehouse Architecture:

┌─────────────────────────────────────────┐
│ Query Engines: Spark, Trino, Flink     │
├─────────────────────────────────────────┤
│ Table Format: Delta/Iceberg/Hudi       │ ← Metadata layer
├─────────────────────────────────────────┤
│ Storage: Parquet/ORC files on S3/GCS  │
└─────────────────────────────────────────┘

Table format làm gì:

Track which files belong to table
Store schema, partition info
Provide ACID via optimistic concurrency control
Enable time travel via versioning
Optimize queries via file pruning

Bây giờ so sánh 3 table format phổ biến nhất.

Delta Lake (Databricks)

Tổng Quan

Nguồn gốc: Phát triển bởi Databricks, open-source từ 2019 (Linux Foundation) Licenses: Apache 2.0 Best for: Databricks users, Spark-centric workloads

Kiến Trúc

Delta Lake Table:

data/
  _delta_log/
    00000000000000000000.json  ← Transaction log (ACID)
    00000000000000000001.json
    00000000000000000010.checkpoint.parquet
  part-00000-*.snappy.parquet  ← Data files
  part-00001-*.snappy.parquet

Delta Log (transaction log):

JSON files ghi lại mọi thay đổi (add file, remove file, metadata change)
Mỗi transaction tạo new commit
Checkpoint files (Parquet) mỗi 10 commits để tăng tốc read

Tính Năng Chính

1. ACID Transactions

# Concurrent writes không conflict
# Writer 1
df1.write.format("delta").mode("append").save("/data/orders")

# Writer 2 (cùng lúc)
df2.write.format("delta").mode("append").save("/data/orders")

# Delta Lake ensures both succeed or both fail, no partial writes

2. Time Travel

# Read version cũ
df = spark.read.format("delta").option("versionAsOf", 5).load("/data/orders")

# Hoặc theo timestamp
df = spark.read.format("delta").option("timestampAsOf", "2025-01-01").load("/data/orders")

# Restore về version cũ
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/orders")
deltaTable.restoreToVersion(5)

Use case: Rollback sau khi load bad data, audit historical changes.

3. Upserts (MERGE)

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/data/customers")

# Upsert: update existing, insert new
deltaTable.alias("target").merge(
    updates.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdate(set = {
    "email": "source.email",
    "updated_at": "source.updated_at"
}).whenNotMatchedInsert(values = {
    "customer_id": "source.customer_id",
    "email": "source.email",
    "created_at": "source.created_at"
}).execute()

Performance: Chỉ rewrite files chứa matched records, không phải toàn bộ table.

4. Schema Evolution

# Add column mới
df_with_new_column.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/data/orders")

# Delta Lake tự động merge schema

5. Optimize & Z-Ordering

# Compact small files thành larger files
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()

# Z-Ordering: cluster data theo columns thường filter
deltaTable.optimize().executeZOrderBy("user_id", "event_date")

Kết quả: Query "WHERE user_id = X" nhanh hơn 10x.

6. Vacuum (Cleanup Old Files)

# Xóa files cũ hơn 7 days (đã bị replaced bởi newer versions)
deltaTable.vacuum(7)

Ưu & Nhược Điểm

Ưu điểm ✅:

Mature, production-proven (hàng nghìn companies)
Tight integration với Databricks (optimized)
Rich features (Z-Order, Bloom filters, CDF)
Strong community support
Easy to get started

Nhược điểm ❌:

Spark-centric (Trino, Flink support còn limited)
Databricks-optimized (một số features chỉ trên Databricks)
Partitioning không thông minh như Iceberg

Khi Nào Dùng Delta Lake

✅ Nên dùng khi:

Đang dùng Databricks
Workload chủ yếu Spark
Cần mature ecosystem
Team quen Spark APIs

❌ Không nên khi:

Muốn multi-engine (Trino, Flink, Presto)
Muốn tránh vendor lock-in (Databricks)
Cần advanced partitioning (hidden partitions)

Apache Iceberg (Netflix)

Tổng Quan

Nguồn gốc: Phát triển bởi Netflix, donated to Apache Foundation 2018 Licenses: Apache 2.0 Best for: Multi-engine environments, large-scale analytics

Kiến Trúc

Iceberg Table:

data/
  metadata/
    v1.metadata.json           ← Current metadata
    v2.metadata.json
    snap-123.avro              ← Snapshot files
    manifest-list-456.avro     ← Manifest lists
    manifest-789.avro          ← Manifests (file lists)
  data/
    part-00000.parquet         ← Data files
    part-00001.parquet

3-level metadata:

Metadata file: Schema, partition spec, snapshots
Manifest list: List of manifest files for a snapshot
Manifest: List of data files with statistics

Tại sao 3 levels? Optimize cho large tables (millions of files).

Tính Năng Chính

1. Hidden Partitioning

Vấn đề với Hive partitioning:

-- Users phải biết partition structure
SELECT * FROM orders
WHERE year = 2025 AND month = 1 AND day = 15
  AND order_date = '2025-01-15'  -- Redundant!

Iceberg hidden partitioning:

# Define partition transform
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, TimestampType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform

schema = Schema(
    NestedField(1, "order_id", StringType()),
    NestedField(2, "order_date", TimestampType()),
)

# Partition by day(order_date) - hidden from users
partition_spec = PartitionSpec(
    PartitionField(source_id=2, field_id=1000, transform=DayTransform(), name="order_day")
)

Query:

-- Users chỉ cần filter theo order_date, Iceberg tự map sang partition
SELECT * FROM orders WHERE order_date = '2025-01-15'

Benefit: Users không cần biết partition structure, tránh errors.

2. Partition Evolution

Ví dụ: Bạn partition table theo day(order_date), sau 2 năm data lớn, muốn đổi sang month(order_date).

Với Hive: Phải rewrite toàn bộ table (hàng trăm TB).

Với Iceberg:

from pyiceberg.transforms import MonthTransform

# Update partition spec - không cần rewrite data!
table.update_partition_spec() \
    .add_field(source_column="order_date", transform=MonthTransform(), name="order_month") \
    .commit()

Kết quả: Old data vẫn partitioned by day, new data partitioned by month. Iceberg query optimizer biết cách đọc cả hai.

3. Schema Evolution

from pyiceberg.table import Table

# Add column
table.update_schema() \
    .add_column("customer_tier", StringType()) \
    .commit()

# Rename column
table.update_schema() \
    .rename_column("old_name", "new_name") \
    .commit()

# Delete column
table.update_schema() \
    .delete_column("unused_column") \
    .commit()

Backward compatibility: Old Parquet files không có new column → Iceberg return null.

4. Time Travel

# Read snapshot tại timestamp
df = spark.read.format("iceberg") \
    .option("snapshot-id", 1234567890) \
    .load("warehouse.db.orders")

# Hoặc
df = spark.read.format("iceberg") \
    .option("as-of-timestamp", "1609459200000") \
    .load("warehouse.db.orders")

5. Engine-Agnostic

Iceberg hoạt động với nhiều engines:

# Spark
df = spark.read.format("iceberg").load("db.table")

# Trino
spark.sql("SELECT * FROM iceberg.db.table")

# Flink (streaming)
tableEnv.executeSql("SELECT * FROM iceberg_catalog.db.table")

# Hive
hive> SELECT * FROM db.table;

Benefit: Không bị lock-in vào 1 engine.

Ưu & Nhược Điểm

Ưu điểm ✅:

Truly open: Không tied to vendor
Multi-engine: Spark, Trino, Flink, Hive, Presto
Advanced partitioning: Hidden partitions, partition evolution
Scalable: Designed for petabyte-scale
Active development: Apple, Netflix, LinkedIn contribute

Nhược điểm ❌:

Newer than Delta (ecosystem smaller)
Setup phức tạp hơn Delta (cần catalog như Hive Metastore hoặc Nessie)
Ít features hơn Delta (no Z-Ordering, CDF built-in)

Khi Nào Dùng Iceberg

✅ Nên dùng khi:

Multi-engine environment (Spark + Trino + Flink)
Muốn avoid vendor lock-in
Cần partition evolution
Large-scale tables (billions of files)

❌ Không nên khi:

Spark-only workload (Delta simpler)
Cần advanced Spark optimizations (Z-Order)
Team nhỏ, muốn easy setup

Apache Hudi (Uber)

Tổng Quan

Nguồn gốc: Phát triển bởi Uber, donated to Apache Foundation 2019 Licenses: Apache 2.0 Best for: Streaming ingestion, CDC, near real-time analytics

Kiến Trúc

Hudi có 2 storage types:

1. Copy-on-Write (CoW):

Data stored in Parquet files
Updates rewrite entire Parquet file
Fast reads (chỉ Parquet)
Slow writes (rewrite overhead)

2. Merge-on-Read (MoR):

Base files (Parquet) + delta files (Avro logs)
Updates append to delta logs
Fast writes
Reads phải merge base + delta (slower)

Hudi MoR Table:

data/
  .hoodie/
    hoodie.properties
    00001.commit               ← Timeline (commits log)
    00001.inflight
  partition1/
    file1_base.parquet         ← Base files
    file1_delta.log            ← Delta logs (updates)
  partition2/
    file2_base.parquet

Tính Năng Chính

1. Incremental Queries

Ví dụ: Ingest orders từ database mỗi 5 phút, chỉ process new/updated records.

# Read only changes since last checkpoint
incremental_df = spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_checkpoint) \
    .load("/data/orders")

Benefit: Thay vì reprocess 10TB data, chỉ process 10GB new data → Nhanh hơn 1000x.

2. Upserts (Record-Level Updates)

from pyspark.sql import SparkSession

# Hudi upsert
hudi_options = {
    'hoodie.table.name': 'customers',
    'hoodie.datasource.write.recordkey.field': 'customer_id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.operation': 'upsert'
}

df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("/data/customers")

Performance (MoR): Updates append to log files, không rewrite Parquet → Very fast.

3. CDC (Change Data Capture)

Ví dụ: Stream changes từ MySQL sang Hudi.

# Debezium → Kafka → Spark Streaming → Hudi
stream_df = spark.readStream.format("kafka") \
    .option("subscribe", "mysql.orders") \
    .load()

# Parse CDC events
parsed_df = stream_df.selectExpr("CAST(value AS STRING)") \
    .select(from_json(col("value"), schema).alias("data")) \
    .select("data.*")

# Write to Hudi (MoR for fast ingestion)
parsed_df.writeStream.format("hudi") \
    .options(**hudi_options) \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .trigger(processingTime='1 minute') \
    .start("/data/orders_cdc")

Latency: Near real-time (1-5 minutes từ database change → queryable).

4. Timeline & Time Travel

# Hudi timeline: sequence of commits
# Read table tại specific commit
df = spark.read.format("hudi") \
    .option("as.of.instant", "20250115120000") \
    .load("/data/orders")

5. Compaction (MoR Tables)

MoR tables tích lũy delta logs → Reads chậm dần. Compaction merge logs vào base files.

# Inline compaction (automatic)
hudi_options['hoodie.compact.inline'] = 'true'
hudi_options['hoodie.compact.inline.max.delta.commits'] = '5'

# Async compaction (separate job)
spark.read.format("hudi") \
    .load("/data/orders") \
    .write.format("hudi") \
    .option("hoodie.datasource.compaction.async.enable", "true") \
    .save()

Ưu & Nhược Điểm

Ưu điểm ✅:

Best for streaming: MoR tables rất nhanh cho ingestion
Incremental processing: Chỉ process changed data
CDC-friendly: Near real-time pipelines
Record-level updates: Efficient upserts

Nhược điểm ❌:

Complexity: 2 storage types (CoW vs MoR) confusing
Reads slower (MoR): Phải merge base + delta
Smaller ecosystem: Không phổ biến bằng Delta/Iceberg
Spark-centric: Trino support limited

Khi Nào Dùng Hudi

✅ Nên dùng khi:

Streaming use cases (Kafka → Data Lake)
CDC pipelines (database replication)
Cần incremental processing
Near real-time analytics

❌ Không nên khi:

Batch-only workloads (Delta/Iceberg simpler)
Cần multi-engine support (Iceberg better)
Team không familiar với streaming

So Sánh Chi Tiết: Feature Matrix

Feature	Delta Lake	Apache Iceberg	Apache Hudi
ACID Transactions	✅ Yes	✅ Yes	✅ Yes
Time Travel	✅ Yes	✅ Yes	✅ Yes (Timeline)
Schema Evolution	✅ Yes	✅ Yes	✅ Yes
Upserts/Deletes	✅ MERGE	✅ MERGE	✅ Upsert (native)
Partition Evolution	❌ No	✅ Yes	⚠️ Limited
Hidden Partitioning	❌ No	✅ Yes	❌ No
Incremental Queries	⚠️ CDF (Databricks)	⚠️ Via snapshots	✅ Native
Streaming Ingestion	⚠️ OK	⚠️ OK	✅ Excellent (MoR)
Query Engines	Spark (primary)	Spark, Trino, Flink, Hive	Spark (primary)
Read Performance	⚠️ Good	⚠️ Good	❌ Slower (MoR)
Write Performance	⚠️ Good	⚠️ Good	✅ Excellent (MoR)
Small Files Problem	✅ OPTIMIZE	✅ Compaction	✅ Compaction
Z-Ordering	✅ Yes	❌ No	❌ No
Bloom Filters	✅ Yes (Databricks)	⚠️ Limited	⚠️ Limited
File Format	Parquet (default)	Parquet, ORC, Avro	Parquet, Avro (MoR)
Maturity	⭐⭐⭐⭐⭐ Mature	⭐⭐⭐⭐ Growing	⭐⭐⭐ Niche
Ecosystem	⭐⭐⭐⭐⭐ Rich	⭐⭐⭐⭐ Growing	⭐⭐⭐ Limited
Community	Very active	Very active	Active
License	Apache 2.0	Apache 2.0	Apache 2.0

Performance Benchmarks

Read Performance

Test: Query 1TB table (100M rows), filter 1% data.

Format	Query Time	Files Scanned
Plain Parquet	45s	1,000 files
Delta Lake	12s	150 files (optimized)
Iceberg	14s	180 files
Hudi (CoW)	13s	160 files
Hudi (MoR)	22s	160 base + 40 delta

Insight: Delta/Iceberg/Hudi CoW similar performance. Hudi MoR slower vì merge overhead.

Write Performance (Upserts)

Test: Update 1% of 1TB table (1M records).

Format	Update Time	Approach
Plain Parquet	180s	Rewrite 10GB partition
Delta Lake	35s	Rewrite only affected files
Iceberg	38s	Rewrite only affected files
Hudi (CoW)	40s	Rewrite affected files
Hudi (MoR)	8s	Append to delta logs

Insight: Hudi MoR nhanh nhất cho writes (5x faster). Delta/Iceberg tương đương.

Streaming Ingestion

Test: Ingest 100K records/second từ Kafka.

Format	Ingestion Lag	Queryable Latency
Plain Parquet	60s	60s
Delta Lake	20s	20s
Iceberg	25s	25s
Hudi (MoR)	10s	15s (after compaction)

Insight: Hudi MoR tốt nhất cho streaming (low latency).

Ecosystem Support

Query Engines

Engine	Delta Lake	Iceberg	Hudi
Apache Spark	✅ Native	✅ Native	✅ Native
Trino/Presto	⚠️ Basic	✅ Full support	⚠️ Basic
Apache Flink	❌ No	✅ Full support	⚠️ Limited
Hive	❌ No	✅ Via connector	⚠️ Limited
Dremio	✅ Yes	✅ Yes	❌ No
Athena	✅ Yes	✅ Yes	❌ No

Insight: Iceberg có multi-engine support tốt nhất.

Cloud Platforms

Platform	Delta Lake	Iceberg	Hudi
Databricks	✅ Native	⚠️ Preview	❌ No
AWS EMR	✅ Yes	✅ Yes	✅ Yes
AWS Glue	⚠️ Limited	✅ Yes	✅ Yes
GCP Dataproc	✅ Yes	✅ Yes	✅ Yes
Azure Synapse	✅ Preview	⚠️ Limited	❌ No
Snowflake	⚠️ Can read	✅ Iceberg Tables	❌ No

Insight: Delta best on Databricks, Iceberg best cross-platform.

How to Choose: Decision Framework

┌─────────────────────────────────────────────────────────┐
│ START: Chọn Table Format cho Lakehouse                 │
└────────────────┬────────────────────────────────────────┘
                 │
                 ▼
         ┌───────────────────┐
         │ Bạn dùng Databricks│
         │    platform?       │
         └────┬─────────┬─────┘
              │         │
           YES│         │NO
              │         │
              ▼         ▼
       ┌──────────┐   ┌────────────────────┐
       │ DELTA    │   │ Workload nào chính?│
       │ LAKE     │   └───┬──────────┬─────┘
       └──────────┘       │          │
                    Streaming        Batch/Ad-hoc
                          │          │
                          ▼          ▼
                   ┌──────────┐  ┌────────────────┐
                   │ HUDI     │  │ Multi-engine?  │
                   │ (MoR)    │  └───┬────────┬───┘
                   └──────────┘      │        │
                                   YES       NO
                                     │        │
                                     ▼        ▼
                              ┌─────────┐ ┌──────┐
                              │ ICEBERG │ │ DELTA│
                              └─────────┘ └──────┘

Decision Criteria

Chọn Delta Lake nếu:

✅ Dùng Databricks (best integration)
✅ Spark-centric workloads
✅ Cần mature ecosystem
✅ Team quen Spark APIs
✅ Cần advanced features (Z-Order, CDF)

Chọn Apache Iceberg nếu:

✅ Multi-engine environment (Spark + Trino + Flink)
✅ Muốn open standard, avoid lock-in
✅ Cần partition evolution
✅ Large-scale tables (billions of rows)
✅ Query từ Snowflake (Iceberg Tables)

Chọn Apache Hudi nếu:

✅ Streaming-heavy use cases (Kafka, CDC)
✅ Near real-time analytics (< 5 min latency)
✅ Incremental processing (process only changes)
✅ Write-heavy workload (MoR)
✅ Uber-style data platform

Migration Guide: Parquet → Table Formats

Bước 1: Assessment

# Scan existing Parquet data
parquet_path = "s3://data-lake/orders/"
df = spark.read.parquet(parquet_path)

# Check size, partitions
print(f"Rows: {df.count()}")
print(f"Size: {df.rdd.map(lambda x: len(str(x))).sum() / 1e9} GB")
print(f"Partitions: {df.rdd.getNumPartitions()}")

Bước 2: Convert to Delta Lake

# Option 1: CONVERT (in-place, không duplicate data)
from delta.tables import DeltaTable

DeltaTable.convertToDelta(
    spark,
    f"parquet.`{parquet_path}`",
    "order_date DATE, customer_id STRING"  # Partition schema
)

# Option 2: CREATE TABLE (explicit)
df = spark.read.parquet(parquet_path)
df.write.format("delta") \
    .partitionBy("order_date") \
    .save("s3://lakehouse/orders_delta")

Bước 3: Convert to Iceberg

from pyspark.sql import SparkSession

# Create Iceberg table
spark.sql(f"""
    CREATE TABLE iceberg_catalog.db.orders
    USING iceberg
    PARTITIONED BY (days(order_date))
    AS SELECT * FROM parquet.`{parquet_path}`
""")

# Hoặc migrate existing
df = spark.read.parquet(parquet_path)
df.writeTo("iceberg_catalog.db.orders").create()

Bước 4: Convert to Hudi

hudi_options = {
    'hoodie.table.name': 'orders',
    'hoodie.datasource.write.recordkey.field': 'order_id',
    'hoodie.datasource.write.precombine.field': 'order_date',
    'hoodie.datasource.write.partitionpath.field': 'order_date',
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',  # hoặc MERGE_ON_READ
    'hoodie.datasource.write.operation': 'bulk_insert'
}

df = spark.read.parquet(parquet_path)
df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("overwrite") \
    .save("s3://lakehouse/orders_hudi")

Bước 5: Validate & Cutover

# Validate row counts match
parquet_count = spark.read.parquet(parquet_path).count()
delta_count = spark.read.format("delta").load("s3://lakehouse/orders_delta").count()

assert parquet_count == delta_count, "Row count mismatch!"

# Update pipelines to write to new format
# Cutover queries to read from new format
# Monitor for 1 week
# Decommission old Parquet files

Code Examples: Common Operations

Create Table

Delta Lake:

df.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("date") \
    .save("/data/events")

Iceberg:

df.writeTo("iceberg_catalog.db.events") \
    .partitionedBy("date") \
    .create()

Hudi:

hudi_options = {
    'hoodie.table.name': 'events',
    'hoodie.datasource.write.recordkey.field': 'event_id',
    'hoodie.datasource.write.precombine.field': 'timestamp',
}

df.write.format("hudi") \
    .options(**hudi_options) \
    .save("/data/events")

Read Table

Delta Lake:

df = spark.read.format("delta").load("/data/events")

Iceberg:

df = spark.read.format("iceberg").load("iceberg_catalog.db.events")

Hudi:

df = spark.read.format("hudi").load("/data/events")

Update Records

Delta Lake:

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/data/customers")
deltaTable.update(
    condition = "status = 'inactive'",
    set = {"status": "'archived'"}
)

Iceberg:

spark.sql("""
    UPDATE iceberg_catalog.db.customers
    SET status = 'archived'
    WHERE status = 'inactive'
""")

Hudi:

# Hudi updates via upsert
updates_df = spark.createDataFrame([
    ("cust123", "archived")
], ["customer_id", "status"])

updates_df.write.format("hudi") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .save("/data/customers")

Delete Records

Delta Lake:

deltaTable.delete("order_date < '2020-01-01'")

Iceberg:

spark.sql("""
    DELETE FROM iceberg_catalog.db.orders
    WHERE order_date < '2020-01-01'
""")

Hudi:

deletes_df = spark.sql("SELECT order_id FROM orders WHERE order_date < '2020-01-01'")

deletes_df.write.format("hudi") \
    .option("hoodie.datasource.write.operation", "delete") \
    .save("/data/orders")

Case Studies: Real-World Usage

Case Study 1: E-commerce - Delta Lake

Company: Online retailer, 5M customers, 100M orders Challenge: Update order status real-time, analytics on full history Solution: Delta Lake on Databricks

Architecture:

MySQL (orders) → Debezium → Kafka → Spark Streaming → Delta Lake
                                                         ↓
                                                    BI Tools (Tableau)

Implementation:

# Streaming write to Delta
stream_df = spark.readStream.format("kafka") \
    .option("subscribe", "order_updates") \
    .load()

stream_df.writeStream.format("delta") \
    .option("checkpointLocation", "/checkpoints/orders") \
    .outputMode("append") \
    .start("/delta/orders")

Results:

✅ Latency: 2-5 minutes (CDC → queryable)
✅ MERGE operations handle 100K updates/hour
✅ Analysts query with sub-second response time
✅ Cost: $3K/month (compute + storage)

Case Study 2: Fintech - Apache Iceberg

Company: Payment processor, 10TB transaction data Challenge: Multi-engine access (Spark for ETL, Trino for ad-hoc, Flink for streaming) Solution: Apache Iceberg on AWS S3

Architecture:

Kafka → Flink (streaming) → Iceberg Tables on S3
                              ↓
                         Spark (batch ETL)
                         Trino (ad-hoc SQL)

Implementation:

# Flink streaming write
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableEnvironment tableEnv = StreamTableEnvironment.create(env);

tableEnv.executeSql("""
    CREATE CATALOG iceberg WITH (
        'type' = 'iceberg',
        'catalog-type' = 'hadoop',
        'warehouse' = 's3://data-lakehouse/'
    )
""");

tableEnv.executeSql("""
    INSERT INTO iceberg.db.transactions
    SELECT * FROM kafka_source
""");

Results:

✅ Multi-engine: Flink, Spark, Trino work seamlessly
✅ Partition evolution: Changed monthly → daily partitions without rewrite
✅ Query performance: 3x faster than plain Parquet
✅ No vendor lock-in

Case Study 3: Ride-sharing - Apache Hudi

Company: Ride-sharing platform (tương tự Grab) Challenge: Near real-time driver/rider metrics, CDC from PostgreSQL Solution: Apache Hudi (MoR) on AWS EMR

Architecture:

PostgreSQL → Debezium → Kafka → Spark Streaming → Hudi (MoR)
                                                      ↓
                                                  Presto (queries)

Implementation:

hudi_options = {
    'hoodie.table.name': 'trips',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.compact.inline': 'false',  # Async compaction
}

stream_df.writeStream.format("hudi") \
    .options(**hudi_options) \
    .trigger(processingTime='1 minute') \
    .start("/hudi/trips")

Results:

✅ Latency: < 2 minutes (PostgreSQL → queryable)
✅ Incremental processing: Chỉ process changed records (10x faster)
✅ Storage savings: 30% vs plain Parquet (compression + deduplication)
✅ Handles 1M trip updates/day

Future Outlook: Convergence?

Current Trends

1. Format Convergence:

Delta, Iceberg, Hudi ngày càng giống nhau về features
Delta adding multi-engine support (UniForm)
Iceberg adding Z-Ordering
Standards emerging (open table formats)

2. Cloud Platform Support:

AWS Athena/Glue supports all 3
Snowflake adding Iceberg Tables (2024)
Databricks supports Delta + Iceberg (preview)

3. Open Standards:

Lakehouse open standard (Linux Foundation)
Interoperability improving

Predictions (2025-2026)

Delta Lake: Sẽ dominant trong Databricks ecosystem, expand multi-engine
Iceberg: Sẽ trở thành "default open standard" cho cross-platform
Hudi: Sẽ focus vào streaming niche, integrate with Flink ecosystem

Lời khuyên: Chọn format phù hợp với platform hiện tại, nhưng theo dõi interoperability để dễ migrate sau này.

Best Practices

1. Start Small

# Pilot với 1 table trước
pilot_table = "orders"  # Most important, high-volume table

# Migrate, test, validate
# Nếu thành công → rollout to other tables

2. Optimize Regularly

Delta Lake:

# Schedule weekly optimization
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("customer_id", "order_date")
deltaTable.vacuum(7)  # Cleanup old files

Iceberg:

# Expire old snapshots (time travel only need last 7 days)
spark.sql("""
    CALL iceberg_catalog.system.expire_snapshots(
        table => 'db.orders',
        older_than => TIMESTAMP '2025-01-15 00:00:00'
    )
""")

Hudi:

# Schedule compaction for MoR tables
spark.read.format("hudi").load("/data/orders") \
    .write.format("hudi") \
    .option("hoodie.datasource.compaction.async.enable", "true") \
    .save()

3. Monitor Performance

# Metrics to track
- Query latency (p50, p95, p99)
- Storage size (growth rate)
- Small files count (target < 128MB avg)
- Compaction lag (Hudi MoR)
- Snapshot count (Iceberg/Delta)

4. Schema Evolution

# Add columns (non-breaking)
df_with_new_column.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/data/orders")

# Rename/delete columns (breaking) → communicate to users first

Kết Luận

Key Takeaways

✅ Table formats (Delta, Iceberg, Hudi) bring ACID + performance to Data Lakes ✅ Delta Lake: Best cho Databricks, Spark-centric workloads ✅ Apache Iceberg: Best cho multi-engine, open ecosystems ✅ Apache Hudi: Best cho streaming, CDC, near real-time ✅ Giá trị: 3-5x faster queries, 20-40% storage savings, enable upserts/deletes

Recommendations

Cho startups/SMEs:

Start với Iceberg nếu muốn flexibility
Hoặc Delta nếu dùng Databricks
Avoid Hudi unless streaming-heavy

Cho enterprises:

Multi-team → Iceberg (multi-engine support)
Databricks-centric → Delta Lake
Real-time analytics → Hudi cho streaming tables

Migration path:

Pilot với 1 high-value table
Measure performance improvements
Rollout to other tables incrementally
Deprecate plain Parquet

Next Steps

Muốn implement Lakehouse architecture cho doanh nghiệp của bạn?

Carptech giúp bạn:

✅ Assessment & recommendations (Delta vs Iceberg vs Hudi)
✅ Migration từ plain Parquet/Hive sang table formats
✅ Performance optimization (queries 3-5x faster)
✅ Training cho team

📞 Liên hệ Carptech: carptech.vn

Related Posts:

Bước tiếp theo

Làm Data Maturity Assessment → — Đánh giá mức độ trưởng thành dữ liệu trên 6 dimensions
Tính ROI Data Platform → — Ước tính chi phí và lợi ích đầu tư data platform
Đặt lịch tư vấn miễn phí → — 60 phút cùng chuyên gia Carptech

Lakehouse Architecture: So sánh Delta Lake, Iceberg, Hudi

TL;DR

Giới Thiệu: Vấn Đề của Plain Data Lakes

Lakehouse Architecture Recap

Delta Lake (Databricks)

Tổng Quan

Kiến Trúc

Tính Năng Chính

Ưu & Nhược Điểm

Khi Nào Dùng Delta Lake

Apache Iceberg (Netflix)

Tổng Quan

Kiến Trúc

Tính Năng Chính

Ưu & Nhược Điểm

Khi Nào Dùng Iceberg

Apache Hudi (Uber)

Tổng Quan

Kiến Trúc

Tính Năng Chính

Ưu & Nhược Điểm

Khi Nào Dùng Hudi

So Sánh Chi Tiết: Feature Matrix

Performance Benchmarks

Read Performance

Write Performance (Upserts)

Streaming Ingestion

Ecosystem Support

Query Engines

Cloud Platforms

How to Choose: Decision Framework

Decision Criteria

Migration Guide: Parquet → Table Formats

Bước 1: Assessment

Bước 2: Convert to Delta Lake

Bước 3: Convert to Iceberg

Bước 4: Convert to Hudi

Bước 5: Validate & Cutover

Code Examples: Common Operations

Create Table

Read Table

Update Records

Delete Records

Case Studies: Real-World Usage

Case Study 1: E-commerce - Delta Lake

Case Study 2: Fintech - Apache Iceberg

Case Study 3: Ride-sharing - Apache Hudi

Future Outlook: Convergence?

Current Trends

Predictions (2025-2026)

Best Practices

1. Start Small

2. Optimize Regularly

3. Monitor Performance

4. Schema Evolution

Kết Luận

Key Takeaways

Recommendations

Next Steps

Bước tiếp theo

Đăng ký nhận bài viết mới

Có câu hỏi về Data Platform?

Bài viết liên quan

Serverless Data Architecture: Scale to Zero, Pay per Query

Data Platform Migration: From On-Premise to Cloud

Cost Optimization cho Data Platform: Giảm 40-60% Cloud Bills

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý