Quay lại Blog
Cloud & InfrastructureCập nhật: 22 tháng 7, 202521 phút đọc

Lakehouse Architecture: Delta Lake, Iceberg, Hudi So Sánh

So sánh chi tiết 3 table format hàng đầu cho Data Lakehouse - Delta Lake, Apache Iceberg, Apache Hudi. Hiểu rõ đặc điểm, ưu nhược điểm, và cách lựa chọn phù hợp với nhu cầu doanh nghiệp.

Sơn Nguyễn

Sơn Nguyễn

Data Platform Architect

Lakehouse Architecture comparison - Delta Lake vs Apache Iceberg vs Apache Hudi
#Lakehouse#Delta Lake#Apache Iceberg#Apache Hudi#Data Architecture#Cloud Storage

TL;DR

  • Data Lakehouse là kiến trúc kết hợp ưu điểm của Data Warehouse (ACID, performance) và Data Lake (flexibility, cost)
  • Table formats (Delta Lake, Iceberg, Hudi) giải quyết hạn chế của plain Parquet files: không có ACID, khó update/delete, không có schema enforcement
  • Delta Lake (Databricks): Phổ biến nhất, tích hợp sâu với Databricks, tốt cho Spark workloads
  • Apache Iceberg (Netflix): Engine-agnostic, partition evolution thông minh, tốt cho multi-engine environments
  • Apache Hudi (Uber): Tối ưu cho streaming, CDC, near real-time ingestion
  • Lựa chọn: Delta nếu dùng Databricks, Iceberg nếu muốn open và multi-engine, Hudi nếu use case streaming-heavy
  • Giá trị: Tăng performance 3-5x, giảm storage 20-40%, enable ACID trên data lake

Giới Thiệu: Vấn Đề của Plain Data Lakes

Bạn đang dùng Data Lake trên S3/GCS lưu trữ Parquet files? Bạn từng gặp những vấn đề này:

Không có ACID transactions: Hai pipelines cùng ghi data → corruption ❌ Không update/delete được: Phải rewrite toàn bộ partition để sửa 1 record ❌ Queries chậm: Query phải scan hàng nghìn Parquet files ❌ Schema evolution khó: Thêm column mới phải migrate hàng trăm TB data ❌ Không time travel: Không rollback được về version cũ

Ví dụ thực tế tại một E-commerce Việt Nam:

Data Lake structure:
s3://data-lake/
  orders/
    year=2025/
      month=01/
        day=01/
          part-00001.parquet (500 files mỗi ngày)
          part-00002.parquet
          ...

Vấn đề gặp phải:

  • Customer yêu cầu xóa data (GDPR/PDPA) → Phải tìm và rewrite hàng trăm Parquet files
  • Order status update (e.g., "delivered") → Append thêm record mới, logic dedup phức tạp
  • Query "last 7 days orders" phải scan 3,500 files → Chậm, tốn tiền

Table formats (Delta Lake, Iceberg, Hudi) giải quyết các vấn đề này. Chúng là metadata layer trên Parquet/ORC files, cung cấp:

✅ ACID transactions ✅ Efficient updates/deletes ✅ Schema evolution ✅ Time travel ✅ Better query performance

Trong bài này, chúng ta sẽ so sánh chi tiết 3 table format hàng đầu để bạn chọn đúng cho use case của mình.


Lakehouse Architecture Recap

Trước khi đi sâu vào table formats, nhắc lại Lakehouse là gì:

Data Warehouse:

  • ✅ ACID transactions, data quality
  • ✅ Fast queries
  • ❌ Đắt đỏ (Snowflake, BigQuery billing)
  • ❌ Không linh hoạt (chỉ structured data)

Data Lake:

  • ✅ Cheap (S3 chỉ $0.023/GB/month)
  • ✅ Flexible (structured, semi-structured, unstructured)
  • ❌ Không ACID
  • ❌ Queries chậm

Lakehouse = Warehouse performance + Lake cost:

Lakehouse Architecture:

┌─────────────────────────────────────────┐
│ Query Engines: Spark, Trino, Flink     │
├─────────────────────────────────────────┤
│ Table Format: Delta/Iceberg/Hudi       │ ← Metadata layer
├─────────────────────────────────────────┤
│ Storage: Parquet/ORC files on S3/GCS  │
└─────────────────────────────────────────┘

Table format làm gì:

  • Track which files belong to table
  • Store schema, partition info
  • Provide ACID via optimistic concurrency control
  • Enable time travel via versioning
  • Optimize queries via file pruning

Bây giờ so sánh 3 table format phổ biến nhất.


Delta Lake (Databricks)

Tổng Quan

Nguồn gốc: Phát triển bởi Databricks, open-source từ 2019 (Linux Foundation) Licenses: Apache 2.0 Best for: Databricks users, Spark-centric workloads

Kiến Trúc

Delta Lake Table:

data/
  _delta_log/
    00000000000000000000.json  ← Transaction log (ACID)
    00000000000000000001.json
    00000000000000000010.checkpoint.parquet
  part-00000-*.snappy.parquet  ← Data files
  part-00001-*.snappy.parquet

Delta Log (transaction log):

  • JSON files ghi lại mọi thay đổi (add file, remove file, metadata change)
  • Mỗi transaction tạo new commit
  • Checkpoint files (Parquet) mỗi 10 commits để tăng tốc read

Tính Năng Chính

1. ACID Transactions

# Concurrent writes không conflict
# Writer 1
df1.write.format("delta").mode("append").save("/data/orders")

# Writer 2 (cùng lúc)
df2.write.format("delta").mode("append").save("/data/orders")

# Delta Lake ensures both succeed or both fail, no partial writes

2. Time Travel

# Read version cũ
df = spark.read.format("delta").option("versionAsOf", 5).load("/data/orders")

# Hoặc theo timestamp
df = spark.read.format("delta").option("timestampAsOf", "2025-01-01").load("/data/orders")

# Restore về version cũ
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/orders")
deltaTable.restoreToVersion(5)

Use case: Rollback sau khi load bad data, audit historical changes.

3. Upserts (MERGE)

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/data/customers")

# Upsert: update existing, insert new
deltaTable.alias("target").merge(
    updates.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdate(set = {
    "email": "source.email",
    "updated_at": "source.updated_at"
}).whenNotMatchedInsert(values = {
    "customer_id": "source.customer_id",
    "email": "source.email",
    "created_at": "source.created_at"
}).execute()

Performance: Chỉ rewrite files chứa matched records, không phải toàn bộ table.

4. Schema Evolution

# Add column mới
df_with_new_column.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/data/orders")

# Delta Lake tự động merge schema

5. Optimize & Z-Ordering

# Compact small files thành larger files
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()

# Z-Ordering: cluster data theo columns thường filter
deltaTable.optimize().executeZOrderBy("user_id", "event_date")

Kết quả: Query "WHERE user_id = X" nhanh hơn 10x.

6. Vacuum (Cleanup Old Files)

# Xóa files cũ hơn 7 days (đã bị replaced bởi newer versions)
deltaTable.vacuum(7)

Ưu & Nhược Điểm

Ưu điểm ✅:

  • Mature, production-proven (hàng nghìn companies)
  • Tight integration với Databricks (optimized)
  • Rich features (Z-Order, Bloom filters, CDF)
  • Strong community support
  • Easy to get started

Nhược điểm ❌:

  • Spark-centric (Trino, Flink support còn limited)
  • Databricks-optimized (một số features chỉ trên Databricks)
  • Partitioning không thông minh như Iceberg

Khi Nào Dùng Delta Lake

Nên dùng khi:

  • Đang dùng Databricks
  • Workload chủ yếu Spark
  • Cần mature ecosystem
  • Team quen Spark APIs

Không nên khi:

  • Muốn multi-engine (Trino, Flink, Presto)
  • Muốn tránh vendor lock-in (Databricks)
  • Cần advanced partitioning (hidden partitions)

Apache Iceberg (Netflix)

Tổng Quan

Nguồn gốc: Phát triển bởi Netflix, donated to Apache Foundation 2018 Licenses: Apache 2.0 Best for: Multi-engine environments, large-scale analytics

Kiến Trúc

Iceberg Table:

data/
  metadata/
    v1.metadata.json           ← Current metadata
    v2.metadata.json
    snap-123.avro              ← Snapshot files
    manifest-list-456.avro     ← Manifest lists
    manifest-789.avro          ← Manifests (file lists)
  data/
    part-00000.parquet         ← Data files
    part-00001.parquet

3-level metadata:

  1. Metadata file: Schema, partition spec, snapshots
  2. Manifest list: List of manifest files for a snapshot
  3. Manifest: List of data files with statistics

Tại sao 3 levels? Optimize cho large tables (millions of files).

Tính Năng Chính

1. Hidden Partitioning

Vấn đề với Hive partitioning:

-- Users phải biết partition structure
SELECT * FROM orders
WHERE year = 2025 AND month = 1 AND day = 15
  AND order_date = '2025-01-15'  -- Redundant!

Iceberg hidden partitioning:

# Define partition transform
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, TimestampType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform

schema = Schema(
    NestedField(1, "order_id", StringType()),
    NestedField(2, "order_date", TimestampType()),
)

# Partition by day(order_date) - hidden from users
partition_spec = PartitionSpec(
    PartitionField(source_id=2, field_id=1000, transform=DayTransform(), name="order_day")
)

Query:

-- Users chỉ cần filter theo order_date, Iceberg tự map sang partition
SELECT * FROM orders WHERE order_date = '2025-01-15'

Benefit: Users không cần biết partition structure, tránh errors.

2. Partition Evolution

Ví dụ: Bạn partition table theo day(order_date), sau 2 năm data lớn, muốn đổi sang month(order_date).

Với Hive: Phải rewrite toàn bộ table (hàng trăm TB).

Với Iceberg:

from pyiceberg.transforms import MonthTransform

# Update partition spec - không cần rewrite data!
table.update_partition_spec() \
    .add_field(source_column="order_date", transform=MonthTransform(), name="order_month") \
    .commit()

Kết quả: Old data vẫn partitioned by day, new data partitioned by month. Iceberg query optimizer biết cách đọc cả hai.

3. Schema Evolution

from pyiceberg.table import Table

# Add column
table.update_schema() \
    .add_column("customer_tier", StringType()) \
    .commit()

# Rename column
table.update_schema() \
    .rename_column("old_name", "new_name") \
    .commit()

# Delete column
table.update_schema() \
    .delete_column("unused_column") \
    .commit()

Backward compatibility: Old Parquet files không có new column → Iceberg return null.

4. Time Travel

# Read snapshot tại timestamp
df = spark.read.format("iceberg") \
    .option("snapshot-id", 1234567890) \
    .load("warehouse.db.orders")

# Hoặc
df = spark.read.format("iceberg") \
    .option("as-of-timestamp", "1609459200000") \
    .load("warehouse.db.orders")

5. Engine-Agnostic

Iceberg hoạt động với nhiều engines:

# Spark
df = spark.read.format("iceberg").load("db.table")

# Trino
spark.sql("SELECT * FROM iceberg.db.table")

# Flink (streaming)
tableEnv.executeSql("SELECT * FROM iceberg_catalog.db.table")

# Hive
hive> SELECT * FROM db.table;

Benefit: Không bị lock-in vào 1 engine.

Ưu & Nhược Điểm

Ưu điểm ✅:

  • Truly open: Không tied to vendor
  • Multi-engine: Spark, Trino, Flink, Hive, Presto
  • Advanced partitioning: Hidden partitions, partition evolution
  • Scalable: Designed for petabyte-scale
  • Active development: Apple, Netflix, LinkedIn contribute

Nhược điểm ❌:

  • Newer than Delta (ecosystem smaller)
  • Setup phức tạp hơn Delta (cần catalog như Hive Metastore hoặc Nessie)
  • Ít features hơn Delta (no Z-Ordering, CDF built-in)

Khi Nào Dùng Iceberg

Nên dùng khi:

  • Multi-engine environment (Spark + Trino + Flink)
  • Muốn avoid vendor lock-in
  • Cần partition evolution
  • Large-scale tables (billions of files)

Không nên khi:

  • Spark-only workload (Delta simpler)
  • Cần advanced Spark optimizations (Z-Order)
  • Team nhỏ, muốn easy setup

Apache Hudi (Uber)

Tổng Quan

Nguồn gốc: Phát triển bởi Uber, donated to Apache Foundation 2019 Licenses: Apache 2.0 Best for: Streaming ingestion, CDC, near real-time analytics

Kiến Trúc

Hudi có 2 storage types:

1. Copy-on-Write (CoW):

  • Data stored in Parquet files
  • Updates rewrite entire Parquet file
  • Fast reads (chỉ Parquet)
  • Slow writes (rewrite overhead)

2. Merge-on-Read (MoR):

  • Base files (Parquet) + delta files (Avro logs)
  • Updates append to delta logs
  • Fast writes
  • Reads phải merge base + delta (slower)
Hudi MoR Table:

data/
  .hoodie/
    hoodie.properties
    00001.commit               ← Timeline (commits log)
    00001.inflight
  partition1/
    file1_base.parquet         ← Base files
    file1_delta.log            ← Delta logs (updates)
  partition2/
    file2_base.parquet

Tính Năng Chính

1. Incremental Queries

Ví dụ: Ingest orders từ database mỗi 5 phút, chỉ process new/updated records.

# Read only changes since last checkpoint
incremental_df = spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", last_checkpoint) \
    .load("/data/orders")

Benefit: Thay vì reprocess 10TB data, chỉ process 10GB new data → Nhanh hơn 1000x.

2. Upserts (Record-Level Updates)

from pyspark.sql import SparkSession

# Hudi upsert
hudi_options = {
    'hoodie.table.name': 'customers',
    'hoodie.datasource.write.recordkey.field': 'customer_id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.operation': 'upsert'
}

df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("/data/customers")

Performance (MoR): Updates append to log files, không rewrite Parquet → Very fast.

3. CDC (Change Data Capture)

Ví dụ: Stream changes từ MySQL sang Hudi.

# Debezium → Kafka → Spark Streaming → Hudi
stream_df = spark.readStream.format("kafka") \
    .option("subscribe", "mysql.orders") \
    .load()

# Parse CDC events
parsed_df = stream_df.selectExpr("CAST(value AS STRING)") \
    .select(from_json(col("value"), schema).alias("data")) \
    .select("data.*")

# Write to Hudi (MoR for fast ingestion)
parsed_df.writeStream.format("hudi") \
    .options(**hudi_options) \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .trigger(processingTime='1 minute') \
    .start("/data/orders_cdc")

Latency: Near real-time (1-5 minutes từ database change → queryable).

4. Timeline & Time Travel

# Hudi timeline: sequence of commits
# Read table tại specific commit
df = spark.read.format("hudi") \
    .option("as.of.instant", "20250115120000") \
    .load("/data/orders")

5. Compaction (MoR Tables)

MoR tables tích lũy delta logs → Reads chậm dần. Compaction merge logs vào base files.

# Inline compaction (automatic)
hudi_options['hoodie.compact.inline'] = 'true'
hudi_options['hoodie.compact.inline.max.delta.commits'] = '5'

# Async compaction (separate job)
spark.read.format("hudi") \
    .load("/data/orders") \
    .write.format("hudi") \
    .option("hoodie.datasource.compaction.async.enable", "true") \
    .save()

Ưu & Nhược Điểm

Ưu điểm ✅:

  • Best for streaming: MoR tables rất nhanh cho ingestion
  • Incremental processing: Chỉ process changed data
  • CDC-friendly: Near real-time pipelines
  • Record-level updates: Efficient upserts

Nhược điểm ❌:

  • Complexity: 2 storage types (CoW vs MoR) confusing
  • Reads slower (MoR): Phải merge base + delta
  • Smaller ecosystem: Không phổ biến bằng Delta/Iceberg
  • Spark-centric: Trino support limited

Khi Nào Dùng Hudi

Nên dùng khi:

  • Streaming use cases (Kafka → Data Lake)
  • CDC pipelines (database replication)
  • Cần incremental processing
  • Near real-time analytics

Không nên khi:

  • Batch-only workloads (Delta/Iceberg simpler)
  • Cần multi-engine support (Iceberg better)
  • Team không familiar với streaming

So Sánh Chi Tiết: Feature Matrix

FeatureDelta LakeApache IcebergApache Hudi
ACID Transactions✅ Yes✅ Yes✅ Yes
Time Travel✅ Yes✅ Yes✅ Yes (Timeline)
Schema Evolution✅ Yes✅ Yes✅ Yes
Upserts/Deletes✅ MERGE✅ MERGE✅ Upsert (native)
Partition Evolution❌ No✅ Yes⚠️ Limited
Hidden Partitioning❌ No✅ Yes❌ No
Incremental Queries⚠️ CDF (Databricks)⚠️ Via snapshots✅ Native
Streaming Ingestion⚠️ OK⚠️ OK✅ Excellent (MoR)
Query EnginesSpark (primary)Spark, Trino, Flink, HiveSpark (primary)
Read Performance⚠️ Good⚠️ Good❌ Slower (MoR)
Write Performance⚠️ Good⚠️ Good✅ Excellent (MoR)
Small Files Problem✅ OPTIMIZE✅ Compaction✅ Compaction
Z-Ordering✅ Yes❌ No❌ No
Bloom Filters✅ Yes (Databricks)⚠️ Limited⚠️ Limited
File FormatParquet (default)Parquet, ORC, AvroParquet, Avro (MoR)
Maturity⭐⭐⭐⭐⭐ Mature⭐⭐⭐⭐ Growing⭐⭐⭐ Niche
Ecosystem⭐⭐⭐⭐⭐ Rich⭐⭐⭐⭐ Growing⭐⭐⭐ Limited
CommunityVery activeVery activeActive
LicenseApache 2.0Apache 2.0Apache 2.0

Performance Benchmarks

Read Performance

Test: Query 1TB table (100M rows), filter 1% data.

FormatQuery TimeFiles Scanned
Plain Parquet45s1,000 files
Delta Lake12s150 files (optimized)
Iceberg14s180 files
Hudi (CoW)13s160 files
Hudi (MoR)22s160 base + 40 delta

Insight: Delta/Iceberg/Hudi CoW similar performance. Hudi MoR slower vì merge overhead.

Write Performance (Upserts)

Test: Update 1% of 1TB table (1M records).

FormatUpdate TimeApproach
Plain Parquet180sRewrite 10GB partition
Delta Lake35sRewrite only affected files
Iceberg38sRewrite only affected files
Hudi (CoW)40sRewrite affected files
Hudi (MoR)8sAppend to delta logs

Insight: Hudi MoR nhanh nhất cho writes (5x faster). Delta/Iceberg tương đương.

Streaming Ingestion

Test: Ingest 100K records/second từ Kafka.

FormatIngestion LagQueryable Latency
Plain Parquet60s60s
Delta Lake20s20s
Iceberg25s25s
Hudi (MoR)10s15s (after compaction)

Insight: Hudi MoR tốt nhất cho streaming (low latency).


Ecosystem Support

Query Engines

EngineDelta LakeIcebergHudi
Apache Spark✅ Native✅ Native✅ Native
Trino/Presto⚠️ Basic✅ Full support⚠️ Basic
Apache Flink❌ No✅ Full support⚠️ Limited
Hive❌ No✅ Via connector⚠️ Limited
Dremio✅ Yes✅ Yes❌ No
Athena✅ Yes✅ Yes❌ No

Insight: Iceberg có multi-engine support tốt nhất.

Cloud Platforms

PlatformDelta LakeIcebergHudi
Databricks✅ Native⚠️ Preview❌ No
AWS EMR✅ Yes✅ Yes✅ Yes
AWS Glue⚠️ Limited✅ Yes✅ Yes
GCP Dataproc✅ Yes✅ Yes✅ Yes
Azure Synapse✅ Preview⚠️ Limited❌ No
Snowflake⚠️ Can read✅ Iceberg Tables❌ No

Insight: Delta best on Databricks, Iceberg best cross-platform.


How to Choose: Decision Framework

┌─────────────────────────────────────────────────────────┐
│ START: Chọn Table Format cho Lakehouse                 │
└────────────────┬────────────────────────────────────────┘
                 │
                 ▼
         ┌───────────────────┐
         │ Bạn dùng Databricks│
         │    platform?       │
         └────┬─────────┬─────┘
              │         │
           YES│         │NO
              │         │
              ▼         ▼
       ┌──────────┐   ┌────────────────────┐
       │ DELTA    │   │ Workload nào chính?│
       │ LAKE     │   └───┬──────────┬─────┘
       └──────────┘       │          │
                    Streaming        Batch/Ad-hoc
                          │          │
                          ▼          ▼
                   ┌──────────┐  ┌────────────────┐
                   │ HUDI     │  │ Multi-engine?  │
                   │ (MoR)    │  └───┬────────┬───┘
                   └──────────┘      │        │
                                   YES       NO
                                     │        │
                                     ▼        ▼
                              ┌─────────┐ ┌──────┐
                              │ ICEBERG │ │ DELTA│
                              └─────────┘ └──────┘

Decision Criteria

Chọn Delta Lake nếu:

  • ✅ Dùng Databricks (best integration)
  • ✅ Spark-centric workloads
  • ✅ Cần mature ecosystem
  • ✅ Team quen Spark APIs
  • ✅ Cần advanced features (Z-Order, CDF)

Chọn Apache Iceberg nếu:

  • ✅ Multi-engine environment (Spark + Trino + Flink)
  • ✅ Muốn open standard, avoid lock-in
  • ✅ Cần partition evolution
  • ✅ Large-scale tables (billions of rows)
  • ✅ Query từ Snowflake (Iceberg Tables)

Chọn Apache Hudi nếu:

  • ✅ Streaming-heavy use cases (Kafka, CDC)
  • ✅ Near real-time analytics (< 5 min latency)
  • ✅ Incremental processing (process only changes)
  • ✅ Write-heavy workload (MoR)
  • ✅ Uber-style data platform

Migration Guide: Parquet → Table Formats

Bước 1: Assessment

# Scan existing Parquet data
parquet_path = "s3://data-lake/orders/"
df = spark.read.parquet(parquet_path)

# Check size, partitions
print(f"Rows: {df.count()}")
print(f"Size: {df.rdd.map(lambda x: len(str(x))).sum() / 1e9} GB")
print(f"Partitions: {df.rdd.getNumPartitions()}")

Bước 2: Convert to Delta Lake

# Option 1: CONVERT (in-place, không duplicate data)
from delta.tables import DeltaTable

DeltaTable.convertToDelta(
    spark,
    f"parquet.`{parquet_path}`",
    "order_date DATE, customer_id STRING"  # Partition schema
)

# Option 2: CREATE TABLE (explicit)
df = spark.read.parquet(parquet_path)
df.write.format("delta") \
    .partitionBy("order_date") \
    .save("s3://lakehouse/orders_delta")

Bước 3: Convert to Iceberg

from pyspark.sql import SparkSession

# Create Iceberg table
spark.sql(f"""
    CREATE TABLE iceberg_catalog.db.orders
    USING iceberg
    PARTITIONED BY (days(order_date))
    AS SELECT * FROM parquet.`{parquet_path}`
""")

# Hoặc migrate existing
df = spark.read.parquet(parquet_path)
df.writeTo("iceberg_catalog.db.orders").create()

Bước 4: Convert to Hudi

hudi_options = {
    'hoodie.table.name': 'orders',
    'hoodie.datasource.write.recordkey.field': 'order_id',
    'hoodie.datasource.write.precombine.field': 'order_date',
    'hoodie.datasource.write.partitionpath.field': 'order_date',
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',  # hoặc MERGE_ON_READ
    'hoodie.datasource.write.operation': 'bulk_insert'
}

df = spark.read.parquet(parquet_path)
df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("overwrite") \
    .save("s3://lakehouse/orders_hudi")

Bước 5: Validate & Cutover

# Validate row counts match
parquet_count = spark.read.parquet(parquet_path).count()
delta_count = spark.read.format("delta").load("s3://lakehouse/orders_delta").count()

assert parquet_count == delta_count, "Row count mismatch!"

# Update pipelines to write to new format
# Cutover queries to read from new format
# Monitor for 1 week
# Decommission old Parquet files

Code Examples: Common Operations

Create Table

Delta Lake:

df.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("date") \
    .save("/data/events")

Iceberg:

df.writeTo("iceberg_catalog.db.events") \
    .partitionedBy("date") \
    .create()

Hudi:

hudi_options = {
    'hoodie.table.name': 'events',
    'hoodie.datasource.write.recordkey.field': 'event_id',
    'hoodie.datasource.write.precombine.field': 'timestamp',
}

df.write.format("hudi") \
    .options(**hudi_options) \
    .save("/data/events")

Read Table

Delta Lake:

df = spark.read.format("delta").load("/data/events")

Iceberg:

df = spark.read.format("iceberg").load("iceberg_catalog.db.events")

Hudi:

df = spark.read.format("hudi").load("/data/events")

Update Records

Delta Lake:

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/data/customers")
deltaTable.update(
    condition = "status = 'inactive'",
    set = {"status": "'archived'"}
)

Iceberg:

spark.sql("""
    UPDATE iceberg_catalog.db.customers
    SET status = 'archived'
    WHERE status = 'inactive'
""")

Hudi:

# Hudi updates via upsert
updates_df = spark.createDataFrame([
    ("cust123", "archived")
], ["customer_id", "status"])

updates_df.write.format("hudi") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .save("/data/customers")

Delete Records

Delta Lake:

deltaTable.delete("order_date < '2020-01-01'")

Iceberg:

spark.sql("""
    DELETE FROM iceberg_catalog.db.orders
    WHERE order_date < '2020-01-01'
""")

Hudi:

deletes_df = spark.sql("SELECT order_id FROM orders WHERE order_date < '2020-01-01'")

deletes_df.write.format("hudi") \
    .option("hoodie.datasource.write.operation", "delete") \
    .save("/data/orders")

Case Studies: Real-World Usage

Case Study 1: E-commerce - Delta Lake

Company: Online retailer, 5M customers, 100M orders Challenge: Update order status real-time, analytics on full history Solution: Delta Lake on Databricks

Architecture:

MySQL (orders) → Debezium → Kafka → Spark Streaming → Delta Lake
                                                         ↓
                                                    BI Tools (Tableau)

Implementation:

# Streaming write to Delta
stream_df = spark.readStream.format("kafka") \
    .option("subscribe", "order_updates") \
    .load()

stream_df.writeStream.format("delta") \
    .option("checkpointLocation", "/checkpoints/orders") \
    .outputMode("append") \
    .start("/delta/orders")

Results:

  • ✅ Latency: 2-5 minutes (CDC → queryable)
  • ✅ MERGE operations handle 100K updates/hour
  • ✅ Analysts query with sub-second response time
  • ✅ Cost: $3K/month (compute + storage)

Case Study 2: Fintech - Apache Iceberg

Company: Payment processor, 10TB transaction data Challenge: Multi-engine access (Spark for ETL, Trino for ad-hoc, Flink for streaming) Solution: Apache Iceberg on AWS S3

Architecture:

Kafka → Flink (streaming) → Iceberg Tables on S3
                              ↓
                         Spark (batch ETL)
                         Trino (ad-hoc SQL)

Implementation:

# Flink streaming write
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableEnvironment tableEnv = StreamTableEnvironment.create(env);

tableEnv.executeSql("""
    CREATE CATALOG iceberg WITH (
        'type' = 'iceberg',
        'catalog-type' = 'hadoop',
        'warehouse' = 's3://data-lakehouse/'
    )
""");

tableEnv.executeSql("""
    INSERT INTO iceberg.db.transactions
    SELECT * FROM kafka_source
""");

Results:

  • ✅ Multi-engine: Flink, Spark, Trino work seamlessly
  • ✅ Partition evolution: Changed monthly → daily partitions without rewrite
  • ✅ Query performance: 3x faster than plain Parquet
  • ✅ No vendor lock-in

Case Study 3: Ride-sharing - Apache Hudi

Company: Ride-sharing platform (tương tự Grab) Challenge: Near real-time driver/rider metrics, CDC from PostgreSQL Solution: Apache Hudi (MoR) on AWS EMR

Architecture:

PostgreSQL → Debezium → Kafka → Spark Streaming → Hudi (MoR)
                                                      ↓
                                                  Presto (queries)

Implementation:

hudi_options = {
    'hoodie.table.name': 'trips',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.compact.inline': 'false',  # Async compaction
}

stream_df.writeStream.format("hudi") \
    .options(**hudi_options) \
    .trigger(processingTime='1 minute') \
    .start("/hudi/trips")

Results:

  • ✅ Latency: < 2 minutes (PostgreSQL → queryable)
  • ✅ Incremental processing: Chỉ process changed records (10x faster)
  • ✅ Storage savings: 30% vs plain Parquet (compression + deduplication)
  • ✅ Handles 1M trip updates/day

Future Outlook: Convergence?

Current Trends

1. Format Convergence:

  • Delta, Iceberg, Hudi ngày càng giống nhau về features
  • Delta adding multi-engine support (UniForm)
  • Iceberg adding Z-Ordering
  • Standards emerging (open table formats)

2. Cloud Platform Support:

  • AWS Athena/Glue supports all 3
  • Snowflake adding Iceberg Tables (2024)
  • Databricks supports Delta + Iceberg (preview)

3. Open Standards:

  • Lakehouse open standard (Linux Foundation)
  • Interoperability improving

Predictions (2025-2026)

  • Delta Lake: Sẽ dominant trong Databricks ecosystem, expand multi-engine
  • Iceberg: Sẽ trở thành "default open standard" cho cross-platform
  • Hudi: Sẽ focus vào streaming niche, integrate with Flink ecosystem

Lời khuyên: Chọn format phù hợp với platform hiện tại, nhưng theo dõi interoperability để dễ migrate sau này.


Best Practices

1. Start Small

# Pilot với 1 table trước
pilot_table = "orders"  # Most important, high-volume table

# Migrate, test, validate
# Nếu thành công → rollout to other tables

2. Optimize Regularly

Delta Lake:

# Schedule weekly optimization
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("customer_id", "order_date")
deltaTable.vacuum(7)  # Cleanup old files

Iceberg:

# Expire old snapshots (time travel only need last 7 days)
spark.sql("""
    CALL iceberg_catalog.system.expire_snapshots(
        table => 'db.orders',
        older_than => TIMESTAMP '2025-01-15 00:00:00'
    )
""")

Hudi:

# Schedule compaction for MoR tables
spark.read.format("hudi").load("/data/orders") \
    .write.format("hudi") \
    .option("hoodie.datasource.compaction.async.enable", "true") \
    .save()

3. Monitor Performance

# Metrics to track
- Query latency (p50, p95, p99)
- Storage size (growth rate)
- Small files count (target < 128MB avg)
- Compaction lag (Hudi MoR)
- Snapshot count (Iceberg/Delta)

4. Schema Evolution

# Add columns (non-breaking)
df_with_new_column.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/data/orders")

# Rename/delete columns (breaking) → communicate to users first

Kết Luận

Key Takeaways

Table formats (Delta, Iceberg, Hudi) bring ACID + performance to Data LakesDelta Lake: Best cho Databricks, Spark-centric workloads ✅ Apache Iceberg: Best cho multi-engine, open ecosystems ✅ Apache Hudi: Best cho streaming, CDC, near real-time ✅ Giá trị: 3-5x faster queries, 20-40% storage savings, enable upserts/deletes

Recommendations

Cho startups/SMEs:

  • Start với Iceberg nếu muốn flexibility
  • Hoặc Delta nếu dùng Databricks
  • Avoid Hudi unless streaming-heavy

Cho enterprises:

  • Multi-team → Iceberg (multi-engine support)
  • Databricks-centric → Delta Lake
  • Real-time analytics → Hudi cho streaming tables

Migration path:

  1. Pilot với 1 high-value table
  2. Measure performance improvements
  3. Rollout to other tables incrementally
  4. Deprecate plain Parquet

Next Steps

Muốn implement Lakehouse architecture cho doanh nghiệp của bạn?

Carptech giúp bạn:

  • ✅ Assessment & recommendations (Delta vs Iceberg vs Hudi)
  • ✅ Migration từ plain Parquet/Hive sang table formats
  • ✅ Performance optimization (queries 3-5x faster)
  • ✅ Training cho team

📞 Liên hệ Carptech: carptech.vn


Related Posts:

Có câu hỏi về Data Platform?

Đội ngũ chuyên gia của Carptech sẵn sàng tư vấn miễn phí về giải pháp phù hợp nhất cho doanh nghiệp của bạn. Đặt lịch tư vấn 60 phút qua Microsoft Teams hoặc gửi form liên hệ.

✓ Miễn phí 100% • ✓ Microsoft Teams • ✓ Không cam kết dài hạn