A/B Testing Best Practices: Hướng Dẫn Toàn Diện cho Data Teams

TL;DR

A/B Testing (hay Experimentation) là phương pháp khoa học để so sánh 2 phiên bản (A và B) nhằm xác định phiên bản nào tốt hơn dựa trên data.

Quy trình chuẩn:

Hypothesis: "Thay đổi button color từ xanh → đỏ sẽ tăng CTR"
Design: Control (xanh) vs Variant (đỏ), 50-50 split
Sample Size: Tính trước cần bao nhiêu users
Run: Thu thập data trong X ngày
Analyze: Kiểm tra statistical significance (p-value < 0.05)
Decision: Ship winner hoặc iterate

Common Pitfalls:

❌ Peeking: Nhìn kết quả giữa chừng → Tăng false positive
❌ Multiple Testing: Chạy 20 tests cùng lúc → 64% xác suất false positive
❌ Too Small Sample: Kết luận sớm khi chưa đủ data
❌ Novelty Effect: Users thích cái mới → Effect biến mất sau 2 tuần

Key Metrics:

p-value: < 0.05 = significant (< 5% xác suất kết quả do ngẫu nhiên)
Confidence Level: 95% (standard), 99% (conservative)
Statistical Power: 80% (khả năng phát hiện effect nếu có thật)
Minimum Detectable Effect (MDE): Effect size nhỏ nhất có thể phát hiện

Example: Booking.com chạy 1,000+ A/B tests/năm, tăng conversion 10-25%/năm.

A/B Testing Là Gì?

Định nghĩa

A/B Testing là phương pháp thực nghiệm (experiment) so sánh 2 phiên bản (A và B) để xác định phiên bản nào perform tốt hơn về một metric cụ thể.

Cấu trúc cơ bản:

Control (A): Phiên bản hiện tại, baseline
Variant (B): Phiên bản mới muốn test
Random Assignment: User được phân ngẫu nhiên vào A hoặc B (thường 50-50)
Metric: Đo lường kết quả (CTR, conversion rate, revenue, ...)

Tại sao cần A/B Testing?

Vấn đề với "Ship and See":

Tháng 1: Launch feature mới
Tháng 2: Conversion tăng 5%

→ Feature mới có hiệu quả?

Không chắc! Có thể do:

Mùa vụ (tết, lễ, ...)
Marketing campaign cùng lúc
Competitor có vấn đề
Ngẫu nhiên

A/B Testing giải quyết:

Tháng 1: 50% users xem feature mới (B), 50% xem version cũ (A)
Kết quả:
- Group A (Control): 10% conversion
- Group B (Variant): 10.5% conversion
- p-value = 0.03 (< 0.05) → Significant!

→ Feature mới THỰC SỰ tăng conversion 5%

When to use A/B Testing

Nên dùng A/B Testing khi:

✅ Có đủ traffic (>1,000 users/variant/tuần)
✅ Metric rõ ràng, quantifiable
✅ Có thể random assign users
✅ Decision quan trọng (effort lớn, rủi ro cao)

KHÔNG cần A/B Testing khi:

❌ Traffic quá ít (<500 users/tuần)
❌ Bug fixes, legal requirements
❌ UX quá tệ (obvious improvement)
❌ Exploratory research (dùng user interviews thay thế)

Statistical Foundations

Hypothesis Testing 101

Null Hypothesis (H0): Không có sự khác biệt giữa A và B

Alternative Hypothesis (H1): Có sự khác biệt

Ví dụ:

H0: Conversion rate của A = Conversion rate của B
H1: Conversion rate của A ≠ Conversion rate của B

Goal: Tìm evidence để reject H0 (chứng minh có khác biệt).

p-value

Định nghĩa: Xác suất quan sát được kết quả như vậy (hoặc cực đoan hơn) nếu H0 đúng (tức là không có sự khác biệt thực sự).

Ví dụ:

p-value = 0.03 (3%)

Nghĩa: Nếu A và B thực sự giống nhau, chỉ có 3% xác suất
bạn quan sát được sự khác biệt lớn như vậy.

→ 3% < 5% (threshold) → Reject H0 → "B tốt hơn A" (significant)

Threshold thông dụng:

p < 0.05 (5%): Standard, 95% confidence
p < 0.01 (1%): Conservative, 99% confidence

Type I & Type II Errors

Type I Error (False Positive): Kết luận B tốt hơn A, nhưng thực ra không phải.

Xác suất = α (alpha) = 0.05 (thường dùng)
Consequence: Ship feature không hiệu quả

Type II Error (False Negative): Kết luận B không tốt hơn A, nhưng thực ra có.

Xác suất = β (beta) = 0.20 (thường dùng)
Consequence: Bỏ lỡ improvement

Statistical Power: 1 - β = 0.80 (khả năng phát hiện effect nếu có thật)

Sample Size Calculation

Formula (cho proportion test):

from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize

# Parameters
baseline_conversion = 0.10  # 10% conversion rate hiện tại
desired_lift = 0.05         # Muốn phát hiện improvement 5% (tương đối)
                            # = 0.10 * 1.05 = 0.105 (absolute)
alpha = 0.05               # 5% false positive rate
power = 0.80               # 80% statistical power

# Calculate effect size (Cohen's h)
effect_size = proportion_effectsize(
    baseline_conversion,
    baseline_conversion * (1 + desired_lift)
)

# Calculate required sample size per group
sample_size = zt_ind_solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    alternative='two-sided'
)

print(f"Required sample size per variant: {int(sample_size):,}")
# Output: Required sample size per variant: 30,941

Ví dụ:

Baseline conversion: 10%
Muốn phát hiện lift 5% (tương đối) = 10% → 10.5%
Cần: ~31,000 users/variant = 62,000 users tổng

Duration Calculation:

daily_users = 5000
sample_size_total = 62000

duration_days = sample_size_total / daily_users
print(f"Test duration: {duration_days:.1f} days")
# Output: Test duration: 12.4 days

Minimum Detectable Effect (MDE)

MDE: Effect size nhỏ nhất mà test của bạn có thể phát hiện với statistical power đủ.

Trade-off:

MDE nhỏ (detect small changes) → Cần sample size LỚN
MDE lớn (chỉ detect big changes) → Sample size nhỏ hơn

Ví dụ:

MDE	Sample Size/Variant	Duration (5K users/day)
2% relative lift	194,000	77 days
5% relative lift	31,000	12 days
10% relative lift	7,700	3 days

Khuyến nghị: Chọn MDE = smallest effect that matters for business.

A/B Testing Workflow

Step 1: Hypothesis & Metric

Bad Hypothesis:

"Thử thay đổi button color xem sao"

Good Hypothesis:

"Thay đổi CTA button từ xanh → đỏ sẽ tăng CTR 10%
vì đỏ nổi bật hơn trên background trắng"

Metric: CTR (clicks / impressions)

Criteria cho Metric tốt:

✅ Measurable: Có thể đo được
✅ Sensitive: Thay đổi khi variant có effect
✅ Timely: Có data nhanh (không phải chờ 6 tháng)
✅ Aligned with Business Goal: Liên quan đến revenue/growth

Primary vs Secondary Metrics:

Primary: Metric chính để ra quyết định (e.g., conversion rate)
Secondary: Metrics để hiểu deeper (e.g., time on page, bounce rate)

Step 2: Design Experiment

Randomization:

import hashlib

def assign_variant(user_id, experiment_name, salt=""):
    """
    Hash-based random assignment (consistent across sessions)
    """
    hash_input = f"{user_id}:{experiment_name}:{salt}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    return 'A' if hash_value % 2 == 0 else 'B'

# Example
user_id = "user_12345"
variant = assign_variant(user_id, "button_color_test")
print(f"User {user_id} assigned to variant: {variant}")

Traffic Split:

50-50: Standard, maximum statistical power
90-10: Conservative (90% control, 10% variant) - dùng khi lo rủi ro
Multi-armed Bandit: Động, shift traffic về winner theo thời gian

Step 3: Sample Size & Duration

Tính sample size (như ở phần trước):

# Use sample size calculator
sample_size_per_variant = calculate_sample_size(
    baseline=0.10,
    mde=0.05,
    alpha=0.05,
    power=0.80
)

Duration recommendations:

Minimum: 1 tuần (capture weekday/weekend patterns)
Ideal: 2-4 tuần
Maximum: 6 tuần (sau đó diminishing returns)

Common mistake:

❌ Chạy đến khi significant rồi dừng
✅ Tính trước duration, chạy đủ rồi mới analyze

Step 4: Run Experiment & Monitor

Logging events:

-- Experiment assignments table
CREATE TABLE experiment_assignments (
  user_id VARCHAR(255),
  experiment_name VARCHAR(255),
  variant VARCHAR(10),
  assigned_at TIMESTAMP,
  PRIMARY KEY (user_id, experiment_name)
);

-- Events table
CREATE TABLE events (
  event_id VARCHAR(255) PRIMARY KEY,
  user_id VARCHAR(255),
  event_type VARCHAR(100),
  event_timestamp TIMESTAMP,
  event_properties JSON
);

Monitoring checklist:

✅ Sample Ratio Mismatch (SRM): Variant split = 50-50? (nếu không, có bug!)
✅ Outliers: Có users cực đoan (1 user revenue $100K)?
✅ Data quality: Missing data, null values?

Step 5: Analyze Results

SQL để tính conversion rate:

WITH experiment_users AS (
  -- Users in experiment
  SELECT
    user_id,
    variant
  FROM experiment_assignments
  WHERE experiment_name = 'button_color_test'
),

conversions AS (
  -- Conversion events
  SELECT
    user_id,
    COUNT(*) AS conversion_count
  FROM events
  WHERE event_type = 'purchase'
    AND event_timestamp >= '2025-09-01'
    AND event_timestamp < '2025-09-15'
  GROUP BY user_id
)

-- Calculate conversion rate per variant
SELECT
  eu.variant,
  COUNT(DISTINCT eu.user_id) AS total_users,
  COUNT(DISTINCT c.user_id) AS converted_users,
  ROUND(100.0 * COUNT(DISTINCT c.user_id) / COUNT(DISTINCT eu.user_id), 2) AS conversion_rate_pct
FROM experiment_users eu
LEFT JOIN conversions c ON eu.user_id = c.user_id
GROUP BY eu.variant;

Output:

variant | total_users | converted_users | conversion_rate_pct
--------|-------------|-----------------|--------------------
A       | 31,000      | 3,100           | 10.00
B       | 31,000      | 3,255           | 10.50

Python statistical test:

from scipy.stats import chi2_contingency
import numpy as np

# Data from SQL
control_converted = 3100
control_total = 31000
variant_converted = 3255
variant_total = 31000

# Create contingency table
contingency_table = np.array([
    [control_converted, control_total - control_converted],
    [variant_converted, variant_total - variant_converted]
])

# Chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Control conversion rate: {control_converted/control_total:.2%}")
print(f"Variant conversion rate: {variant_converted/variant_total:.2%}")
print(f"Relative lift: {(variant_converted/variant_total - control_converted/control_total) / (control_converted/control_total):.2%}")
print(f"p-value: {p_value:.4f}")
print(f"Significant? {p_value < 0.05}")

# Output:
# Control conversion rate: 10.00%
# Variant conversion rate: 10.50%
# Relative lift: 5.00%
# p-value: 0.0245
# Significant? True

Confidence Interval:

from statsmodels.stats.proportion import confint_proportions_2indep

# Calculate 95% confidence interval for difference
ci_low, ci_high = confint_proportions_2indep(
    variant_converted, variant_total,
    control_converted, control_total,
    method='wald'
)

print(f"95% CI for difference: [{ci_low:.4f}, {ci_high:.4f}]")
# Output: 95% CI for difference: [0.0006, 0.0094]
# Positive interval → Variant is better

Step 6: Decision

Decision Matrix:

p-value	Confidence Interval	Decision
< 0.05	Fully above 0	✅ Ship variant
< 0.05	Fully below 0	❌ Keep control
>= 0.05	Includes 0	⚠️ Inconclusive (run longer or iterate)

Considerations beyond statistics:

Effect size: Significant nhưng lift chỉ 0.5% → Có đáng ship không?
Cost: Variant phức tạp hơn, maintenance cost cao hơn?
User Experience: Secondary metrics (bounce rate, time on page) có tệ đi không?

Common Pitfalls & How to Avoid

Pitfall 1: Peeking (Early Stopping)

Vấn đề:

Day 1: Check results → p = 0.08 (not significant)
Day 3: Check again → p = 0.04 (significant!) → Ship!
Day 7: Check again → p = 0.12 (not significant anymore)

Tại sao sai: Mỗi lần peek = 1 hypothesis test → Tăng false positive rate.

Simulation:

import numpy as np
from scipy.stats import ttest_ind

np.random.seed(42)
n_simulations = 10000
false_positives = 0

for _ in range(n_simulations):
    # H0 đúng: A và B giống nhau (cùng mean)
    control = np.random.normal(0, 1, 1000)
    variant = np.random.normal(0, 1, 1000)

    # Peek 5 lần
    peeks = [200, 400, 600, 800, 1000]
    for peek in peeks:
        _, p = ttest_ind(control[:peek], variant[:peek])
        if p < 0.05:
            false_positives += 1
            break  # Stop early!

print(f"False positive rate với peeking: {false_positives/n_simulations:.2%}")
# Output: ~14% (thay vì 5%!)

Giải pháp:

Sequential testing: Dùng methods adjust cho peeking (e.g., Bonferroni correction)
Fixed horizon: Commit trước duration, không peek
Always Valid p-values: Dùng methods như mSPRT (mixture Sequential Probability Ratio Test)

Pitfall 2: Multiple Testing Problem

Vấn đề: Chạy nhiều tests cùng lúc → Tăng false positive.

Example:

20 A/B tests, mỗi test α = 0.05

Xác suất ít nhất 1 false positive:
1 - (1 - 0.05)^20 = 64%!

Giải pháp:

1. Bonferroni Correction:

n_tests = 20
alpha_individual = 0.05 / n_tests  # 0.0025

print(f"Adjusted alpha: {alpha_individual:.4f}")
# Mỗi test cần p < 0.0025 thay vì 0.05

2. False Discovery Rate (FDR) - Benjamini-Hochberg:

from statsmodels.stats.multitest import multipletests

# p-values từ 20 tests
p_values = [0.01, 0.03, 0.04, 0.06, 0.08, ...]  # 20 values

# FDR correction
reject, p_adjusted, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')

print(f"Original significant: {sum(np.array(p_values) < 0.05)}")
print(f"After FDR correction: {sum(reject)}")

Pitfall 3: Simpson's Paradox

Vấn đề: Kết quả overall khác với kết quả từng segment.

Ví dụ:

Overall:
- Control: 10% conversion
- Variant: 12% conversion
→ Variant wins!

Nhưng khi chia theo segment:

Mobile:
- Control: 15% (1,000 users)
- Variant: 13% (9,000 users)

Desktop:
- Control: 5% (9,000 users)
- Variant: 8% (1,000 users)

→ Variant THUA ở cả 2 segments!

Nguyên nhân: Variant có nhiều mobile users (segment có conversion cao), tạo illusion.

Giải pháp:

Segment analysis: Luôn phân tích theo segments quan trọng
Stratified sampling: Balance segments giữa A và B

Pitfall 4: Novelty Effect

Vấn đề: Users thích cái mới → Variant win trong ngắn hạn, nhưng effect biến mất sau 2-4 tuần.

Example:

Week 1: Variant +15% conversion
Week 2: Variant +8% conversion
Week 3: Variant +2% conversion
Week 4: Variant -1% conversion (không khác biệt)

Giải pháp:

Run longer: Ít nhất 2-3 tuần
Cohort analysis: So sánh new users vs returning users
Holdout group: Giữ 10% users ở control vĩnh viễn để long-term comparison

Pitfall 5: Sample Ratio Mismatch (SRM)

Vấn đề: Variant split không đúng 50-50 (nếu design là 50-50).

Example:

Expected: 50,000 control, 50,000 variant
Actual:   48,000 control, 52,000 variant

→ Bug trong randomization!

Detection:

from scipy.stats import chisquare

observed = [48000, 52000]
expected = [50000, 50000]

chi2, p_value = chisquare(observed, expected)
print(f"p-value: {p_value:.4f}")
# p < 0.05 → SRM detected! Investigate!

Common causes:

Bot traffic không được split đúng
Cache issues
Redirect bugs

Advanced Topics

Multi-Armed Bandit (MAB)

So với A/B Testing:

A/B: Fixed 50-50 split, chạy đến hết rồi mới ship winner
MAB: Dynamically shift traffic về variant perform tốt hơn

Khi nào dùng MAB:

High traffic (>100K users/week)
Cần optimize nhanh (email subject lines, ad creatives)
OK với "regret" (phần traffic bị lãng phí cho loser)

Thompson Sampling (example):

import numpy as np

class ThompsonSampling:
    def __init__(self, n_variants):
        self.successes = np.ones(n_variants)  # Beta prior
        self.failures = np.ones(n_variants)

    def select_variant(self):
        # Sample from Beta distribution
        samples = np.random.beta(self.successes, self.failures)
        return np.argmax(samples)

    def update(self, variant, reward):
        if reward == 1:
            self.successes[variant] += 1
        else:
            self.failures[variant] += 1

# Simulate
bandit = ThompsonSampling(n_variants=2)
for i in range(1000):
    variant = bandit.select_variant()
    # Variant 0: 10% conversion, Variant 1: 12% conversion
    conversion = np.random.random() < (0.10 if variant == 0 else 0.12)
    bandit.update(variant, int(conversion))

print(f"Traffic allocation: {bandit.successes / bandit.successes.sum()}")
# Output: Variant 1 gets ~70% traffic

Multi-Variant Testing (A/B/C/D/...)

Trade-off: Nhiều variants → Cần sample size LỚN hơn.

Sample size adjustment:

# 2 variants (A/B): 31K per variant = 62K total
# 4 variants (A/B/C/D): ~40K per variant = 160K total (+158%!)

# Rule of thumb: Sample size per variant tăng ~30% mỗi khi double số variants

Khuyến nghị:

A/B: Standard
A/B/C: OK nếu đủ traffic
A/B/C/D/E/F...: Dùng MAB thay vì A/B

Case Studies

Case Study 1: Booking.com - Experimentation at Scale

Context:

Booking.com chạy 1,000+ A/B tests/năm
Mọi thay đổi đều phải qua A/B test

Example Tests:

Test 1: Urgency Messaging

Control: "10 rooms left"
Variant: "10 rooms left, booked 50 times in last 24 hours"
Result: +2.3% conversion, p < 0.001
Impact: $15M revenue/năm

Test 2: Price Display

Control: Hiển thị giá/đêm
Variant: Hiển thị tổng giá (all nights)
Result: -5% conversion (people scared by big number)
Decision: Không ship

Test 3: Photo Size

Control: Small thumbnails
Variant A: Medium photos
Variant B: Large photos
Result: Variant A +3% CTR, Variant B +2% CTR (nhưng slower page load → -1% conversion)
Decision: Ship Variant A

Key Learnings:

Không có thay đổi "nhỏ" → Tất cả đều test
1/3 tests win, 1/3 neutral, 1/3 lose
Cumulative effect: +10-25% conversion improvement/năm

Case Study 2: Netflix - Artwork Personalization

Context:

Netflix test personalized artwork (thumbnails) cho mỗi title
Hypothesis: Artwork phù hợp sở thích user → Tăng playback rate

Experiment Design:

Control: Default artwork (same for everyone)
Variant: Personalized artwork (ML model chọn artwork based on viewing history)

Example:

User thích romantic movies:
- Control: "Stranger Things" với artwork monster
- Variant: "Stranger Things" với artwork Eleven & Mike

User thích action:
- Control: Same artwork monster
- Variant: "Stranger Things" với artwork action scene

Results:

Playback rate: +5% (significant, p < 0.001)
Session duration: +2%
Secondary: Không ảnh hưởng xấu đến completion rate

Decision: Ship

Impact: Áp dụng cho toàn bộ catalog → Ước tính +$1B retention value.

Case Study 3: Startup Vietnam - Pricing Page Test

Context:

SaaS startup Việt Nam
Muốn test pricing page layout

Variants:

Control: 3 tiers horizontal (Basic, Pro, Enterprise)
Variant: 2 tiers + "Contact Sales" cho Enterprise

Hypothesis: Đơn giản hóa → Tăng conversion

Results (after 3 weeks, 15K visitors):

Metric	Control	Variant	Lift	p-value
Signup rate	8.2%	9.5%	+15.9%	0.02
Pro tier %	30%	42%	+40%	0.01
Avg revenue/user	$45	$52	+15.6%	0.04

Segment Analysis:

Small companies (<10 employees):
- Control: 9% signup
- Variant: 11% signup (+22%)

Large companies (>50 employees):
- Control: 5% signup
- Variant: 4% signup (-20%, but not significant p=0.15)

Decision: Ship variant

Follow-up: A/B test riêng cho large companies (add "Enterprise" tier back).

Tools & Infrastructure

Experimentation Platforms

1. Google Optimize (Free, sunset 2023)

Replaced by: Google Analytics 4 + Firebase A/B Testing

2. Optimizely (Enterprise, $50K+/năm)

Feature flags + A/B testing
Visual editor (no-code)
Stats engine tốt

3. VWO (Visual Website Optimizer, $200+/tháng)

Heatmaps + A/B testing
Dễ dùng cho non-technical

4. LaunchDarkly (Feature flags, $1,000+/tháng)

Feature flags với A/B testing built-in
Targeting rules phức tạp

5. Statsig (Free tier, $1,000+/tháng cho paid)

Modern experimentation platform
Pulse analysis (auto metrics)
Free cho <1M events/tháng

Build Your Own (for Startups)

Components:

Feature Flag System:

# feature_flags.py
import hashlib

def get_variant(user_id, experiment_id):
    """Simple hash-based assignment"""
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    return 'B' if hash_value % 100 < 50 else 'A'  # 50-50 split

# Usage in application
if get_variant(current_user.id, 'button_color_test') == 'B':
    button_color = 'red'
else:
    button_color = 'blue'

Event Logging:

// Track experiment exposure
analytics.track('Experiment Viewed', {
  experiment_id: 'button_color_test',
  variant: variant,
  user_id: userId
});

// Track conversion
analytics.track('Purchase', {
  revenue: 99.99,
  user_id: userId
});

Analysis Pipeline (dbt):

-- models/experiments/button_color_test_results.sql
{{ config(materialized='table') }}

WITH experiment_users AS (
  SELECT DISTINCT user_id, variant
  FROM {{ ref('experiment_events') }}
  WHERE experiment_id = 'button_color_test'
),

conversions AS (
  SELECT user_id, SUM(revenue) as revenue
  FROM {{ ref('purchase_events') }}
  WHERE created_at >= '2025-09-01'
  GROUP BY user_id
)

SELECT
  eu.variant,
  COUNT(DISTINCT eu.user_id) as users,
  COUNT(DISTINCT c.user_id) as converted_users,
  SAFE_DIVIDE(COUNT(DISTINCT c.user_id), COUNT(DISTINCT eu.user_id)) as conversion_rate,
  SUM(COALESCE(c.revenue, 0)) as total_revenue
FROM experiment_users eu
LEFT JOIN conversions c ON eu.user_id = c.user_id
GROUP BY eu.variant

Stats Testing (Python notebook):

# analysis.ipynb
import pandas as pd
from scipy.stats import chi2_contingency

# Load from data warehouse
df = pd.read_sql("SELECT * FROM button_color_test_results", conn)

# Chi-square test
contingency = [[df.loc[0, 'converted_users'], df.loc[0, 'users'] - df.loc[0, 'converted_users']],
               [df.loc[1, 'converted_users'], df.loc[1, 'users'] - df.loc[1, 'converted_users']]]

chi2, p_value, _, _ = chi2_contingency(contingency)
print(f"p-value: {p_value:.4f}")

Total Cost: ~$0 (chỉ data warehouse + analyst time)

Best Practices Checklist

Pre-Launch Checklist

Hypothesis documented: Rõ ràng, measurable
Primary metric defined: Aligned with business goal
Sample size calculated: Đủ power để detect MDE
Duration planned: Ít nhất 1 tuần, cover weekday/weekend
Randomization tested: SRM check, AA test passed
Guardrail metrics set: Metrics không được làm tệ đi (e.g., revenue, engagement)
Logging verified: Events được track đúng
Stakeholders aligned: PM, Eng, Design đồng ý design

During Experiment

Monitor SRM: Variant split đúng không?
Check data quality: Missing data? Outliers?
Resist peeking: Chỉ check monitoring, không ra quyết định sớm
Document issues: Bugs, downtime, external events (marketing campaigns)

Post-Experiment

Statistical test passed: p < 0.05 (or your threshold)
Effect size meaningful: Lift đủ lớn để đáng ship?
Segment analysis done: Consistent across segments?
Guardrail metrics OK: Không làm tệ metrics khác?
Decision documented: Ship/Don't ship + reasoning
Learnings shared: Write-up cho team

Kết Luận

A/B Testing là công cụ mạnh nhất để ra quyết định dựa trên data thay vì opinions. Tuy nhiên, cần làm đúng để tránh false positives và bad decisions.

Key Takeaways

Always calculate sample size trước → Không chạy đến khi significant
Avoid peeking → Commit duration từ đầu
Watch out for multiple testing → Adjust alpha nếu chạy nhiều tests
Segment analysis → Phát hiện Simpson's Paradox
Consider effect size, không chỉ p-value → Significant ≠ Meaningful
Run long enough → Avoid novelty effect (ít nhất 2 tuần)

Antipatterns to Avoid

❌ "Ship and see" without control group
❌ Stop test sớm khi significant
❌ Chạy 20 tests, chỉ report tests "win"
❌ Ignore sample size calculation
❌ Trust single test → Always validate với follow-up tests

Next Steps

Sau khi master A/B Testing, bạn nên học:

Attribution Modeling: Hiểu multi-touch attribution
Cohort Analysis: Phân tích hành vi nhóm người dùng
Customer Segmentation: Advanced segmentation techniques

Carptech - Giải Pháp Experimentation cho Doanh Nghiệp Việt Nam

Tại Carptech, chúng tôi giúp doanh nghiệp Việt Nam xây dựng experimentation culture:

Dịch vụ của chúng tôi

Experimentation Platform Setup: Feature flags, logging, analysis pipeline
Statistical Consulting: Sample size calculation, test design, analysis
Training & Workshops: Đào tạo team về A/B testing best practices
Embedded Analytics Engineer: Support ongoing experiments

Case Studies

E-commerce: Setup experimentation platform, chạy 50+ tests/năm → Conversion +18%
SaaS: A/B testing pricing page, onboarding flow → MRR +35%
Fintech: Feature rollout với A/B tests → Giảm bugs 60%, tăng confidence

Liên hệ: https://carptech.vn

Bài viết được viết bởi Carptech Team - Chuyên gia về Data Platform & Analytics tại Việt Nam.

A/B Testing Best Practices: Hướng Dẫn Toàn Diện cho Data Teams

A/B Testing Best Practices: Hướng Dẫn Toàn Diện cho Data Teams

TL;DR

A/B Testing Là Gì?

Định nghĩa

Tại sao cần A/B Testing?

When to use A/B Testing

Statistical Foundations

Hypothesis Testing 101

p-value

Type I & Type II Errors

Sample Size Calculation

Minimum Detectable Effect (MDE)

A/B Testing Workflow

Step 1: Hypothesis & Metric

Step 2: Design Experiment

Step 3: Sample Size & Duration

Step 4: Run Experiment & Monitor

Step 5: Analyze Results

Step 6: Decision

Common Pitfalls & How to Avoid

Pitfall 1: Peeking (Early Stopping)

Pitfall 2: Multiple Testing Problem

Pitfall 3: Simpson's Paradox

Pitfall 4: Novelty Effect

Pitfall 5: Sample Ratio Mismatch (SRM)

Advanced Topics

Multi-Armed Bandit (MAB)

Multi-Variant Testing (A/B/C/D/...)

Case Studies

Case Study 1: Booking.com - Experimentation at Scale

Case Study 2: Netflix - Artwork Personalization

Case Study 3: Startup Vietnam - Pricing Page Test

Tools & Infrastructure

Experimentation Platforms

Build Your Own (for Startups)

Best Practices Checklist

Pre-Launch Checklist

During Experiment

Post-Experiment

Kết Luận

Key Takeaways

Antipatterns to Avoid

Next Steps

Carptech - Giải Pháp Experimentation cho Doanh Nghiệp Việt Nam

Dịch vụ của chúng tôi

Case Studies

Có câu hỏi về Data Platform?

Bài viết liên quan

Cohort Analysis: Phân Tích Hành Vi Theo Nhóm để Hiểu Customer Journey

Customer Segmentation: Kỹ Thuật Nâng Cao để Personalization & Targeting

Attribution Modeling: Multi-Touch Attribution cho Marketing & Product

Dịch Vụ

Công Ty

Tài Nguyên

Pháp Lý