Inference Crisis: Tại Sao Serving LLM Đang 'Đốt Tiền' Hơn Training

Tháng trước tôi nhận được bill hosting cho một RAG system khá nhỏ: 3 triệu requests/tháng, average 500 tokens/request. Tổng cost: $4,200. Training model ban đầu? Zero (dùng API). Nhưng serving nó mỗi tháng thì không phải zero chút nào.

Một paper mới từ Google (đồng tác giả là Turing Award winner) đặt tên thẳng cho vấn đề này: Inference Crisis. Luận điểm cốt lõi: hardware GPU/TPU hiện tại được thiết kế cho training — matrix multiplication dày đặc, batch lớn, memory access pattern có thể predict. Inference thì ngược lại hoàn toàn: request lẻ tẻ, context length biến động, memory bandwidth là bottleneck chứ không phải compute.

Tại Sao Inference Khác Training Về Mặt Kỹ Thuật

Khi bạn train một model, bạn:

Process batch lớn cùng lúc (high GPU utilization)
Biết trước sequence length (predictable memory)
Tối ưu cho throughput (bao nhiêu samples/second)

Khi bạn serve inference:

Requests đến lẻ tẻ và không đồng đều
Context length mỗi request khác nhau
Phải optimize cho latency (time to first token) VÀ throughput

Vấn đề thực sự là KV Cache. Trong transformer inference, bạn cache key-value pairs để tránh recompute attention cho tokens đã process. Cache này grow tuyến tính với context length và ăn memory rất nhanh:

Memory for KV Cache = 2 × num_layers × num_heads × head_dim × seq_len × dtype_bytes

Ví dụ với Llama 3 70B, sequence length 32K:
= 2 × 80 × 8 × 128 × 32768 × 2 (float16)
≈ 85 GB

85 GB chỉ cho KV cache của một single request với context dài. Một A100 80GB không đủ chứa model weights + KV cache của một request context dài. Đây là lý do tại sao serving expensive.

Những Kỹ Thuật Đang Giải Quyết Vấn Đề Này

PagedAttention và vLLM

vLLM giải quyết memory fragmentation vấn đề bằng cách borrow idea từ OS virtual memory — paged allocation cho KV cache.

from vllm import LLM, SamplingParams

# vLLM tự động quản lý KV cache memory qua PagedAttention
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,  # Spread across 4 GPUs
    gpu_memory_utilization=0.90,  # Use 90% GPU memory
    max_num_seqs=256,  # Max concurrent sequences
    # PagedAttention enabled by default
)

# Continuous batching — không phải static batch size
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, sampling_params)

Trước vLLM, serving systems dùng static batch size — bạn chọn batch size lúc start, tất cả requests trong batch finish thì mới accept batch mới. Throughput thấp, GPU idle nhiều. vLLM dùng continuous batching: khi một sequence finish, immediately nhận sequence mới mà không cần chờ cả batch.

Trong thực tế tôi thấy vLLM tăng throughput 3-5x so với naive Hugging Face serving, với latency tương đương.

Speculative Decoding

Đây là kỹ thuật khá elegant: dùng một draft model nhỏ để predict K tokens tiếp theo, rồi dùng main model lớn để verify tất cả trong một forward pass.

Draft model (nhỏ, nhanh) generate: "The quick brown fox"
Main model verify trong 1 forward pass: Accept "The quick brown" → Reject "fox" → Generate "cat"
Result: 3 tokens với chi phí 1 main model forward pass thay vì 3

Khi draft model đúng (thường là ~70-80% với model cùng family), bạn tiết kiệm significant compute. Khi sai, bạn chỉ tốn thêm một small model forward pass. Expected speedup: 2-3x latency improvement.

# Speculative decoding với vLLM
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",  # Draft model
    num_speculative_tokens=5,  # Predict 5 tokens ahead
    use_v2_block_manager=True,
)

Quantization: Trade-off Thực Tế

INT4 quantization giảm model size 4x — Llama 70B từ 140GB xuống ~35GB, fit vào 2× A100 thay vì 4×. Cost savings rõ ràng.

Nhưng gotcha: không phải mọi workload đều tolerate quantization loss.

# Benchmark cùng task với different quantization levels
import json

test_cases = [
    {"type": "factual_qa", "input": "What is the capital of France?"},
    {"type": "code_generation", "input": "Write a binary search function"},
    {"type": "reasoning", "input": "If all A are B, and some B are C, what can we conclude?"},
    {"type": "long_context", "input": "..." * 8000 + "Summarize the above"},
]

results = {}
for quantization in ["fp16", "int8", "int4"]:
    model = load_model(quantization=quantization)
    results[quantization] = {
        case["type"]: evaluate(model, case["input"])
        for case in test_cases
    }

Từ kinh nghiệm thực tế của tôi:

Factual QA: INT4 fine, degradation nhỏ
Code generation: INT8 acceptable, INT4 có thể produce subtle bugs
Complex reasoning: Stick với FP16 hoặc INT8
Long context (>16K tokens): INT4 thường degrade đáng kể

Cost Optimization Thực Tế Ở Production

Sau khi optimize inference pipeline, đây là những thứ thực sự tiết kiệm tiền:

1. Request Batching Ở Application Layer

Đừng send requests một lần một — batch chúng lại nếu use case cho phép.

import asyncio
from collections import defaultdict

class RequestBatcher:
    def __init__(self, max_batch_size=16, max_wait_ms=50):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.pending = []
        self._lock = asyncio.Lock()

    async def add_request(self, prompt: str) -> str:
        future = asyncio.Future()
        async with self._lock:
            self.pending.append((prompt, future))
            if len(self.pending) >= self.max_batch_size:
                await self._flush()
        # Wait for result
        return await future

    async def _flush(self):
        if not self.pending:
            return
        batch = self.pending[:self.max_batch_size]
        self.pending = self.pending[self.max_batch_size:]
        prompts = [p for p, _ in batch]
        futures = [f for _, f in batch]
        results = await llm.generate_batch(prompts)
        for future, result in zip(futures, results):
            future.set_result(result)

2. Semantic Caching

Cache responses không phải theo exact prompt match mà theo semantic similarity. Request “Thủ đô của Pháp là gì?” và “France có thủ đô là thành phố nào?” có thể share cached response.

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = []  # (embedding, response)
        self.threshold = similarity_threshold

    def get(self, query: str):
        if not self.cache:
            return None
        query_emb = self.encoder.encode(query)
        for cached_emb, response in self.cache:
            similarity = np.dot(query_emb, cached_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
            )
            if similarity > self.threshold:
                return response
        return None

    def set(self, query: str, response: str):
        emb = self.encoder.encode(query)
        self.cache.append((emb, response))

Trong production system tôi build, semantic cache hit rate ~35% cho typical user patterns. Với $4,200/tháng baseline, đó là $1,470 saved.

3. Model Routing Dựa Trên Complexity

Không phải mọi query đều cần frontier model. Route dựa trên complexity:

class ModelRouter:
    def route(self, query: str) -> str:
        # Simple queries → cheap model
        if len(query.split()) < 20 and not self._needs_reasoning(query):
            return "claude-haiku-4-5"

        # Medium complexity
        if not self._needs_deep_reasoning(query):
            return "claude-sonnet-4-6"

        # Complex reasoning, code generation
        return "claude-opus-4-6"

    def _needs_reasoning(self, query: str) -> bool:
        reasoning_signals = ["tại sao", "giải thích", "so sánh", "phân tích", "why", "explain", "compare"]
        return any(s in query.lower() for s in reasoning_signals)

80% queries trong most applications có thể handled bởi cheaper models mà không ảnh hưởng user experience.

Điều Không Ai Nói Với Bạn

Inference cost không scale tuyến tính với load — nó scale với context length * số requests đồng thời. Một request 32K tokens có thể đắt gấp 32x một request 1K tokens, nhưng không phải vì compute — mà vì KV cache memory pressure force bạn phải use fewer concurrent requests.

Batching có diminishing returns sau một điểm nhất định. Tăng batch size từ 1 lên 16 thường cho speedup gần tuyến tính. Từ 16 lên 64 thì ít hơn nhiều. Từ 64 lên 256 thì gần như flat — bạn hit memory bandwidth bottleneck.

Speculative decoding work tốt nhất khi output predictable. Cho code generation, document summarization, translation — accuracy cao, speedup rõ ràng. Cho creative writing hoặc open-ended QA — draft model thường sai nhiều hơn, overhead có thể không đáng.

Kết Luận

Inference crisis là thật, nhưng có giải pháp thực tế:

Dùng vLLM với continuous batching cho self-hosted inference
Speculative decoding cho latency-sensitive workloads với predictable outputs
INT8 quantization cho most use cases, INT4 chỉ sau khi benchmark kỹ
Semantic caching cho workloads có repetitive patterns
Model routing để không lãng phí frontier model budget cho simple queries

Inference cost sẽ tiếp tục là vấn đề quan trọng khi agent systems scale — vì agent không chỉ call model một lần, mà gọi nhiều lần trong một workflow. Optimize inference không còn là “nice to have” — nó là điều kiện để AI systems bền vững về mặt kinh tế.

Xuất nội dung

Inference Crisis: Tại Sao Serving LLM Đang 'Đốt Tiền' Hơn Training

Tại Sao Inference Khác Training Về Mặt Kỹ Thuật

Những Kỹ Thuật Đang Giải Quyết Vấn Đề Này

PagedAttention và vLLM

Speculative Decoding

Quantization: Trade-off Thực Tế

Cost Optimization Thực Tế Ở Production

1. Request Batching Ở Application Layer

2. Semantic Caching

3. Model Routing Dựa Trên Complexity

Điều Không Ai Nói Với Bạn

Kết Luận

Bình luận

Nội dung chính

Inference Crisis: Tại Sao Serving LLM Đang 'Đốt Tiền' Hơn Training