AI Agents Lên Production: Những Cạm Bẫy Không Ai Nói Trước

Tuần trước tôi debug một agent LangGraph chạy production. Agent cứ loop vô tận — không error, không timeout, chỉ đơn giản là không dừng. Log đẹp, trace sạch, nhưng bill Anthropic API cuối tháng tăng gấp đôi. Sau khi đào sâu, tôi phát hiện ra một điều: production agents thất bại theo những cách hoàn toàn khác với prototype agents.

Theo survey mới nhất của LangChain (State of Agent Engineering 2026), 57% tổ chức đã có agents chạy production. Nhưng chưa đến 25% scale được thành công. Gap này không phải do model chưa đủ tốt — mà do chúng ta đang áp dụng mental model sai.

Stack của Agent khác Stack của Chatbot

Đây là điều đầu tiên cần internalize. Một chatbot cần: inference + RAG là xong. Nhưng một production agent cần:

State management qua multi-step execution (không phải stateless request-response)
Tool access với permission boundaries rõ ràng
Memory persist across sessions (không phải context window tạm thời)
Guardrails kiểm soát hành động thực tế, không chỉ text output
Evaluation continuous, không phải one-time testing

Tôi đã thấy team deploy agent bằng cách lấy chatbot code và add thêm tool calls. Họ gặp đủ thứ: race condition trong state updates, tool calls không có idempotency, memory leak vì session không được cleanup. Đây không phải vấn đề model — đây là vấn đề kiến trúc.

Những Cạm Bẫy Thực Tế

1. LangGraph Loop Vô Tận

# Code ngây thơ — không có termination guard
def build_agent():
    graph = StateGraph(AgentState)
    graph.add_node("agent", call_model)
    graph.add_node("tools", tool_node)
    graph.add_conditional_edges(
        "agent",
        should_continue,  # Cái này có thể không bao giờ return False
    )
    return graph.compile()

Vấn đề: should_continue logic phụ thuộc vào model output. Nếu model bị stuck ở một reasoning loop — ví dụ cố gắng call một tool bị rate-limited rồi retry mãi — graph sẽ không có cơ chế thoát.

# Production-safe: thêm iteration guard
class AgentState(TypedDict):
    messages: list
    iteration_count: int  # Thêm field này

def should_continue(state: AgentState):
    if state["iteration_count"] > 20:  # Hard limit
        return "end"
    if len(state["messages"]) == 0:
        return "end"
    last_message = state["messages"][-1]
    if hasattr(last_message, 'tool_calls') and last_message.tool_calls:
        return "tools"
    return "end"

def call_model(state: AgentState):
    # Increment counter mỗi iteration
    return {
        **state,
        "iteration_count": state.get("iteration_count", 0) + 1,
        "messages": [...]
    }

Tôi đặt hard limit ở 20 iterations cho most workflows, 50 cho research agents có nhiều tool calls. Chưa bao giờ có workflow hợp lệ nào cần hơn con số đó.

2. CrewAI Output Schema Không Nhất Quán

CrewAI dễ prototype nhưng output format là ác mộng ở production. Agent trả về free-form text thay vì structured data, và bạn chỉ phát hiện khi downstream system crash.

# Nguy hiểm — output là string tùy ý
researcher_task = Task(
    description="Analyze this document and extract key metrics",
    agent=researcher,
    expected_output="A list of metrics with values"
)

# Production-safe — enforce schema
from pydantic import BaseModel
from typing import List

class MetricResult(BaseModel):
    name: str
    value: float
    unit: str
    confidence: float

class AnalysisOutput(BaseModel):
    metrics: List[MetricResult]
    summary: str
    data_quality_score: float

researcher_task = Task(
    description="Analyze this document and extract key metrics",
    agent=researcher,
    expected_output="Structured metrics data",
    output_pydantic=AnalysisOutput  # Enforce này
)

Khi output không match schema, CrewAI sẽ retry tự động (tốn thêm token). Nếu sau 3 lần vẫn không match, task fail với error rõ ràng — tốt hơn là silently trả về garbage data.

3. Evaluation Gap Là Rủi Ro Lớn Nhất

32% teams trong survey nói quality là production killer số 1. Và nguyên nhân chính: most prototypes have zero eval.

Tôi từng tự tin deploy một document processing agent vì nó hoạt động tốt trên 20 test cases. Production users tìm ra failure cases trong ngày đầu tiên. Vấn đề là tôi đang eval trên distribution quá hẹp.

Framework eval tôi hiện đang dùng:

# Eval framework đơn giản nhưng effective
import json
from dataclasses import dataclass
from typing import Callable, List

@dataclass
class EvalCase:
    input: dict
    expected_output: dict
    category: str  # "happy_path", "edge_case", "adversarial"

def run_eval(
    agent_fn: Callable,
    eval_cases: List[EvalCase],
    scorer: Callable
) -> dict:
    results = {"passed": 0, "failed": 0, "by_category": {}}

    for case in eval_cases:
        actual = agent_fn(case.input)
        score = scorer(actual, case.expected_output)

        if score > 0.8:
            results["passed"] += 1
        else:
            results["failed"] += 1

        cat = case.category
        if cat not in results["by_category"]:
            results["by_category"][cat] = {"passed": 0, "failed": 0}

        if score > 0.8:
            results["by_category"][cat]["passed"] += 1
        else:
            results["by_category"][cat]["failed"] += 1

    return results

Quan trọng là categorize eval cases. Edge cases và adversarial inputs thường reveal vấn đề mà happy path tests che giấu.

4. Guardrails Cần Được Rethink Hoàn Toàn

Năm 2024, guardrails là input/output filter. Năm 2026, agent của bạn call tools, spend tiền, và thực hiện hành động. Guardrails bây giờ phải cover:

Authorization: Agent có quyền call tool này không, trong context này không?
Rate limiting: Bao nhiêu API calls trong 1 phút? Tổng spend limit/ngày?
Action validation: Trước khi commit một action irreversible, có human-in-loop không?
Audit trail: Log đủ chi tiết để reproduce và debug bất kỳ decision nào

class AgentGuardrails:
    def __init__(self, config: dict):
        self.max_api_calls_per_minute = config.get("rate_limit", 60)
        self.max_daily_spend_usd = config.get("spend_limit", 10.0)
        self.requires_human_approval = config.get("human_approval_tools", [])
        self._call_count = 0
        self._daily_spend = 0.0

    def before_tool_call(self, tool_name: str, args: dict) -> bool:
        """Returns True if call should proceed"""
        if self._call_count >= self.max_api_calls_per_minute:
            raise RateLimitExceeded(f"Agent exceeded rate limit")

        if tool_name in self.requires_human_approval:
            approved = self._request_human_approval(tool_name, args)
            if not approved:
                return False

        self._call_count += 1
        return True

    def after_tool_call(self, tool_name: str, cost_usd: float):
        self._daily_spend += cost_usd
        if self._daily_spend > self.max_daily_spend_usd:
            raise SpendLimitExceeded(f"Daily spend limit reached: ${self._daily_spend:.2f}")

Điều Thực Sự Quan Trọng Ở Production

Sau khi debug đủ thứ, tôi đúc kết ra một nguyên tắc: Production agent không cần phải perfect — nó cần phải fail safely và be observable.

Cụ thể:

Blast radius control: Limit what agent can do. Nếu agent bị compromised hoặc loop, thiệt hại tối đa là bao nhiêu?
Detect uncertainty early: Khi agent không confident, route về human hoặc fail explicitly — đừng để nó “guess and hope”
Continuous eval: Set up automated regression tests. Sau mỗi model update hoặc prompt change, chạy eval suite trước khi deploy

Khó nhất trong 3 cái này là “detect uncertainty early” — model thường không biết nó không biết gì. Trick tôi hay dùng là thêm một “confidence scoring” step sau reasoning, nơi model tự đánh giá confidence level của output trước khi commit action.

Kết Luận

Nếu bạn đang build agent production:

Dùng LangGraph: Powerful nhưng phải add iteration guards và explicit termination conditions
Dùng Pydantic models cho mọi agent output — đừng trust free-form text
Build eval pipeline trước khi deploy, không phải sau khi user complain
Redesign workflow thay vì layer agent lên legacy process

Agents không thất bại vì model kém — chúng thất bại vì chúng ta mang assumptions từ chatbot world vào agent world. Đó là hai vấn đề kỹ thuật khác nhau hoàn toàn.

Xuất nội dung

AI Agents Lên Production: Những Cạm Bẫy Không Ai Nói Trước

Stack của Agent khác Stack của Chatbot

Những Cạm Bẫy Thực Tế

1. LangGraph Loop Vô Tận

2. CrewAI Output Schema Không Nhất Quán

3. Evaluation Gap Là Rủi Ro Lớn Nhất

4. Guardrails Cần Được Rethink Hoàn Toàn

Điều Thực Sự Quan Trọng Ở Production

Kết Luận

Bình luận

Nội dung chính

AI Agents Lên Production: Những Cạm Bẫy Không Ai Nói Trước