From 5647b2ef538a8a15f52e9a7c988676c9b051e7cb Mon Sep 17 00:00:00 2001
From: seyeong <seyoung4503@gmail.com>
Date: Tue, 2 Jun 2026 16:22:12 +0900
Subject: [PATCH] =?UTF-8?q?feat(harness):=20=CE=B1=20system=5Fprompt=20?=
 =?UTF-8?q?=EB=B3=B4=EA=B0=95=20=EA=B0=80=EC=9D=B4=EB=93=9C=20+=20=CE=B2?=
 =?UTF-8?q?=20concierge=20=EC=84=A0=EC=A0=9C=20=EC=8B=9C=EB=A9=98=ED=8B=B1?=
 =?UTF-8?q?=20=EB=B3=B4=EA=B0=95?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

V1 하네스 *기존 포트만* 써서 "현실의 지저분함" 강건성 첫 한 칸.
새 추상/모듈/포트 추가 0건 — system_prompt 텍스트 + concierge 훅 1개.

α (harness/system_prompt.py)
- 모호한 컬럼은 explore_schema + SELECT DISTINCT 로 값 살피고,
  추정한 의미는 define_metric 으로 시멘틱 레이어에 박아 다음 턴 재사용
  하라는 지시 추가.

β (tenancy/concierge.py)
- build_context 가 길드 scope 첫 진입 시 1회: 스키마 스캔 → LLM 에게
  컬럼 의미 추정 요청 → SemanticEntry(kind=DIMENSION) 로 길드 scope 에
  define(). 이미 정의가 있으면 skip (사람 정의 보호).
- prewarm_enabled / prewarm_table_limit 노출. fail-soft.
- 모든 쓰기는 기존 ScopeResolverPort.define() 통과 → system_prompt 의
  "Semantic layer" 섹션으로 자연스럽게 출력.

tests/test_prewarm.py — 6 케이스
- 빈 길드: prewarm 동작 + SemanticEntry 작성
- 이미 정의 있으면 skip
- 같은 scope 두 번째 호출은 LLM 호출 0 (per-process 캐시)
- 잘못된 JSON → 크래시 0, 빈 결과
- prewarm_enabled=False 비활성
- system_prompt 가 prewarm 결과를 노출

docs/MEASUREMENTS.md
- gpt-4.1-mini × Qwen3-14B-4bit × (clean/dirty) × (no help / prewarm /
  prewarm+predefine) 매트릭스 측정 결과 + 발견.
- 핵심: gpt-4.1 dirty 5/8 → predefine 시 8/8, Qwen 1/8 → 3/8.
- "★④ federation 이 진짜 강건성 메커니즘, ★① prewarm 은 효율 보조" 결론.

테스트: 112 → 118 통과 (6 신규).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
---
 docs/MEASUREMENTS.md                  |  82 ++++++++++++++++
 src/lang2sql/harness/system_prompt.py |  12 +++
 src/lang2sql/tenancy/concierge.py     | 117 +++++++++++++++++++++-
 tests/test_prewarm.py                 | 133 ++++++++++++++++++++++++++
 4 files changed, 342 insertions(+), 2 deletions(-)
 create mode 100644 docs/MEASUREMENTS.md
 create mode 100644 tests/test_prewarm.py

diff --git a/docs/MEASUREMENTS.md b/docs/MEASUREMENTS.md
new file mode 100644
index 0000000..0392897
--- /dev/null
+++ b/docs/MEASUREMENTS.md
@@ -0,0 +1,82 @@
+# V1 측정: 현실의 지저분함에 견디는가?
+
+> **2026-06-02**. *"현실의 지저분함에 견딘다"* 라는 헤드라인을 *근거가 있는 주장*으로 바꾸려고 한 측정.
+
+## 1. 무엇을 측정했나
+
+같은 자연어 질문 8개 / 같은 ground truth 를 두 종류의 DuckDB 스키마와 4가지 조건의 시스템 상태에서 돌렸습니다. 두 모델 비교:
+- **`gpt-4.1-mini`** (V1 plan 가정 모델)
+- **`mlx-community/Qwen3-14B-4bit`** (MLX 로컬 양자화)
+
+### 스키마
+| 종류 | 컬럼명 | description | enum 값 |
+|---|---|---|---|
+| **Clean** | `users.id`, `orders.amount`, `orders.status`, `subscriptions.ended_at` | (없음) | `'paid' / 'cancelled'` |
+| **Dirty** | `usr.u_id`, `ord_tx.amt`, `ord_tx.st`, `sb_mst.canc_dt` | (없음, 약어) | `'P' / 'Paid' / 'PAID' / 'paid' / '결제완료'` / `'C' / 'cancelled' / '취소'` |
+
+Dirty 는 *실제 production 누적 cruft* 시뮬레이션 — 컬럼명 약어, description 부재, enum 값이 표기/언어/대소문자 카오스, 일부 컬럼은 의미 모호 (`canc_dt` vs `e_at` 등).
+
+### 시스템 조건
+- **no help**: V1 harness 그대로
+- **β prewarm**: `ContextConcierge.build_context` 가 길드 첫 호출에서 LLM에게 *컬럼 설명을 추정시켜* `ScopeResolverPort.define` 으로 시멘틱 레이어 자동 채움. 새 포트 추가 0, 기존 federation 메커니즘만 사용.
+- **★④ predefine**: 사람이 `/define_metric` 으로 박았을 비즈니스 매핑 시뮬레이션 — `paid_orders_filter`, `cancelled_orders_filter`, `active_subscription` 의 SQL 조각.
+
+## 2. 결과 매트릭스
+
+| 조건 | gpt-4.1-mini | Qwen3-14B-4bit |
+|---|---|---|
+| Clean, no help | **10/10** | 4/10 |
+| Dirty, no help | 5/8 | 1/8 |
+| Dirty + β prewarm | 5/8 (도구 호출 **1/3**) | 1/8 |
+| Dirty + prewarm + ★④ `/define_metric` | **8/8** | **3/8** |
+
+원본 stdout: `/tmp/bench_result.txt` · `/tmp/bench_qwen_result.txt` · `/tmp/bench_dirty_gpt.txt` · `/tmp/bench_dirty_qwen.txt` · `/tmp/bench_dirty_predefine.txt`.
+
+## 3. 무엇을 발견했나
+
+### ① V1 아키텍처는 *플랜이 가정한 모델*(gpt-4.1-mini)에서 작동
+깨끗한 DB·도움 없이 10/10. safety 게이트, federation 의 *기제(mechanism)* 자체가 동작.
+
+### ② 지저분한 DB는 *모든 모델을 깎음*
+깨끗 → 지저분으로 가면 gpt-4.1 도 10/10 → 5/8. 주범:
+- **enum 값 카오스** (Q3 매출, Q4 취소). 모델이 `SELECT DISTINCT` 까지 가도 표기 변형(`P`/`Paid`/`PAID`/`paid`/`결제완료`)을 *모두* 잡지는 못함.
+
+### ③ β prewarm = *효율 향상*, 정확도는 못 끌어올림
+gpt-4.1 의 dirty 정확도는 5/8 → 5/8 로 동일하지만, **도구 호출 수 36 → 10 (3.6×)**. 즉 *컬럼 의미 추정에 드는 탐색 비용*은 prewarm 이 흡수. enum 매핑처럼 *데이터 안에 있는 사실*은 prewarm 으로 못 풀음.
+
+Qwen 은 prewarm 받아도 1/8 → 1/8. 작은 양자화 모델은 *multi-step tool reasoning* 자체에서 막힘 (explore_schema 다음 run_sql 로 못 이어감, 답이 빈 문자열). 이건 시멘틱 레이어 보강과 *다른 차원* 의 문제.
+
+### ④ ★④ `/define_metric` 이 *진짜* 강건성 메커니즘
+사람이 enum 매핑을 박으면:
+- gpt-4.1: 5/8 → **8/8** (모두 1 도구 호출로 정답).
+- Qwen: 1/8 → **3/8** — 그리고 정답인 3개는 *정확히 사전 정의된 metric 이 답을 가지고 있던 질문* (Q3 paid sum / Q4 cancel / Q5 active subs).
+
+이게 v4.1 plan §3.5 가 *원래 약속한 것*: *"같은 용어 다른 정의"의 충돌을 git-like 분기로 푼다*. 측정이 그 약속을 직접 검증.
+
+## 4. 솔직한 한계
+- **Qwen 의 빈-답 문제** (multi-step tool reasoning) 는 ★①/★④ 어느 것으로도 못 풀음. 작은 양자화 모델 지원은 별도 트랙 (모델별 prompt fallback / 자동 재시도 / fine-tuning) — V1.5+ 의 새 작업거리.
+- 측정은 **합성 dirty** 데이터에서 수행. 실제 production 의 *오랜 누적 messiness* 와는 다름. BIRD / 한국 공공데이터로 확장 검증이 다음 단계.
+- 측정 질문 8개는 *내가 정의*. 골든 쿼리셋 표준화는 별도 과제.
+
+## 5. 한 줄 요약
+> *"현실의 지저분함에 견디는"* 의 V1 메커니즘은 **★④ federation** 이다. 시멘틱 레이어가 *사람·문서가 박은 정의* 를 들고 있으면 모델은 그 정의를 쓴다 — 깨끗한 모델은 완벽(8/8) 으로, 작은 모델도 3배 개선. ★① prewarm 은 *효율 보조* 수단으로 자리잡음.
+
+## 6. 재현 (직접 돌려보려면)
+```bash
+# 환경
+uv sync --extra duckdb
+export OPENAI_API_KEY=...   # 또는 .env 의 OPEN_AI_KEY 매핑
+
+# 깨끗한 DB 생성 + 깨끗한 측정
+python bench/seed_clean.py    # → /tmp/lang2sql_demo.duckdb
+LANG2SQL_DB_URL=duckdb:////tmp/lang2sql_demo.duckdb \
+  python bench/quality_clean.py --gpt
+  # (선택) mlx_lm.server --model mlx-community/Qwen3-14B-4bit 띄운 뒤
+  python bench/quality_clean.py --qwen
+
+# 지저분한 DB 측정
+python bench/seed_dirty.py    # → /tmp/lang2sql_dirty.duckdb
+python bench/dirty.py --gpt --qwen --prewarm both
+python bench/dirty.py --gpt --qwen --prewarm on --predefine
+```
+(현재는 `/tmp/bench_*.py` 에 ad-hoc 스크립트로 존재. 정식 bench/ 통합은 후속 PR.)
diff --git a/src/lang2sql/harness/system_prompt.py b/src/lang2sql/harness/system_prompt.py
index 10d2837..9cb5c97 100644
--- a/src/lang2sql/harness/system_prompt.py
+++ b/src/lang2sql/harness/system_prompt.py
@@ -19,6 +19,18 @@
 - Discover schema with explore_schema before guessing table or column names.
 - Prefer definitions from the semantic layer below over your own assumptions.
 - Answer concisely; show the SQL you ran.
+
+Working with messy schemas (cryptic column names / no descriptions / dirty enums):
+- After explore_schema, if a column's purpose is unclear (e.g. `amt`, `st`,
+  `e_at`), call run_sql with a small `SELECT DISTINCT` or `LIMIT 5` to see the
+  actual values. That tells you what a status enum or date field really holds.
+- Once you have inferred a column's meaning or a value set, persist it for
+  future turns: call `define_metric` to record a usable mapping (e.g. a metric
+  whose definition is the SQL expression you'll keep reusing). Future questions
+  in this scope will see it in the semantic layer above and won't have to
+  re-guess.
+- If business meaning is ambiguous (currency unit, what "active" means), ask
+  the user with ask_user instead of inventing an answer.
 """
 
 
diff --git a/src/lang2sql/tenancy/concierge.py b/src/lang2sql/tenancy/concierge.py
index 2d2e318..33ab863 100644
--- a/src/lang2sql/tenancy/concierge.py
+++ b/src/lang2sql/tenancy/concierge.py
@@ -20,19 +20,21 @@
 from ..adapters.llm.openai_ import OpenAILLM
 from ..adapters.storage.sqlite_semantic import SqliteSemanticStore
 from ..adapters.storage.sqlite_store import SqliteStore
-from ..core.identity import Identity
+from ..core.identity import Identity, Scope, ScopeLevel
 from ..core.ports.audit import AuditPort
 from ..core.ports.explorer import ExplorerPort
 from ..core.ports.llm import LLMPort
 from ..core.ports.safety import SafetyPipelinePort
 from ..core.ports.secrets import SecretsPort
 from ..core.ports.semantic_scope import ScopeResolverPort
+from ..core.types import Message, Role
 from ..harness.context import HarnessContext
 from ..harness.session import Session
 from ..harness.tool_registry import ToolRegistry
 from ..ingestion import FileSource, IngestionPipeline, LLMExtractor
 from ..memory import InjectAllRecall, InMemoryStore, ManualExtractor, MemoryService
 from ..safety.pipeline import SafetyPipeline
+from ..semantic.types import SemanticEntry, SemanticKind
 from ..tools import build_default_tools
 from .encrypted_secrets import EncryptedSecrets
 from .scope_resolver import ScopeResolver
@@ -90,6 +92,12 @@ def __init__(
         # it on demand and reuses it across turns (lazy + cached).
         self._scope_explorers: dict[str, ExplorerPort] = {}
 
+        # Scopes that have already been pre-warmed against this concierge
+        # instance — avoids re-running the (LLM-paid) schema scan every turn.
+        self._prewarmed_scopes: set[str] = set()
+        self.prewarm_enabled: bool = True
+        self.prewarm_table_limit: int = 8
+
     @property
     def store(self) -> SqliteStore:
         return self._store
@@ -137,6 +145,14 @@ async def build_context(
         if session is None:
             session = Session(identity=identity)
 
+        explorer = await self._explorer_for(identity)
+
+        # ★ β — first-time, scope-level pre-warm. Walk the schema once and
+        # write SemanticEntry rows into the scope_resolver via its existing
+        # define(). The system_prompt path then naturally surfaces these as
+        # "Semantic layer" to every future turn — no new wiring, just data.
+        await self._prewarm_semantic_layer(identity, explorer)
+
         tools = ToolRegistry(
             build_default_tools(
                 memory=self._memory,
@@ -151,13 +167,110 @@ async def build_context(
             llm=self._llm,
             tools=tools,
             session=session,
-            explorer=await self._explorer_for(identity),
+            explorer=explorer,
             safety=self._safety,
             audit=self._audit,
             scope_resolver=self._scope_resolver,
             max_turns=self._max_turns,
         )
 
+    async def _prewarm_semantic_layer(
+        self, identity: Identity, explorer: ExplorerPort
+    ) -> None:
+        """One-shot LLM-driven schema → SemanticEntry pre-fill at guild scope.
+
+        Stays inside the V1 harness: it only writes through the existing
+        ``ScopeResolverPort.define``. The system prompt's "Semantic layer"
+        section then surfaces these entries on every subsequent turn, exactly
+        as if a human had typed them via ``/define_metric``. Skipped when:
+
+        - prewarm is disabled
+        - the guild scope already has any SemanticEntry (don't overwrite humans)
+        - this scope was already pre-warmed in this process
+        - explorer has no tables to describe
+        """
+        if not self.prewarm_enabled:
+            return
+        scope = _guild_scope(identity)
+        if scope.key in self._prewarmed_scopes:
+            return
+        existing = await self._scope_resolver.entries_at(scope)
+        if existing:
+            self._prewarmed_scopes.add(scope.key)
+            return
+        try:
+            tables = await explorer.list_tables()
+        except Exception:
+            return
+        tables = tables[: self.prewarm_table_limit]
+        if not tables:
+            return
+
+        # Describe each table once and ask the LLM for a short DIMENSION
+        # definition for every column. One LLM call total (cheap on context).
+        try:
+            described = []
+            for t in tables:
+                described.append(await explorer.describe_table(t.name))
+            schema_dump = "\n".join(
+                f"{t.qualified or t.name}: "
+                + ", ".join(f"{c.name} ({c.type})" for c in t.columns)
+                for t in described
+            )
+            prompt = (
+                "For the database schema below, write a one-sentence description "
+                "(≤120 chars) for EACH column, explaining what it likely means. "
+                "Return STRICT JSON: an object mapping `\"<table>.<column>\"` to a "
+                "description string. No markdown, no commentary.\n\n"
+                f"{schema_dump}"
+            )
+            comp = await self._llm.complete(
+                [Message(role=Role.USER, content=prompt)], tools=()
+            )
+        except Exception:
+            self._prewarmed_scopes.add(scope.key)
+            return
+
+        text = (comp.content or "").strip()
+        if text.startswith("```"):
+            text = text.strip("`").lstrip("json").strip()
+        import json as _json
+        try:
+            mapping = _json.loads(text)
+            if not isinstance(mapping, dict):
+                mapping = {}
+        except _json.JSONDecodeError:
+            mapping = {}
+
+        actor = f"prewarm:{identity.user_id}"
+        for key, desc in mapping.items():
+            if not isinstance(key, str) or not isinstance(desc, str):
+                continue
+            if "." in key:
+                table_name, col = key.split(".", 1)
+            else:
+                table_name, col = "", key
+            await self._scope_resolver.define(
+                scope,
+                SemanticEntry(
+                    kind=SemanticKind.DIMENSION,
+                    name=col,
+                    definition=desc[:200],
+                    applies_to=table_name,
+                    source_id="prewarm",
+                    created_by=actor,
+                ),
+            )
+        self._prewarmed_scopes.add(scope.key)
+
+
+def _guild_scope(identity: Identity) -> Scope:
+    """Pre-warm targets the guild (so all channels in the guild share). DMs use
+    a per-user pseudo-guild so personal connections don't leak."""
+    if identity.guild_id:
+        return Scope(ScopeLevel.GUILD, identity.guild_id)
+    return Scope(ScopeLevel.GUILD, f"dm:{identity.user_id}")
+
 
 def _default_llm() -> LLMPort:
     """OpenAI when a key is present, otherwise the offline FakeLLM."""
diff --git a/tests/test_prewarm.py b/tests/test_prewarm.py
new file mode 100644
index 0000000..a201ef8
--- /dev/null
+++ b/tests/test_prewarm.py
@@ -0,0 +1,133 @@
+"""β — ContextConcierge 선제 시멘틱 보강 (build_context 첫 호출 시 1회).
+
+새 포트/모듈 0. 기존 ScopeResolverPort.define()을 통해서만 시멘틱 레이어를
+채우므로 system_prompt 경로가 그대로 사용. 이 테스트는 그 흐름이 정확히
+일어나는지 확인.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+from typing import Sequence
+
+from lang2sql.core.identity import Identity, Scope, ScopeLevel
+from lang2sql.core.ports.explorer import Column, Table
+from lang2sql.core.types import Completion, Message, ToolSpec
+from lang2sql.tenancy.concierge import ContextConcierge
+
+
+class _StubExplorer:
+    """3 테이블을 가진 stub explorer (선제 보강 입력용)."""
+
+    def __init__(self) -> None:
+        self.tables = [
+            Table(name="ord_tx", schema="main", columns=[
+                Column("tx_id", "INTEGER"), Column("amt", "DECIMAL"), Column("st", "VARCHAR"),
+            ]),
+            Table(name="usr", schema="main", columns=[
+                Column("u_id", "INTEGER"), Column("e_addr", "VARCHAR"),
+            ]),
+        ]
+
+    async def list_tables(self) -> list[Table]:
+        return self.tables
+
+    async def describe_table(self, name: str) -> Table:
+        return next(t for t in self.tables if t.name == name)
+
+    async def sample_rows(self, name, limit=5): return []
+    async def execute(self, sql, limit=1000): return []
+
+
+class _ScriptedLLM:
+    def __init__(self, payload: str) -> None:
+        self.payload, self.calls = payload, 0
+    async def complete(self, messages: Sequence[Message], tools: Sequence[ToolSpec] = ()) -> Completion:
+        self.calls += 1
+        return Completion(content=self.payload)
+
+
+def _prewarm_response() -> str:
+    return json.dumps({
+        "ord_tx.tx_id": "Order transaction id (primary key).",
+        "ord_tx.amt": "Order total in the store's base currency.",
+        "ord_tx.st": "Order status code (paid/cancelled and variants).",
+        "usr.u_id": "User id (primary key).",
+        "usr.e_addr": "User email address.",
+    })
+
+
+def test_prewarm_writes_semantic_entries_into_guild_scope():
+    llm = _ScriptedLLM(_prewarm_response())
+    concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
+    ident = Identity(user_id="alice", guild_id="g1", channel_id="c-mkt")
+
+    asyncio.run(concierge.build_context(ident))
+
+    guild_scope = Scope(ScopeLevel.GUILD, "g1")
+    entries = asyncio.run(concierge.scope_resolver.entries_at(guild_scope))
+    names = {e.name for e in entries}
+    assert {"tx_id", "amt", "st", "u_id", "e_addr"}.issubset(names)
+    assert llm.calls == 1
+
+
+def test_prewarm_skips_when_existing_entries():
+    """사람이 이미 박은 정의가 있으면 선제 보강이 덮어쓰지 않음."""
+    llm = _ScriptedLLM(_prewarm_response())
+    concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
+    ident = Identity(user_id="alice", guild_id="g2", channel_id="c")
+    from lang2sql.semantic.types import SemanticEntry, SemanticKind
+    asyncio.run(concierge.scope_resolver.define(
+        Scope(ScopeLevel.GUILD, "g2"),
+        SemanticEntry(SemanticKind.METRIC, "revenue", "SUM(amt) of paid orders")
+    ))
+
+    asyncio.run(concierge.build_context(ident))
+    assert llm.calls == 0  # 사람 정의가 있으면 LLM 호출 0
+
+
+def test_prewarm_runs_only_once_per_scope():
+    llm = _ScriptedLLM(_prewarm_response())
+    concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
+    ident = Identity(user_id="alice", guild_id="g3", channel_id="c")
+
+    asyncio.run(concierge.build_context(ident))
+    asyncio.run(concierge.build_context(ident))
+    asyncio.run(concierge.build_context(ident))
+    assert llm.calls == 1
+
+
+def test_prewarm_failsoft_on_bad_json():
+    llm = _ScriptedLLM("this is not json")
+    concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
+    ident = Identity(user_id="alice", guild_id="g4", channel_id="c")
+    # 크래시 없이 통과해야 함
+    asyncio.run(concierge.build_context(ident))
+    entries = asyncio.run(concierge.scope_resolver.entries_at(Scope(ScopeLevel.GUILD, "g4")))
+    assert entries == []
+
+
+def test_prewarm_can_be_disabled():
+    llm = _ScriptedLLM(_prewarm_response())
+    concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
+    concierge.prewarm_enabled = False
+    ident = Identity(user_id="alice", guild_id="g5", channel_id="c")
+    asyncio.run(concierge.build_context(ident))
+    assert llm.calls == 0
+
+
+def test_prewarm_entries_surface_in_system_prompt():
+    """end-to-end: 선제 보강 → build_system_prompt 가 그 정의를 출력해야 함."""
+    from lang2sql.harness.system_prompt import build_system_prompt
+
+    llm = _ScriptedLLM(_prewarm_response())
+    concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
+    ident = Identity(user_id="alice", guild_id="g6", channel_id="c")
+    ctx = asyncio.run(concierge.build_context(ident))
+
+    prompt = asyncio.run(build_system_prompt(ctx))
+    # 선제 보강된 컬럼 설명이 시스템 프롬프트에 포함되어야 함
+    assert "Semantic layer" in prompt
+    assert "amt" in prompt  # 보강된 dimension 이름
+    assert "currency" in prompt  # 설명 본문 일부