From 5647b2ef538a8a15f52e9a7c988676c9b051e7cb Mon Sep 17 00:00:00 2001 From: seyeong Date: Tue, 2 Jun 2026 16:22:12 +0900 Subject: [PATCH] =?UTF-8?q?feat(harness):=20=CE=B1=20system=5Fprompt=20?= =?UTF-8?q?=EB=B3=B4=EA=B0=95=20=EA=B0=80=EC=9D=B4=EB=93=9C=20+=20=CE=B2?= =?UTF-8?q?=20concierge=20=EC=84=A0=EC=A0=9C=20=EC=8B=9C=EB=A9=98=ED=8B=B1?= =?UTF-8?q?=20=EB=B3=B4=EA=B0=95?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit V1 하네스 *기존 포트만* 써서 "현실의 지저분함" 강건성 첫 한 칸. 새 추상/모듈/포트 추가 0건 — system_prompt 텍스트 + concierge 훅 1개. α (harness/system_prompt.py) - 모호한 컬럼은 explore_schema + SELECT DISTINCT 로 값 살피고, 추정한 의미는 define_metric 으로 시멘틱 레이어에 박아 다음 턴 재사용 하라는 지시 추가. β (tenancy/concierge.py) - build_context 가 길드 scope 첫 진입 시 1회: 스키마 스캔 → LLM 에게 컬럼 의미 추정 요청 → SemanticEntry(kind=DIMENSION) 로 길드 scope 에 define(). 이미 정의가 있으면 skip (사람 정의 보호). - prewarm_enabled / prewarm_table_limit 노출. fail-soft. - 모든 쓰기는 기존 ScopeResolverPort.define() 통과 → system_prompt 의 "Semantic layer" 섹션으로 자연스럽게 출력. tests/test_prewarm.py — 6 케이스 - 빈 길드: prewarm 동작 + SemanticEntry 작성 - 이미 정의 있으면 skip - 같은 scope 두 번째 호출은 LLM 호출 0 (per-process 캐시) - 잘못된 JSON → 크래시 0, 빈 결과 - prewarm_enabled=False 비활성 - system_prompt 가 prewarm 결과를 노출 docs/MEASUREMENTS.md - gpt-4.1-mini × Qwen3-14B-4bit × (clean/dirty) × (no help / prewarm / prewarm+predefine) 매트릭스 측정 결과 + 발견. - 핵심: gpt-4.1 dirty 5/8 → predefine 시 8/8, Qwen 1/8 → 3/8. - "★④ federation 이 진짜 강건성 메커니즘, ★① prewarm 은 효율 보조" 결론. 테스트: 112 → 118 통과 (6 신규). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- docs/MEASUREMENTS.md | 82 ++++++++++++++++ src/lang2sql/harness/system_prompt.py | 12 +++ src/lang2sql/tenancy/concierge.py | 117 +++++++++++++++++++++- tests/test_prewarm.py | 133 ++++++++++++++++++++++++++ 4 files changed, 342 insertions(+), 2 deletions(-) create mode 100644 docs/MEASUREMENTS.md create mode 100644 tests/test_prewarm.py diff --git a/docs/MEASUREMENTS.md b/docs/MEASUREMENTS.md new file mode 100644 index 0000000..0392897 --- /dev/null +++ b/docs/MEASUREMENTS.md @@ -0,0 +1,82 @@ +# V1 측정: 현실의 지저분함에 견디는가? + +> **2026-06-02**. *"현실의 지저분함에 견딘다"* 라는 헤드라인을 *근거가 있는 주장*으로 바꾸려고 한 측정. + +## 1. 무엇을 측정했나 + +같은 자연어 질문 8개 / 같은 ground truth 를 두 종류의 DuckDB 스키마와 4가지 조건의 시스템 상태에서 돌렸습니다. 두 모델 비교: +- **`gpt-4.1-mini`** (V1 plan 가정 모델) +- **`mlx-community/Qwen3-14B-4bit`** (MLX 로컬 양자화) + +### 스키마 +| 종류 | 컬럼명 | description | enum 값 | +|---|---|---|---| +| **Clean** | `users.id`, `orders.amount`, `orders.status`, `subscriptions.ended_at` | (없음) | `'paid' / 'cancelled'` | +| **Dirty** | `usr.u_id`, `ord_tx.amt`, `ord_tx.st`, `sb_mst.canc_dt` | (없음, 약어) | `'P' / 'Paid' / 'PAID' / 'paid' / '결제완료'` / `'C' / 'cancelled' / '취소'` | + +Dirty 는 *실제 production 누적 cruft* 시뮬레이션 — 컬럼명 약어, description 부재, enum 값이 표기/언어/대소문자 카오스, 일부 컬럼은 의미 모호 (`canc_dt` vs `e_at` 등). + +### 시스템 조건 +- **no help**: V1 harness 그대로 +- **β prewarm**: `ContextConcierge.build_context` 가 길드 첫 호출에서 LLM에게 *컬럼 설명을 추정시켜* `ScopeResolverPort.define` 으로 시멘틱 레이어 자동 채움. 새 포트 추가 0, 기존 federation 메커니즘만 사용. +- **★④ predefine**: 사람이 `/define_metric` 으로 박았을 비즈니스 매핑 시뮬레이션 — `paid_orders_filter`, `cancelled_orders_filter`, `active_subscription` 의 SQL 조각. + +## 2. 결과 매트릭스 + +| 조건 | gpt-4.1-mini | Qwen3-14B-4bit | +|---|---|---| +| Clean, no help | **10/10** | 4/10 | +| Dirty, no help | 5/8 | 1/8 | +| Dirty + β prewarm | 5/8 (도구 호출 **1/3**) | 1/8 | +| Dirty + prewarm + ★④ `/define_metric` | **8/8** | **3/8** | + +원본 stdout: `/tmp/bench_result.txt` · `/tmp/bench_qwen_result.txt` · `/tmp/bench_dirty_gpt.txt` · `/tmp/bench_dirty_qwen.txt` · `/tmp/bench_dirty_predefine.txt`. + +## 3. 무엇을 발견했나 + +### ① V1 아키텍처는 *플랜이 가정한 모델*(gpt-4.1-mini)에서 작동 +깨끗한 DB·도움 없이 10/10. safety 게이트, federation 의 *기제(mechanism)* 자체가 동작. + +### ② 지저분한 DB는 *모든 모델을 깎음* +깨끗 → 지저분으로 가면 gpt-4.1 도 10/10 → 5/8. 주범: +- **enum 값 카오스** (Q3 매출, Q4 취소). 모델이 `SELECT DISTINCT` 까지 가도 표기 변형(`P`/`Paid`/`PAID`/`paid`/`결제완료`)을 *모두* 잡지는 못함. + +### ③ β prewarm = *효율 향상*, 정확도는 못 끌어올림 +gpt-4.1 의 dirty 정확도는 5/8 → 5/8 로 동일하지만, **도구 호출 수 36 → 10 (3.6×)**. 즉 *컬럼 의미 추정에 드는 탐색 비용*은 prewarm 이 흡수. enum 매핑처럼 *데이터 안에 있는 사실*은 prewarm 으로 못 풀음. + +Qwen 은 prewarm 받아도 1/8 → 1/8. 작은 양자화 모델은 *multi-step tool reasoning* 자체에서 막힘 (explore_schema 다음 run_sql 로 못 이어감, 답이 빈 문자열). 이건 시멘틱 레이어 보강과 *다른 차원* 의 문제. + +### ④ ★④ `/define_metric` 이 *진짜* 강건성 메커니즘 +사람이 enum 매핑을 박으면: +- gpt-4.1: 5/8 → **8/8** (모두 1 도구 호출로 정답). +- Qwen: 1/8 → **3/8** — 그리고 정답인 3개는 *정확히 사전 정의된 metric 이 답을 가지고 있던 질문* (Q3 paid sum / Q4 cancel / Q5 active subs). + +이게 v4.1 plan §3.5 가 *원래 약속한 것*: *"같은 용어 다른 정의"의 충돌을 git-like 분기로 푼다*. 측정이 그 약속을 직접 검증. + +## 4. 솔직한 한계 +- **Qwen 의 빈-답 문제** (multi-step tool reasoning) 는 ★①/★④ 어느 것으로도 못 풀음. 작은 양자화 모델 지원은 별도 트랙 (모델별 prompt fallback / 자동 재시도 / fine-tuning) — V1.5+ 의 새 작업거리. +- 측정은 **합성 dirty** 데이터에서 수행. 실제 production 의 *오랜 누적 messiness* 와는 다름. BIRD / 한국 공공데이터로 확장 검증이 다음 단계. +- 측정 질문 8개는 *내가 정의*. 골든 쿼리셋 표준화는 별도 과제. + +## 5. 한 줄 요약 +> *"현실의 지저분함에 견디는"* 의 V1 메커니즘은 **★④ federation** 이다. 시멘틱 레이어가 *사람·문서가 박은 정의* 를 들고 있으면 모델은 그 정의를 쓴다 — 깨끗한 모델은 완벽(8/8) 으로, 작은 모델도 3배 개선. ★① prewarm 은 *효율 보조* 수단으로 자리잡음. + +## 6. 재현 (직접 돌려보려면) +```bash +# 환경 +uv sync --extra duckdb +export OPENAI_API_KEY=... # 또는 .env 의 OPEN_AI_KEY 매핑 + +# 깨끗한 DB 생성 + 깨끗한 측정 +python bench/seed_clean.py # → /tmp/lang2sql_demo.duckdb +LANG2SQL_DB_URL=duckdb:////tmp/lang2sql_demo.duckdb \ + python bench/quality_clean.py --gpt + # (선택) mlx_lm.server --model mlx-community/Qwen3-14B-4bit 띄운 뒤 + python bench/quality_clean.py --qwen + +# 지저분한 DB 측정 +python bench/seed_dirty.py # → /tmp/lang2sql_dirty.duckdb +python bench/dirty.py --gpt --qwen --prewarm both +python bench/dirty.py --gpt --qwen --prewarm on --predefine +``` +(현재는 `/tmp/bench_*.py` 에 ad-hoc 스크립트로 존재. 정식 bench/ 통합은 후속 PR.) diff --git a/src/lang2sql/harness/system_prompt.py b/src/lang2sql/harness/system_prompt.py index 10d2837..9cb5c97 100644 --- a/src/lang2sql/harness/system_prompt.py +++ b/src/lang2sql/harness/system_prompt.py @@ -19,6 +19,18 @@ - Discover schema with explore_schema before guessing table or column names. - Prefer definitions from the semantic layer below over your own assumptions. - Answer concisely; show the SQL you ran. + +Working with messy schemas (cryptic column names / no descriptions / dirty enums): +- After explore_schema, if a column's purpose is unclear (e.g. `amt`, `st`, + `e_at`), call run_sql with a small `SELECT DISTINCT` or `LIMIT 5` to see the + actual values. That tells you what a status enum or date field really holds. +- Once you have inferred a column's meaning or a value set, persist it for + future turns: call `define_metric` to record a usable mapping (e.g. a metric + whose definition is the SQL expression you'll keep reusing). Future questions + in this scope will see it in the semantic layer above and won't have to + re-guess. +- If business meaning is ambiguous (currency unit, what "active" means), ask + the user with ask_user instead of inventing an answer. """ diff --git a/src/lang2sql/tenancy/concierge.py b/src/lang2sql/tenancy/concierge.py index 2d2e318..33ab863 100644 --- a/src/lang2sql/tenancy/concierge.py +++ b/src/lang2sql/tenancy/concierge.py @@ -20,19 +20,21 @@ from ..adapters.llm.openai_ import OpenAILLM from ..adapters.storage.sqlite_semantic import SqliteSemanticStore from ..adapters.storage.sqlite_store import SqliteStore -from ..core.identity import Identity +from ..core.identity import Identity, Scope, ScopeLevel from ..core.ports.audit import AuditPort from ..core.ports.explorer import ExplorerPort from ..core.ports.llm import LLMPort from ..core.ports.safety import SafetyPipelinePort from ..core.ports.secrets import SecretsPort from ..core.ports.semantic_scope import ScopeResolverPort +from ..core.types import Message, Role from ..harness.context import HarnessContext from ..harness.session import Session from ..harness.tool_registry import ToolRegistry from ..ingestion import FileSource, IngestionPipeline, LLMExtractor from ..memory import InjectAllRecall, InMemoryStore, ManualExtractor, MemoryService from ..safety.pipeline import SafetyPipeline +from ..semantic.types import SemanticEntry, SemanticKind from ..tools import build_default_tools from .encrypted_secrets import EncryptedSecrets from .scope_resolver import ScopeResolver @@ -90,6 +92,12 @@ def __init__( # it on demand and reuses it across turns (lazy + cached). self._scope_explorers: dict[str, ExplorerPort] = {} + # Scopes that have already been pre-warmed against this concierge + # instance — avoids re-running the (LLM-paid) schema scan every turn. + self._prewarmed_scopes: set[str] = set() + self.prewarm_enabled: bool = True + self.prewarm_table_limit: int = 8 + @property def store(self) -> SqliteStore: return self._store @@ -137,6 +145,14 @@ async def build_context( if session is None: session = Session(identity=identity) + explorer = await self._explorer_for(identity) + + # ★ β — first-time, scope-level pre-warm. Walk the schema once and + # write SemanticEntry rows into the scope_resolver via its existing + # define(). The system_prompt path then naturally surfaces these as + # "Semantic layer" to every future turn — no new wiring, just data. + await self._prewarm_semantic_layer(identity, explorer) + tools = ToolRegistry( build_default_tools( memory=self._memory, @@ -151,13 +167,110 @@ async def build_context( llm=self._llm, tools=tools, session=session, - explorer=await self._explorer_for(identity), + explorer=explorer, safety=self._safety, audit=self._audit, scope_resolver=self._scope_resolver, max_turns=self._max_turns, ) + async def _prewarm_semantic_layer( + self, identity: Identity, explorer: ExplorerPort + ) -> None: + """One-shot LLM-driven schema → SemanticEntry pre-fill at guild scope. + + Stays inside the V1 harness: it only writes through the existing + ``ScopeResolverPort.define``. The system prompt's "Semantic layer" + section then surfaces these entries on every subsequent turn, exactly + as if a human had typed them via ``/define_metric``. Skipped when: + + - prewarm is disabled + - the guild scope already has any SemanticEntry (don't overwrite humans) + - this scope was already pre-warmed in this process + - explorer has no tables to describe + """ + if not self.prewarm_enabled: + return + scope = _guild_scope(identity) + if scope.key in self._prewarmed_scopes: + return + existing = await self._scope_resolver.entries_at(scope) + if existing: + self._prewarmed_scopes.add(scope.key) + return + try: + tables = await explorer.list_tables() + except Exception: + return + tables = tables[: self.prewarm_table_limit] + if not tables: + return + + # Describe each table once and ask the LLM for a short DIMENSION + # definition for every column. One LLM call total (cheap on context). + try: + described = [] + for t in tables: + described.append(await explorer.describe_table(t.name)) + schema_dump = "\n".join( + f"{t.qualified or t.name}: " + + ", ".join(f"{c.name} ({c.type})" for c in t.columns) + for t in described + ) + prompt = ( + "For the database schema below, write a one-sentence description " + "(≤120 chars) for EACH column, explaining what it likely means. " + "Return STRICT JSON: an object mapping `\".\"` to a " + "description string. No markdown, no commentary.\n\n" + f"{schema_dump}" + ) + comp = await self._llm.complete( + [Message(role=Role.USER, content=prompt)], tools=() + ) + except Exception: + self._prewarmed_scopes.add(scope.key) + return + + text = (comp.content or "").strip() + if text.startswith("```"): + text = text.strip("`").lstrip("json").strip() + import json as _json + try: + mapping = _json.loads(text) + if not isinstance(mapping, dict): + mapping = {} + except _json.JSONDecodeError: + mapping = {} + + actor = f"prewarm:{identity.user_id}" + for key, desc in mapping.items(): + if not isinstance(key, str) or not isinstance(desc, str): + continue + if "." in key: + table_name, col = key.split(".", 1) + else: + table_name, col = "", key + await self._scope_resolver.define( + scope, + SemanticEntry( + kind=SemanticKind.DIMENSION, + name=col, + definition=desc[:200], + applies_to=table_name, + source_id="prewarm", + created_by=actor, + ), + ) + self._prewarmed_scopes.add(scope.key) + + +def _guild_scope(identity: Identity) -> Scope: + """Pre-warm targets the guild (so all channels in the guild share). DMs use + a per-user pseudo-guild so personal connections don't leak.""" + if identity.guild_id: + return Scope(ScopeLevel.GUILD, identity.guild_id) + return Scope(ScopeLevel.GUILD, f"dm:{identity.user_id}") + def _default_llm() -> LLMPort: """OpenAI when a key is present, otherwise the offline FakeLLM.""" diff --git a/tests/test_prewarm.py b/tests/test_prewarm.py new file mode 100644 index 0000000..a201ef8 --- /dev/null +++ b/tests/test_prewarm.py @@ -0,0 +1,133 @@ +"""β — ContextConcierge 선제 시멘틱 보강 (build_context 첫 호출 시 1회). + +새 포트/모듈 0. 기존 ScopeResolverPort.define()을 통해서만 시멘틱 레이어를 +채우므로 system_prompt 경로가 그대로 사용. 이 테스트는 그 흐름이 정확히 +일어나는지 확인. +""" + +from __future__ import annotations + +import asyncio +import json +from typing import Sequence + +from lang2sql.core.identity import Identity, Scope, ScopeLevel +from lang2sql.core.ports.explorer import Column, Table +from lang2sql.core.types import Completion, Message, ToolSpec +from lang2sql.tenancy.concierge import ContextConcierge + + +class _StubExplorer: + """3 테이블을 가진 stub explorer (선제 보강 입력용).""" + + def __init__(self) -> None: + self.tables = [ + Table(name="ord_tx", schema="main", columns=[ + Column("tx_id", "INTEGER"), Column("amt", "DECIMAL"), Column("st", "VARCHAR"), + ]), + Table(name="usr", schema="main", columns=[ + Column("u_id", "INTEGER"), Column("e_addr", "VARCHAR"), + ]), + ] + + async def list_tables(self) -> list[Table]: + return self.tables + + async def describe_table(self, name: str) -> Table: + return next(t for t in self.tables if t.name == name) + + async def sample_rows(self, name, limit=5): return [] + async def execute(self, sql, limit=1000): return [] + + +class _ScriptedLLM: + def __init__(self, payload: str) -> None: + self.payload, self.calls = payload, 0 + async def complete(self, messages: Sequence[Message], tools: Sequence[ToolSpec] = ()) -> Completion: + self.calls += 1 + return Completion(content=self.payload) + + +def _prewarm_response() -> str: + return json.dumps({ + "ord_tx.tx_id": "Order transaction id (primary key).", + "ord_tx.amt": "Order total in the store's base currency.", + "ord_tx.st": "Order status code (paid/cancelled and variants).", + "usr.u_id": "User id (primary key).", + "usr.e_addr": "User email address.", + }) + + +def test_prewarm_writes_semantic_entries_into_guild_scope(): + llm = _ScriptedLLM(_prewarm_response()) + concierge = ContextConcierge(llm=llm, explorer=_StubExplorer()) + ident = Identity(user_id="alice", guild_id="g1", channel_id="c-mkt") + + asyncio.run(concierge.build_context(ident)) + + guild_scope = Scope(ScopeLevel.GUILD, "g1") + entries = asyncio.run(concierge.scope_resolver.entries_at(guild_scope)) + names = {e.name for e in entries} + assert {"tx_id", "amt", "st", "u_id", "e_addr"}.issubset(names) + assert llm.calls == 1 + + +def test_prewarm_skips_when_existing_entries(): + """사람이 이미 박은 정의가 있으면 선제 보강이 덮어쓰지 않음.""" + llm = _ScriptedLLM(_prewarm_response()) + concierge = ContextConcierge(llm=llm, explorer=_StubExplorer()) + ident = Identity(user_id="alice", guild_id="g2", channel_id="c") + from lang2sql.semantic.types import SemanticEntry, SemanticKind + asyncio.run(concierge.scope_resolver.define( + Scope(ScopeLevel.GUILD, "g2"), + SemanticEntry(SemanticKind.METRIC, "revenue", "SUM(amt) of paid orders") + )) + + asyncio.run(concierge.build_context(ident)) + assert llm.calls == 0 # 사람 정의가 있으면 LLM 호출 0 + + +def test_prewarm_runs_only_once_per_scope(): + llm = _ScriptedLLM(_prewarm_response()) + concierge = ContextConcierge(llm=llm, explorer=_StubExplorer()) + ident = Identity(user_id="alice", guild_id="g3", channel_id="c") + + asyncio.run(concierge.build_context(ident)) + asyncio.run(concierge.build_context(ident)) + asyncio.run(concierge.build_context(ident)) + assert llm.calls == 1 + + +def test_prewarm_failsoft_on_bad_json(): + llm = _ScriptedLLM("this is not json") + concierge = ContextConcierge(llm=llm, explorer=_StubExplorer()) + ident = Identity(user_id="alice", guild_id="g4", channel_id="c") + # 크래시 없이 통과해야 함 + asyncio.run(concierge.build_context(ident)) + entries = asyncio.run(concierge.scope_resolver.entries_at(Scope(ScopeLevel.GUILD, "g4"))) + assert entries == [] + + +def test_prewarm_can_be_disabled(): + llm = _ScriptedLLM(_prewarm_response()) + concierge = ContextConcierge(llm=llm, explorer=_StubExplorer()) + concierge.prewarm_enabled = False + ident = Identity(user_id="alice", guild_id="g5", channel_id="c") + asyncio.run(concierge.build_context(ident)) + assert llm.calls == 0 + + +def test_prewarm_entries_surface_in_system_prompt(): + """end-to-end: 선제 보강 → build_system_prompt 가 그 정의를 출력해야 함.""" + from lang2sql.harness.system_prompt import build_system_prompt + + llm = _ScriptedLLM(_prewarm_response()) + concierge = ContextConcierge(llm=llm, explorer=_StubExplorer()) + ident = Identity(user_id="alice", guild_id="g6", channel_id="c") + ctx = asyncio.run(concierge.build_context(ident)) + + prompt = asyncio.run(build_system_prompt(ctx)) + # 선제 보강된 컬럼 설명이 시스템 프롬프트에 포함되어야 함 + assert "Semantic layer" in prompt + assert "amt" in prompt # 보강된 dimension 이름 + assert "currency" in prompt # 설명 본문 일부