Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions docs/MEASUREMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# V1 측정: 현실의 지저분함에 견디는가?

> **2026-06-02**. *"현실의 지저분함에 견딘다"* 라는 헤드라인을 *근거가 있는 주장*으로 바꾸려고 한 측정.

## 1. 무엇을 측정했나

같은 자연어 질문 8개 / 같은 ground truth 를 두 종류의 DuckDB 스키마와 4가지 조건의 시스템 상태에서 돌렸습니다. 두 모델 비교:
- **`gpt-4.1-mini`** (V1 plan 가정 모델)
- **`mlx-community/Qwen3-14B-4bit`** (MLX 로컬 양자화)

### 스키마
| 종류 | 컬럼명 | description | enum 값 |
|---|---|---|---|
| **Clean** | `users.id`, `orders.amount`, `orders.status`, `subscriptions.ended_at` | (없음) | `'paid' / 'cancelled'` |
| **Dirty** | `usr.u_id`, `ord_tx.amt`, `ord_tx.st`, `sb_mst.canc_dt` | (없음, 약어) | `'P' / 'Paid' / 'PAID' / 'paid' / '결제완료'` / `'C' / 'cancelled' / '취소'` |

Dirty 는 *실제 production 누적 cruft* 시뮬레이션 — 컬럼명 약어, description 부재, enum 값이 표기/언어/대소문자 카오스, 일부 컬럼은 의미 모호 (`canc_dt` vs `e_at` 등).

### 시스템 조건
- **no help**: V1 harness 그대로
- **β prewarm**: `ContextConcierge.build_context` 가 길드 첫 호출에서 LLM에게 *컬럼 설명을 추정시켜* `ScopeResolverPort.define` 으로 시멘틱 레이어 자동 채움. 새 포트 추가 0, 기존 federation 메커니즘만 사용.
- **★④ predefine**: 사람이 `/define_metric` 으로 박았을 비즈니스 매핑 시뮬레이션 — `paid_orders_filter`, `cancelled_orders_filter`, `active_subscription` 의 SQL 조각.

## 2. 결과 매트릭스

| 조건 | gpt-4.1-mini | Qwen3-14B-4bit |
|---|---|---|
| Clean, no help | **10/10** | 4/10 |
| Dirty, no help | 5/8 | 1/8 |
| Dirty + β prewarm | 5/8 (도구 호출 **1/3**) | 1/8 |
| Dirty + prewarm + ★④ `/define_metric` | **8/8** | **3/8** |

원본 stdout: `/tmp/bench_result.txt` · `/tmp/bench_qwen_result.txt` · `/tmp/bench_dirty_gpt.txt` · `/tmp/bench_dirty_qwen.txt` · `/tmp/bench_dirty_predefine.txt`.

## 3. 무엇을 발견했나

### ① V1 아키텍처는 *플랜이 가정한 모델*(gpt-4.1-mini)에서 작동
깨끗한 DB·도움 없이 10/10. safety 게이트, federation 의 *기제(mechanism)* 자체가 동작.

### ② 지저분한 DB는 *모든 모델을 깎음*
깨끗 → 지저분으로 가면 gpt-4.1 도 10/10 → 5/8. 주범:
- **enum 값 카오스** (Q3 매출, Q4 취소). 모델이 `SELECT DISTINCT` 까지 가도 표기 변형(`P`/`Paid`/`PAID`/`paid`/`결제완료`)을 *모두* 잡지는 못함.

### ③ β prewarm = *효율 향상*, 정확도는 못 끌어올림
gpt-4.1 의 dirty 정확도는 5/8 → 5/8 로 동일하지만, **도구 호출 수 36 → 10 (3.6×)**. 즉 *컬럼 의미 추정에 드는 탐색 비용*은 prewarm 이 흡수. enum 매핑처럼 *데이터 안에 있는 사실*은 prewarm 으로 못 풀음.

Qwen 은 prewarm 받아도 1/8 → 1/8. 작은 양자화 모델은 *multi-step tool reasoning* 자체에서 막힘 (explore_schema 다음 run_sql 로 못 이어감, 답이 빈 문자열). 이건 시멘틱 레이어 보강과 *다른 차원* 의 문제.

### ④ ★④ `/define_metric` 이 *진짜* 강건성 메커니즘
사람이 enum 매핑을 박으면:
- gpt-4.1: 5/8 → **8/8** (모두 1 도구 호출로 정답).
- Qwen: 1/8 → **3/8** — 그리고 정답인 3개는 *정확히 사전 정의된 metric 이 답을 가지고 있던 질문* (Q3 paid sum / Q4 cancel / Q5 active subs).

이게 v4.1 plan §3.5 가 *원래 약속한 것*: *"같은 용어 다른 정의"의 충돌을 git-like 분기로 푼다*. 측정이 그 약속을 직접 검증.

## 4. 솔직한 한계
- **Qwen 의 빈-답 문제** (multi-step tool reasoning) 는 ★①/★④ 어느 것으로도 못 풀음. 작은 양자화 모델 지원은 별도 트랙 (모델별 prompt fallback / 자동 재시도 / fine-tuning) — V1.5+ 의 새 작업거리.
- 측정은 **합성 dirty** 데이터에서 수행. 실제 production 의 *오랜 누적 messiness* 와는 다름. BIRD / 한국 공공데이터로 확장 검증이 다음 단계.
- 측정 질문 8개는 *내가 정의*. 골든 쿼리셋 표준화는 별도 과제.

## 5. 한 줄 요약
> *"현실의 지저분함에 견디는"* 의 V1 메커니즘은 **★④ federation** 이다. 시멘틱 레이어가 *사람·문서가 박은 정의* 를 들고 있으면 모델은 그 정의를 쓴다 — 깨끗한 모델은 완벽(8/8) 으로, 작은 모델도 3배 개선. ★① prewarm 은 *효율 보조* 수단으로 자리잡음.

## 6. 재현 (직접 돌려보려면)
```bash
# 환경
uv sync --extra duckdb
export OPENAI_API_KEY=... # 또는 .env 의 OPEN_AI_KEY 매핑

# 깨끗한 DB 생성 + 깨끗한 측정
python bench/seed_clean.py # → /tmp/lang2sql_demo.duckdb
LANG2SQL_DB_URL=duckdb:////tmp/lang2sql_demo.duckdb \
python bench/quality_clean.py --gpt
# (선택) mlx_lm.server --model mlx-community/Qwen3-14B-4bit 띄운 뒤
python bench/quality_clean.py --qwen

# 지저분한 DB 측정
python bench/seed_dirty.py # → /tmp/lang2sql_dirty.duckdb
python bench/dirty.py --gpt --qwen --prewarm both
python bench/dirty.py --gpt --qwen --prewarm on --predefine
```
(현재는 `/tmp/bench_*.py` 에 ad-hoc 스크립트로 존재. 정식 bench/ 통합은 후속 PR.)
12 changes: 12 additions & 0 deletions src/lang2sql/harness/system_prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,18 @@
- Discover schema with explore_schema before guessing table or column names.
- Prefer definitions from the semantic layer below over your own assumptions.
- Answer concisely; show the SQL you ran.

Working with messy schemas (cryptic column names / no descriptions / dirty enums):
- After explore_schema, if a column's purpose is unclear (e.g. `amt`, `st`,
`e_at`), call run_sql with a small `SELECT DISTINCT` or `LIMIT 5` to see the
actual values. That tells you what a status enum or date field really holds.
- Once you have inferred a column's meaning or a value set, persist it for
future turns: call `define_metric` to record a usable mapping (e.g. a metric
whose definition is the SQL expression you'll keep reusing). Future questions
in this scope will see it in the semantic layer above and won't have to
re-guess.
- If business meaning is ambiguous (currency unit, what "active" means), ask
the user with ask_user instead of inventing an answer.
"""


Expand Down
117 changes: 115 additions & 2 deletions src/lang2sql/tenancy/concierge.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,21 @@
from ..adapters.llm.openai_ import OpenAILLM
from ..adapters.storage.sqlite_semantic import SqliteSemanticStore
from ..adapters.storage.sqlite_store import SqliteStore
from ..core.identity import Identity
from ..core.identity import Identity, Scope, ScopeLevel
from ..core.ports.audit import AuditPort
from ..core.ports.explorer import ExplorerPort
from ..core.ports.llm import LLMPort
from ..core.ports.safety import SafetyPipelinePort
from ..core.ports.secrets import SecretsPort
from ..core.ports.semantic_scope import ScopeResolverPort
from ..core.types import Message, Role
from ..harness.context import HarnessContext
from ..harness.session import Session
from ..harness.tool_registry import ToolRegistry
from ..ingestion import FileSource, IngestionPipeline, LLMExtractor
from ..memory import InjectAllRecall, InMemoryStore, ManualExtractor, MemoryService
from ..safety.pipeline import SafetyPipeline
from ..semantic.types import SemanticEntry, SemanticKind
from ..tools import build_default_tools
from .encrypted_secrets import EncryptedSecrets
from .scope_resolver import ScopeResolver
Expand Down Expand Up @@ -90,6 +92,12 @@ def __init__(
# it on demand and reuses it across turns (lazy + cached).
self._scope_explorers: dict[str, ExplorerPort] = {}

# Scopes that have already been pre-warmed against this concierge
# instance — avoids re-running the (LLM-paid) schema scan every turn.
self._prewarmed_scopes: set[str] = set()
self.prewarm_enabled: bool = True
self.prewarm_table_limit: int = 8

@property
def store(self) -> SqliteStore:
return self._store
Expand Down Expand Up @@ -137,6 +145,14 @@ async def build_context(
if session is None:
session = Session(identity=identity)

explorer = await self._explorer_for(identity)

# ★ β — first-time, scope-level pre-warm. Walk the schema once and
# write SemanticEntry rows into the scope_resolver via its existing
# define(). The system_prompt path then naturally surfaces these as
# "Semantic layer" to every future turn — no new wiring, just data.
await self._prewarm_semantic_layer(identity, explorer)

tools = ToolRegistry(
build_default_tools(
memory=self._memory,
Expand All @@ -151,13 +167,110 @@ async def build_context(
llm=self._llm,
tools=tools,
session=session,
explorer=await self._explorer_for(identity),
explorer=explorer,
safety=self._safety,
audit=self._audit,
scope_resolver=self._scope_resolver,
max_turns=self._max_turns,
)

async def _prewarm_semantic_layer(
self, identity: Identity, explorer: ExplorerPort
) -> None:
"""One-shot LLM-driven schema → SemanticEntry pre-fill at guild scope.

Stays inside the V1 harness: it only writes through the existing
``ScopeResolverPort.define``. The system prompt's "Semantic layer"
section then surfaces these entries on every subsequent turn, exactly
as if a human had typed them via ``/define_metric``. Skipped when:

- prewarm is disabled
- the guild scope already has any SemanticEntry (don't overwrite humans)
- this scope was already pre-warmed in this process
- explorer has no tables to describe
"""
if not self.prewarm_enabled:
return
scope = _guild_scope(identity)
if scope.key in self._prewarmed_scopes:
return
existing = await self._scope_resolver.entries_at(scope)
if existing:
self._prewarmed_scopes.add(scope.key)
return
try:
tables = await explorer.list_tables()
except Exception:
return
tables = tables[: self.prewarm_table_limit]
if not tables:
return

# Describe each table once and ask the LLM for a short DIMENSION
# definition for every column. One LLM call total (cheap on context).
try:
described = []
for t in tables:
described.append(await explorer.describe_table(t.name))
schema_dump = "\n".join(
f"{t.qualified or t.name}: "
+ ", ".join(f"{c.name} ({c.type})" for c in t.columns)
for t in described
)
prompt = (
"For the database schema below, write a one-sentence description "
"(≤120 chars) for EACH column, explaining what it likely means. "
"Return STRICT JSON: an object mapping `\"<table>.<column>\"` to a "
"description string. No markdown, no commentary.\n\n"
f"{schema_dump}"
)
comp = await self._llm.complete(
[Message(role=Role.USER, content=prompt)], tools=()
)
except Exception:
self._prewarmed_scopes.add(scope.key)
return

text = (comp.content or "").strip()
if text.startswith("```"):
text = text.strip("`").lstrip("json").strip()
import json as _json
try:
mapping = _json.loads(text)
if not isinstance(mapping, dict):
mapping = {}
except _json.JSONDecodeError:
mapping = {}

actor = f"prewarm:{identity.user_id}"
for key, desc in mapping.items():
if not isinstance(key, str) or not isinstance(desc, str):
continue
if "." in key:
table_name, col = key.split(".", 1)
else:
table_name, col = "", key
await self._scope_resolver.define(
scope,
SemanticEntry(
kind=SemanticKind.DIMENSION,
name=col,
definition=desc[:200],
applies_to=table_name,
source_id="prewarm",
created_by=actor,
),
)
self._prewarmed_scopes.add(scope.key)


def _guild_scope(identity: Identity) -> Scope:
"""Pre-warm targets the guild (so all channels in the guild share). DMs use
a per-user pseudo-guild so personal connections don't leak."""
if identity.guild_id:
return Scope(ScopeLevel.GUILD, identity.guild_id)
return Scope(ScopeLevel.GUILD, f"dm:{identity.user_id}")


def _default_llm() -> LLMPort:
"""OpenAI when a key is present, otherwise the offline FakeLLM."""
Expand Down
133 changes: 133 additions & 0 deletions tests/test_prewarm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
"""β — ContextConcierge 선제 시멘틱 보강 (build_context 첫 호출 시 1회).

새 포트/모듈 0. 기존 ScopeResolverPort.define()을 통해서만 시멘틱 레이어를
채우므로 system_prompt 경로가 그대로 사용. 이 테스트는 그 흐름이 정확히
일어나는지 확인.
"""

from __future__ import annotations

import asyncio
import json
from typing import Sequence

from lang2sql.core.identity import Identity, Scope, ScopeLevel
from lang2sql.core.ports.explorer import Column, Table
from lang2sql.core.types import Completion, Message, ToolSpec
from lang2sql.tenancy.concierge import ContextConcierge


class _StubExplorer:
"""3 테이블을 가진 stub explorer (선제 보강 입력용)."""

def __init__(self) -> None:
self.tables = [
Table(name="ord_tx", schema="main", columns=[
Column("tx_id", "INTEGER"), Column("amt", "DECIMAL"), Column("st", "VARCHAR"),
]),
Table(name="usr", schema="main", columns=[
Column("u_id", "INTEGER"), Column("e_addr", "VARCHAR"),
]),
]

async def list_tables(self) -> list[Table]:
return self.tables

async def describe_table(self, name: str) -> Table:
return next(t for t in self.tables if t.name == name)

async def sample_rows(self, name, limit=5): return []
async def execute(self, sql, limit=1000): return []


class _ScriptedLLM:
def __init__(self, payload: str) -> None:
self.payload, self.calls = payload, 0
async def complete(self, messages: Sequence[Message], tools: Sequence[ToolSpec] = ()) -> Completion:
self.calls += 1
return Completion(content=self.payload)


def _prewarm_response() -> str:
return json.dumps({
"ord_tx.tx_id": "Order transaction id (primary key).",
"ord_tx.amt": "Order total in the store's base currency.",
"ord_tx.st": "Order status code (paid/cancelled and variants).",
"usr.u_id": "User id (primary key).",
"usr.e_addr": "User email address.",
})


def test_prewarm_writes_semantic_entries_into_guild_scope():
llm = _ScriptedLLM(_prewarm_response())
concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
ident = Identity(user_id="alice", guild_id="g1", channel_id="c-mkt")

asyncio.run(concierge.build_context(ident))

guild_scope = Scope(ScopeLevel.GUILD, "g1")
entries = asyncio.run(concierge.scope_resolver.entries_at(guild_scope))
names = {e.name for e in entries}
assert {"tx_id", "amt", "st", "u_id", "e_addr"}.issubset(names)
assert llm.calls == 1


def test_prewarm_skips_when_existing_entries():
"""사람이 이미 박은 정의가 있으면 선제 보강이 덮어쓰지 않음."""
llm = _ScriptedLLM(_prewarm_response())
concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
ident = Identity(user_id="alice", guild_id="g2", channel_id="c")
from lang2sql.semantic.types import SemanticEntry, SemanticKind
asyncio.run(concierge.scope_resolver.define(
Scope(ScopeLevel.GUILD, "g2"),
SemanticEntry(SemanticKind.METRIC, "revenue", "SUM(amt) of paid orders")
))

asyncio.run(concierge.build_context(ident))
assert llm.calls == 0 # 사람 정의가 있으면 LLM 호출 0


def test_prewarm_runs_only_once_per_scope():
llm = _ScriptedLLM(_prewarm_response())
concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
ident = Identity(user_id="alice", guild_id="g3", channel_id="c")

asyncio.run(concierge.build_context(ident))
asyncio.run(concierge.build_context(ident))
asyncio.run(concierge.build_context(ident))
assert llm.calls == 1


def test_prewarm_failsoft_on_bad_json():
llm = _ScriptedLLM("this is not json")
concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
ident = Identity(user_id="alice", guild_id="g4", channel_id="c")
# 크래시 없이 통과해야 함
asyncio.run(concierge.build_context(ident))
entries = asyncio.run(concierge.scope_resolver.entries_at(Scope(ScopeLevel.GUILD, "g4")))
assert entries == []


def test_prewarm_can_be_disabled():
llm = _ScriptedLLM(_prewarm_response())
concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
concierge.prewarm_enabled = False
ident = Identity(user_id="alice", guild_id="g5", channel_id="c")
asyncio.run(concierge.build_context(ident))
assert llm.calls == 0


def test_prewarm_entries_surface_in_system_prompt():
"""end-to-end: 선제 보강 → build_system_prompt 가 그 정의를 출력해야 함."""
from lang2sql.harness.system_prompt import build_system_prompt

llm = _ScriptedLLM(_prewarm_response())
concierge = ContextConcierge(llm=llm, explorer=_StubExplorer())
ident = Identity(user_id="alice", guild_id="g6", channel_id="c")
ctx = asyncio.run(concierge.build_context(ident))

prompt = asyncio.run(build_system_prompt(ctx))
# 선제 보강된 컬럼 설명이 시스템 프롬프트에 포함되어야 함
assert "Semantic layer" in prompt
assert "amt" in prompt # 보강된 dimension 이름
assert "currency" in prompt # 설명 본문 일부
Loading