Semantic data analysis agent built on the OSI specification. Translates business questions into executable queries — not by guessing table schemas, but by reasoning within a controlled semantic framework.
LLMs can write SQL, but the SQL they write is brittle. Ask "revenue by customer" and the model has no idea that your data spans orders_old (column amt) and orders_new (column total_amount), or that a performance summary table only covers the last 90 days. It does not think in business concepts — it guesses column names.
The idea: define "revenue" once — which tables, which columns, which aggregation — and let everyone (LLM included) use that definition. Queries pass through rule-based validation first, then translate to physical SQL. No more guessing.
This approach isn't new (Looker's LookML, dbt's Semantic Layer), but the OSI spec is an open-source attempt at standardizing it. DataSemAgent is my exploration of that spec — a working prototype built to see how far the idea can go.
- 📐 Model — Define business terms (metrics, dimensions) in an OSI semantic model
- 🔍 Index — Build entity graphs, metric lineage, and reasoning chains from your database
- ⚡ Ask — Ask questions in natural language; the agent writes logical SQL, validates, translates, and executes it
The agent writes logical SQL using business terms, then validates and translates it to physical SQL.
Logical Physical
────────────────────────── ──────────────────────────────
SELECT customer_id, revenue SELECT stg_orders.customer_id,
FROM orders SUM(stg_orders.total_amount)
GROUP BY customer_id FROM stg_orders
GROUP BY stg_orders.customer_id
✅ Validation rules (each an independent Rule class on top of sqlglot's AST + Scope analysis):
- Metrics cannot be re-aggregated
- Dimensions must appear in
GROUP BY - Cross-dataset metrics require proper
JOINs - Filter metrics in
HAVING, notWHERE - Metrics cannot appear in
GROUP BY
LLM-driven pipeline turns every database row into a graph node and every foreign key into an edge.
┌──────────┐ purchased_by ┌──────────┐
│ store_ │ ──────────────►│ customer │
│ sales │ └──────────┘
│ │ includes ┌──────────┐
│ ss_item │ ──────────────►│ item │
│ _sk: 6 │ │ i_brand │
│ ss_price │ occurred_at │ :BrandA │
│ : 303.0 │ ──────────────►└──────────┘
└──────────┘ ┌──────────┐
│ store │
└──────────┘
🔁 Pipeline: Sampler → Schema Agent → Validator → Mapping Agent (incremental React loop) → Validator → Compiler
Mapping builds step by step with inline validation on each call. Failed validation injects structured feedback into the next LLM call for automatic retry.
Built deterministically from the OSI model — no LLM calls needed.
┌──────────┐ AGGREGATES_FROM ┌────────────────┐
│ Metric │ ─────────────────►│ PhysicalField │
│ revenue │ │ ss_ext_sales │
│ │ SLICES_BY │ _price │
│ │ ─────────────────►│ │
│ │ │ Dimension │
│ │ │ customer_id │
└──────────┘ └────────────────┘
💡 Discover available metrics, their physical sources, and valid dimension slices — before writing SQL.
Background reflection extracts reusable reasoning patterns from conversations — not specific numbers ("profit was $1.2M"), but general approaches ("period-over-period comparison reveals seasonal volatility"). An ontology hierarchy organizes concepts; child facts inherit parent chains.
Reflection fires automatically every N turns (configurable via config/agent.toml). Use /reflect in ask mode to trigger manually. The agent is prompted to check the reasoning graph first when facing complex analytical problems — past experience saves time.
🔁 Pipeline: Reflector (builds prompt from last N rounds) → Agent (incremental EmitReasoningTool) → Synonym Merger → Builder → KuzuDB
git clone https://github.com/VilTea/DataSemAgent.git && cd DataSemAgent
uv sync
cp config/llm/config.toml.demo config/llm/config.toml # add your API key
uv run python run.py --lang zh| Command | Description |
|---|---|
[1] init |
Build entity graph & metric lineage |
[2] ask |
Multi-turn agent Q&A |
[3] exit |
Quit |
[8] inspect |
Interactive reasoning graph browser |
[9] build-db |
Generate test database from OSI model |
uv run python run.py --lang zh # Chinese UI
uv run python run.py --lang en # English UI
uv run python run.py --debug # Verbose hook trigger logging
uv run python run.py --model path/to/model.yaml # Custom OSI model path| Command | Description |
|---|---|
/reflect |
Manually trigger reasoning chain reflection |
/exit, /q, /quit |
Return to main menu |
Build a SQLite database from the OSI semantic model for development and testing:
# Via CLI menu: select [9] build-db (shows model/DB path confirmation)
# Or directly:
uv run python tests/build_tpcds_test_data.py --db data/test.dbCreates physical tables and populates them with realistic sample data — 3 years of dates, 10 customers/items, 5 stores, 500 transactions.
| Tool | Description |
|---|---|
sql_exec |
Logical SQL with pre-aggregated metrics |
entity_graph |
Cypher queries over the entity property graph |
metric_lineage |
Discover metrics, dimensions, and their sources |
reasoning_graph |
Query accumulated reusable reasoning patterns |
todo_write |
Track multi-step task progress |
| Component | File |
|---|---|
🧠 LLM (openai / anthropic) |
config/llm/config.toml |
| 🤖 Agent behavior | config/agent.toml |
| 🗄️ SQL Database | config/database.toml |
| 🔗 Graph Database | config/graph_database.toml |
| 📦 OSI Model | config/config.toml → [paths] |
| 🔌 MCP Servers | config/mcp/servers.yaml |
[default]
reflection_interval = 5 # trigger reasoning reflection every N turns run.py
app/
├── semantics/sql/ # SQL translation (classifier, expander, translator, validator)
├── semantics/graph/ # Entity + metric + reasoning graphs
│ ├── entity/ # LLM-driven entity graph pipeline
│ ├── metric/ # Deterministic metric lineage
│ └── reasoning/ # Background reflection chain graph
├── node/ # Agent orchestration (PocketFlow: AgentNode, ToolNode)
├── tool/ # sql_exec, entity_graph, metric_graph, reasoning_graph, todo_write
├── llm.py # OpenAI / Anthropic client
├── hook/ # Lifecycle hooks (node/tool/flow granularity, priority ordering)
├── pipeline/ # Streaming output (EventConsumer / Consumable abstraction)
└── cli/ # Typer + Rich terminal UI
config/ # Config + i18n (zh/en)
tests/
Evaluation pipeline — Non-intrusive conversation trace collection for model evaluation. EvalCollector uses consumer hooks to capture complete LLM I/O, tool calls, and tool results as JSONL, with configurable sensitive data redaction. See config/eval.toml.
Tool schema injection — All four tools (sql_exec, entity_graph, metric_lineage, reasoning_graph) now inject schema definitions into the system prompt via <tag> blocks. The agent no longer needs to probe the graph to discover labels, properties, or relationships — everything is available upfront.
Consumer hook system — EventConsumer implementations can use @hook annotations to observe agent lifecycle events (tool calls, LLM execution, flow boundaries). Hooks are observation-only (forced on_error="log", priority 200) and auto-registered via async with pipeline.bind(ctx).
SQL alias pushdown — Fixed multiple alias precedence issues across CTEs, subqueries, plain columns, dimensions, and metrics. SQL aliases are now consistently preserved instead of being replaced by physical table names.
This is a personal validation-stage project, not a product:
- Tested on TPC-DS datasets only; real-world schema coverage is unknown
- Entity graph quality depends on the underlying LLM; complex schemas haven't been thoroughly evaluated
- No performance optimization, no production deployment plan, rough error handling
It is a working prototype that explores what OSI-based semantic analysis can look like — not a turnkey tool.
- Feedback. Solo projects drift. If you work on semantic layers, Text-to-SQL, or LLM agents, I'd love to hear what you think.
- OSI spec exploration. Semantic layer standardization matters for reusable data components. This project is my way of stress-testing the spec.
- Code that might help. The PocketFlow orchestration, structured retry feedback, sqlglot AST patterns, and hook system could save someone some research time.
📄 Apache 2.0 · Copyright 2026 VILTEA