🤖 DataSemAgent

中文版本

Semantic data analysis agent built on the OSI specification. Translates business questions into executable queries — not by guessing table schemas, but by reasoning within a controlled semantic framework.

💡 Why

LLMs can write SQL, but the SQL they write is brittle. Ask "revenue by customer" and the model has no idea that your data spans orders_old (column amt) and orders_new (column total_amount), or that a performance summary table only covers the last 90 days. It does not think in business concepts — it guesses column names.

The idea: define "revenue" once — which tables, which columns, which aggregation — and let everyone (LLM included) use that definition. Queries pass through rule-based validation first, then translate to physical SQL. No more guessing.

This approach isn't new (Looker's LookML, dbt's Semantic Layer), but the OSI spec is an open-source attempt at standardizing it. DataSemAgent is my exploration of that spec — a working prototype built to see how far the idea can go.

🧭 How it works

📐 Model — Define business terms (metrics, dimensions) in an OSI semantic model
🔍 Index — Build entity graphs, metric lineage, and reasoning chains from your database
⚡ Ask — Ask questions in natural language; the agent writes logical SQL, validates, translates, and executes it

⛓️ Pipelines

🧱 Semantic SQL

The agent writes logical SQL using business terms, then validates and translates it to physical SQL.

Logical                               Physical
──────────────────────────            ──────────────────────────────
SELECT customer_id, revenue           SELECT stg_orders.customer_id,
FROM orders                                  SUM(stg_orders.total_amount)
GROUP BY customer_id                  FROM stg_orders
                                      GROUP BY stg_orders.customer_id

✅ Validation rules (each an independent Rule class on top of sqlglot's AST + Scope analysis):

Metrics cannot be re-aggregated
Dimensions must appear in GROUP BY
Cross-dataset metrics require proper JOINs
Filter metrics in HAVING, not WHERE
Metrics cannot appear in GROUP BY

🕸️ Entity Graph

LLM-driven pipeline turns every database row into a graph node and every foreign key into an edge.

┌──────────┐  purchased_by  ┌──────────┐
│ store_   │ ──────────────►│ customer │
│ sales    │                └──────────┘
│          │  includes       ┌──────────┐
│ ss_item  │ ──────────────►│ item     │
│ _sk: 6   │                │ i_brand  │
│ ss_price │  occurred_at   │ :BrandA  │
│ : 303.0  │ ──────────────►└──────────┘
└──────────┘                ┌──────────┐
                            │ store    │
                            └──────────┘

🔁 Pipeline: Sampler → Schema Agent → Validator → Mapping Agent (incremental React loop) → Validator → Compiler
Mapping builds step by step with inline validation on each call. Failed validation injects structured feedback into the next LLM call for automatic retry.

🌳 Metric Lineage

Built deterministically from the OSI model — no LLM calls needed.

┌──────────┐  AGGREGATES_FROM  ┌────────────────┐
│ Metric   │ ─────────────────►│ PhysicalField   │
│ revenue  │                   │ ss_ext_sales    │
│          │  SLICES_BY        │ _price          │
│          │ ─────────────────►│                 │
│          │                   │ Dimension       │
│          │                   │ customer_id     │
└──────────┘                   └────────────────┘

💡 Discover available metrics, their physical sources, and valid dimension slices — before writing SQL.

🧠 Reasoning Chain Graph

Background reflection extracts reusable reasoning patterns from conversations — not specific numbers ("profit was $1.2M"), but general approaches ("period-over-period comparison reveals seasonal volatility"). An ontology hierarchy organizes concepts; child facts inherit parent chains.

Reflection fires automatically every N turns (configurable via config/agent.toml). Use /reflect in ask mode to trigger manually. The agent is prompted to check the reasoning graph first when facing complex analytical problems — past experience saves time.

🔁 Pipeline: Reflector (builds prompt from last N rounds) → Agent (incremental EmitReasoningTool) → Synonym Merger → Builder → KuzuDB

🚀 Quick Start

git clone https://github.com/VilTea/DataSemAgent.git && cd DataSemAgent
uv sync
cp config/llm/config.toml.demo config/llm/config.toml   # add your API key
uv run python run.py --lang zh

CLI Menu

Command	Description
`[1]` init	Build entity graph & metric lineage
`[2]` ask	Multi-turn agent Q&A
`[3]` exit	Quit
`[8]` inspect	Interactive reasoning graph browser
`[9]` build-db	Generate test database from OSI model

CLI Options

uv run python run.py --lang zh          # Chinese UI
uv run python run.py --lang en          # English UI
uv run python run.py --debug            # Verbose hook trigger logging
uv run python run.py --model path/to/model.yaml  # Custom OSI model path

In-Ask Commands

Command	Description
`/reflect`	Manually trigger reasoning chain reflection
`/exit`, `/q`, `/quit`	Return to main menu

Build Test Database

Build a SQLite database from the OSI semantic model for development and testing:

# Via CLI menu: select [9] build-db (shows model/DB path confirmation)
# Or directly:
uv run python tests/build_tpcds_test_data.py --db data/test.db

Creates physical tables and populates them with realistic sample data — 3 years of dates, 10 customers/items, 5 stores, 500 transactions.

🧰 Agent Tools

Tool	Description
`sql_exec`	Logical SQL with pre-aggregated metrics
`entity_graph`	Cypher queries over the entity property graph
`metric_lineage`	Discover metrics, dimensions, and their sources
`reasoning_graph`	Query accumulated reusable reasoning patterns
`todo_write`	Track multi-step task progress

⚙️ Configuration

Component	File
🧠 LLM (`openai` / `anthropic`)	`config/llm/config.toml`
🤖 Agent behavior	`config/agent.toml`
🗄️ SQL Database	`config/database.toml`
🔗 Graph Database	`config/graph_database.toml`
📦 OSI Model	`config/config.toml` → `[paths]`
🔌 MCP Servers	`config/mcp/servers.yaml`

Agent Settings (`config/agent.toml`)

[default]
reflection_interval = 5   # trigger reasoning reflection every N turns

📁 Project Structure

  run.py
  app/
  ├── semantics/sql/          # SQL translation (classifier, expander, translator, validator)
  ├── semantics/graph/        # Entity + metric + reasoning graphs
  │   ├── entity/             #   LLM-driven entity graph pipeline
  │   ├── metric/             #   Deterministic metric lineage
  │   └── reasoning/          #   Background reflection chain graph
  ├── node/                   # Agent orchestration (PocketFlow: AgentNode, ToolNode)
  ├── tool/                   # sql_exec, entity_graph, metric_graph, reasoning_graph, todo_write
  ├── llm.py                  # OpenAI / Anthropic client
  ├── hook/                   # Lifecycle hooks (node/tool/flow granularity, priority ordering)
  ├── pipeline/               # Streaming output (EventConsumer / Consumable abstraction)
  └── cli/                    # Typer + Rich terminal UI
  config/                     # Config + i18n (zh/en)
  tests/

📢 Recent Updates

Evaluation pipeline — Non-intrusive conversation trace collection for model evaluation. EvalCollector uses consumer hooks to capture complete LLM I/O, tool calls, and tool results as JSONL, with configurable sensitive data redaction. See config/eval.toml.

Tool schema injection — All four tools (sql_exec, entity_graph, metric_lineage, reasoning_graph) now inject schema definitions into the system prompt via <tag> blocks. The agent no longer needs to probe the graph to discover labels, properties, or relationships — everything is available upfront.

Consumer hook system — EventConsumer implementations can use @hook annotations to observe agent lifecycle events (tool calls, LLM execution, flow boundaries). Hooks are observation-only (forced on_error="log", priority 200) and auto-registered via async with pipeline.bind(ctx).

SQL alias pushdown — Fixed multiple alias precedence issues across CTEs, subqueries, plain columns, dimensions, and metrics. SQL aliases are now consistently preserved instead of being replaced by physical table names.

📍 Status

This is a personal validation-stage project, not a product:

Tested on TPC-DS datasets only; real-world schema coverage is unknown
Entity graph quality depends on the underlying LLM; complex schemas haven't been thoroughly evaluated
No performance optimization, no production deployment plan, rough error handling

It is a working prototype that explores what OSI-based semantic analysis can look like — not a turnkey tool.

Why open source

Feedback. Solo projects drift. If you work on semantic layers, Text-to-SQL, or LLM agents, I'd love to hear what you think.
OSI spec exploration. Semantic layer standardization matters for reusable data components. This project is my way of stress-testing the spec.
Code that might help. The PocketFlow orchestration, structured retry feedback, sqlglot AST patterns, and hook system could save someone some research time.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
app		app
config		config
tests		tests
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 DataSemAgent

💡 Why

🧭 How it works

⛓️ Pipelines

🧱 Semantic SQL

🕸️ Entity Graph

🌳 Metric Lineage

🧠 Reasoning Chain Graph

🚀 Quick Start

CLI Menu

CLI Options

In-Ask Commands

Build Test Database

🧰 Agent Tools

⚙️ Configuration

Agent Settings (`config/agent.toml`)

📁 Project Structure

📢 Recent Updates

📍 Status

Why open source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 DataSemAgent

💡 Why

🧭 How it works

⛓️ Pipelines

🧱 Semantic SQL

🕸️ Entity Graph

🌳 Metric Lineage

🧠 Reasoning Chain Graph

🚀 Quick Start

CLI Menu

CLI Options

In-Ask Commands

Build Test Database

🧰 Agent Tools

⚙️ Configuration

Agent Settings (config/agent.toml)

📁 Project Structure

📢 Recent Updates

📍 Status

Why open source

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Agent Settings (`config/agent.toml`)

Packages