Skip to content

arango-solutions/hardware-design-knowledge-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Integrated Circuit Design Knowledge Graph

A multi-repo temporal knowledge graph that harmonizes structured integrated circuit (IC) hardware design (RTL/Verilog), temporal version history (Git), and unstructured technical specifications (GraphRAG) into a single, queryable ArangoDB graph. The current implementation covers four open-source OpenRISC/RISC-V processors — OR1200, IBEX, MOR1KX, and Marocchino — with cross-repo similarity detection, design epoch analysis, and semantic bridging across all repositories.

[Schema diagram — generated by running the ETL pipeline and opening the graph in ArangoDB Visualizer]

Research Foundations

This project is a modern implementation of the principles established in the Design Knowledge Management System (DKMS) research program co-authored for the Air Force Materiel Command (1989-1992). It realizes the vision of a "Semantic Bridge" between design intent and implementation that was pioneered in these foundational reports.

For details on the theoretical foundations, see docs/research/DKMS_Foundations.md.

Visualizing the Knowledge Graph

Global Schema

The knowledge graph harmonizes three disparate data silos: RTL code structure, Git history, and technical specifications.

[Schema diagram — generated by running docs/project/SCHEMA.md Mermaid diagram or viewing in ArangoDB Visualizer]

The Semantic Bridge

The core value of this project is the Semantic Bridge, which connects unstructured documentation (GraphRAG) to structured hardware implementation (RTL). Below is a visualization from the ArangoDB Graph Visualizer showing a Documentation Entity (center) resolved to multiple RTL Modules (the "Flip-Flop" logic block hierarchy).

[Semantic Bridge visualization — generated by opening IC_Temporal_Knowledge_Graph in ArangoDB Visualizer and running the "Show Entity Resolutions" canvas action]

Key Features

  • Multi-Repo Temporal Analysis: Ingests four processor repositories, tracking ~6,400 modules, ~3,800 commits, and 381 design epochs across their full Git histories.
  • Semantic Bridge: Automatically links Verilog modules, ports, and signals to entities referenced in corresponding documentation sections using lexical analysis. 193 RESOLVED_TO edges span all repositories.
  • Cross-Repo Similarity & Evolution: Detects structurally similar modules across repositories (61 CROSS_REPO_SIMILAR_TO edges) and tracks how designs co-evolve.
  • Design Epoch & Situation Detection: Groups commits into temporal epochs (DesignEpoch) and identifies 721 design situations (DesignSituation) — refactors, interface changes, complexity shifts — across all repos.
  • High-Performance Consolidation: Uses set-based AQL operations for near-instant (sub-second) entity resolution across thousands of documentation nodes.
  • Author Expertise Mapping: First-class contributor vertices enable knowledge transfer, collaboration analysis, and bus factor assessment across all ingested repositories.
  • Granular RTL Graph: Decomposes monolithic Verilog files into a rich graph of Module, Port, Signal, and LogicChunk nodes.
  • GraphRAG Augmented: Integrated with entity and community extraction via a local GraphRAG pipeline (src/local_graphrag/) or the Arango AI team's AMP-hosted pipeline.

GraphRAG Status

GraphRAG entity and community extraction is available through two paths:

  1. Local GraphRAG pipeline (src/local_graphrag/) — runs locally without cloud dependencies, suitable for development and demos.
  2. ArangoDB AMP (cloud) — requires the GenAI services feature; used for large-scale or production imports via src/etl_graphrag.py.

GraphRAG collections use per-repo prefixes: OR1200_Entities, IBEX_Entities, MOR1KX_Entities, MAROCCHINO_Entities (and corresponding *_Golden_Entities, *_Relations, etc.). These are present in the demo database for all four processors.

What works without GraphRAG:

  • Full RTL parsing and graph construction across all repos
  • Git history ingestion and author expertise mapping
  • Temporal epoch and situation detection
  • Semantic bridging between RTL elements and documentation entities (reads existing *_Golden_Entities)
  • Cross-repo similarity detection
  • All AQL queries and visualizations in the demo

What requires ArangoDB AMP + GraphRAG:

  • Re-importing or refreshing document entities from PDFs (src/etl_graphrag.py)
  • Running the Importer/Retriever services via the GenAI API

See GRAPHRAG_STATUS.md for a detailed description of the integration, known issues, and instructions for attempting a fresh import.

Project Structure

  • src/: Core ETL, bridging, and analysis scripts.
    • local_graphrag/: Local GraphRAG entity/community extraction pipeline.
  • scripts/multi_repo/: Multi-repo ingestion (ingest_repo.py) and registry (repo_registry.yaml).
  • scripts/temporal/: Temporal ETL pipeline (epochs, situations, evolution edges).
  • scripts/setup/: Database creation, visualizer theme, and demo query installation.
  • data/temporal/: Temporal data artifacts.
  • docs/: Comprehensive documentation (see docs/README.md)
    • project/: Core project docs (Walkthrough, Schema, TEMPORAL_IMPLEMENTATION.md)
    • reference/: Technical references
  • tests/: 213 unit tests for parsing, normalization, and pipeline logic.
  • validation/: Ground truth datasets and validation scripts.

Setup & Usage

1. Prerequisites

  • Python 3.10+
  • ArangoDB instance (local Docker or remote)
  • Cluster users: if you see collection shards spread across many DB-Servers (one shard per collection, different leaders), graph-heavy queries pay extra network cost. See docs/arangodb-cluster-sharding.md for OneShard vs SmartGraph, scripts/setup/create_oneshard_database.py (new DB), and scripts/setup/migrate_to_oneshard.sh (dump → drop → OneShard → restore).

2. Environment Configuration

Copy env.template to .env in the root directory and configure your settings:

cp env.template .env

Then edit .env with your specific values:

# Choose LOCAL or REMOTE mode
ARANGO_MODE=LOCAL

# For REMOTE mode, configure these:
ARANGO_ENDPOINT=https://your-instance.arango.ai
ARANGO_USERNAME=root
ARANGO_PASSWORD=your_password
ARANGO_DATABASE=ic-knowledge-graph-temporal

# For LOCAL mode (Docker), configure these:
LOCAL_ARANGO_ENDPOINT=http://localhost:8530
LOCAL_ARANGO_USERNAME=root
LOCAL_ARANGO_PASSWORD=
LOCAL_ARANGO_DATABASE=ic-knowledge-graph-temporal

# GraphRAG prefix for collection names (per-repo)
# OR1200_, IBEX_, MOR1KX_, MAROCCHINO_ — set to match the target repo
GRAPHRAG_PREFIX=OR1200_

3. Install Dependencies

pip install -r requirements-core.txt

Key Dependencies:

  • arango-entity-resolution==3.1.0 - Official PyPI package for entity resolution
    • Provides WeightedFieldSimilarity for multi-field scoring (name + description)
    • Lazy loading ensures fast startup times
    • No manual configuration required

Optional (GraphRAG/document processing):

pip install -r requirements.txt

3b. Install agentic graph analytics (required for analytics reports)

This repo runs analytics via the agentic-graph-analytics project. Install from source (editable):

cd ~/code/agentic-graph-analytics
git pull origin main
pip install -e .

Ensure .env has valid ArangoDB credentials—the workflow uses JWT for GRAL; tokens expire during long runs and are auto-refreshed using ARANGO_ENDPOINT, ARANGO_USER (or ARANGO_USERNAME), and ARANGO_PASSWORD.

4. Running the Pipeline

Full rebuild (recommended):

./scripts/rebuild_database.sh

Or step-by-step:

python scripts/multi_repo/ingest_repo.py             # Ingest all four repos (default)
python scripts/temporal/create_temporal_graph.py     # Create named graph (28 edge definitions)
python src/situation_detector.py --all               # Detect design situations
python src/rtl_semantic_bridge.py --all              # Build RESOLVED_TO edges
python src/cross_repo_bridge.py --all                # Build cross-repo similarity edges

Author Expertise Mapping (included in rebuild):

  • Extracts contributor expertise from Git history across all ingested repositories
  • Creates AUTHORED edges (author -> commit)
  • Creates MAINTAINS edges (author -> module) based on commit frequency
  • Enables expertise queries, bus factor analysis, and collaboration networks

5. Verification

Run the test suite to ensure the environment is correctly configured:

pytest tests/

Customer hands-on workflow (numbered databases)

Customers can explore the preloaded demo database ic-knowledge-graph-temporal in read-only mode, then create their own numbered sandbox database ic-knowledge-graph-1, ic-knowledge-graph-2, … for hands-on exercises.

See docs/CUSTOMER_EXERCISE_WORKFLOW.md for the step-by-step process (UI-primary DB creation, GraphRAG UI import, and one-command setup).

Agentic analytics (reports)

Once your ArangoDB database is populated (pipeline above), run:

python run_ic_analysis.py

Reports are written to ic_analysis_output/ as both Markdown and interactive HTML.

Visualization

The "Semantic Bridge" can be explored visually via the ArangoDB Dashboard:

  1. Go to Graphs -> IC_Temporal_Knowledge_Graph.
  2. Identify cross-model links: (RTL_Module) -[RESOLVED_TO]-> (*_Golden_Entities) and (RTL_Module) -[CROSS_REPO_SIMILAR_TO]-> (RTL_Module).

Demo Materials

Complete demonstration materials are available:

  1. Full Setup: Run ./scripts/rebuild_database.sh to create the database and ingest all repos
  2. Quick Start: Read docs/DEMO_EXECUTIVE_SUMMARY.md (5-minute overview)
  3. Setup Theme: Run python scripts/setup/install_ic_theme.py to install the 'hardware-design' visualization theme
  4. Setup Queries: Run python scripts/setup/install_demo_setup.py to install 24 saved queries and canvas actions
  5. Demo Guide: Follow docs/TEMPORAL_DEMO_SCRIPT.md for a comprehensive demonstration
  6. Preparation: Use docs/DEMO_README.md for setup checklist and troubleshooting

The demo showcases:

  • Multi-repo semantic bridges (spec -> code across four processors)
  • Temporal design audit (epoch-based time-travel queries)
  • Cross-repo similarity and evolution detection
  • Design situation analysis (refactors, interface changes, complexity shifts)
  • Type-safe entity resolution via arango-entity-resolution
  • Sub-200ms graph traversals
  • Agent integration for 10x token savings

For technical details, see the Project Walkthrough.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors