Automated Tekton-orchestrated pipeline on OpenShift for evaluating AI artifacts:
- Skills — Measures skill efficacy by comparing agent performance with and without skills (A/B "gap" testing)
- MCP Servers — Validates MCP server implementations via task-based verification
- Agents — Evaluates full agent behavior using Harbor (general agents) or A2A protocol (A2A-compliant agents)
Produces statistical reports with pass rates, uplift metrics, significance tests, and a unified scorecard.
ABEvalFlow provides two pipeline variants:
| Pipeline | Purpose | Key Differences |
|---|---|---|
| CI Pipeline | Full evaluation for new submissions | Includes security scan, quality review, artifact generation |
| Monitoring Pipeline | Regression detection for deployed artifacts | Includes degradation check against historical baseline, Slack alerts |
The pipeline executes in six main stages, with engine-specific steps within each:
- Clone submission repository
- Validate structure and
metadata.yamlschema - AI-assisted generation of missing test artifacts (optional):
- Harbor/A2A: generates
instruction.mdandtest_outputs.pyfromSKILL.md - ASE: generates
evals.jsonfromSKILL.md
- Harbor/A2A: generates
- Quality Review — AI-powered review of skill/test coherence (advisory)
- Security Scan — Cisco AI Defense scan for prompt injection, data exfiltration risks
Four evaluation engines, each suited for different artifact types:
| Engine | Evaluates | Comparison Mode | Container Isolation |
|---|---|---|---|
| Harbor | Skills, general agents | A/B (treatment vs control) | Yes |
| ASE | Skills only | A/B (treatment vs control) | No |
| A2A | A2A-protocol agents | A/B (treatment vs control) | Yes |
| MCPChecker | MCP servers | Single-agent task verification | No |
- Compute pass rates, uplift (gap), statistical significance (p-value)
- Generate
report.jsonandreport.md - Aggregate gate results into unified
scorecard.json - Monitoring only: Check for degradation against historical baseline
- Upload reports and artifacts to MinIO
- Record results to PostgreSQL for historical analysis
- Remove temporary workspaces and artifacts
All flow configuration is defined in metadata.yaml within each submission:
name: my-submission
eval_engine: harbor # harbor, ase, a2a, or mcpchecker
persona: general # Agent persona for Harbor/A2A
experiment:
n_trials: 20 # Number of evaluation attempts
gate_policy:
default_mode: warn
combination: all_pass
gates:
evaluation:
mode: block
threshold: 0.0
security:
mode: warnSee Gate Policy Configuration for full options.
ABEvalFlow/
├── Docs/ # ADR, implementation plan, guides
├── pipeline/
│ ├── pipeline.yaml # Main pipeline definition
│ ├── triggers/ # EventListener, TriggerTemplate, TriggerBinding
│ └── tasks/
│ ├── validate.yaml
│ ├── generate_tests.yaml
│ ├── test-quality-review.yaml
│ ├── security-scan.yaml
│ ├── scaffold.yaml
│ ├── build-push.yaml
│ ├── harbor-eval.yaml
│ ├── analyze-report.yaml
│ └── publish-store.yaml
├── templates/ # Jinja2 templates (Dockerfiles, test.sh, task.toml)
├── scripts/ # Python scripts invoked by pipeline tasks
├── config/ # K8s manifests (RBAC, PostgreSQL, LiteLLM)
└── tests/ # Unit and integration tests
| Repository | Purpose |
|---|---|
| skill-submissions | Submission intake — users push skills, MCP evals, and agent evals here |
| skills_eval_corrections | Harbor fork with OpenShift backend for ABEvalFlow |
| All-Hands-AI/openhands-agent-monitor | Harbor upstream — agent evaluation framework |
| cisco-ai-defense/skill-scanner | Security scanner for prompt injection and data exfiltration detection |
The pipeline supports four evaluation engines, each suited for different artifact types:
| Engine | Artifact Type | Use Case | Comparison | Container Isolation |
|---|---|---|---|---|
| Harbor | Skills, Agents | Full evaluation with real tool execution | A/B (with vs without skill) | Yes |
| ASE | Skills only | Lightweight LLM-as-judge assertions | A/B (with vs without skill) | No |
| A2A | A2A Agents | A2A-protocol compliant agent evaluation | A/B (treatment vs control) | Yes |
| MCPChecker | MCP Servers | MCP server/tool verification | Single-agent task verification | No |
Engines are implemented in abevalflow/engines/ using a registry pattern:
abevalflow/engines/
├── __init__.py # Engine registry and factory
├── base.py # EvalEngine abstract base class
├── harbor.py # Harbor A/B evaluation
├── ase.py # ASE LLM-as-judge evaluation
├── a2a.py # A2A protocol evaluation
└── mcpchecker.py # MCPChecker task verification
Gates are evaluation checkpoints that produce standardized results. The unified scorecard aggregates all gate results to produce a final recommendation.
| Category | Policy Key | Purpose | Implementation |
|---|---|---|---|
| evaluation | evaluation |
Results from the selected eval engine | Harbor, ASE, A2A, or MCPChecker |
| security | security |
Security scanning results | Cisco AI Defense scanner |
| quality | quality |
Quality review results | LLM-powered review |
Each gate operates in one of three modes:
| Mode | Behavior |
|---|---|
disabled |
Gate is skipped entirely |
warn |
Gate runs; failures produce warnings but don't block |
block |
Gate runs; failures cause the scorecard to fail |
All gates produce a standardized GateResult:
class GateResult:
gate_type: GateType # engine, security, or quality
gate_name: str # Category name: "evaluation", "security", or "quality"
policy_key: str # Implementation: "harbor", "cisco", "llm-review", etc.
passed: bool # Whether the gate passed
score: float # Normalized score (0.0 to 1.0)
mode: GateMode # Mode that was applied (disabled/warn/block)
threshold: float | None # Threshold used for pass/fail
findings: list[Finding] # Issues discovered (security/quality gates)
details: dict # Implementation-specific data (e.g., {"engine": "harbor"})
message: str # Human-readable summaryThe gate_name is the category used in policy configuration, while policy_key identifies the specific implementation.
The primary gate that wraps the selected evaluation engine's results.
- Location:
abevalflow/engines/*.py(each engine produces evaluation gate results) - Input: Engine-specific report from
reports/{submission}/ - Engines: Harbor, ASE, A2A, MCPChecker (selected via
eval_enginein metadata.yaml) - Pass criteria:
- Harbor/ASE/A2A:
treatment_score - control_score >= threshold(default threshold: 0.0) - MCPChecker: All tasks pass verification
- Harbor/ASE/A2A:
- Score: Mean reward or pass rate depending on engine
Reads security-scan.json produced by the Cisco AI Defense scanner.
- Location:
abevalflow/gates/security/cisco.py - Input:
reports/{submission}/security-scan.json - Scanner: Cisco AI Defense
- Pass criteria:
warnmode: Always passes (findings are advisory)blockmode: Fails if any HIGH or CRITICAL findings exist
- Score: Weighted average based on finding severities
Reads _ai_review.json produced by the AI quality reviewer.
- Location:
abevalflow/gates/quality/llm_review.py - Input:
{workspace}/_ai_review.json - Reviewer: LLM-powered quality review
- Dimensions evaluated: coherence, coverage, clarity, feasibility, robustness
- Pass criteria:
warnmode: Passes unless recommendation is "fail"blockmode: Passes only ifoverall_score >= threshold
- Default threshold: 0.6
The scorecard is the single source of truth for submission evaluation, aggregating all gate results with configurable policy.
class Scorecard:
submission_name: str # Name of the evaluated submission
pipeline_run_id: str # Tekton PipelineRun ID
eval_engine: str # Primary evaluation engine used
gates: list[GateResult] # All gate results
policy: GatePolicy # Policy that was applied
recommendation: Recommendation # pass, warn, or fail
recommendation_reason: str # Human-readable explanation
gates_passed: int # Count of passed gates
gates_failed: int # Count of failed gates
blocking_gates_passed: int # Count of passed blocking gates
blocking_gates_failed: int # Count of failed blocking gatesThe scorecard supports three modes for combining gate results:
| Mode | Logic |
|---|---|
all_pass |
All blocking gates must pass; failing warn gates produce warnings |
any_pass |
At least one blocking gate must pass |
weighted |
Weighted average of gate scores determines outcome |
The scorecard is written to reports/{submission}/scorecard.json and includes:
- All gate results with scores and findings
- Final recommendation with reasoning
- Provenance metadata (commit SHA, branch, pipeline run ID)
Gate policies are configured in metadata.yaml under the gate_policy key:
# metadata.yaml
name: my-skill
eval_engine: harbor
gate_policy:
default_mode: warn # Default mode for all gates
combination: all_pass # How to combine gate results
gates:
# Security gate configuration
security:
mode: block # Fail the scorecard on security issues
threshold: 0.8 # Minimum score to pass
# Quality gate configuration
quality:
mode: warn # Advisory only
threshold: 0.6 # Threshold for pass/fail
# Engine gate configuration (uses eval_engine automatically)
evaluation:
mode: block
threshold: 0.0 # Any positive uplift passes| Field | Type | Default | Description |
|---|---|---|---|
mode |
disabled/warn/block |
warn |
Enforcement mode |
threshold |
float |
Gate-specific | Score threshold for pass/fail |
weight |
float |
1.0 |
Weight for weighted combination mode |
The pipeline can push gate results to Red Hat Compass as Soundcheck facts for visibility in the developer portal.
Enable fact pushing in metadata.yaml:
gate_policy:
push_facts:
endpoint: https://compass.redhat.com/api/soundcheck/facts/
entity_ref: component:default/my-componentEach gate result is pushed as a separate fact. The fact reference includes both the category and implementation:
{
"facts": [
{
"factRef": "catalog:default/abevalflow_evaluation_harbor",
"entityRef": "component:default/my-component",
"data": {
"gate_name": "evaluation",
"passed": true,
"score": 0.85,
"mode": "block",
"message": "Harbor A/B: gap=0.15 >= threshold=0.0 -> PASS",
"evaluated_at": "2026-06-21T10:35:53Z"
}
}
]
}The Compass API token is stored in a Kubernetes secret:
oc create secret generic compass-facts-api --from-literal=token=<your-token>Reports and artifacts are uploaded to MinIO under a timestamped prefix:
s3://ab-eval-reports/YYYYMMDD_hhmmss_{submission}_{run-id}/
├── report.json # Main evaluation report
├── report.md # Human-readable report
├── scorecard.json # Unified scorecard
├── security_scans/ # Security scan results
│ └── security-scan.json
├── generated/ # AI-generated artifacts
│ ├── instruction.md
│ └── test_outputs.py
├── scaffolded/ # Scaffolded configs and review
│ └── _ai_review.json
└── trials/ # Per-trial artifacts (Harbor)
├── trial_001/
│ ├── agent/
│ └── verifier/
└── ...
Evaluation results are persisted for historical analysis and monitoring:
- Script:
scripts/store_results.py - Data stored:
- Submission metadata
- Per-trial results (Harbor/ASE)
- Security scan findings
- Aggregate statistics
- Scorecard recommendation
- Create a new file in
abevalflow/engines/:
# abevalflow/engines/my_engine.py
from abevalflow.engines import register_engine
from abevalflow.engines.base import EvalEngine
from abevalflow.gates.base import GateResult, GateType
@register_engine("my-engine")
class MyEngine(EvalEngine):
name = "my-engine"
def read_result(self, reports_dir: Path) -> dict | None:
"""Read engine results from reports directory."""
result_path = reports_dir / "my-engine-report.json"
if not result_path.exists():
return None
return json.loads(result_path.read_text())
def to_gate_result(self, raw_result: dict, policy: GatePolicy) -> GateResult:
"""Convert engine result to standardized GateResult."""
score = raw_result.get("score", 0.0)
threshold = policy.get_gate_policy(self.name).threshold or 0.0
return GateResult(
gate_type=GateType.ENGINE,
gate_name="evaluation",
policy_key=self.name,
passed=score >= threshold,
score=score,
mode=policy.get_gate_policy(self.name).mode,
message=f"MyEngine: score={score:.2f}",
)- Import in
abevalflow/engines/__init__.py:
from abevalflow.engines.my_engine import MyEngine- Create a new file in
abevalflow/gates/security/:
# abevalflow/gates/security/snyk.py
from abevalflow.gates.security import register_security_gate
from abevalflow.gates.security.base import SecurityGate
from abevalflow.gates.base import GateResult, GateType
@register_security_gate("snyk")
class SnykGate(SecurityGate):
name = "snyk"
def evaluate(self, reports_dir: Path, policy: GatePolicy) -> GateResult:
"""Evaluate Snyk security scan results."""
# Read snyk-report.json and produce GateResult
...- Import in
abevalflow/gates/security/__init__.py:
from abevalflow.gates.security.snyk import SnykGate- Create a new file in
abevalflow/gates/quality/:
# abevalflow/gates/quality/custom_review.py
from abevalflow.gates.quality import register_quality_gate
from abevalflow.gates.quality.base import QualityGate
from abevalflow.gates.base import GateResult, GateType
@register_quality_gate("custom-review")
class CustomReviewGate(QualityGate):
name = "custom-review"
def evaluate(self, workspace_root: Path, policy: GatePolicy) -> GateResult:
"""Evaluate custom quality review results."""
# Read review artifacts and produce GateResult
...- Import in
abevalflow/gates/quality/__init__.py:
from abevalflow.gates.quality.custom_review import CustomReviewGateTo add an entirely new gate category (e.g., "compliance", "performance"):
- Add the GateType enum in
abevalflow/gates/base.py:
class GateType(str, Enum):
ENGINE = "engine"
SECURITY = "security"
QUALITY = "quality"
COMPLIANCE = "compliance" # New category- Create the gate directory at
abevalflow/gates/compliance/:
abevalflow/gates/compliance/
├── __init__.py # Registry and exports
├── base.py # ComplianceGate base class
└── my_checker.py # First implementation
- Create the base class in
abevalflow/gates/compliance/base.py:
from abc import abstractmethod
from abevalflow.gates.base import GateResult, GateType
class ComplianceGate:
name: str
@abstractmethod
def evaluate(self, reports_dir: Path, policy: GatePolicy) -> GateResult:
"""Evaluate compliance and return standardized GateResult."""
pass- Update the scorecard aggregation in
scripts/aggregate_scorecard.py:
from abevalflow.gates.compliance import get_all_compliance_gates
# In aggregate_scorecard():
for compliance_gate in get_all_compliance_gates():
if not policy.is_enabled(compliance_gate.name):
continue
gate_result = compliance_gate.evaluate(reports_dir, policy)
gates.append(gate_result)- Add the category to policy schema in
abevalflow/schemas.py(documentation only, the schema is flexible)
For full agent evaluation with container isolation and A/B comparison:
my-skill/
├── instruction.md # Task description (required, or generated from SKILL.md)
├── skills/
│ └── SKILL.md # Skill definition (required)
├── tests/
│ ├── test_outputs.py # Verification tests (required, or generated)
│ └── llm_judge.py # LLM-based judge (optional)
├── docs/ # Reference documentation (optional)
├── supportive/ # Mock MCPs, data files (optional, <50MB)
└── metadata.yaml # eval_engine: harbor (required)
For lightweight LLM-as-judge evaluation without containers:
my-skill/
├── skills/
│ └── SKILL.md # Skill definition (required)
├── evals/
│ ├── evals.json # Evaluation prompts and assertions (optional, generated if missing)
│ └── files/ # Test data files (optional)
└── metadata.yaml # eval_engine: ase (required)
For validating MCP server implementations:
my-mcp-server-eval/
├── metadata.yaml # eval_engine: mcpchecker (required)
├── mcp-config.yaml # MCP server connection settings (required)
│ # - url: MCP server endpoint
│ # - auth: authentication config (if needed)
└── tasks/
├── task-1.yaml # Task definition with expected tool calls
└── task-2.yaml # Each task tests specific MCP functionality
MCPChecker validates that the MCP server correctly handles tool invocations and returns expected results.
For evaluating agents that implement the A2A (Agent-to-Agent) protocol:
my-a2a-agent-eval/
├── metadata.yaml # eval_engine: a2a (required)
├── agent-config.yaml # Agent endpoint and auth config (required)
│ # - endpoint: http://agent-service:8000
│ # - auth: bearer token or API key
└── tasks/
├── instruction.md # Task description
├── tests/
│ └── test_outputs.py
└── task.toml # Task configuration
A2A evaluation connects to a deployed agent via the A2A protocol and runs evaluation tasks against it.
For evaluating general agents (non-A2A) with full container isolation:
my-agent-eval/
├── metadata.yaml # eval_engine: harbor, persona: agent (required)
├── instruction.md # Task description (required)
├── tests/
│ ├── test_outputs.py # Verification tests (required)
│ └── llm_judge.py # LLM-based judge (optional)
└── supportive/ # Environment files, data (optional)
Harbor creates treatment/control container variants and runs A/B comparison.
See Trigger Guide for detailed submission and trigger instructions.
The pipeline is LLM-agnostic. Three modes are supported:
| Mode | Proxy Required? |
|---|---|
| Direct API key (Anthropic, OpenAI, etc.) | No |
| opencode + self-hosted model (vLLM, Ollama) | No |
| Google Vertex AI + LiteLLM proxy | Yes |
- OpenShift cluster with Pipelines operator (Tekton)
- Container registry (Quay.io) with push credentials
- Harbor fork with OpenShift backend
- LLM access (one of the three modes above)
- Python 3.11+
- Trigger Guide — How to submit skills, configure gate policies, and interpret scorecard results
- ADR: Skill Evaluation Pipeline
Apache License 2.0