research: add autoconfig POC with QNN NPU catalog sweep by DingmaomaoBJTU · Pull Request #891 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-15T02:30:48Z

What this PR adds

research/autoconfig/ — an automated config search POC that sweeps opset versions (17–21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware.

Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)

npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property

Rigorously validated with fresh quantized.onnx builds, 3×500-iter sessions:

Model	opset17	opset21	Gain
facebook/dinov2-small	7.18 ms	4.98 ms	+30.6% ✅
facebook/dinov2-base	34.56 ms	26.23 ms	+24.1% ✅
facebook/dino-vitb16	19.92 ms	20.07 ms	-0.7% (NEUTRAL) ← critical control
microsoft/rad-dino	274.98 ms	275.36 ms	-0.1% (CPU-bound)

Key discriminant: dino-vitb16 is the same ViT-B size as dinov2-base, but gets zero benefit from opset21. The speedup is specific to the DINOv2 architecture — mechanism TBD (DINOv2-specific op patterns in opset21 ONNX export, not the original kMaxSupportedOpset bypass mechanism which doesn't apply to ORT 1.24.x).

npu-006: conv fusions cause catastrophic regression on Conv-dominant models only

Model	No fusions	With fusions	Regression
microsoft/resnet-18	~1–4 ms	~132–135 ms	+4900% 🔥
facebook/dinov2-base	34.56 ms	25.92 ms	-25% (FASTER)
facebook/dino-vitb16	19.92 ms	20.12 ms	+1% (neutral)

Hazard is proportional to Conv op density. Attention-dominant models are safe or slightly benefit.

npu-007: DVFS thermal noise requires session-level averaging

QNN NPU CV is always 0.1–2.0+. Use 3×500-iter sessions with 30s cool-down. Trust gains >10% only.

Included files

Core scripts

autoconfig.py — main search loop (ConvNext CPU baseline)
catalog_qnn_sweep.py — 8-model QNN NPU catalog sweep
analyze_graph.py — ONNX graph analysis helper
validation_sweep.py — focused npu-001/npu-006 validation sweep (NEW)
gen_report_v3.py, autoconfig_diagram.html

Knowledge base (`ep_knowledge/`)

qnn_npu.json — 7 findings (npu-001 through npu-007), continuously updated with validation data
cpu.json, dml.json, qnn_gpu.json

Benchmark results (`catalog-qnn-sweep/`)

SUMMARY.md — original 8-model sweep results
VALIDATION_SUMMARY.md — 3-model validation sweep with full per-session data and cross-model comparison table
Per-model results.json and results_v2.json for dinov2-base, rad-dino, dino-vitb16

Design docs (`docs/`)

agent-design.md — winml-cli agent layer design (Diagnostic / Decision / Cross-Device / Regression / Recommendation agents)
skills-design.md — WinML CLI Skills Design (11 skills, competitive analysis, feature gaps)
ep-knowledge-review.md — statistical audit of ep_knowledge findings

Feature gaps identified

FusedConv detection in analyze_graph.py — needed to gate npu-006 rule automatically
DVFS-aware perf protocol — current winml perf doesn't expose session-level averaging
Budget-aware sweep — skip expensive hypotheses when time budget exhausted
Mechanism investigation for npu-001 — graph dump comparing Transpose counts at opset17 vs opset21

Status: Research POC — not production code. Scripts run standalone; not integrated into the winml CLI yet.

Adds research/autoconfig/ — an automated config search POC that sweeps opset versions (17-21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware. Key findings from 8-model QNN NPU catalog sweep: - npu-001: opset 21 bypass gives +25-31% on Conv+residual models (MobileViT, DINOv2) - npu-006: conv fusions (conv-bn/add/activation) cause 4900% regression on ResNet-18 QNN NPU - npu-007: DVFS thermal noise requires session-level averaging (3x500 iters) for reliable results Includes ep_knowledge/ KB with confirmed findings per EP, and catalog-qnn-sweep/ with per-model benchmark results and cross-model pattern analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/agent-design.md — strategic design for the agent layer of winml-cli, covering: - winml-cli vs Olive distinction (UX + Windows-first + explainability) - Why autoconfig search is a sub-tool, not the agent entry point - 5 agent types: Diagnostic, Decision Guidance, Cross-Device Confidence, Regression Detection, Model Recommendation - Autoconfig's role within the agent framework - Key concerns and open questions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/skills-design.md — full design doc for the winml-cli skills/agent layer, including: - 11 skill designs (use-winml-cli, optimize-for-device, ep-compatibility-check, debug-accuracy-drop, and others) - Competitive analysis (Apple coremltools, ExecuTorch, AI Hub, NVIDIA ModelOpt, OpenVINO, Olive) - Top 5 feature gaps - Validation confidence levels (L1-L5) - Structured output requirements - QNN NPU catalog sweep findings (npu-001/006/007) - FusedConv unfuse feature request Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+import json
+
+
+results = json.load(open(r"ablation-search\results.json"))


…ping skills - Split skill catalog into two ranked categories by the 'does it touch code?' discriminator: User (config-only) and Contributor (code changes) - Merge overlapping skills (12 -> 9): - check-model-feasibility = find-a-model + ep-compatibility-check - ship-to-winapp = validate-before-ship + prepare-for-winapp - autoconfig absorbs optimize-for-device as its manual mode - Add self-contained HTML render of the design doc for easier reading

xieofxie · 2026-06-16T01:14:34Z

+
+    {
+      "id": "cpu-005",
+      "title": "Baseline (no extra flags) is the optimal config for ConvNext CPU",


xieofxie · 2026-06-16T01:15:04Z

+
+    {
+      "id": "cpu-001",
+      "title": "opset 19+ causes severe regression on CPU EP (3-4x slowdown)",


Critical issues found and corrected: npu-001 (opset 21 speedup): - mechanism_confirmed changed TRUE → FALSE The kMaxSupportedOpset bypass requires ORT < 1.18; the sweep used onnxruntime-windowsml 1.24.5 where kMaxSupportedOpset >= 22. The bypass mechanism does not apply. The speedup for DINOv2/MobileViT is empirically real but the WHY is now unknown. - ResNet-18 removed from 'benefits' list — sub-ms model, 3-session ranges span 4x for the same config (pure DVFS noise). Reported +20.2% was noise. - MobileViT magnitude corrected: h1 had DVFS spike inflating median to 11.72ms; actual gain is ~20-26% not 26.5%. - DINOv2 finding kept: 3-session data shows non-overlapping distributions. - Added per-session raw data analysis and required follow-up experiments. npu-002 / npu-003 (W8A16 speedup, compile speedup): - scope changed from 'General / all vision models' to 'ConvNext only' (both findings from 1 model; magnitude claims not transferable) - confidence reduced from 'high' to 'medium' npu-004 (W8A8 accuracy collapse): - confidence changed from 'medium' to 'very_low / anecdote' - Finding has NO recorded data (experiment 'aborted early, numbers not saved') Cannot be treated as a KB finding until re-run with recorded numbers. npu-005 (QNN Hub comparison): - Added fairness caveat: comparing qairt-stack model on ORT QNN EP is not a valid comparison. Finding is trivially true (use right tool for right stack) but not informative. npu-006 (conv fusions catastrophic): - No confidence change — this is the most statistically solid finding. - Added session-level evidence note: h4 CV=0.016 (extremely stable, unusual for QNN NPU), consistent with deterministic CPU fallback hypothesis. search_space_rules: - opset recommendation changed from 'Conv+residual' to 'Conv+attention hybrid' to reflect actual validated models (DINOv2 is attention-dominant, not Conv+residual in the traditional sense) New file: docs/ep-knowledge-review.md - Full statistical analysis of per-session data - ORT version dependency explained - Additional models needed for validation - Minimum experiment protocol Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eneral ViT Run validation_sweep.py across 3 new models to rigorously test npu-001 (opset21 speedup) and npu-006 (conv fusion regression) hypotheses. KEY FINDINGS: npu-001 (opset21 speedup): - facebook/dinov2-base: +24.1% (opset17 34.56ms -> opset21 26.23ms) 3-session full bench, fresh quantized.onnx builds, very stable - microsoft/rad-dino: -0.1% NEUTRAL -- model runs on CPU (~275ms), QNN NPU cannot accelerate ViT-L; opset irrelevant when CPU-bound - facebook/dino-vitb16: -0.7% NEUTRAL -- critical control proving the speedup is NOT a general ViT property; DINOv2-specific op patterns must explain the difference Combined with original catalog data: dinov2-small +30.6%, dinov2-base +24.1% (both confirmed) dino-vitb16 NEUTRAL (confirmed control) -> scope is DINOv2 family npu-006 (conv fusions): - dinov2-base: fusions -25% (faster) -- attention-dominant, benign - dino-vitb16: fusions +1% (neutral) -- no meaningful Conv ops to fuse Combined with original resnet-18 +4900% -> hazard is conv-density-gated Script fixes in validation_sweep.py: - bench_screen parsed d.get('p50_ms') instead of d['latency_ms']['p50'] - Reuse check accepted any .onnx (including truncated export.onnx) - Model selection preferred optimized.onnx over quantized.onnx Updated files: - ep_knowledge/qnn_npu.json: npu-001 scope narrowed to DINOv2-family, validated_models expanded with dino-vitb16 (negative control) and dinov2-base (positive), rad-dino (CPU-bound); npu-006 scope updated - catalog-qnn-sweep/VALIDATION_SUMMARY.md: full cross-model results table - catalog-qnn-sweep/{dinov2-base,rad-dino,dino-vitb16}/results_v2.json - catalog-qnn-sweep/.gitignore: exclude val_h*/ build artifact dirs

+        if complete_models:
+            print(f"  [reuse] existing build in {hyp_dir.name}", flush=True)
+            ok = True
+            build_out = "(reused)"


+            p50 = lat.get("p50") if isinstance(lat, dict) else None
+            if p50:
+                p50s.append(round(p50, 3))
+        except Exception:


…nism invalidated, confidence calibrated Merge structural improvements from local review into KB (smart merge, preserving validation sweep data from 2026-06-16): npu-001: - Add mechanism_invalidation field (explicit statement of INVALIDATION with cause: ORT 1.24.5 kMaxSupportedOpset>=22, bypass does not apply) - Add critical_caveats array (4 caveats incl. DINOv2-specific scope note) - Downgrade confidence to 'medium-high on empirical / low on mechanism' (was 'high' which was overclaiming given unknown mechanism) npu-002/003: - Add follow_up_required fields (FP32 baselines on MobileViT/DINOv2/ResNet) npu-004: - Update action_for_autoconfig: 'Do NOT use to skip W8A8 without running eval first' (was 'Treat as potentially risky' which was still prescriptive without data) search_space_rules: - Rename recommended_order_conv_attention_hybrid -> recommended_order_conv_residual to match local review terminology NOTE: Validation sweep data (dinov2-base +24.1%, dino-vitb16 NEUTRAL, rad-dino CPU-bound) from 2026-06-16 is preserved — not overwritten.

…d NOT Transpose elimination Task 3 investigation: loaded dinov2-small opset17 (h0) and opset21 (h3) optimized.onnx and quantized.onnx from catalog_qnn_sweep builds; counted op types with onnx.load(). Key finding: Transpose count is IDENTICAL (49 nodes) in both opsets. - opset17 optimized: 391 total, 49 Transpose, 121 Reshape - opset21 optimized: 439 total, 49 Transpose, 169 Reshape (+48) - opset17 quantized: 1398 total, 49 Transpose, 615 DQ, 392 Q - opset21 quantized: 1542 total, 49 Transpose, 663 DQ, 440 Q (+48 QDQ pairs) Rules out: NHWC Transpose-elimination as speedup cause, fewer-ops as explanation. Consistent with: QNN EP scheduling/partitioning difference triggered by +48 Reshape nodes. Also: kMaxSupportedOpset confirmed >= 23 in ORT 1.24.4 (C:\\tmp env), reaffirming that the original bypass mechanism does NOT apply. Updated npu-001 critical_caveats, follow_up_required, and added transpose_analysis_2026_06_16 section with raw op counts.

…DINOv2-specific New benchmark results (2026-06-17, QNN NPU Snapdragon X Elite, 3x500-iter W8A16): BAAI/bge-small-en-v1.5 (BERT/sentence-similarity): h0=10.617ms [10.52, 10.32, 11.01] h3=9.840ms [10.25, 9.33, 9.94] opset21 gain +7.3% -- MARGINAL / INCONCLUSIVE (CV=0.3, ranges barely non-overlapping) Unusual vs all other NLP models (distilbert -0.1%, MiniLM -0.7%, roberta +0.1%) Needs 5+ sessions to differentiate from DVFS noise. rizvandwiki/gender-classification (plain ViT): h0=14.326ms [14.15, 14.94, 13.89] h3=13.830ms [13.70, 13.92, 13.87] opset21 gain +3.5% -- NEUTRAL (ranges overlap 13.89/13.92ms, CV=0.35) CRITICAL FINDING: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count profiles or general ViT architecture. Combined with Transpose analysis (Task 3): opset17 and opset21 DINOv2-small have identical Transpose node counts (49). The speedup mechanism is NOT Transpose elimination. The effect is specific to DINOv2 family at a level below op-count visibility -- possibly quantization behavior, tensor layout, or QNN EP partitioning. Also updated: models_tested list (+5 entries), validated_models sections, scope and confidence statements, task completion notes in follow_up_required.

…ndings, fix mechanism claims cpu.json: - cpu-001: mechanism_confirmed true->false. Data is real (opset 17 best) but the kMaxSupportedOpset gate hypothesis doesn't explain the non-monotonic pattern (opset22=85ms partial recovery while 19/20/21 all ~150-170ms). Two separate kMaxSupportedOpset constants exist (NHWC gate vs Transpose Optimizer gate); the CPU one is unverified. Added note on this distinction. - cpu-006: mechanism_confirmed true->false (derived from cpu-001). Meta-rule (EP isolation) remains valid. Added note that NPU/CPU experiments used different models (DINOv2 vs ConvNext) -- comparison is directional only. dml.json: - dml-001: INVALIDATED as 'DML is faster'. DML p50=16.9ms vs QNN GPU p50=17.7ms: diff = 0.8ms = 0.82 sigma of GPU measurement -- distributions OVERLAP. Retained: DML IS more stable (std 0.52 vs 0.97), that difference is real. - dml-002: HEADLINE CORRECTED. p50 with NHWC is marginally BETTER (16.5 vs 16.9ms), not worse. The actual finding is NHWC increases tail latency (p90 +19%) and variance (std 3.6x worse). Action unchanged (avoid NHWC) but for stability reasons, not p50. qnn_gpu.json: - gpu-003: Downgraded from medium to low confidence. Single experiment, 34% gap is above noise level but needs replication before citing as 'NEVER use compile'.

Key corrections: - Bench protocol: QNN NPU CV 0.10-1.2 is normal (DVFS); never reject on CV. Protocol is 3x500-iter always, not gated on CV. - Phase 4 conv fusions: add npu-006 hard gate — FusedConv not supported by QNN EP -> CPU fallback -> +4900% regression on Conv-dense models. Rule: skip all conv-*-fusion if Conv% of total ops > 20%. - Diagnosis table: add npu-006 catastrophic regression row. - Gate 2 lesson: DINOv2 opset21 +24-31% is real but mechanism UNKNOWN. Two hypotheses ruled out: kMaxSupportedOpset bypass (ORT>=23), Transpose elimination (count identical opset17/21). +48 Reshape nodes only diff found. ViT models with identical op counts see no benefit -- effect below topology. - DML vs QNN GPU: correct 'consistently faster' claim -- 0.8ms diff = 0.82sigma, distributions overlap. Real finding: DML is more stable (std 0.52 vs 0.97). - EP table: update QNN NPU to 'architecture-dependent', add conv-fusion caveat; DML note corrected; CPU note: mechanism uncertain (two kMaxSupportedOpset). - Actionable findings: replace 'mechanism CONFIRMED' with full invalidation log.

… loop Phase 0 — new analyze step sets 3 EP-specific flags before any experiment: conv_fusions_blocked: QNN NPU + Conv% > 20% -> skip all conv-*-fusion nhwc_blocked: QNN GPU / DML -> skip nhwc-transformer (dml-002) opset_sweep_blocked: CPU EP -> never sweep opset (cpu-001, fixed at 17) bench_protocol: 'npu' if QNN NPU -> always 3-session, no CV gate Phase 1 skip_set — 3 new hard blocks wired from Phase 0 flags: conv fusions blocked when npu-006 risk detected nhwc-transformer blocked for GPU/DML EPs opset sweep blocked for CPU EP Conv bottleneck queue respects conv_fusions_blocked flag Phase 2 loop: Hypothesis rule 2a: start with W8A16 (not W8A8); W8A8 is high-risk for LN/GELU W8A8 early exit: if top-1 <= 15% on first W8A8 attempt -> skip all W8A8 variants PERF step: full EP-aware bench protocol with 3-session NPU path, CV gate for CPU/GPU, s0 JIT exclusion rule, and non-overlapping range requirement for KEEP Post-convergence: mandatory compile for QNN NPU (+1.7x validated), explicit compile-skip guard for GPU/DML (compile regresses on Adreno X1-85). Hypothesis generation: opset sweep is now EP-qualified — CPU always blocked, GPU/DML not validated (skip), QNN NPU full sweep 17-21 with scope note.

…p script catalog_qnn_sweep.py: - Add NPU006_CONV_PCT_THRESHOLD constant (20%) -- npu-006 guard - Add _count_conv_pct(): after h0 builds, count Conv ops via onnx library to assess whether h4/h5 conv fusions are safe or will catastrophically regress - In hypothesis loop: after h0 succeeds, analyze model.onnx Conv%. If Conv% > 20%: print [npu-006] WARNING before running h4/h5. Annotate h4/h5 bench result with npu006_expected_regression=True/False. - results dict: add conv_pct, npu006_risk, npu006_regression, npu001_ranges_non_overlapping fields - _compute_summary: improve npu001_generalized with range-overlap check (max(h3_p50s) < min(h1_p50s)) alongside median test. DVFS-noisy NPU results where ranges overlap are reported as 'median_only' (marginal), not True -- prevents false positives like BGE-small (+7.3%, overlapping). - _compute_summary: add npu-006 catastrophic regression detector (h4/h5 median >= 5x baseline = CPU fallback confirmed) - write_summary: SUMMARY.md now includes Conv% column, npu-006 regression column, and range-overlap note in npu001 column. Bench protocol header updated to note DVFS expectation.

Bugs fixed (from code-review + rubber-duck analysis): 1. [CRITICAL] autoconfig.py hypothesis optim keys were kebab-case ('conv-bn-fusion') but build_config() in pipes/graph.py looks up cap.python_name (snake_case). All h1-h5 were silently benchmarking the baseline config. Fix: rename all optim keys to snake_case ('conv_bn_fusion', 'gelu_fusion', etc.) 2. [HIGH] autoconfig.py hypothesis accumulation: h2-h5 used {**cfg['optim'], ...} but each hypothesis starts from a fresh BASELINE copy where optim={}. Refactored to explicit isolated mode — each hypothesis is independent. Labels updated to remove misleading '+' prefix. Behavior now matches intent. 3. [HIGH] autoconfig.py baseline_p50 only set when i==0 AND bench passes. If iter 0 was KB-skipped, baseline_p50 stayed None forever and the perf gate never fired. Fix: set baseline_p50 on the first successful Phase B bench regardless of iteration index. 4. [HIGH] catalog_qnn_sweep.py MODEL_TIMEOUT_S=20*60 (20 min) caused all hypotheses after h0 to time out. A single hypothesis takes ~30 min minimum. Fix: raise to 180 min (3 hours for 6 hypotheses). 5. [MEDIUM] catalog_qnn_sweep.py _count_conv_pct() used a catch-all except that masked ImportError. When onnx is missing, conv_pct returns 0.0 which evaluates as 'no risk' — silently disabling the npu-006 guard. Fix: split ImportError (loud warning + treat as UNKNOWN/HIGH risk) from other exceptions (parse errors, silent fallback). Additional fixes: - validation_sweep.py npu-007 bug: bench_screen failure gated Phase B for QNN NPU. For QNN NPU, only non-NPU EPs should gate Phase B on screen fail. - autoconfig.py: replace 'Likely DVFS noise' CV message with EP-aware text - autoconfig.py: median_p50 local variable shadowed imported function — renamed to med_p50 to prevent confusion - autoconfig.py: remove duplicate code section left by earlier refactor - bench_utils.py: new shared module with run_cmd, bench_screen, bench_full, ScreenResult, count_conv_pct, ranges_non_overlapping, median_p50, etc. bench_full now accepts warmup/iters/cool_down_s overrides for CPU protocol

…ume (AgenticGPUOptimizer V2) Three improvements borrowed from AgenticGPUOptimizer V2 patterns: 1. ThroughputOnly verdict policy (bench_utils.py) - improvement must exceed max(1% floor, 2x screen-CV) - noise-level deltas (delta < stat_bar * CV) are DISCARD, not KEEP - marks marginal KKEPs (1x < delta < 1.5x threshold) as MARGINAL_KEEP 2. Screen phase early exit (autoconfig.py) - if screen improvement < 1%, skip 3x full-bench entirely - saves ~25-90 min per rejected hypothesis on first run - applied only when baseline_p50 is known (not first iter) 3. Crash-resume via SessionManager (bench_utils.py) - session.json written atomically after each experiment - on restart, completed iters are loaded and skipped - state includes baseline_p50, best_p50/label, consecutive_discards Also extracts _run_phase_b() helper to reduce main() nesting depth.

+from bench_utils import (
+    FULL_ITERS,
+    FULL_SESSIONS,
+    SCREEN_CV_MAX_STD,
+    SCREEN_ITERS,
+    SessionManager,
+    ThroughputOnly,
+    VerdictInput,
+    bench_full,
+    bench_screen,
+    median_p50,
+    run_cmd,
+)


+            try:
+                baseline_p50 = float(exp_info["median_p50"])
+                exp_info["baseline_p50"] = f"{baseline_p50:.1f}"
+            except (ValueError, TypeError):


+        self.stat_bar_multiplier = stat_bar_multiplier
+
+    @abstractmethod
+    def evaluate(self, inp: VerdictInput) -> VerdictOutput: ...


…summary.html autoconfig_diagram.html (v3): - Phase 2 Optimizer: screen early exit box (skip full bench when screen delta < 1%) - Phase 2 Reviewer: ThroughputOnly verdict policy with KEEP/MARGINAL_KEEP/DISCARD/EARLY pills - Phase 2: crash-resume session.json box (new teal row) - Phase 0: session.json load on startup (crash-resume) - Phase 1 skip_set: updated with empirical KB rules (npu-006 Conv% gate, cpu-002, gpu-004, etc.) - Side panel: session.json added alongside results.tsv and ep_knowledge/ - Footnote: v3 change summary + pending features with issue references agent-design.md: - New Section 2.1: improved loop V3 (what it does well) vs remaining agent gaps - Section 2.2: corrected framing (original was wrong; V3 fixes the computation layer; agent gaps are explanation/architecture-awareness/cross-device/KB self-update) - Date updated to 2026-06-17 docs/ep-findings-summary.html (new): - 17 findings across QNN NPU / CPU / DML / QNN GPU, only confirmed/valid - Color-coded by EP, confidence badges (HIGH/MEDIUM/LOW) - Per-finding: observation data, scope, autoconfig action - 7 feature requests table with issue IDs (#155, #158, #443, #867, #868) + 2 not-yet-filed gaps (FusedConv detect, DML analyze rules)

…dings-summary.html 11 findings (npu-002/003/004, cpu-001/002/005, dml-001/002/003, gpu-001/002/003/005) are hidden by default because they derive from only 1 model (convnext-tiny-224). 6 multi-model / universal findings remain visible: npu-001 (14 models), npu-006 (4 models), npu-007 (8 models), cpu-006 (meta EP-isolation rule), dml-004 (all DML models), gpu-004 (QNN SDK limitation). A toggle button lets readers expand hidden findings on demand. sm-divider rows summarize how many are hidden per EP section.

…s-summary.html The .finding { grid rule lost its selector, breaking the 4-column layout for every finding row. Restored selector and fixed grid-template-columns to explicit 28px 70px 1fr 220px (was auto, caused action column collapse).

… condensed footnote

…ness contract

…ptions

…with sample issue

…er Phase 3 badges

…mbers only

…ode bar only

…e format

…ple only

…rt + KB draft Phase 1 — analyze_insight.py (new module): - run_graph_analysis(): op counts, Conv%, GELU variant, dynamic axes from ONNX proto - run_winml_analyze(): calls winml analyze --ep <ep> -o json, parses partial/unsupported ops - build_insight(): fuses 3 signals (graph + analyze + KB) into skip_set + priority_boosts - skip_set: npu-006 Conv%>20% block, cpu-001 opset deprioritise, gpu-004 quant skip, dml-002/gpu-002 nhwc-transformer skip - priority_boosts: npu-001 DINOv2 heuristic (+10), GELU-decomposed (+3), high-Gemm% (+2) Phase 2 integration in autoconfig.py: - Calls build_insight() after KB load (graceful fallback if baseline ONNX not yet built) - Sorts HYPOTHESES by priority_boost (highest first) - Checks insight.skip_set before each iteration (in addition to KB skip_passes) Phase 3 — report_gen.py (new module): - generate_report(): reads results.tsv, writes report.html - Champion config box (best KEEP verdict) - Benchmark bar chart (CSS bars, colour-coded by status) - Full experiment table - Phase 1 Insight Engine notes section Phase 3 — KB draft auto-write in autoconfig.py: - write_kb_draft(): on KEEP verdict with improvement > 10%, appends status=draft entry to ep_knowledge/<ep>.json - Draft has mechanism_confirmed=false; human must Gate-2 validate before promoting

…winml optimize flags analyze_insight.py: - Added FusionCandidate dataclass: flag, count, evidence - _detect_fusion_candidates(): 30+ patterns mapped to winml optimize flags GELU (erf/tanh/quick), LayerNorm variants, attention, MatMul patterns, Conv patterns, Gemm patterns, eliminations, layout transforms - build_insight(): log-scaled priority boosts from fusion_candidates Validated on 5 real sweep ONNX (optimized.onnx): dinov2: 49x transpose_optimizer, 24x matmul_transpose_fusion, 12x attention_fusion roberta: +12x bias_softmax_fusion, 12x matmul_add_fusion resnet: 11x conv_add_fusion, 11x conv_add_activation_fusion mobilevit: 36x matmul_transpose_fusion, 12x highdimRTR_lowdimRTR bge-small: 12x matmul_add_fusion, 12x matmul_transpose_fusion catalog_qnn_sweep.py — hypothesis matrix expanded h6-h10: h6: opset21 + matmul_transpose_fusion (24-36x in all transformers) h7: opset21 + bias_softmax_fusion (12x BERT-family Add->Softmax) h8: opset21 + attention_fusion (9-12x Softmax nodes) h9: opset21 + highdimRTR_lowdimRTR (12x RTR chains on MobileViT) h10: opset17 + conv_add_fusion only (11x Conv->Add on ResNet, safe subset) ep_knowledge/qnn_npu.json: npu-008 rad-dino BUILD_FAIL (rc=0xC0000005)

… for delta sweeps Enables incremental sweeps without re-running all hypotheses: --only-hypotheses h6,h7,h8 run only specified IDs (skip others) --reuse-h0-config load base config from existing h0/build_config.json When --only-hypotheses is set: - loads existing results.json and preserves prior hypothesis data - skips winml config call if --reuse-h0-config + h0/build_config.json exists - writes updated results.json merging old + new entries Allows targeted delta sweeps, e.g. testing only new h6-h10 on models that already have h0-h5 data from a previous full run.

QNN GPU sweep differs from NPU: - No quantization (gpu-004: QDQ hangs on GPU EP) - No compile (gpu-003: EPContext regresses ~34% on GPU) - No nhwc-transformer (gpu-002: Adreno X1-85 does not benefit) - CV gating IS reliable (no DVFS noise unlike NPU) - opset 21 previously untested — explicitly validated via h3 (gpu-006) Hypothesis matrix (13 total, h0-h12): h0-h3: opset 17/17-explicit/19/21 baselines (FP32, no quant) h4-h8: targeted fusions from graph analysis matmul_transpose, attention, bias_softmax, layer_norm, skip_layer_norm h9-h10: bundled combinations (opset21+attention, ln+skip_ln+matmul_tp) h11: gelu_fusion explicit (tests gpu-005 stability on non-ConvNext) h12: transpose_optimizer Models: 8 catalog + 3 recipe (rad-dino, tinyroberta-squad2, bge-small) Sweep is queued to auto-start after NPU h6-h10 finishes (run_gpu_sweep.bat polls h6h10_sweep.log for completion marker) Supports --only-hypotheses and --reuse-h0-config for delta sweeps.

+            try:
+                base_config = json.loads(h0_cfg.read_text(encoding="utf-8"))
+                print("  [reuse] h0 config loaded", flush=True)
+            except Exception:


Both sweeps now run additional confirmation sessions for any KEEP-level result to reduce false positives from thermal/DVFS noise: QNN GPU (catalog_gpu_sweep.py): - Phase B: 2 → 3 sessions × 300 iters (baseline) - Phase C: KEEP candidates get 2 extra confirmation sessions - All 5 sessions above MIN_IMPROVEMENT_PCT → KEEP_CONFIRMED - Fewer → MARGINAL_UNCONFIRMED (downgraded, not dropped) QNN NPU (catalog_qnn_sweep.py): - Phase B unchanged: 3 sessions × 500 iters (already robust) - Phase C: best hypothesis (gain ≥ 5%) gets 2 extra confirmation sessions - Strict criterion: max(all 5 p50s) < min(baseline p50s) → CONFIRMED - Otherwise → MARGINAL_UNCONFIRMED (ranges overlap = DVFS noise) Motivation: avoid publishing false conclusions from single-run noise. GPU is more stable (CV gating already helps) but confirmation pass gives rigour before updating ep_knowledge KB with a new finding.

- Replace --output-json with --output (correct winml perf flag) - Fix _get_p50/_get_cv to read latency_ms.p50/std keys (winml perf JSON nests metrics under 'latency_ms', not top-level)

Without --rebuild, winml build fails when a partial export.onnx exists in the output directory (optimize step exits rc=1). --rebuild forces a clean pipeline run, which succeeds consistently.

DingmaomaoBJTU requested a review from a team as a code owner June 15, 2026 02:30

github-actions Bot and others added 2 commits June 15, 2026 10:32

github-advanced-security AI found potential problems Jun 15, 2026

View reviewed changes

Comment thread research/autoconfig/gen_report_v3.py

import json

results = json.load(open(r"ablation-search\results.json"))

xieofxie reviewed Jun 16, 2026

View reviewed changes

github-actions Bot and others added 2 commits June 16, 2026 14:33

github-advanced-security AI found potential problems Jun 16, 2026

View reviewed changes

github-actions Bot added 8 commits June 16, 2026 19:12

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread research/autoconfig/autoconfig.py Fixed

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

github-actions Bot added 8 commits June 17, 2026 16:29

research(autoconfig): rename --report to --json, drop dml-004 FR row

2c38721

research(autoconfig): trim autoconfig_diagram.html — shorter bullets,…

5c5684a

… condensed footnote

research(autoconfig): add Pending Features badge to Phase 3 in diagram

e18e066

research(autoconfig): add local PyTorch reference FR; clarify correct…

526fcd0

…ness contract

research(autoconfig): fix Phase 0 layout — nowrap, 3 equal-width boxes

9d5148e

github-actions Bot added 11 commits June 17, 2026 17:07

research(autoconfig): pending features badge — outcome-focused descri…

408c647

…ptions

research(autoconfig): Phase 3 -> Outcome; Feature Requirements badge …

d761741

…with sample issue

research(autoconfig): align Feature Requirements badge style with oth…

e9f1cf4

…er Phase 3 badges

research(autoconfig): simplify Feature Requirements badge to issue nu…

d242fdd

…mbers only

research(autoconfig): feature requirements badge — issue numbers in c…

0940ef5

…ode bar only

research(autoconfig): feature requirements badge — show issue templat…

eeef0b9

…e format

research(autoconfig): simplify Insight Engine boxes to concept + exam…

75bea5a

…ple only

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread research/autoconfig/catalog_gpu_sweep.py

try:

base_config = json.loads(h0_cfg.read_text(encoding="utf-8"))

print(" [reuse] h0 config loaded", flush=True)

except Exception:

github-actions Bot added 3 commits June 17, 2026 20:39

fix(autoconfig): fix GPU sweep perf flag and JSON parsing

cc25e31

- Replace --output-json with --output (correct winml perf flag) - Fix _get_p50/_get_cv to read latency_ms.p50/std keys (winml perf JSON nests metrics under 'latency_ms', not top-level)

fix(autoconfig): add --rebuild to GPU sweep build step

7b30db9

Without --rebuild, winml build fails when a partial export.onnx exists in the output directory (optimize step exits rc=1). --rebuild forces a clean pipeline run, which succeeds consistently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: add autoconfig POC with QNN NPU catalog sweep#891

research: add autoconfig POC with QNN NPU catalog sweep#891
DingmaomaoBJTU wants to merge 37 commits into
mainfrom
dingmaomaobjtu/research-autoconfig-poc

DingmaomaoBJTU commented Jun 15, 2026 •

edited

Loading

Uh oh!

xieofxie Jun 16, 2026

Uh oh!

xieofxie Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		import json


		results = json.load(open(r"ablation-search\results.json"))

Conversation

DingmaomaoBJTU commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)

npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property

npu-006: conv fusions cause catastrophic regression on Conv-dominant models only

npu-007: DVFS thermal noise requires session-level averaging

Included files

Core scripts

Knowledge base (ep_knowledge/)

Benchmark results (catalog-qnn-sweep/)

Design docs (docs/)

Feature gaps identified

Uh oh!

xieofxie Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

xieofxie Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DingmaomaoBJTU commented Jun 15, 2026 •

edited

Loading

Knowledge base (`ep_knowledge/`)

Benchmark results (`catalog-qnn-sweep/`)

Design docs (`docs/`)