Skip to content

Commit f7ad732

Browse files
committed
docs: update AGENTS.md with omni_harness.sh combined mode and current DQE state
1 parent c7a5129 commit f7ad732

1 file changed

Lines changed: 121 additions & 69 deletions

File tree

AGENTS.md

Lines changed: 121 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ SQL query → TransportTrinoSqlAction (coordinator)
4040

4141
1. Code change (edit Java files)
4242
2. Compile: `./gradlew :dqe:compileJava`
43-
3. Reload plugin — see Long-Running Task Rules
44-
4. Correctness gate — MUST be >= 38/43. If regression, STOP and fix.
45-
5. Benchmark target queries — see Long-Running Task Rules
43+
3. Restart OpenSearch: `./gradlew :opensearch-sql-plugin:run -x test -x integTest` (32GB heap)
44+
4. Benchmark target queries: `bash benchmark_dqe_iceberg/omni_harness.sh --query N --timeout 300`
45+
5. Full correctness+perf gate: `bash benchmark_dqe_iceberg/omni_harness.sh --timeout 300`
4646

4747
All steps 3-5 MUST follow the async execution pattern in Long-Running Task Rules.
4848

@@ -55,55 +55,40 @@ Any command that may take longer than 2 minutes MUST be run asynchronously. This
5555
1. **NEVER run long-running commands synchronously** — always background and poll.
5656
2. **Launch in a subshell** so the parent shell returns immediately:
5757
```bash
58-
nohup bash -c 'cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash run/run_all.sh reload-plugin > /tmp/reload.log 2>&1' &>/dev/null &
58+
nohup bash -c 'cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --timeout 300 > /tmp/q33.log 2>&1' &>/dev/null &
5959
echo "launched"
6060
```
6161
**CRITICAL**: Plain `nohup cmd &` or `(cmd &)` does NOT work — the shell hangs waiting for the background process. You MUST use `nohup bash -c '...' &>/dev/null &`.
6262
3. **Poll for completion** — check output tail for success/failure:
6363
```bash
64-
tail -5 /tmp/reload.log
64+
tail -5 /tmp/q33.log
6565
```
66-
4. **Poll interval**: every 10-30s for benchmarks, every 30-60s for builds.
66+
4. **Poll interval**: every 60-120s for single-query benchmarks, every 30s for builds.
6767
5. **Analyze each poll result** — if ERROR/FAILURE appears in output, stop and diagnose immediately.
6868
6. **Monitoring IS the task** — never launch a long-running command and then do something else.
6969

7070
### Common Long-Running Commands
7171

72-
| Command | Est. Time | Output File | Completion Marker | Error Marker |
73-
|---------|-----------|-------------|-------------------|--------------|
74-
| `./gradlew :dqe:compileJava` | ~5s | `/tmp/compile.log` | `BUILD SUCCESSFUL` | `BUILD FAILED` |
75-
| `run_all.sh reload-plugin` | 2-3 min | `/tmp/reload.log` | `reloaded successfully` | `FAILED` or `Error` |
76-
| `run_all.sh correctness` | ~2 min | `/tmp/correctness.log` | `Summary:` | `Error` |
77-
| `run_opensearch.sh --query N` | ~1 min | `/tmp/bench-qN.log` | `Results written` | `Error` or `failed` |
78-
| `run_opensearch.sh` (full suite) | 5-15 min | `/tmp/bench-full.log` | `Results written` | `Error` or `failed` |
79-
80-
### Multi-Query Benchmark with Monitoring
81-
82-
```bash
83-
# Benchmark multiple queries sequentially, monitoring each
84-
for Q in 31 32 38 41; do
85-
LOG=/tmp/bench-q${Q}.log
86-
nohup bash -c "cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash run/run_opensearch.sh --warmup 1 --num-tries 3 --query $Q --output-dir /tmp/q${Q} > $LOG 2>&1" &>/dev/null &
87-
PID=$!
88-
while kill -0 $PID 2>/dev/null; do sleep 3; tail -1 $LOG 2>/dev/null; done
89-
echo "=== Q${Q} ==="
90-
grep -E "Q[0-9]+ run" $LOG
91-
done
92-
```
72+
| Command | Est. Time | Completion Marker | Error Marker |
73+
|---------|-----------|-------------------|--------------|
74+
| `./gradlew :dqe:compileJava` | ~5s | `BUILD SUCCESSFUL` | `BUILD FAILED` |
75+
| `./gradlew :opensearch-sql-plugin:run -x test -x integTest` | ~45s startup | `started` in log | `BUILD FAILED` |
76+
| `omni_harness.sh --query N` | 30s-5min per query | `Done.` | `ERROR` |
77+
| `omni_harness.sh` (full 43 queries) | 20-60 min | `Done.` | `ERROR` |
9378

9479
### Kill All Benchmarks
9580

9681
```bash
97-
pkill -f "run_opensearch.sh"; pkill -f "run_all.sh"
82+
pkill -f "omni_harness"; pkill -f "run_opensearch.sh"; pkill -f "run_all.sh"
9883
```
9984

10085
## Query Numbering (CRITICAL)
10186

10287
| Context | Indexing | "Q17" means |
10388
|---------|----------|-------------|
104-
| `--query N` in scripts | 1-based | `--query 18` for Q17 |
105-
| `queries_trino.sql` line | 1-based | line 18 for Q17 |
106-
| JSON `result[N]` | 0-based | `result[17]` for Q17 |
89+
| `--query N` in scripts | 1-based | `--query 17` for Q17 |
90+
| `queries_iceberg.sql` line | 1-based | line 17 for Q17 |
91+
| JSON `result[N]` | 0-based | `result[16]` for Q17 |
10792

10893
**Mnemonic**: scripts and SQL are 1-based, JSON is 0-based.
10994

@@ -121,12 +106,10 @@ SQL query → TransportTrinoSqlAction (coordinator)
121106

122107
Key difference from OpenSearch path: no DSL filters, no DocValues — reads Parquet directly via `ParquetPageSource`. Predicate pushdown is NOT applied (Iceberg uses `optimizeForIceberg()` which skips DSL conversion). AVG is decomposed into SUM+COUNT via `decomposeAvgInPlanTree()`.
123108

124-
### Benchmark Suite
125-
126-
Location: `benchmarks/clickbench/benchmark_dqe_iceberg/`
109+
### Benchmark Suite Layout
127110

128111
```
129-
benchmark_dqe_iceberg/
112+
benchmarks/clickbench/benchmark_dqe_iceberg/
130113
├── golden/ # 43 expected results (100M dataset, TSV)
131114
│ ├── q01.expected ... q43.expected
132115
├── baseline/
@@ -135,53 +118,95 @@ benchmark_dqe_iceberg/
135118
│ ├── correctness_report.txt
136119
│ ├── <instance>.json
137120
│ └── comparison.txt
138-
└── omni_harness.sh # Main harness script
121+
└── omni_harness.sh # Main harness script — USE THIS FOR EVERYTHING
139122
```
140123

141-
### Harness Usage
124+
### omni_harness.sh — The One Script
125+
126+
**Always use `omni_harness.sh` for correctness checks, performance benchmarks, and comparisons.** Do NOT write custom curl loops or one-off scripts.
127+
128+
#### Options
129+
130+
| Flag | Description | Default |
131+
|------|-------------|---------|
132+
| `--query N` | Run only query N (1-based) | all 43 |
133+
| `--tries N` | Runs per query for timing | 3 |
134+
| `--timeout N` | Per-query timeout in seconds | 120 |
135+
| `--skip-correctness` | Skip correctness, perf only | false |
136+
| `--correctness-only` | Correctness only, skip perf | false |
137+
138+
#### Modes
139+
140+
| Mode | What It Does | When to Use |
141+
|------|-------------|-------------|
142+
| Default (no flags) | Combined: correctness on run 1 + timing on all N runs, then Trino comparison | **Standard dev loop — use this** |
143+
| `--correctness-only` | Single run, check against golden files | Quick correctness gate |
144+
| `--skip-correctness` | N timing runs only, no correctness check | Re-benchmarking known-correct queries |
145+
146+
**The default combined mode is the most efficient** — it checks correctness on the first run's response and records timing for all runs in a single pass. No redundant query execution.
147+
148+
#### Examples
142149

143150
```bash
144151
cd benchmarks/clickbench
145152

146-
# Full run: correctness → perf (3 runs) → Trino comparison
147-
bash benchmark_dqe_iceberg/omni_harness.sh
153+
# Single query: correctness + 3 perf runs + Trino comparison
154+
bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --timeout 300
148155

149-
# Correctness only
150-
bash benchmark_dqe_iceberg/omni_harness.sh --correctness-only
156+
# Full suite: all 43 queries, correctness + perf
157+
bash benchmark_dqe_iceberg/omni_harness.sh --timeout 300
151158

152-
# Single query
153-
bash benchmark_dqe_iceberg/omni_harness.sh --query 3
159+
# Quick correctness check on one query
160+
bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --correctness-only --timeout 300
154161

155-
# Perf only, 5 runs, 180s timeout
156-
bash benchmark_dqe_iceberg/omni_harness.sh --skip-correctness --tries 5 --timeout 180
157-
```
162+
# Perf only, 5 runs, for stable timing
163+
bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --skip-correctness --tries 5 --timeout 300
158164

159-
### Harness Long-Running Commands
165+
# Range of queries (run sequentially via loop)
166+
for q in 32 33 34 35 36; do
167+
bash benchmark_dqe_iceberg/omni_harness.sh --query $q --timeout 300
168+
done
169+
```
160170

161-
| Command | Est. Time | Output File | Completion Marker |
162-
|---------|-----------|-------------|-------------------|
163-
| `omni_harness.sh --correctness-only` | 5-15 min | stdout | `Correctness:` |
164-
| `omni_harness.sh --skip-correctness` | 15-30 min | stdout | `Results written` |
165-
| `omni_harness.sh` (full) | 20-45 min | stdout | `Done.` |
171+
#### Async Execution (MANDATORY for agents)
166172

167-
All harness runs MUST follow the async execution pattern:
168173
```bash
169-
nohup bash -c 'cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash benchmark_dqe_iceberg/omni_harness.sh > /tmp/harness.log 2>&1' &>/dev/null &
174+
# Launch in background
175+
nohup bash -c 'cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --timeout 300 > /tmp/q33.log 2>&1' &>/dev/null &
170176
echo "launched"
171-
# Poll:
172-
tail -5 /tmp/harness.log
177+
178+
# Poll for results
179+
tail -5 /tmp/q33.log
180+
181+
# Expected output when done:
182+
# [HH:MM:SS] Q33 run 1: 183.787s
183+
# [HH:MM:SS] Q33: PASS
184+
# [HH:MM:SS] Q33 run 2: 175.234s
185+
# [HH:MM:SS] Q33 run 3: 171.890s
186+
# [HH:MM:SS] Correctness: 1 pass, 0 fail, 0 error, 42 skip (out of 43)
187+
# [HH:MM:SS] Results written to results/r5.4xlarge.json
188+
# [HH:MM:SS] Done.
173189
```
174190

191+
#### Outputs
192+
193+
| File | Content |
194+
|------|---------|
195+
| `results/correctness_report.txt` | Per-query PASS/FAIL/ERROR |
196+
| `results/<instance>.json` | Timing results in ClickBench JSON format |
197+
| `results/comparison.txt` | Side-by-side DQE vs Trino table |
198+
| `results/diff_qN.txt` | Diff for failed queries |
199+
175200
### Iceberg Table Setup
176201

177202
```bash
178-
# Create optimized Iceberg table (sorted by CounterID, EventDate, IsRefresh)
179203
python3 benchmarks/clickbench/data/create_iceberg_table_optimized.py \
180204
--data-dir benchmarks/clickbench/data/parquet \
181205
--warehouse /tmp/iceberg-warehouse
182206
```
183207

184208
Table properties: 128MB target file size, 32MB row groups, 64KB pages, ZSTD, 4MB dict.
209+
Data: sorted by (CounterID, EventDate, IsRefresh), 125 files, 10.1GB.
185210

186211
### Trino Baseline (Docker)
187212

@@ -202,25 +227,52 @@ bash benchmarks/clickbench/run/run_trino_baseline.sh 3
202227
- Uses `iceberg.default.hits` catalog, DATE/TIMESTAMP types (not raw int)
203228
- EventDate is DATE type, EventTime is TIMESTAMP — converted during Iceberg table creation
204229

230+
## Starting OpenSearch for Benchmarks
231+
232+
**Always use gradle run for 32GB heap:**
233+
234+
```bash
235+
# Kill any existing instance
236+
pkill -f "org.opensearch.bootstrap" 2>/dev/null || true; sleep 3
237+
238+
# Start with 32GB heap
239+
nohup bash -c 'cd /local/home/penghuo/oss/os-sql && ./gradlew :opensearch-sql-plugin:run -x test -x integTest > /tmp/os-gradle-run.log 2>&1' &>/dev/null &
240+
241+
# Wait for cluster green (~45s)
242+
for i in $(seq 1 24); do
243+
sleep 5
244+
STATUS=$(curl -s --max-time 3 "http://localhost:9200/_cluster/health" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin)['status'])" 2>/dev/null)
245+
if [ "$STATUS" = "green" ]; then echo "Cluster green"; break; fi
246+
done
247+
248+
# Verify heap
249+
curl -s "http://localhost:9200/_nodes/stats/jvm" | python3 -c "
250+
import json,sys
251+
d=json.load(sys.stdin)
252+
for nid,n in d.get('nodes',{}).items():
253+
jvm=n.get('jvm',{}).get('mem',{})
254+
print(f'heap: {jvm.get(\"heap_used_in_bytes\",0)/1024/1024/1024:.1f}GB / {jvm.get(\"heap_max_in_bytes\",0)/1024/1024/1024:.1f}GB')
255+
"
256+
```
257+
258+
**Do NOT use `/tmp/os-cluster/node1/bin/opensearch` directly** — that starts with 1GB heap and will OOM on 100M dataset queries.
259+
205260
## Pitfalls
206261

207-
- **NEVER** run `reload-plugin` while a benchmark is running
208-
- OpenSearch benchmark: 100M (`hits`), correctness on 1M (`hits_1m`)
262+
- **NEVER** run benchmarks while another benchmark is running
263+
- **NEVER** use 1GB heap for Iceberg benchmarks — always gradle run (32GB)
209264
- Iceberg benchmark: always 100M (`iceberg.default.hits`)
210-
- OpenSearch baseline: `results/performance/clickhouse_parquet_official/c6a.4xlarge.json`
211265
- Iceberg baseline: `benchmark_dqe_iceberg/baseline/trino_r5.4xlarge.json`
212266
- OpenSearch endpoint: `http://localhost:9200`, DQE: `POST /_plugins/_trino_sql`
267+
- `run_all.sh reload-plugin` requires `OS_INSTALL_DIR=/opt/opensearch` which may not exist — use gradle run instead
213268

214-
## Current State (2026-04-08)
215-
216-
### OpenSearch DQE
217-
- Correctness: 29/43 on 1M
218-
- Within 2x of CH-Parquet: 19/43 on r5.4xlarge
219-
- Hybrid bitset/collector optimization deployed
269+
## Current State (2026-04-09)
220270

221271
### DQE Iceberg
222-
- Correctness: 20/43 on 100M (Q1-Q18, Q20 pass; Q19 type error; Q21+ OOM on high-cardinality GROUP BY)
272+
- Correctness: ~20/43 on 100M
273+
- Parallel bucket aggregation deployed (CompletableFuture)
274+
- Inflated Sort+Limit (500K min) for ORDER BY agg LIMIT N
275+
- Double-dispatch eliminated in multi-pass aggregation fallback
223276
- Baseline: Trino 442 on Hive Parquet — 43/43, 363.2s total on r5.4xlarge
224-
- Key fixes applied: `optimizeForIceberg()` (skip predicate pushdown), AVG decomposition
225-
- Known issues: OOM on high-cardinality GROUP BY (Q19 extract(minute) type, Q21+ coordinator merge)
277+
- Known slow: Q33 (~184s vs Trino 26.8s), Q35 (>180s vs Trino 21.1s) — high-cardinality GROUP BY
226278
- Data: sorted by (CounterID, EventDate, IsRefresh), 125 files, 10.1GB, 32MB row groups

0 commit comments

Comments
 (0)