@@ -40,9 +40,9 @@ SQL query → TransportTrinoSqlAction (coordinator)
4040
41411 . Code change (edit Java files)
42422 . Compile: ` ./gradlew :dqe:compileJava `
43- 3 . Reload plugin — see Long-Running Task Rules
44- 4 . Correctness gate — MUST be >= 38/43. If regression, STOP and fix.
45- 5 . Benchmark target queries — see Long-Running Task Rules
43+ 3 . Restart OpenSearch: ` ./gradlew :opensearch-sql- plugin:run -x test -x integTest ` (32GB heap)
44+ 4 . Benchmark target queries: ` bash benchmark_dqe_iceberg/omni_harness.sh --query N --timeout 300 `
45+ 5 . Full correctness+perf gate: ` bash benchmark_dqe_iceberg/omni_harness.sh --timeout 300 `
4646
4747All steps 3-5 MUST follow the async execution pattern in Long-Running Task Rules.
4848
@@ -55,55 +55,40 @@ Any command that may take longer than 2 minutes MUST be run asynchronously. This
55551 . ** NEVER run long-running commands synchronously** — always background and poll.
56562 . ** Launch in a subshell** so the parent shell returns immediately:
5757 ``` bash
58- nohup bash -c ' cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash run/run_all .sh reload-plugin > /tmp/reload .log 2>&1' & > /dev/null &
58+ nohup bash -c ' cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash benchmark_dqe_iceberg/omni_harness .sh --query 33 --timeout 300 > /tmp/q33 .log 2>&1' & > /dev/null &
5959 echo " launched"
6060 ```
6161 ** CRITICAL** : Plain ` nohup cmd & ` or ` (cmd &) ` does NOT work — the shell hangs waiting for the background process. You MUST use ` nohup bash -c '...' &>/dev/null & ` .
62623 . ** Poll for completion** — check output tail for success/failure:
6363 ``` bash
64- tail -5 /tmp/reload .log
64+ tail -5 /tmp/q33 .log
6565 ```
66- 4 . ** Poll interval** : every 10-30s for benchmarks, every 30-60s for builds.
66+ 4 . ** Poll interval** : every 60-120s for single-query benchmarks, every 30s for builds.
67675 . ** Analyze each poll result** — if ERROR/FAILURE appears in output, stop and diagnose immediately.
68686 . ** Monitoring IS the task** — never launch a long-running command and then do something else.
6969
7070### Common Long-Running Commands
7171
72- | Command | Est. Time | Output File | Completion Marker | Error Marker |
73- | ---------| -----------| -------------| -------------------| --------------|
74- | ` ./gradlew :dqe:compileJava ` | ~ 5s | ` /tmp/compile.log ` | ` BUILD SUCCESSFUL ` | ` BUILD FAILED ` |
75- | ` run_all.sh reload-plugin ` | 2-3 min | ` /tmp/reload.log ` | ` reloaded successfully ` | ` FAILED ` or ` Error ` |
76- | ` run_all.sh correctness ` | ~ 2 min | ` /tmp/correctness.log ` | ` Summary: ` | ` Error ` |
77- | ` run_opensearch.sh --query N ` | ~ 1 min | ` /tmp/bench-qN.log ` | ` Results written ` | ` Error ` or ` failed ` |
78- | ` run_opensearch.sh ` (full suite) | 5-15 min | ` /tmp/bench-full.log ` | ` Results written ` | ` Error ` or ` failed ` |
79-
80- ### Multi-Query Benchmark with Monitoring
81-
82- ``` bash
83- # Benchmark multiple queries sequentially, monitoring each
84- for Q in 31 32 38 41; do
85- LOG=/tmp/bench-q${Q} .log
86- nohup bash -c " cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash run/run_opensearch.sh --warmup 1 --num-tries 3 --query $Q --output-dir /tmp/q${Q} > $LOG 2>&1" & > /dev/null &
87- PID=$!
88- while kill -0 $PID 2> /dev/null; do sleep 3; tail -1 $LOG 2> /dev/null; done
89- echo " === Q${Q} ==="
90- grep -E " Q[0-9]+ run" $LOG
91- done
92- ```
72+ | Command | Est. Time | Completion Marker | Error Marker |
73+ | ---------| -----------| -------------------| --------------|
74+ | ` ./gradlew :dqe:compileJava ` | ~ 5s | ` BUILD SUCCESSFUL ` | ` BUILD FAILED ` |
75+ | ` ./gradlew :opensearch-sql-plugin:run -x test -x integTest ` | ~ 45s startup | ` started ` in log | ` BUILD FAILED ` |
76+ | ` omni_harness.sh --query N ` | 30s-5min per query | ` Done. ` | ` ERROR ` |
77+ | ` omni_harness.sh ` (full 43 queries) | 20-60 min | ` Done. ` | ` ERROR ` |
9378
9479### Kill All Benchmarks
9580
9681``` bash
97- pkill -f " run_opensearch.sh" ; pkill -f " run_all.sh"
82+ pkill -f " omni_harness " ; pkill -f " run_opensearch.sh" ; pkill -f " run_all.sh"
9883```
9984
10085## Query Numbering (CRITICAL)
10186
10287| Context | Indexing | "Q17" means |
10388| ---------| ----------| -------------|
104- | ` --query N ` in scripts | 1-based | ` --query 18 ` for Q17 |
105- | ` queries_trino .sql` line | 1-based | line 18 for Q17 |
106- | JSON ` result[N] ` | 0-based | ` result[17 ] ` for Q17 |
89+ | ` --query N ` in scripts | 1-based | ` --query 17 ` for Q17 |
90+ | ` queries_iceberg .sql` line | 1-based | line 17 for Q17 |
91+ | JSON ` result[N] ` | 0-based | ` result[16 ] ` for Q17 |
10792
10893** Mnemonic** : scripts and SQL are 1-based, JSON is 0-based.
10994
@@ -121,12 +106,10 @@ SQL query → TransportTrinoSqlAction (coordinator)
121106
122107Key difference from OpenSearch path: no DSL filters, no DocValues — reads Parquet directly via ` ParquetPageSource ` . Predicate pushdown is NOT applied (Iceberg uses ` optimizeForIceberg() ` which skips DSL conversion). AVG is decomposed into SUM+COUNT via ` decomposeAvgInPlanTree() ` .
123108
124- ### Benchmark Suite
125-
126- Location: ` benchmarks/clickbench/benchmark_dqe_iceberg/ `
109+ ### Benchmark Suite Layout
127110
128111```
129- benchmark_dqe_iceberg/
112+ benchmarks/clickbench/ benchmark_dqe_iceberg/
130113├── golden/ # 43 expected results (100M dataset, TSV)
131114│ ├── q01.expected ... q43.expected
132115├── baseline/
@@ -135,53 +118,95 @@ benchmark_dqe_iceberg/
135118│ ├── correctness_report.txt
136119│ ├── <instance>.json
137120│ └── comparison.txt
138- └── omni_harness.sh # Main harness script
121+ └── omni_harness.sh # Main harness script — USE THIS FOR EVERYTHING
139122```
140123
141- ### Harness Usage
124+ ### omni_harness.sh — The One Script
125+
126+ ** Always use ` omni_harness.sh ` for correctness checks, performance benchmarks, and comparisons.** Do NOT write custom curl loops or one-off scripts.
127+
128+ #### Options
129+
130+ | Flag | Description | Default |
131+ | ------| -------------| ---------|
132+ | ` --query N ` | Run only query N (1-based) | all 43 |
133+ | ` --tries N ` | Runs per query for timing | 3 |
134+ | ` --timeout N ` | Per-query timeout in seconds | 120 |
135+ | ` --skip-correctness ` | Skip correctness, perf only | false |
136+ | ` --correctness-only ` | Correctness only, skip perf | false |
137+
138+ #### Modes
139+
140+ | Mode | What It Does | When to Use |
141+ | ------| -------------| -------------|
142+ | Default (no flags) | Combined: correctness on run 1 + timing on all N runs, then Trino comparison | ** Standard dev loop — use this** |
143+ | ` --correctness-only ` | Single run, check against golden files | Quick correctness gate |
144+ | ` --skip-correctness ` | N timing runs only, no correctness check | Re-benchmarking known-correct queries |
145+
146+ ** The default combined mode is the most efficient** — it checks correctness on the first run's response and records timing for all runs in a single pass. No redundant query execution.
147+
148+ #### Examples
142149
143150``` bash
144151cd benchmarks/clickbench
145152
146- # Full run : correctness → perf (3 runs) → Trino comparison
147- bash benchmark_dqe_iceberg/omni_harness.sh
153+ # Single query : correctness + 3 perf runs + Trino comparison
154+ bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --timeout 300
148155
149- # Correctness only
150- bash benchmark_dqe_iceberg/omni_harness.sh --correctness-only
156+ # Full suite: all 43 queries, correctness + perf
157+ bash benchmark_dqe_iceberg/omni_harness.sh --timeout 300
151158
152- # Single query
153- bash benchmark_dqe_iceberg/omni_harness.sh --query 3
159+ # Quick correctness check on one query
160+ bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --correctness-only --timeout 300
154161
155- # Perf only, 5 runs, 180s timeout
156- bash benchmark_dqe_iceberg/omni_harness.sh --skip-correctness --tries 5 --timeout 180
157- ```
162+ # Perf only, 5 runs, for stable timing
163+ bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --skip-correctness --tries 5 --timeout 300
158164
159- ### Harness Long-Running Commands
165+ # Range of queries (run sequentially via loop)
166+ for q in 32 33 34 35 36; do
167+ bash benchmark_dqe_iceberg/omni_harness.sh --query $q --timeout 300
168+ done
169+ ```
160170
161- | Command | Est. Time | Output File | Completion Marker |
162- | ---------| -----------| -------------| -------------------|
163- | ` omni_harness.sh --correctness-only ` | 5-15 min | stdout | ` Correctness: ` |
164- | ` omni_harness.sh --skip-correctness ` | 15-30 min | stdout | ` Results written ` |
165- | ` omni_harness.sh ` (full) | 20-45 min | stdout | ` Done. ` |
171+ #### Async Execution (MANDATORY for agents)
166172
167- All harness runs MUST follow the async execution pattern:
168173``` bash
169- nohup bash -c ' cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash benchmark_dqe_iceberg/omni_harness.sh > /tmp/harness.log 2>&1' & > /dev/null &
174+ # Launch in background
175+ nohup bash -c ' cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash benchmark_dqe_iceberg/omni_harness.sh --query 33 --timeout 300 > /tmp/q33.log 2>&1' & > /dev/null &
170176echo " launched"
171- # Poll:
172- tail -5 /tmp/harness.log
177+
178+ # Poll for results
179+ tail -5 /tmp/q33.log
180+
181+ # Expected output when done:
182+ # [HH:MM:SS] Q33 run 1: 183.787s
183+ # [HH:MM:SS] Q33: PASS
184+ # [HH:MM:SS] Q33 run 2: 175.234s
185+ # [HH:MM:SS] Q33 run 3: 171.890s
186+ # [HH:MM:SS] Correctness: 1 pass, 0 fail, 0 error, 42 skip (out of 43)
187+ # [HH:MM:SS] Results written to results/r5.4xlarge.json
188+ # [HH:MM:SS] Done.
173189```
174190
191+ #### Outputs
192+
193+ | File | Content |
194+ | ------| ---------|
195+ | ` results/correctness_report.txt ` | Per-query PASS/FAIL/ERROR |
196+ | ` results/<instance>.json ` | Timing results in ClickBench JSON format |
197+ | ` results/comparison.txt ` | Side-by-side DQE vs Trino table |
198+ | ` results/diff_qN.txt ` | Diff for failed queries |
199+
175200### Iceberg Table Setup
176201
177202``` bash
178- # Create optimized Iceberg table (sorted by CounterID, EventDate, IsRefresh)
179203python3 benchmarks/clickbench/data/create_iceberg_table_optimized.py \
180204 --data-dir benchmarks/clickbench/data/parquet \
181205 --warehouse /tmp/iceberg-warehouse
182206```
183207
184208Table properties: 128MB target file size, 32MB row groups, 64KB pages, ZSTD, 4MB dict.
209+ Data: sorted by (CounterID, EventDate, IsRefresh), 125 files, 10.1GB.
185210
186211### Trino Baseline (Docker)
187212
@@ -202,25 +227,52 @@ bash benchmarks/clickbench/run/run_trino_baseline.sh 3
202227- Uses ` iceberg.default.hits ` catalog, DATE/TIMESTAMP types (not raw int)
203228- EventDate is DATE type, EventTime is TIMESTAMP — converted during Iceberg table creation
204229
230+ ## Starting OpenSearch for Benchmarks
231+
232+ ** Always use gradle run for 32GB heap:**
233+
234+ ``` bash
235+ # Kill any existing instance
236+ pkill -f " org.opensearch.bootstrap" 2> /dev/null || true ; sleep 3
237+
238+ # Start with 32GB heap
239+ nohup bash -c ' cd /local/home/penghuo/oss/os-sql && ./gradlew :opensearch-sql-plugin:run -x test -x integTest > /tmp/os-gradle-run.log 2>&1' & > /dev/null &
240+
241+ # Wait for cluster green (~45s)
242+ for i in $( seq 1 24) ; do
243+ sleep 5
244+ STATUS=$( curl -s --max-time 3 " http://localhost:9200/_cluster/health" 2> /dev/null | python3 -c " import json,sys; print(json.load(sys.stdin)['status'])" 2> /dev/null)
245+ if [ " $STATUS " = " green" ]; then echo " Cluster green" ; break ; fi
246+ done
247+
248+ # Verify heap
249+ curl -s " http://localhost:9200/_nodes/stats/jvm" | python3 -c "
250+ import json,sys
251+ d=json.load(sys.stdin)
252+ for nid,n in d.get('nodes',{}).items():
253+ jvm=n.get('jvm',{}).get('mem',{})
254+ print(f'heap: {jvm.get(\" heap_used_in_bytes\" ,0)/1024/1024/1024:.1f}GB / {jvm.get(\" heap_max_in_bytes\" ,0)/1024/1024/1024:.1f}GB')
255+ "
256+ ```
257+
258+ ** Do NOT use ` /tmp/os-cluster/node1/bin/opensearch ` directly** — that starts with 1GB heap and will OOM on 100M dataset queries.
259+
205260## Pitfalls
206261
207- - ** NEVER** run ` reload-plugin ` while a benchmark is running
208- - OpenSearch benchmark: 100M ( ` hits ` ), correctness on 1M ( ` hits_1m ` )
262+ - ** NEVER** run benchmarks while another benchmark is running
263+ - ** NEVER ** use 1GB heap for Iceberg benchmarks — always gradle run (32GB )
209264- Iceberg benchmark: always 100M (` iceberg.default.hits ` )
210- - OpenSearch baseline: ` results/performance/clickhouse_parquet_official/c6a.4xlarge.json `
211265- Iceberg baseline: ` benchmark_dqe_iceberg/baseline/trino_r5.4xlarge.json `
212266- OpenSearch endpoint: ` http://localhost:9200 ` , DQE: ` POST /_plugins/_trino_sql `
267+ - ` run_all.sh reload-plugin ` requires ` OS_INSTALL_DIR=/opt/opensearch ` which may not exist — use gradle run instead
213268
214- ## Current State (2026-04-08)
215-
216- ### OpenSearch DQE
217- - Correctness: 29/43 on 1M
218- - Within 2x of CH-Parquet: 19/43 on r5.4xlarge
219- - Hybrid bitset/collector optimization deployed
269+ ## Current State (2026-04-09)
220270
221271### DQE Iceberg
222- - Correctness: 20/43 on 100M (Q1-Q18, Q20 pass; Q19 type error; Q21+ OOM on high-cardinality GROUP BY)
272+ - Correctness: ~ 20/43 on 100M
273+ - Parallel bucket aggregation deployed (CompletableFuture)
274+ - Inflated Sort+Limit (500K min) for ORDER BY agg LIMIT N
275+ - Double-dispatch eliminated in multi-pass aggregation fallback
223276- Baseline: Trino 442 on Hive Parquet — 43/43, 363.2s total on r5.4xlarge
224- - Key fixes applied: ` optimizeForIceberg() ` (skip predicate pushdown), AVG decomposition
225- - Known issues: OOM on high-cardinality GROUP BY (Q19 extract(minute) type, Q21+ coordinator merge)
277+ - Known slow: Q33 (~ 184s vs Trino 26.8s), Q35 (>180s vs Trino 21.1s) — high-cardinality GROUP BY
226278- Data: sorted by (CounterID, EventDate, IsRefresh), 125 files, 10.1GB, 32MB row groups
0 commit comments