feat: SoA binary streaming output for high-throughput pipelines by filiprumenovski · Pull Request #206 · CompOmics/ThermoRawFileParser

filiprumenovski · 2026-05-10T10:08:18Z

What

Adds --format BinarySoa (4) — a binary output that emits each scan as [128-byte header][filter string][f64 mz array][f32 intensity array][optional trailer dump]. Format is documented in BINARY_SOA_FORMAT.md.

Both --stdout and --output produce the same byte format.

Why

For pipelines that consume TRFP output and don't need mzML's portability, the XML round-trip is meaningful overhead. This format skips it — downstream consumers cast the bytes into native arrays directly.

Implementation notes

Output wrapped in a 1 MB BufferedStream. On a 3.7 GB Orbitrap DDA file this dropped sys time from ~72s to ~3s vs unbuffered.
mz array goes out as a span over the existing double[], no copy.
Intensity narrowing (f64→f32) uses ArrayPool<float>.Shared and a tight loop the JIT vectorizes.
128-byte fixed scalar header per scan is built into a reusable buffer with inline little-endian writers.
The optional trailer dump captures every key/value pair from ILogEntryAccess verbatim, so consumers can pick out instrument-specific fields (lock-mass calibration, AGC, conversion params, etc.) without us having to enumerate them.
EThcD detection: uses FindLastReaction + SupplementalActivation == TriState.On to tag the spectrum as 5 (EThcD) rather than the supplemental HCD/CID's type.

Compatibility

Strictly additive:

New OutputFormat.BinarySoa = 4 enum value before None. Existing format paths untouched.
SpectrumWriter.ConfigureWriter adds a branch for binary destinations (raw FileStream, no text-encoding wrapper). --gzip works on file output.
CLI -f/--format help updated. Numeric and case-insensitive name parsing inherits the existing ParseToEnum helper, so --format 4, --format BinarySoa, --format binarysoa all work.

Tested

Built clean on .NET 8 (macOS arm64 dev with Rosetta'd x64 dotnet for the Thermo DLL; CI Ubuntu should be unaffected).

Smoke-tested on:

bundled Data/small.RAW (48 spectra, FTMS+ITMS hybrid)
bundled Data/small2.RAW (95 spectra, ETD-capable instrument with reagent-ion trailer fields)
a 143,136-spectrum / ~60M-peak Orbitrap DDA file from PXD028735 (0 errors, 0 warnings)

--stdout and --output produce byte-exact identical streams in all three cases.

Happy to add a WriterTests entry in the same shape as the MzML/Parquet ones if you'd prefer the test be in-tree.

…mers Adds a new --format=BinarySoa (4) output that emits each scan as a self-describing binary record laid out as Structure-of-Arrays (mz f64, intensity f32). Designed for downstream pipelines (Rust engines, GPU rescorers, columnar database loaders) that prefer zero-copy ingestion over portable XML. The format is fully documented in BINARY_SOA_FORMAT.md and consists of: - 32-byte file header (magic "RCIASTR1", format_version, flags) - per-spectrum records with a 128-byte fixed scalar header capturing every commonly-needed field (rt, precursor mz, isolation window, collision energy, FAIMS CV, ion injection time, base peak, TIC, low/high mass, charge, master scan, activation type, ...) with graceful nullability via NaN floats and -1 int sentinels - an optional verbatim trailer key/value dump preserving every per-scan vendor-reported field (AGC target, conversion parameters, lock-mass calibration, etc.) without selective filtering - SoA peak arrays (f64 mz, then f32 intensity), naturally aligned - u32 = 0 EOF marker Both --stdout and --output produce the identical byte format, so a file written with --output can be played back through the same downstream consumer that reads from a streaming pipe. Performance notes from a 3.7 GB Orbitrap DDA benchmark (143k spectra, 60M peaks): - Output is wrapped in a 1 MB BufferedStream to coalesce small writes into few large pipe syscalls (sys time dropped ~22x in measurement vs naive per-element writes) - mz array emitted via zero-copy MemoryMarshal.AsBytes over the existing double[] - intensity narrowing (f64->f32) uses ArrayPool<float>.Shared and a tight loop the JIT auto-vectorizes - per-spectrum header is built into a reusable 128-byte buffer with inline little-endian writers (no BinaryWriter virtual calls) - metadata block built into a reusable MemoryStream that's reset (not freed) between scans Activation type encoding handles EThcD correctly: when the instrument reports SupplementalActivation == TriState.On AND the primary reaction is ETD/ECD followed by HCD/CID, the encoded byte is 5 (EThcD) rather than the supplemental's HCD/CID value. Compatibility: - Additive: new OutputFormat.BinarySoa = 4 enum value, existing formats (MGF, mzML, IndexMzML, Parquet) untouched - SpectrumWriter.ConfigureWriter handles BinarySoa as a binary destination (no text-encoded StreamWriter wrapper, optional gzip via --gzip) - CLI help text updated to document the new format

ypriverol · 2026-05-11T13:49:51Z

@filiprumenovski can this be aligned; at least the column names with the parquet representation?

caetera · 2026-05-11T13:56:01Z

+            precursorMz = CalculateSelectedIonMz(reaction, monoisotopicMz, isolationWidthTrailer);
+            collisionEnergy = (float)reaction.CollisionEnergy;
+
+            double iw = isolationWidthTrailer ?? reaction.IsolationWidth;


Isolation window offset needs to be implemented to make the output similar to other formats

caetera · 2026-05-11T14:06:17Z

+            int filterLen = filterBytes.Length;
+            if (filterLen > MaxFilterStringLen)
+            {
+                Log.Warn($"Filter string for scan {scanNumber} truncated from {filterLen} to {MaxFilterStringLen} bytes");


In general issued warning affect the error code through parseInput.NewWarn(). The warning is "silenced" here. Is it intended?

caetera · 2026-05-11T14:09:43Z

+| 2     | HCD (Higher-Energy Collisional Dissociation) |
+| 3     | ETD (Electron Transfer Dissociation) |
+| 4     | ECD (Electron Capture Dissociation) |
+| 5     | EThcD (ETD + HCD supplemental) |


Is it palnned to support ETciD?

caetera · 2026-05-11T14:15:23Z

This changes the code for None. Since we do major version update now I think it is justified

caetera · 2026-05-11T14:16:13Z

                {
                    "f=|format=",
-                    "The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.",
+                    "The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for BinarySoa (RCIA streaming binary, see BINARY_SOA_FORMAT.md), 5 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.",


Is it necessary to include reference to format specification in the help output?

caetera · 2026-05-11T14:24:13Z

Hi @filiprumenovski,
thank you for implementing the new format. I did a quick review of changes and added some inline comments/considerations.
I have a design question as well. I doubt if the format specification belongs to this repository. Should we have a separate place for the specification? @ypriverol, what is your view on this?

caetera requested review from caetera and ypriverol May 11, 2026 13:45

caetera added the enhancement New feature or request label May 11, 2026

caetera reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SoA binary streaming output for high-throughput pipelines#206

feat: SoA binary streaming output for high-throughput pipelines#206
filiprumenovski wants to merge 1 commit into
CompOmics:masterfrom
filiprumenovski:feat/binary-soa-output

filiprumenovski commented May 10, 2026 •

edited

Loading

Uh oh!

ypriverol commented May 11, 2026

Uh oh!

caetera May 11, 2026

Uh oh!

caetera May 11, 2026

Uh oh!

caetera May 11, 2026

Uh oh!

caetera May 11, 2026

Uh oh!

caetera May 11, 2026

Uh oh!

caetera commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

filiprumenovski commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Implementation notes

Compatibility

Tested

Uh oh!

ypriverol commented May 11, 2026

Uh oh!

caetera May 11, 2026

Choose a reason for hiding this comment

Uh oh!

caetera May 11, 2026

Choose a reason for hiding this comment

Uh oh!

caetera May 11, 2026

Choose a reason for hiding this comment

Uh oh!

caetera May 11, 2026

Choose a reason for hiding this comment

Uh oh!

caetera May 11, 2026

Choose a reason for hiding this comment

Uh oh!

caetera commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

filiprumenovski commented May 10, 2026 •

edited

Loading