feat: SoA binary streaming output for high-throughput pipelines#206
feat: SoA binary streaming output for high-throughput pipelines#206filiprumenovski wants to merge 1 commit into
Conversation
…mers
Adds a new --format=BinarySoa (4) output that emits each scan as a
self-describing binary record laid out as Structure-of-Arrays (mz f64,
intensity f32). Designed for downstream pipelines (Rust engines, GPU
rescorers, columnar database loaders) that prefer zero-copy ingestion
over portable XML.
The format is fully documented in BINARY_SOA_FORMAT.md and consists of:
- 32-byte file header (magic "RCIASTR1", format_version, flags)
- per-spectrum records with a 128-byte fixed scalar header capturing
every commonly-needed field (rt, precursor mz, isolation window,
collision energy, FAIMS CV, ion injection time, base peak, TIC,
low/high mass, charge, master scan, activation type, ...) with
graceful nullability via NaN floats and -1 int sentinels
- an optional verbatim trailer key/value dump preserving every
per-scan vendor-reported field (AGC target, conversion parameters,
lock-mass calibration, etc.) without selective filtering
- SoA peak arrays (f64 mz, then f32 intensity), naturally aligned
- u32 = 0 EOF marker
Both --stdout and --output produce the identical byte format, so a
file written with --output can be played back through the same
downstream consumer that reads from a streaming pipe.
Performance notes from a 3.7 GB Orbitrap DDA benchmark
(143k spectra, 60M peaks):
- Output is wrapped in a 1 MB BufferedStream to coalesce small
writes into few large pipe syscalls (sys time dropped ~22x in
measurement vs naive per-element writes)
- mz array emitted via zero-copy MemoryMarshal.AsBytes over the
existing double[]
- intensity narrowing (f64->f32) uses ArrayPool<float>.Shared and
a tight loop the JIT auto-vectorizes
- per-spectrum header is built into a reusable 128-byte buffer with
inline little-endian writers (no BinaryWriter virtual calls)
- metadata block built into a reusable MemoryStream that's reset
(not freed) between scans
Activation type encoding handles EThcD correctly: when the instrument
reports SupplementalActivation == TriState.On AND the primary reaction
is ETD/ECD followed by HCD/CID, the encoded byte is 5 (EThcD) rather
than the supplemental's HCD/CID value.
Compatibility:
- Additive: new OutputFormat.BinarySoa = 4 enum value, existing
formats (MGF, mzML, IndexMzML, Parquet) untouched
- SpectrumWriter.ConfigureWriter handles BinarySoa as a binary
destination (no text-encoded StreamWriter wrapper, optional gzip
via --gzip)
- CLI help text updated to document the new format
|
@filiprumenovski can this be aligned; at least the column names with the parquet representation? |
| precursorMz = CalculateSelectedIonMz(reaction, monoisotopicMz, isolationWidthTrailer); | ||
| collisionEnergy = (float)reaction.CollisionEnergy; | ||
|
|
||
| double iw = isolationWidthTrailer ?? reaction.IsolationWidth; |
There was a problem hiding this comment.
Isolation window offset needs to be implemented to make the output similar to other formats
| int filterLen = filterBytes.Length; | ||
| if (filterLen > MaxFilterStringLen) | ||
| { | ||
| Log.Warn($"Filter string for scan {scanNumber} truncated from {filterLen} to {MaxFilterStringLen} bytes"); |
There was a problem hiding this comment.
In general issued warning affect the error code through parseInput.NewWarn(). The warning is "silenced" here. Is it intended?
| | 2 | HCD (Higher-Energy Collisional Dissociation) | | ||
| | 3 | ETD (Electron Transfer Dissociation) | | ||
| | 4 | ECD (Electron Capture Dissociation) | | ||
| | 5 | EThcD (ETD + HCD supplemental) | |
There was a problem hiding this comment.
Is it palnned to support ETciD?
There was a problem hiding this comment.
This changes the code for None. Since we do major version update now I think it is justified
| { | ||
| "f=|format=", | ||
| "The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.", | ||
| "The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for BinarySoa (RCIA streaming binary, see BINARY_SOA_FORMAT.md), 5 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.", |
There was a problem hiding this comment.
Is it necessary to include reference to format specification in the help output?
|
Hi @filiprumenovski, |
What
Adds
--format BinarySoa(4) — a binary output that emits each scan as[128-byte header][filter string][f64 mz array][f32 intensity array][optional trailer dump]. Format is documented inBINARY_SOA_FORMAT.md.Both
--stdoutand--outputproduce the same byte format.Why
For pipelines that consume TRFP output and don't need mzML's portability, the XML round-trip is meaningful overhead. This format skips it — downstream consumers cast the bytes into native arrays directly.
Implementation notes
BufferedStream. On a 3.7 GB Orbitrap DDA file this dropped sys time from ~72s to ~3s vs unbuffered.double[], no copy.ArrayPool<float>.Sharedand a tight loop the JIT vectorizes.ILogEntryAccessverbatim, so consumers can pick out instrument-specific fields (lock-mass calibration, AGC, conversion params, etc.) without us having to enumerate them.FindLastReaction+SupplementalActivation == TriState.Onto tag the spectrum as5(EThcD) rather than the supplemental HCD/CID's type.Compatibility
Strictly additive:
OutputFormat.BinarySoa = 4enum value beforeNone. Existing format paths untouched.SpectrumWriter.ConfigureWriteradds a branch for binary destinations (rawFileStream, no text-encoding wrapper).--gzipworks on file output.-f/--formathelp updated. Numeric and case-insensitive name parsing inherits the existingParseToEnumhelper, so--format 4,--format BinarySoa,--format binarysoaall work.Tested
Built clean on .NET 8 (macOS arm64 dev with Rosetta'd x64 dotnet for the Thermo DLL; CI Ubuntu should be unaffected).
Smoke-tested on:
Data/small.RAW(48 spectra, FTMS+ITMS hybrid)Data/small2.RAW(95 spectra, ETD-capable instrument with reagent-ion trailer fields)--stdoutand--outputproduce byte-exact identical streams in all three cases.Happy to add a
WriterTestsentry in the same shape as the MzML/Parquet ones if you'd prefer the test be in-tree.