Skip to content

feat: SoA binary streaming output for high-throughput pipelines#206

Open
filiprumenovski wants to merge 1 commit into
CompOmics:masterfrom
filiprumenovski:feat/binary-soa-output
Open

feat: SoA binary streaming output for high-throughput pipelines#206
filiprumenovski wants to merge 1 commit into
CompOmics:masterfrom
filiprumenovski:feat/binary-soa-output

Conversation

@filiprumenovski
Copy link
Copy Markdown

@filiprumenovski filiprumenovski commented May 10, 2026

What

Adds --format BinarySoa (4) — a binary output that emits each scan as [128-byte header][filter string][f64 mz array][f32 intensity array][optional trailer dump]. Format is documented in BINARY_SOA_FORMAT.md.

Both --stdout and --output produce the same byte format.

Why

For pipelines that consume TRFP output and don't need mzML's portability, the XML round-trip is meaningful overhead. This format skips it — downstream consumers cast the bytes into native arrays directly.

Implementation notes

  • Output wrapped in a 1 MB BufferedStream. On a 3.7 GB Orbitrap DDA file this dropped sys time from ~72s to ~3s vs unbuffered.
  • mz array goes out as a span over the existing double[], no copy.
  • Intensity narrowing (f64→f32) uses ArrayPool<float>.Shared and a tight loop the JIT vectorizes.
  • 128-byte fixed scalar header per scan is built into a reusable buffer with inline little-endian writers.
  • The optional trailer dump captures every key/value pair from ILogEntryAccess verbatim, so consumers can pick out instrument-specific fields (lock-mass calibration, AGC, conversion params, etc.) without us having to enumerate them.
  • EThcD detection: uses FindLastReaction + SupplementalActivation == TriState.On to tag the spectrum as 5 (EThcD) rather than the supplemental HCD/CID's type.

Compatibility

Strictly additive:

  • New OutputFormat.BinarySoa = 4 enum value before None. Existing format paths untouched.
  • SpectrumWriter.ConfigureWriter adds a branch for binary destinations (raw FileStream, no text-encoding wrapper). --gzip works on file output.
  • CLI -f/--format help updated. Numeric and case-insensitive name parsing inherits the existing ParseToEnum helper, so --format 4, --format BinarySoa, --format binarysoa all work.

Tested

Built clean on .NET 8 (macOS arm64 dev with Rosetta'd x64 dotnet for the Thermo DLL; CI Ubuntu should be unaffected).

Smoke-tested on:

  • bundled Data/small.RAW (48 spectra, FTMS+ITMS hybrid)
  • bundled Data/small2.RAW (95 spectra, ETD-capable instrument with reagent-ion trailer fields)
  • a 143,136-spectrum / ~60M-peak Orbitrap DDA file from PXD028735 (0 errors, 0 warnings)

--stdout and --output produce byte-exact identical streams in all three cases.

Happy to add a WriterTests entry in the same shape as the MzML/Parquet ones if you'd prefer the test be in-tree.

…mers

Adds a new --format=BinarySoa (4) output that emits each scan as a
self-describing binary record laid out as Structure-of-Arrays (mz f64,
intensity f32). Designed for downstream pipelines (Rust engines, GPU
rescorers, columnar database loaders) that prefer zero-copy ingestion
over portable XML.

The format is fully documented in BINARY_SOA_FORMAT.md and consists of:

  - 32-byte file header (magic "RCIASTR1", format_version, flags)
  - per-spectrum records with a 128-byte fixed scalar header capturing
    every commonly-needed field (rt, precursor mz, isolation window,
    collision energy, FAIMS CV, ion injection time, base peak, TIC,
    low/high mass, charge, master scan, activation type, ...) with
    graceful nullability via NaN floats and -1 int sentinels
  - an optional verbatim trailer key/value dump preserving every
    per-scan vendor-reported field (AGC target, conversion parameters,
    lock-mass calibration, etc.) without selective filtering
  - SoA peak arrays (f64 mz, then f32 intensity), naturally aligned
  - u32 = 0 EOF marker

Both --stdout and --output produce the identical byte format, so a
file written with --output can be played back through the same
downstream consumer that reads from a streaming pipe.

Performance notes from a 3.7 GB Orbitrap DDA benchmark
(143k spectra, 60M peaks):

  - Output is wrapped in a 1 MB BufferedStream to coalesce small
    writes into few large pipe syscalls (sys time dropped ~22x in
    measurement vs naive per-element writes)
  - mz array emitted via zero-copy MemoryMarshal.AsBytes over the
    existing double[]
  - intensity narrowing (f64->f32) uses ArrayPool<float>.Shared and
    a tight loop the JIT auto-vectorizes
  - per-spectrum header is built into a reusable 128-byte buffer with
    inline little-endian writers (no BinaryWriter virtual calls)
  - metadata block built into a reusable MemoryStream that's reset
    (not freed) between scans

Activation type encoding handles EThcD correctly: when the instrument
reports SupplementalActivation == TriState.On AND the primary reaction
is ETD/ECD followed by HCD/CID, the encoded byte is 5 (EThcD) rather
than the supplemental's HCD/CID value.

Compatibility:

  - Additive: new OutputFormat.BinarySoa = 4 enum value, existing
    formats (MGF, mzML, IndexMzML, Parquet) untouched
  - SpectrumWriter.ConfigureWriter handles BinarySoa as a binary
    destination (no text-encoded StreamWriter wrapper, optional gzip
    via --gzip)
  - CLI help text updated to document the new format
@caetera caetera requested review from caetera and ypriverol May 11, 2026 13:45
@caetera caetera added the enhancement New feature or request label May 11, 2026
@ypriverol
Copy link
Copy Markdown
Contributor

@filiprumenovski can this be aligned; at least the column names with the parquet representation?

precursorMz = CalculateSelectedIonMz(reaction, monoisotopicMz, isolationWidthTrailer);
collisionEnergy = (float)reaction.CollisionEnergy;

double iw = isolationWidthTrailer ?? reaction.IsolationWidth;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isolation window offset needs to be implemented to make the output similar to other formats

int filterLen = filterBytes.Length;
if (filterLen > MaxFilterStringLen)
{
Log.Warn($"Filter string for scan {scanNumber} truncated from {filterLen} to {MaxFilterStringLen} bytes");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general issued warning affect the error code through parseInput.NewWarn(). The warning is "silenced" here. Is it intended?

Comment thread BINARY_SOA_FORMAT.md
| 2 | HCD (Higher-Energy Collisional Dissociation) |
| 3 | ETD (Electron Transfer Dissociation) |
| 4 | ECD (Electron Capture Dissociation) |
| 5 | EThcD (ETD + HCD supplemental) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it palnned to support ETciD?

Comment thread OutputFormat.cs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the code for None. Since we do major version update now I think it is justified

Comment thread MainClass.cs
{
"f=|format=",
"The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.",
"The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for BinarySoa (RCIA streaming binary, see BINARY_SOA_FORMAT.md), 5 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to include reference to format specification in the help output?

@caetera
Copy link
Copy Markdown
Contributor

caetera commented May 11, 2026

Hi @filiprumenovski,
thank you for implementing the new format. I did a quick review of changes and added some inline comments/considerations.
I have a design question as well. I doubt if the format specification belongs to this repository. Should we have a separate place for the specification? @ypriverol, what is your view on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants