The petabyte-scale data generated by high energy physics (HEP) experiments presents a significant storage challenge. We present the Bytewise Online Autoregressive (BOA) Constrictor, a new pseudo-streaming lossless neural compressor built upon the Mamba state space model. BOA achieves competitive compression ratios (CRs) across diverse structured HEP datasets, matching or exceeding Lempel–Ziv–Markov chain algorithm (LZMA), ZSTD and ZLIB at maximum compression, among other tested algorithms. With a 2.21 MB model, BOA achieves an effective CR (defined as the ratio of original to compressed file size, inclusive of model size) of 7.23× on ATLAS Open Data (HDF5) and 9.13× on simulated particle collision records (HepMC v3), outperforming the next-best traditional algorithm (6.79× and 5.33×, respectively on each dataset). BOA also demonstrates robust cross-file and cross-condition generalisation on CMS Open Data (NanoAOD format), where it obtains comparable or improved effective CRs (within 5%) with respect to the next-best traditional algorithm. Ablation studies show that transitioning to half-precision (FP16) weights reduces the model footprint without degrading predictive accuracy, and data-type analyses reveal BOA performs best on high-entropy float32 payloads. The model has also been tested in other kinds of scientific data, yielding 1.61× (vs 1.14× for next-best algorithm) in computational fluid dynamics and up to 1.53× (vs 1.27×) in cosmology (CAMELS) datasets. BOA is supported by a deterministic reference C++ implementation which ensures bit-exact reproducibility across different CUDA architectures. In this proof-of-principle implementation, BOA delivers a ∼3.5 to 45 MB s−1 compression and ∼1.5 to 25 MB s−1 decompression throughput that is not yet competitive with optimised algorithms such as ZSTD or LZMA, but still provides a first step towards data compression improvements for next-generation scientific data.
Journal article
2026-06-01T00:00:00+00:00
7