

<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>Complexity-Invariant Rate-Distortion Gains of Transformer- Based Neural Image Codecs: A Stratified Evaluation Framework</title>
  <journal>Digital Signal Processing and Artificial Intelligence for Automatic Learning</journal>
  <author>Maleerat Maliyaem</author>
  <volume>5</volume>
  <issue>1</issue>
  <year>2026</year>
  <doi>https://doi.org/10.6025/dspaial/2026/5/1/18-31</doi>
  <url>https://www.dline.info/dspai/fulltext/v5n1/dspaiv5n1_2.pdf</url>
  <abstract>This study investigates the rate distortion complexity tradeoffs of modern neural image codecs, with emphasis
on practical deployment in resource constrained environments such as edge and augmented reality devices.
While neural compression models often surpass classical standards (e.g., HEVC, VVC) in rate distortion
performance, their high decoding complexity particularly from autoregressive entropy models hinders real
world adoption. The authors address this by evaluating three representative architectures on the Kodak
dataset (24 natural RGB images, 768Ã—512): a hyperprior baseline, an autoregressive context model, and a
transformer based codec. To ensure robust analysis, images were objectively stratified into low , medium,
and high complexity bins based on Sobel-based gradient energy.
Results demonstrate that transformer based codecs achieve approximately 44% improvement in BD rate
over the hyperprior baseline, whereas autoregressive models yield approximately 30% savings. Critically,
these gains remain consistent across all complexity levels (variation &lt;1.5 percentage points), indicating
architectural robustness rather than content specific optimization. At 0.62 bits per pixel, transformers deliver
a 2.5 dB PSNR advantage with visibly superior texture and edge preservation. All performance differences
were statistically significant (p &lt; 0.001). The findings underscore a paradigm shift from pure rate distortion
optimization toward balanced rate distortion complexity design. Transformer architectures, with their
capacity for global context modeling, emerge as particularly promising for next generation standards where
bandwidth efficiency and visual fidelity must coexist with computational constraints. The study establishes
a reproducible evaluation framework grounded in objective complexity metrics and rigorous statistical
validation, offering a methodological foundation for future codec development and benchmarking.</abstract>
</record>
