Arithmetic Coding | Data Compression Explained

Delving into Arithmetic Coding

Arithmetic coding is a sophisticated form of entropy encoding used in lossless data compression. Unlike algorithms like Huffman coding which assign a specific integer number of bits to each symbol, arithmetic coding can assign fractional bits, often leading to greater compression efficiency, especially for sources with a small alphabet or highly skewed probabilities.

Abstract representation of arithmetic coding process

Visualizing the probability range division in arithmetic coding.

How Arithmetic Coding Works

The core idea of arithmetic coding is to represent an entire message (or sequence of symbols) as a single fraction, a number in the interval [0, 1). As the message becomes longer, the interval representing it becomes smaller, and the number of bits needed to specify that interval grows.

Here's a simplified view of the process:

Model Probabilities: First, you need a model that provides the probabilities of the symbols in the input data. This could be a fixed model or an adaptive one that updates probabilities as it processes the data.
Interval Division: Start with the range [0, 1). For each symbol in the input sequence, this range is narrowed based on the symbol's probability. The current range is subdivided into sub-intervals, with the size of each sub-interval proportional to the probability of the corresponding symbol.
Select Sub-interval: The sub-interval corresponding to the next symbol in the input sequence becomes the new current range.
Repeat: This process is repeated for all symbols in the message.
Final Code: The final compressed message is a number that uniquely identifies the final, very small, interval. In practice, only enough bits to distinguish this interval from all other possible final intervals are transmitted.

Advantages of Arithmetic Coding

Higher Compression Ratios: Often achieves compression ratios closer to the theoretical limit (entropy) than Huffman coding, particularly when symbol probabilities are not close to powers of 1/2.
Adaptability: It can be easily combined with adaptive probability models, where symbol probabilities change as data is processed, leading to better compression for varying data sources.
Separation of Model and Coding: The probability model and the encoding process are distinct, allowing for easier experimentation with different models.

Key Differences from Huffman Coding

While both are entropy coders, Huffman coding assigns a fixed-length prefix code (e.g., 0, 10, 110) to each symbol. Arithmetic coding, on the other hand, effectively allocates a "fractional" number of bits to each symbol, based on its probability, leading to a single floating-point number representing the entire input sequence.

Disadvantages of Arithmetic Coding

Computational Complexity: Generally slower than Huffman coding due to the more complex mathematical operations involved in maintaining and subdividing intervals.
Patents: Historically, its use was complicated by patents, though many have now expired.
Implementation Sensitivity: Requires careful implementation, especially regarding floating-point precision and scaling to avoid underflow/overflow.

Prominent Use Cases

Despite its complexities, arithmetic coding is used in several standards due to its efficiency:

JPEG 2000: The image compression standard uses arithmetic coding (specifically, the MQ coder, a binary arithmetic coder).
H.264/AVC and H.265/HEVC (CABAC): Context-Adaptive Binary Arithmetic Coding (CABAC) is a significant part of these modern video compression standards.
It has also been used as a component in other compression utilities like bzip2 (though bzip2 primarily uses Burrows-Wheeler Transform and Huffman coding, it can use arithmetic coding as an optional final stage).

For a more in-depth mathematical treatment, you can refer to resources like the Wikipedia page on Arithmetic Coding.