ANS experimentations with 8-bit systems
Note! There's some proof reading still to be done. There are a lot of layout issues that are beyond me how to fix. This online editor is so great..
Since ANS came out I always wanted to try it, but honestly, never got around to do any experimentation until summer 2025. My procrastination abilities are amazing and maths is not really my strongest super power. The older I get the less gets into my head, and stays there.. And not helped by the fact that the stuff I do at work tends to reserve the last remaining brain cells I have left. Anyway, the best way to learn is by doing. My main goal was to make ANS implementations whose decoders are suitable for 8-bit systems, mainly the best ever, Z80.
I ended up implementing mock up code in Python and C++ for all three main ANS "variants":
- rANS a generic arithmetic coding replacement [4]
- tANS a fully table driven ANS variant, which can be seen as a Huffman coding replacement
- rABS a two symbol (0 and 1 bits) rANS variant
Decoders in Z80 assembly are also available for tANS and rABS encoders. The entire code base is available in my github page.
This sad blog is going to concentrate on tANS and rABS issues/challenges I faced when tinkering with the Z80 assembly versions of the decoders. Again here, I was willing to sacrifice code size over the speed.
For the tANS implementation I spent long time thinking how to encode symbol frequencies into the final output file. Outputting 8 bits for every used or non-used symbol seemed a bit wasteful. We'll discuss the symbol frequency encoding problem more later, but as a teaser I ended up using Rice-Golomb [1,2,3] codes to encode symbol frequencies.
ANS has an interesting property that decoding is done in a reverse order to encoding. Internet told me most people have solved the problem by reversing the input file before encoding or encoded the input file from end to beginning. In my typical use-cases I prefer in-place decompression where the encoded file is at the end overwritten by the decoded file. Most of the good old LZ-based in-place compression and decompression approaches I am aware of first compress/encode the input file from beginning to end using forward match references, and decompress/decode from end of the compressed file towards the beginning. Such arrangement is perfect fit for ANS, since one can use ANS encoder "naturally" and once the input file is encoded we have the final ANS state. The following pictures try to illustrate what I am talking about.
In most cases there needs to be a bit of "security distance" (Seclen 0 to n bytes) between compressed and decompressed memory areas. Basically one needs to place the compressed file few bytes lower in memory than the start of the decompressed memory area. This is to avoid decoder and writing over the compressed file before reading it. Sometimes input file for compression is such that the "security distance" can be 0 bytes, which is the ideal case.
Tabled ANS i.e. tANS
There are three topics to discuss in more detail:
- encoding symbol frequencies into the output file
- creation of tANS decoding tables
- tANS Z80 decoder
- Maximum size of the alphabet is 256, and symbols are from 0 to 255.
- All symbol frequencies must fit into 8 bits, which means the maximum single symbol frequency is 255 or less. 0 frequency is reserved for "no symbol" marker.
- The sum of all symbol frequencies must be a power of 2.
- Total sum of all symbol frequencies is maximum 256, but not a single symbol can have the maximum frequency (see previous bullet). This implies there should always be at least 2 symbols in the encoded file.
- Maximum size of the alphabet is 256, and symbols are from 0 to 255.
- All symbol frequencies must fit into 8 bits, which means the maximum single symbol frequency is 255 or less. 0 frequency is reserved for "no symbol" marker.
- The sum of all symbol frequencies must be a power of 2.
- Total sum of all symbol frequencies is maximum 256, but not a single symbol can have the maximum frequency (see previous bullet). This implies there should always be at least 2 symbols in the encoded file.
Encoding symbol frequencies into the output file
I wasted quite a bit of time thinking how to create a table of all symbol frequencies which would comply with the above design limitations but also have a single symbol frequency as a power of 2 value. Despite of my attempts that did not work out and I ended up just ensuring the total sum of all symbols was a power of 2. The basis of my implementation follows the Integer Normalization algorithm [6]. The decoder side is simpler, since all transmitted output file symbol frequencies are already normalized to a total power of 2 sum.
Let's assume the total sum (i.e. M) of symbols is 32. We need 5 bits to encode every symbol frequency in the transmitted output file. How could we encode symbols frequencies in less than 5 bits, especially symbols that do not exist (i.e. those with frequency of 0)?? This was so easy with Huffman and canonical codes.. Anyway, Internet gave me a hint that one could use e.g. Rice or Rice-Golomb codes. I ended up experimenting with Rice codes with shift value K=2 bits.
int total_bits, n = 1;
r = value & ((1 << K) - 1);
q = value >> K;
value = 0;
while (q > 0) {
value = (value + 1) << 1;
--q; ++n;
}
value = (value << K) | r;
total_bits = n + K;
Creating tANS decoding tables
Once we have all the symbol frequencies known and a total sum of a power of 2, when can generate tables to help decoding. We only need 2 tables, one table L_ for ANS state_x to symbol s mapping and another table Y_ for next ANS state_x after each decoded symbol. I found this blog [5] to be excellent describing how the basic principle of table generation works.
Let's assume we have symbols S={a,b,c,d,e,f,g,h}, and total sum M of symbols is 32, which also is a power of 2. The symbol frequencies are F={2,6,9,0,0,2,6,7}. Next we are going to generate two tables with M entries in both. Note that the alphabet size can be and should be smaller than M, and does not need to be power of 2.
One possible Z80 implementation looks like the one below. It is not necessarily the most optimized and compact.
; IX = ptr to Y_ and L_ tables, both M_ bytes, so total 2M_ bytes
; A = initial state_x i.e. INITIAL_STATE
; B = number of symbols in our alphabet.
tans_build_decoding_tables:
ld de,0x8000
; E = s(ymbol), starting from 0
_main_loop:
push bc
push af
EXTRACT_SYMBOL_FREQ_TO_A
jr z, _zero_symbol
ld b,a ; B = f(requency)
ld c,a ; C = f(requency)
pop af
_per_symbol_loop:
push ix
push af
add a,ixl ; Can be optimized if IX is 256 byte aligned
ld ixl,a
; Y_[state_x] = n
ld (ix+Y_),c
inc c
; L_[state_x] = s
ld (ix+L_),e
; spread step i.e. state_x = f(state_x)
pop af
pop ix
add a,SPREAD_STEP
; make sure 0 <= state_x < M
and M_-1
;
djnz _per_symbol_loop
; Trick to avoid a jump i.e. CP nn; POP AF has already been done
db $fe
_zero_symbol:
pop af
pop bc
inc e
djnz _main_loop
; IX = ptr to Y_ and L_ tables, both M_ bytes, so total 2M_ bytes
; A = initial state_x i.e. INITIAL_STATE
; B = number of symbols in our alphabet.
tans_build_decoding_tables:
ld de,0x8000
; E = s(ymbol), starting from 0
_main_loop:
push bc
push af
EXTRACT_SYMBOL_FREQ_TO_A
jr z, _zero_symbol
ld b,a ; B = f(requency)
ld c,a ; C = f(requency)
pop af
_per_symbol_loop:
push ix
push af
add a,ixl ; Can be optimized if IX is 256 byte aligned
ld ixl,a
; Y_[state_x] = n
ld (ix+Y_),c
inc c
; L_[state_x] = s
ld (ix+L_),e
; spread step i.e. state_x = f(state_x)
pop af
pop ix
add a,SPREAD_STEP
; make sure 0 <= state_x < M
and M_-1
;
djnz _per_symbol_loop
; Trick to avoid a jump i.e. CP nn; POP AF has already been done
db $fe
_zero_symbol:
pop af
pop bc
inc e
djnz _main_loop
tANS Z80 decoder
After we have generated both table Y_ and table L_, the decoder is extremely simple. Since we have no table k for how many bits to input from the compressed data stream, we just loop until state_x>=M. The decoder must always input at least 1 bit i.e. implicitly k>=1.
One possible Z80 implementation of the tANS decoder looks like the one below. I believe more seasoned Z80 wizards can optimize it even further :)
; Inputs:
; IX = ptr to Y_ and L_ tables, both M_ bytes, so total 2M_ bytes
; HL = ptr to compressed data
; C = input bit buffer
; A = state_x
; Return:
; A = new state_x
; B = decoded symbol
; Modifies:
; C and HL
tans_decode_symbol:
; Possible optimization here. If IX starts from 256 byte
; boundary then the 'add' below can be omitted, and just
; use the 'ld' directly.
add a,ixl
ld ixl,a
; get a symbol
ld b,(ix+L_)
ld a,(ix+Y_)
; k-tableless adding of bits for the next state
_do_while:
sla c
jr nz, _not_empty
dec hl
ld c,(hl)
rl c
_not_empty:
adc a,a
cp M_
jr c, _do_while
; Scale state between [0..M) or rather [L..2L)
and M_-1
Binary symbol distribution i.e. rAbS
First, we need to have some design limitations to make rABS a fit for 8-bit architectures with 16-bit registers (aaahh.. Z80 goodness), no multiplication and no division. For implementation simplicity sake when modelling symbol probabilities (for 0 or 1), the symbol frequencies/probabilities use 1 byte i.e. the range is [1..255] per context. Total of symbol frequencies is 256 (for symbols 0 and 1). Since our alphabet is only 2 symbols, we need only maintain the probability for one symbol and that probability can also be used as a cumulative probability. Both symbol probability and cumulative probability book keeping we get away with just one 8-bit value. This indeed looks good.
My implementation uses probability of 0 i.e. p0 as the book keeping variable. The p0 is actually p0/256 as a probability and just p0 as a symbol frequency. The probability p1 of symbol 1 is then p1=1-p0 i.e. p1=256-p0, which is in 8-bit values just -p0. Similarly, the cumulative probability for symbol 0 is 0, and for symbol 1 it is p0.
In case of dynamic updates to symbol probabilities there is a great article by Ryg on "Models for adaptive arithmetic coding" [8] that can be ~directly adapted for rABS purposes. Ferris' blog [7] discussed the same modelling even further from 8-bit systems point of view. I am not going to repeat the modelling principles again in this blog entry, instead click the provided links and read from there :)
It is important that the total sum of symbol frequencies is 256. This allows us to define M=256, and we will later use as M a divider. When M=256 we can replace a division with 8 right bit shifts. Or even better in our case by picking up the high 8-bit register of a combined 16-bit register (aahhh.. Z80 goodness). For example, if we have a 16-bit value in register HL, then when dividing HL by 256 the quotient is just H and the reminder is in L. No shifts or nothing else needed.
rABS with binary alphabet is just a massive simplification of rANS. The basic decoding procedure follows the steps:
decode_next_symbol:
// Find symbol and and related symbol stats
d = state_x // M
r = state_x mod M
symbol,Fs,Is = lookup_symbol(r)
// Calculate the next state
// Is = symbol's cumulative probability
// Fs = symbol's frequency
new_state_x = (d * Fs) + r - Is
// Renormalize new_state_x so that it is >= state lower bound
while (new_state_x < L_BIT_LOW)
b = input_n_bits(L_BITS)
new_state = (new_state << L_BITS) | b
return symbol,new_state_x
In the above we have introduced 2 new constants: L_BITS and L_BITS_LOW. The former defines how many bits we extract from the input stream at once when we decode. For an 8-bit architecture, such as Z80, L_BITS=1 is a good choice. L_BITS_LOW is the lower bound of the state_x. L_BITS_LOW must be chosen so that (L_BITS_LOW -1) << L_BITS does not overflow your register (here we target Z80 i.e. 16-bit register). In my case L_BITS=1 and L_BITS_LOW = 0x10000 >> L_BITS i.e. L_BITS_LOW = 0x8000. Why this? Now I can make use of Z80's 16-bit arithmetic and during the renormalization the while-loop exists when new_state_x sets the sign flag. Neat.. however, I was so hoping to be able to use 16-bit arithmetic overflow instead of sign, but my attempts did not lead anywhere.. yet.
Now let's get into the interesting part i.e. how to do this in 8-bits and Z80.. and cutting all possible corners we can. Most likely some more seasoned Z80 wizard codes this better, so do not hesitate to share those findings then ;) We have two fundamental requirements: M=256 and 0 <= p0 < 256.
; Input:
; HL = state_x; d=H and e=L
; DE = ptr to compressed file
; A = p0
; A' = input bit buffer
;
; Return:
; C_flag = decoded symbol i.e. 0 or 1
; A = p0
; A' = bitbuffer
; B = 0
; HL = new_state_x
; DE = possibly updated ptr to compressed file
decode_symbol:
First step is to find the symbol, its frequency and cumulative probability
; is r < p0
cp l
jr nz, _not_zero
; special case L == A.. Any ideas to avoid jump?
scf
_not_zero: ; symbol = C_flag
push af
push de
ld d,0
jr nc, _symbol_0
_symbol_1: ; r >= p0
; Fs = 256-p0 = -p0 = -A
; Is = p0 = A
ld c,a
neg
db $fe ; Old 'CP nn' trick to avoid jump
_symbol_0: ; r < p0
; Fs = p0 i.e. A
; Is = 0 i.e. D
ld c,d
Symbol is found and pushed into stack (status register C_flag). We also have both Fs and Is figured out using the single input p0 value.. Next calculate new_state_x.
_new_state: ld e,a
ld a,l
; calculate new_state_x, E = Fs, H = d
Multiplication is a classic 8*8->16 bits algorithm. DE holds Fs (i.e. DE < 256 always), H is state_x // 256 implicitly. No need to shift or load anything. L is set to 0, so when adding DE to HL there is never unwanted overflow overwriting bits in H.. and we use H as the bitmask when just shift HL to left an when also add DE into it. Once we have looped 8 times, HL is max 255*255.
_umulHxE_HL: ld b,8
ld l,d
_mul_loop: add hl,hl
jr nc, $+3
add hl,de
djnz _mul_loop
; C_flag is always cleared as last ADD never overflows
; new_state = HL = d * Fs - Is + r
ld e,c
; Always clears C_flag
sbc hl,de
ld e,a
; Sets S_flag if new_state_x >= 0x8000
adc hl,de
pop de
Finally renormalize new_state_x.. make use of the fact that when new_state_x >= L_BIT_LOW sign bit gets set i.e. new_state_x >= 0x8000.
; while (new_state < L_BIT_LOW)
jp m, _end_while
ex af,af'
_while_loop: ; Add one bit at time since L_BITS=1
add a,a
jr nz, _not_empty
ld a,(de)
adc a,a
dec de
_not_empty: adc hl,hl
jp p, _while_loop
ex af,af'
Finally, return symbol in C_flag
_end_while: pop af
That was basically it..
Useful references for further reading
Rice (Unary) & Rice-Golomb encoding links:
- [1] https://en.wikipedia.org/wiki/Unary_coding
- [2] https://en.wikipedia.org/wiki/Golomb_coding
- [3] https://michaeldipperstein.github.io/rice.html
Generic ANS links:Tabled ANS links:rABS links:
- [1] https://en.wikipedia.org/wiki/Unary_coding
- [2] https://en.wikipedia.org/wiki/Golomb_coding
- [3] https://michaeldipperstein.github.io/rice.html