Wednesday, February 4, 2026

rABS with byte-size input in Z80

rABS with byte-size input in Z80

In my previous blog I went through an rABS implementation with one bit at a time input during the state renormalization. Since the mockup Python rABS encoder already supported varying input/output lengths during state renormalization, I decided to experiment with byte-wise decoder side as well. The motivational push for this came from recent Baze/3SC rANS experimentations ;)

On top of nice byte-wise handling of compressed input stream as a bonus I also managed to cut a bit of decoder size. Let's cut the prologue short and check what we have as Z80 implementation. Nothing really changed to the point we start calculating the new ANS state_x. (note: I have swapped DL and HL i.e. HL is now a ptr to compressed data and DE holds ANS state_x).

; with 8 bits input the L_BIT_LOW becomes 0x0100, which allows

; byte-wise input from the compressed data stream during state_x

; renormalization.

L_BITS_ equ 8

L_BIT_LOW_ equ 0x10000 >> L_BITS_

...

_new_state: ; new_state = d * Fs - Is + r

; = d * Fs + (r - Is)

Some unwanted register mangling.. partly because HL and DL are

now swapped from the original design. On the other hand now

the register assignment is more LDD(R) friendly.

ex de,hl

ld e,a

ld d,b

ld a,l

ld l,b

; H = (d = state_x // M)

; L = 0

; DE = Fs

; BC = Is

; A = r = state_x & (M - 1)

Multiplication has not changed.

ld b,8

_umulDxL_HL: add hl,hl

jr nc, $+3

add hl,de

djnz _umulDxL_HL

C_flag is always cleared either after last 'jr nc' or ADD

(which never overflows).

sbc hl,bc

ld c,a

add hl,bc

ex de,hl

pop hl

Renormalization 'while (new_state_x < L_BIT_LOW)' loop. Now that

our L_BIT_LOW == 256, there's a nice optimization available. We

only need to check if ANS state_x high byte is zero (i.e. D, since

state_x is kept in DE) and everything less than 256 is [0..255].

Since we input a whole byte at time there is no need for a

loop.. or well, right.. if state_x would be 0 before renormalization

then yes, we would input 2 bytes. I guess that cannot ever happen,

can it? ;)

ld a,d

and a

jr nz, _end_while

ld d,e

ld e,(hl)

dec hl

_end_while: pop af

ret

Then there would be more size optimizations possible if we were to drop the fancy multiplication routine and replace it with a brute force add-loop. Now new ANS state_x calculation would become something like below.. But is 6 bytes saving worth doing a brute force multiplication? Anyway, Z80 source code is available in my github. Although the "multiplication" loop setup looks suspicious i.e. A = d = state_x // M, the loop counter A is always at least 1. The renormalization during previous round made sure that state_x >= 0x100.

...

_new_state: ; new_state = d * Fs - Is + r

; = d * Fs + (r - Is)

ld h,b

ld l,e ; HL = r

ld e,a

ld a,d ; A = d = state_x // M

ld d,b ; DE = Fs

and a ; BC = Is

sbc hl,bc ; HL = r - Is

_umul_HL: add hl,de

dec a

jr nz, _umul_HL

ex de,hl

pop hl

; while (new_state < L_BIT_LOW)

; L_BIT_LOW == 256

or d

...

Sunday, February 1, 2026

ANS experimentations with 8-bit systems

ANS experimentations with 8-bit systems

Updated 9-Feb-26.
Since ANS came out I always wanted to try it, but honestly, never got around to do any experimentation until summer 2025. My procrastination abilities are amazing and maths is not really my strongest super power. The older I get the less gets into my head, and stays there.. And not helped by the fact that the stuff I do at work tends to reserve the last remaining brain cells I have left. Anyway, the best way to learn is by doing. My main goal was to make ANS implementations whose decoders are suitable for 8-bit systems, mainly the best ever, Z80.
I ended up implementing mock up code in Python and C++ for all three main ANS "variants":
rANS a generic arithmetic coding replacement [4]
tANS a fully table driven ANS variant, which can be seen as a Huffman coding replacement
rABS a two symbol (0 and 1 bits) rANS variant
Decoders in Z80 assembly are also available for tANS and rABS encoders. The entire code base is available in my github page.
This sad blog is going to concentrate on tANS and rABS issues/challenges I faced when tinkering with the Z80 assembly versions of the decoders. Again here, I was willing to sacrifice code size over the speed.
For the tANS implementation I spent long time thinking how to encode symbol frequencies into the final output file. Outputting 8 bits for every used or non-used symbol seemed a bit wasteful. We'll discuss the symbol frequency encoding problem more later, but as a teaser I ended up using Rice-Golomb [1,2,3] codes to encode symbol frequencies.
ANS has an interesting property that decoding is done in a reverse order to encoding. Internet told me most people have solved the problem by reversing the input file before encoding or encoded the input file from end to beginning. In my typical use-cases I prefer in-place decompression where the encoded file is at the end overwritten by the decoded file. Most of the good old LZ-based in-place compression and decompression approaches I am aware of first compress/encode the input file from beginning to end using forward match references, and decompress/decode from end of the compressed file towards the beginning. Such arrangement is perfect fit for ANS, since one can use ANS encoder "naturally" and once the input file is encoded we have the final ANS state. The following pictures try to illustrate what I am talking about.
In most cases there needs to be a bit of "security distance" (Seclen 0 to n bytes) between compressed and decompressed memory areas. Basically one needs to place the compressed file few bytes lower in memory than the start of the decompressed memory area. This is to avoid decoder and writing over the compressed file before reading it. Sometimes input file for compression is such that the "security distance" can be 0 bytes, which is the ideal case.

Tabled ANS i.e. tANS

There are three topics to discuss in more detail:
encoding symbol frequencies into the output file
creation of tANS decoding tables
tANS Z80 decoder
Design based limitations:

Maximum size of the alphabet is 256, and symbols are from 0 to 255.
All symbol frequencies must fit into 8 bits, which means the maximum single symbol frequency is 255 or less. 0 frequency is reserved for "no symbol" marker.
The sum of all symbol frequencies must be a power of 2.
Total sum of all symbol frequencies is maximum 256, but not a single symbol can have the maximum frequency (see previous bullet). This implies there should always be at least 2 symbols in the encoded file.

Encoding symbol frequencies into the output file

I wasted quite a bit of time thinking how to create a table of all symbol frequencies which would comply with the above design limitations but also have a single symbol frequency as a power of 2 value. Despite of my attempts that did not work out and I ended up just ensuring the total sum of all symbols was a power of 2. The basis of my implementation follows the Integer Normalization algorithm [6]. The decoder side is simpler, since all transmitted output file symbol frequencies are already normalized to a total power of 2 sum.
Let's assume the total sum (i.e. M) of symbols is 32. We need 5 bits to encode every symbol frequency in the transmitted output file. How could we encode symbols frequencies in less than 5 bits, especially symbols that do not exist (i.e. those with frequency of 0)?? This was so easy with Huffman and canonical codes.. Anyway, Internet gave me a hint that one could use e.g. Rice or Rice-Golomb codes. I ended up experimenting with Rice codes with shift value K=2 bits.
So, we always encode K lowest bits of a value and unary encode the quotient after shifting the value K bits to right. Values between 0 to (1<<K)-1 take K+1 bits to encode. In the unary part 0 bit is the end marker.
  uint32_t r,q;
  int total_bits, n = 1;
  r = value & ((1 << K) - 1);
  q = value >> K;
  value = 0;
  while (q > 0) {
  value = (value + 1) << 1;
  --q; ++n;
  }
  value = (value << K) | r;
  total_bits = n + K;

For example, assume M=5 and K=2, then a value 14 Rice encoded would become:
  1110 10.  // Total 6 bits i.e. 1 bits loss
And a value 3 Rice encoded would become:
  0 11      // Total 3 bits i.e. 2 bits gain

Creating tANS decoding tables

Once we have all the symbol frequencies known and a total sum of a power of 2, when can generate tables to help decoding. We only need 2 tables, one table L_ for ANS state_x to symbol s mapping and another table Y_ for next ANS state_x after each decoded symbol. I found this blog [5] to be excellent describing how the basic principle of table generation works.
Let's assume we have symbols S={a,b,c,d,e,f,g,h}, and total sum M of symbols is 32. The symbol frequencies are F={2,6,9,0,0,2,6,7}. Next we are going to generate two tables with M entries in both. Note that the alphabet size can be and should be smaller than M, and does not need to be power of 2.
The selection of the "spreading function" for tANS table generation seems to be of greater science, however, I picked something very simple i.e. a plain number that allows me to visit every table entry without a collision. Prime numbers are great for that purpose. Lets define our spreading function f(state_x) be "(state_x+5) mod M". We also need to pick up the initial starting state i.e. INITIAL_STATE. That is an encoder problem, but a tANS decoder has to know it also when tables are build. My INITIAL_STATE was selected for this blog example somehow to be 3.

One of the key ideas in tANS is to keep the ANS state_x value within a certain range [L..2L), where in our case L==M. If M is 32 then the state_x has to remain between [32..64). When we look up the next state_x from the decoding tables, the stored value is always less than L, and we need to input k more bits until state_x is again between [L..2L). Here comes in a nice simplification: why bother with state_x range [L..2L), which it is actually the same as [0..M). It would make decoder and table handling much simpler. So, my decoder keeps the state_x between [0..M) and not [L..2L).

For any symbol s in S with frequency f>0, create n, where f<=n<2f, entries in table Y_[state_x]=n and table L_[state_x]=s. For each step after entering values into the table update state_x=f(state_x). The picture below illustrates what kind of tables we generate. The table k is not needed, but added here to show how many bits the decoder needs to input on each step to get to the next state. The arrows on state_x indices show where the next symbol staring point is after current symbol is done e.g. the last state for symbol b is 6 and the first state for symbol c to start is 11.

One possible Z80 implementation looks like the one below. It is not necessarily the most optimized and compact.

; IX = ptr to Y_ and L_ tables, both M_ bytes, so total 2M_ bytes
; A = initial state_x i.e. INITIAL_STATE
; B = number of symbols in our alphabet.
tans_build_decoding_tables:
ld de,0x8000
; E = s(ymbol), starting from 0
_main_loop:
push bc
push af
EXTRACT_SYMBOL_FREQ_TO_A
jr z, _zero_symbol
ld b,a ; B = f(requency)
ld c,a ; C = f(requency)
pop af
_per_symbol_loop:
push ix
push af
add a,ixl ; Can be optimized if IX is 256 byte aligned
ld ixl,a
; Y_[state_x] = n
ld (ix+Y_),c
inc c
; L_[state_x] = s
ld (ix+L_),e
; spread step i.e. state_x = f(state_x)
pop af
pop ix
add a,SPREAD_STEP
; make sure 0 <= state_x < M
and M_-1
;
djnz _per_symbol_loop
; Trick to avoid a jump i.e. CP nn; POP AF has already been done
db $fe
_zero_symbol:
pop af
pop bc
inc e
djnz _main_loop

tANS Z80 decoder

After we have generated both table Y_ and table L_, the decoder is extremely simple. Since we have no table k for how many bits to input from the compressed data stream, we just loop until state_x>=M. The decoder must always input at least 1 bit i.e. implicitly k>=1.
One possible Z80 implementation of the tANS decoder looks like the one below. I believe more seasoned Z80 wizards can optimize it even further :)
; Inputs:
; IX = ptr to Y_ and L_ tables, both M_ bytes, so total 2M_ bytes
; HL = ptr to compressed data
; C = input bit buffer
; A = state_x
; Return:
; A = new state_x
; B = decoded symbol
; Modifies:
; C and HL
tans_decode_symbol:
; Possible optimization here. If IX starts from 256 byte
; boundary then the 'add' below can be omitted, and just
; use the 'ld' directly.
add a,ixl
ld ixl,a
; get a symbol
ld b,(ix+L_)
ld a,(ix+Y_)
; k-tableless adding of bits for the next state
_do_while:
sla c
jr nz, _not_empty
dec hl
ld c,(hl)
rl c
_not_empty:
adc a,a
cp M_
jr c, _do_while
; Scale state between [0..M) or rather [L..2L)
and M_-1

Binary symbol distribution i.e. rAbS

First, we need to have some design limitations to make rABS a fit for 8-bit architectures with 16-bit registers (aaahh.. Z80 goodness), no multiplication and no division. For implementation simplicity sake when modelling symbol probabilities (for 0 or 1), the symbol frequencies/probabilities use 1 byte i.e. the range is [1..255] per context. Total of symbol frequencies is 256 (for symbols 0 and 1). Since our alphabet is only 2 symbols, we need only maintain the probability for one symbol and that probability can also be used as a cumulative frequency. Both symbol probability and cumulative frequency book keeping we get away with just one 8-bit value. This indeed looks good.
My implementation uses probability of 0 i.e. p0 as the book keeping variable. The p0 is actually p0/256 as a probability and just p0 as a symbol frequency. The probability p1 of symbol 1 is then p1=1-p0 i.e. p1=256-p0, which is in 8-bit values just -p0. Similarly, the cumulative frequency for symbol 0 is 0, and for symbol 1 it is p0.
In case of dynamic updates to symbol probabilities there is a great article by Ryg on "Models for adaptive arithmetic coding" [8] that can be ~directly adapted for rABS purposes. Ferris' blog [7] discussed the same modelling even further from 8-bit systems point of view. I am not going to repeat the modelling principles again in this blog entry, instead click the provided links and read from there :)
It is important that the total sum of symbol frequencies is 256. This allows us to define M=256, and we will later use as M a divider. When M=256 we can replace a division with 8 right bit shifts. Or even better in our case by picking up the high 8-bit register of a combined 16-bit register (aahhh.. Z80 goodness). For example, if we have a 16-bit value in register HL, then when dividing HL by 256 the quotient is just H and the reminder is in L. No shifts or nothing else needed.
rABS with binary alphabet is just a massive simplification of rANS. The basic decoding procedure follows the steps:
decode_next_symbol:
// Find symbol and and related symbol stats
d = state_x // M
r = state_x mod M
symbol,Fs,Is = lookup_symbol(r)

// Calculate the next state
// Is = symbol's cumulative frequency
// Fs = symbol's frequency
new_state_x = (d * Fs) + r - Is

// Renormalize new_state_x so that it is >= state lower bound
while (new_state_x < L_BIT_LOW)
b = input_n_bits(L_BITS)
new_state = (new_state << L_BITS) | b

return symbol,new_state_x
In the above we have introduced 2 new constants: L_BITS and L_BITS_LOW. The former defines how many bits we extract from the input stream at once when we decode. For an 8-bit architecture, such as Z80, L_BITS=1 is a good choice. L_BITS_LOW is the lower bound of the state_x. L_BITS_LOW must be chosen so that (L_BITS_LOW -1) << L_BITS does not overflow your register (here we target Z80 i.e. 16-bit register). In my case L_BITS=1 and L_BITS_LOW = 0x10000 >> L_BITS i.e. L_BITS_LOW = 0x8000. Why this? Now I can make use of Z80's 16-bit arithmetic and during the renormalization the while-loop exists when new_state_x sets the sign flag. Neat.. however, I was so hoping to be able to use 16-bit arithmetic overflow instead of sign, but my attempts did not lead anywhere.. yet.
Now let's get into the interesting part i.e. how to do this in 8-bits and Z80.. and cutting all possible corners we can. Most likely some more seasoned Z80 wizard codes this better, so do not hesitate to share those findings then ;) We have two fundamental requirements: M=256 and 0 <= p0 < 256.
; Input:
; HL = state_x; d=H and e=L
; DE = ptr to compressed file
; A = p0
; A' = input bit buffer
;
; Return:
; C_flag = decoded symbol i.e. 0 or 1
; A = p0
; A' = bitbuffer
; B = 0
; HL = new_state_x
; DE = possibly updated ptr to compressed file
decode_symbol:
First step is to find the symbol, its frequency and cumulative frequency
; is r < p0
cp l
jr nz, _not_zero
; special case L == A.. Any ideas to avoid jump?
scf
_not_zero: ; symbol = C_flag
push af
push de
ld d,0
jr nc, _symbol_0
_symbol_1:   ; r >= p0
  ; Fs = 256-p0 = -p0 = -A
  ; Is = p0 = A
  ld   c,a
  neg
db $fe ; Old 'CP nn' trick to avoid jump
_symbol_0: ; r < p0
; Fs = p0 i.e. A
; Is = 0 i.e. D
ld c,d
Symbol is found and pushed into stack (status register C_flag). We also have both Fs and Is figured out using the single input p0 value.. Next calculate new_state_x.
_new_state: ld e,a
ld a,l
; calculate new_state_x, E = Fs, H = d
Multiplication is a classic 88->16 bits algorithm. DE holds Fs (i.e. DE < 256 always), H is state_x // 256 implicitly. No need to shift or load anything. L is set to 0, so when adding DE to HL there is never unwanted overflow overwriting bits in H.. and we use H as the bitmask when just shift HL to left an when also add DE into it. Once we have looped 8 times, HL is max 255255.
_umulHxE_HL: ld b,8
ld l,d
_mul_loop: add hl,hl
jr nc, $+3
add hl,de
djnz _mul_loop
; C_flag is always cleared as last ADD never overflows
; new_state = HL = d * Fs - Is + r
ld e,c
  ; Always clears C_flag
sbc hl,de
ld e,a
        ; Sets S_flag if new_state_x >= 0x8000
adc hl,de
pop de
Finally renormalize new_state_x.. make use of the fact that when new_state_x >= L_BIT_LOW sign bit gets set i.e. new_state_x >= 0x8000.
; while (new_state < L_BIT_LOW)
jp m, _end_while
ex af,af'
_while_loop: ; Add one bit at time since L_BITS=1
add a,a
jr nz, _not_empty
ld a,(de)
adc a,a
dec de
_not_empty: adc hl,hl
jp p, _while_loop
ex af,af'
Finally, return symbol in C_flag
_end_while: pop af

That was basically it.. See a follow-up blog on byte-wise input during renormalization step here: rABS with byte-size input in z80

Useful references for further reading

Rice (Unary) & Rice-Golomb encoding links:
[1] https://en.wikipedia.org/wiki/Unary_coding
[2] https://en.wikipedia.org/wiki/Golomb_coding
[3] https://michaeldipperstein.github.io/rice.html
Generic ANS links:
[4] https://github.com/rygorous/ryg_rans
Tabled ANS links:
[5] https://medium.com/@bredelet/understanding-ans-coding-through-examples-d1bebfc7e076
[6] https://stackoverflow.com/questions/31121591/normalizing-integers
rABS links:
[7] https://yupferris.github.io/blog/2019/02/11/rANS-on-6502.html
[8] https://fgiesen.wordpress.com/2015/05/26/models-for-adaptive-arithmetic-coding/

Subscribe to: Comments (Atom)