rABS with byte-size input in Z80

In my previous blog I went through an rABS implementation with one bit at a time input during the state renormalization. Since the mockup Python rABS encoder already supported varying input/output lengths during state renormalization, I decided to experiment with byte-wise decoder side as well. The motivational push for this came from recent Baze/3SC rANS experimentations ;)

On top of nice byte-wise handling of compressed input stream as a bonus I also managed to cut a bit of decoder size. Let's cut the prologue short and check what we have as Z80 implementation. Nothing really changed to the point we start calculating the new ANS state_x. (note: I have swapped DL and HL i.e. HL is now a ptr to compressed data and DE holds ANS state_x).

; with 8 bits input the L_BIT_LOW becomes 0x0100, which allows

; byte-wise input from the compressed data stream during state_x

; renormalization.

L_BITS_ equ 8

L_BIT_LOW_ equ 0x10000 >> L_BITS_

...

_new_state: ; new_state = d * Fs - Is + r

; = d * Fs + (r - Is)

Some unwanted register mangling.. partly because HL and DL are

now swapped from the original design. On the other hand now

the register assignment is more LDD(R) friendly.

ex de,hl

ld e,a

ld d,b

ld a,l

ld l,b

; H = (d = state_x // M)

; L = 0

; DE = Fs

; BC = Is

; A = r = state_x & (M - 1)

Multiplication has not changed.

ld b,8

_umulDxL_HL: add hl,hl

jr nc, $+3

add hl,de

djnz _umulDxL_HL

C_flag is always cleared either after last 'jr nc' or ADD

(which never overflows).

sbc hl,bc

ld c,a

add hl,bc

ex de,hl

pop hl

Renormalization 'while (new_state_x < L_BIT_LOW)' loop. Now that

our L_BIT_LOW == 256, there's a nice optimization available. We

only need to check if ANS state_x high byte is zero (i.e. D, since

state_x is kept in DE) and everything less than 256 is [0..255].

Since we input a whole byte at time there is no need for a

loop.. or well, right.. if state_x would be 0 before renormalization

then yes, we would input 2 bytes. I guess that cannot ever happen,

can it? ;)

ld a,d

and a

jr nz, _end_while

ld d,e

ld e,(hl)

dec hl

_end_while: pop af

ret

Then there would be more size optimizations possible if we were to drop the fancy multiplication routine and replace it with a brute force add-loop. Now new ANS state_x calculation would become something like below.. But is 6 bytes saving worth doing a brute force multiplication? Anyway, Z80 source code is available in my github. Although the "multiplication" loop setup looks suspicious i.e. A = d = state_x // M, the loop counter A is always at least 1. The renormalization during previous round made sure that state_x >= 0x100.

...

_new_state: ; new_state = d * Fs - Is + r

; = d * Fs + (r - Is)

ld h,b

ld l,e ; HL = r

ld e,a

ld a,d ; A = d = state_x // M

ld d,b ; DE = Fs

and a ; BC = Is

sbc hl,bc ; HL = r - Is

_umul_HL: add hl,de

dec a

jr nz, _umul_HL

ex de,hl

pop hl

; while (new_state < L_BIT_LOW)

; L_BIT_LOW == 256

or d

...

Cracking Stones

Wednesday, February 4, 2026