Sunday, July 18, 2021

Smarty and the Nasty Gluttons - part 1 (compression)

Preface

Recently i.e., about a year ago bunch of old Amiga demo sceners released a game that was never finished back in 1990s. See the official Smarty and the Nasty Glutton page for more details. I have long wished to write down about the stuff I have done with my beloved Amigas. Documenting the tech put into the Smarty and the Nasty Gluttons on my part is a good starting point. The master plan is to do write-ups on:

  • Compression (S405/STC5) used in the game (this post)
  • Disk system and the FFS file loader
  • Bootblock(s) including the obfuscation and encryption pieces
    • The boxed game version
    • The online ADF game version


Short history

StoneCracker "S405" or "STC5" was used to compress all data in the official release of the game "Smarty and the Nasty Gluttons". It is used for a standalone RAM to RAM file decompression and also integrated into the game disk system file loader allowing on the fly decompression of files into RAM.

S405 is the "last" of the StoneCracker cruncher series. Originally started later part of the 1990s and intended to be the next generation of the slightly successful S404 found in StoneCracker 4.10.3 which dates back to January 1994. Based of the references found in my old codes the development for the first public'ish release was done in 1998-9.

During 2002-2004 S405 appeared as a 32 bit ARMv4 version in GP32 scene (and at least once in GBA). Anyhow, I think it was 2015 when the decompressor was rewritten in mc68000 to be used in Flashtro & Scoopex release of the Smarty and the Nasty Gluttons playable preview. S405 is a "look ma no hands" product i.e., series of reinventions of a wheel with a lot of tinkering without spending too much time studying the literature. The retrofitting of the S405 into the Smarty and the Nasty Gluttons file loader was done in summer 2020.


S405 short tech notes

S405 is very (G)ZIP-like but to my recollection performs marginally better - so it is not a state of art in any way. It is a basic LZSS with a Huffman backend, a 32K sliding window, minimum match length of 2 and maximum match length of 256. The original file is compressed in 32K blocks, Huffman trees are recalculated for each block and the maximum length of any prefix code is 15 bits (trees are rebalanced if some prefix becomes longer than 15 bits). The match search algorithm uses simple "lazy evaluation", hash tables for linked lists, context confirmation for found hashes and simple cut-off heuristics for pathetic searches. This is very basic stuff but appeared very fancy back then for a young c0der lad. Maybe, one could get better compression just having a modern match finding algorithm instead of "lazy evaluation"...

The "history references" are to history, not "forward references" as most of the demo scene crunchers had in 1990s. The former allows easy streaming of data for both compression and decompression, while the latter makes in-place decompression easier (the decompression is done from higher to lower memory).

S405 uses total 4 Huffman trees:

  1. Pre-tree (PRE -> 16 symbols) used for encoding:
    • 3 other Huffman trees that are transmitted as a part of each 32K block.
  2. Literals+match tree (LTM -> 512 symbols) for encoding:
    • a literal (symbols 0 to 255).
    • end of block (symbol 256).
    • previous match reference (PRM) (symbol 257).
    • match length (symbols 258 to 511).
  3. High tree (HGH -> 256 symbols) for encoding:
    • short offset from 1 to 127 bytes in history (symbols 1 to 127).
    • high 7 bits of a 15 bit long offset (symbols 128 to 255).
    • previous long offset (symbol 0).
  4. Low tree (LOW -> 256 symbols) for encoding:
    • low 8 bits of a new 15 bit long offset (0 to 255).
The previous match reference (PRM) has a history of 1 previously used offset (short or long) and a match length. The use of PRM is a bit lame since both match length and offset need to match. In any case both PRM and "previous long offset" do help compression enough to be usable.
As can be seen the actual encoding of S405 literal, match and offset "tags" is such that changing to a different entropy coder backend is made really easy. Except for the block framing information no raw bit strings as such are encoded into the compressed stream. The intention was at some point to use Range Coder backend. Maybe an ANS coder some day, heh ;)

Due to lack of proper understanding of compression theory the construction of trees is what it is and could be better. For example, the HGH tree has two very differently behaving sets of input data: short offsets and then the 7 high bits of a 15 bit long offset. I would assume that these two data sets do not really support each other creating a good probability distribution when placed into the same tree.

The latest S405 mc68000 decompression routine needed 2816 bytes of RAM to store all these 4 Huffman trees, lookup tables and other work space simultaneously during decompression. A small decompressor code size and as low as possible runtime RAM usage have always had priority over other aspects.. such as decompression speed or compression efficiency.


S405 file structure

The S405 compressed file has the following structure:

4 bytes of ID
repeat until 2 bytes of BLOCK_SIZE = 0
  if 0 < BLOCK_SIZE < 32768 then
    6 bytes of pre-tree
    i bytes of LTM tree
    j bytes of HGH tree
    k bytes of LOW tree
    n bytes of compressed data.
  if BLOCK_SIZE > 32768 then
    (BLOCK_SIZE & 0x7fff) bytes of 8 bit literals
4 bytes of the original file size


All numbers are in Big Endian and blocks are 16 bit aligned.

The ID word is as follows:

 3   2   1   0 0
 1   3   5   7 0
+---+---+---+---+
|'S'|'4'|'0'|'5'|
+---+---+---+---+

The BLOCK_SIZE is encoded as follows:

 1 1             0
 5 4             0
+-+---------------+
|C| BLOCK_SIZE    |
+-+---------------+

if C=0 then the block contains BLOCK_SIZE bytes
       of compressed data.
     1 then the block contains BLOCK_SIZE bytes of 8
       bit literals.
if C=0 and BLOCK_SIZE=0 then no more blocks follow.

The PRE tree is encoded as follows:

 4   4   4   3   3   3   2   2   2   2   1   1   1   0   0   0 0

 7   4   1   8   5   2   9   6   3   0   7   4   1   8   5   2 0

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

|P0 |P1 |P2 |P3 |P4 |P5 |P6 |P7 |P8 |P9 |P10|P11|P12|P13|P14|P15|

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+


Pn = 0      -> no MTF indices on this tree depth

0 < Pn < 16 -> MTF index depth 'n' in the Huffman PRE tree


The PRE tree maximum depth / prefix length is 7. As can be seen below the PRE tree is used to encode MTF output rather than the actual depth of the symbol in the compressed Huffman tree.

The LTM, HGH and LOW trees are all encoded in the same manner:

  1. find the leaf depths / prefix lengths for all symbols in the Huffman tree. Depth 0 means the symbol does not exist, however, it is still encoded.
  2. do a move to front (MTF) for all symbol depths / prefix lengths.
  3. encode the MTF output using the PRE tree.

It looks like I put a lot of love how to encode Huffman trees into the compressed file. I recall it was kinda hot topic at some point of time. I never benchmarked how well the above tree encoder works but looking at the output these days I get the feeling it could do much better..

Finally, the original file size is encoded as follows:

 3                              0
 1                              0
+--------------------------------+
| original file size in bytes    |
+--------------------------------+

The original file size is placed at the end of the file to allow streamed compression (in 32K blocks). We cannot know the original data size until we have compressed it. If the decompressor has the entire compressed file handy it can easily traverse through the compressed file using BLOCK_SIZEs as an offset to a next one in a linked list and find the original file size information in that way.

Some decompression notes

The S405 mc6800 decompressor is not really a fast one. Quite a bit this can be contributed to the extensive use of Huffman decoding without one step lookup tables. Those would have required 64K RAM per tree, which is plain out of question. Instead a maximum 120 bytes lookup table (plus the symbol array) per tree is used that allows finding symbols with the same prefix length in a single lookup / comparison without any "shifting in additional bits".. so the worst case is 15 lookups until a symbol is found. If some prefix length is not used at all then it is not part of the lookup table. For example, if the shortest prefix length is 2 bits and the longest is 10 then the size of the lookup table is 8 entries, and the prefix search starts from length 2.

The bit buffer for reading the compressed data stream always has at least 16 bits of data. 16 bits of new data is read into the bit buffer whenever the number of available bits becomes less than 16. The bit buffer is kept in a 32 bit register. Another register is used to count the available "MSBs to extract".

The bit buffer:

 3              6 5              0
 1              1 1              0
+----------------+----------------+
| MSBs to extract|  new data here |
+----------------+----------------+
<--- shift direction

So if one wants to read next 3 bits out of the compressed data stream they are located at bits 31 to 29. This has a nice property that we can easily construct a table containing the highest prefix code value for each tree leaf depths (padded with 1-bits) and do a linear search into the table until we find an entry that is higher or the same as the bit buffer value. On that index the decoder finds the stored prefix length and the base offset into an array of symbols. After that we shift out the prefix and add/sub prefix to the base offset to find the index to the symbol in the symbol array. Looking at my old code I can see I was trying to add two stage lookup mechanism at some point, where the first lookup would use top 8 bits as a direct lookup and if that would not be successful fallback to second stage iterative lookup. I don't remember why that never got finished.. 


The current Huffman decoder:


getSyml:
        ; Linear search into the lookup table..
.lut:   cmp.l    (a3)+,d7
        bhi.b    .lut
        ;
        movem.w  HALFLUTSIZE-4(a3),d0/d1 ; Clears bits 31-16 of D1
        ; D0 = base offset into the symbol array
        ; D1 = number of bits to extract
        move.l   d7,d2
        clr.w    d2
        rol.l    d1,d2
        sub.w    d0,d2
        add.w    d2,d2
        ; read 16 bit symbol from the array
        move.w   -4(a3,d2.w),d2

Since the longest prefix code is 15 bits we could actually do a "cmp.w" and save up to 30 bytes per lookup table in addition to some cycles. However, that would also require rearranging the bit buffer to hold the "MSBs to extract" in the low 16 bits of the long word. Another alternative would be clearing bit 16 and use the upper 16 bits of the bit buffer as a direct lookup into a symbol array (assuming we could use 64K of lookup tables per tree).


Below is an example of the lookup table for the following Huffman tree:

   x             C: 0
  / \            A: 100
 /   \           D: 101
C     x          B: 110
     / \         E: 1110
    /   \        F: 1111
   x     x
  / \   / \
 /   \ /   \
A    B D    x
           / \
          /   \
         E    F


Symbol array:


 0   1   2   3   4   5

+---+---+---+---+---+---+

|'C'|'A'|'B'|'D'|'E'|'F'|

+---+---+---+---+---+---+

 ^   ^           ^

 |   |           |

 |   |           +- prefix length 4 base offset

 |   +- prefix length 3 base offset

 +- prefix length 1 base offset


Lookup table (n equals 1):

   3              6 5              0
   1              1 1              0
  +----------------+----------------+
0 |0nnnnnnnnnnnnnnn|nnnnnnnnnnnnnnnn| -> depth 1
1 |110nnnnnnnnnnnnn|nnnnnnnnnnnnnnnn| -> depth 3
2 |1111nnnnnnnnnnnn|nnnnnnnnnnnnnnnn| -> depth 4
  +----------------+----------------+
  :   12 long words empty space     :
  +----------------+----------------+
  |       0        |       1        | -> symbols 'C'
  |      -4+1      |       3        | -> symbols 'A','B','D'
  |     -14+4      |       4        | -> symbols 'E','F'
  +----------------+----------------+
  :   12 long words empty space     :
  +----------------+----------------+

Most of the decompressor code size is for building the prefix lookup tables. Anyway, S405 has seen its best days but was a fun project.. eh.. over 20 years ago ;)

The Part 2 will be about the disk system and floppy disk layout used in the Smarty and the Nasty Gluttons. A good part of the post will circulate around the on fly decompression as part of the file loader.

No comments:

Post a Comment

Blitter c2p for a 16 colour rotozoomer

Preface I released a simple rotozoomer in a "Lure of the Temptress" (see the  Pouet link ) crack intro for Flashtro . The original...