Cracking Stones

Tuesday, July 27, 2021

Blitter c2p for a 16 colour rotozoomer

Preface

I released a simple rotozoomer in a "Lure of the Temptress" (see the Pouet link) crack intro for Flashtro. The original "concept" was done summer 1994 but the intro actually got finalised & optimised the heck out of it during summer 2015 ;-) Some delivery process hiccups I would say..

There's nothing new in rotozoomers and c2p routines but there's one neat piece of blitter use in this one - at least I think so. The rotozoomer uses 4x4 graphics pixels, not copper as usual, and in 16 colours. The c2p used in the intro is fully done in 4 blitter interrupt driven blitter passes, which with code in FASTMEM allows easily to do full screen zero-precalculated (320x256) rotozoomer in 50Hz with a plain mc68000 and 2 additional bitplanes on for a text writer and stuff. With the code in CHIPMEM I almost managed the full screen but with the writer and scroller I think I had to cut AFAIR 4 or 8 chunky pixels on each row.

The blitter c2p is what I'll go through here. Contrary to other neat tricks using 7 bitplanes OCS/ECS undocumented features this method works in all chipsets, although, I can't think of a reason for someone to use this with AGA. I cannot take credit of inventing this method as I recall at least a coder called CrazyCrack has used similar method prior my stuff. Anyway, I just wanted to document this.. since I have not seen it done earlier.

Maybe the next post is about the rotozoomer itself as I always had hard time to get my head around it when described in "proper terms" etc. Back in day I just realised I can do that using a unit circle and a single scaled sin()/cos() value as an affine "texture mapping" interpolation value.

Prerequisites

The blitter c2p uses two "linear" chunky buffers for 4x4 pixels - actually in RAM they are 4x1 pixels but more about that later. A pixel in a chunky buffer is 8 bits. The 4 bit colour information is repeated in both nibbles of the byte:

7 0

+----+----+

|abcd|abcd|

+----+----+

So, there are two chunky buffers. One for odd pixels and another for even pixels. The rotozoomer inner loop must therefore write into two buffers in an interleaved manner. This is not really an issue, for example, if the inner loop unrolls the entire row of writes. Below is an example ASCII pictures showing the buffers:

7 0 7 0 7 0 7 0 7 0

+----+----+----+----+-...-+----+----+----+----+----+----+

+----+----+----+----+-...-+----+----+----+----+----+----+

7 0 7 0 7 0 7 0 7 0

+----+----+----+----+-...-+----+----+----+----+----+----+

+----+----+----+----+-...-+----+----+----+----+----+----+

The blitter is setup as follows:

Channels A, B and D are enabled.
BLTCDAT = $8888
BLTBPTH & L points to the last word of even pixels
BLTAPTH & L points to the last word of odd pixels
BLTDPTH & L points to the last word of the respective bitplane buffer

The BLTCON0 and BLTCON1 are set to:

Pass #1 - $7d283012
Pass #2 - $6d282012
Pass #3 - $5d281012
Pass #4 - $4d280012

We can see that we use the blitter in "descending mode" i.e. the DESC bit in BLTCON1 is set. Also, the EFE bit in BLTCON1 is set, which mean we also use AREA FILLING in exclusive mode.

Let's go through one blitt taking an example of the Pass #1 (for preparing the bitplane 0) to see how the shifts/masks are done. But first, let's have a look at the minterm ($28 i.e. 00101000b). I always found it easier to use the Venn Diagram instead of "blitter logic" stuff when figuring out the minterms. We have bits 5 and 3 set. From the Venn Diagram we see those match to (A and C) and (B and C) and no other combination. This means our logic operation is (AC)+(BC) i.e. a bit gets set in the destination iff (A and C)=1 or (B and C)=1 but not e.g., when (A and B and C)=1. Obviously no bit is set when C=0. We use C channel as a mask for all our blitts. So essentially the minterm means a logic operation "(A XOR B) AND C".

______ 0 ______
/ \ / \
/ \/ \
/ /\ \
/ A / \ B \
| - | | - |
| | 6 | |
| | | |
| 4 |____| 2 |
| /| |\ |
| / | 7 | \ |
\ / \ / \ /
\ / 5 \/ 3 \ /
\ | /\ | /
\|_____/ \_____|/
| |
| 1 |
| |
| |
\ /
\ C /
\ - /
\______/

When applying the BLTCON0 and BLTCON1 $7d283012 we first shift A (odd) by 7 bits to the left and B (even) by 3 bits to the left. After that the logic operation i.e. the "A XOR B" and mask the result using the static C channel value $8888 are applied. Let us assume:

qrst = 1101
QRST = 0111
mnop = 1100
MNOP = 1001
ijkl = 1111
IJKL = 1111

What we are after is a linear bitplane 0 content looking like this ("upper case" pixels underlined). Note the 3 pixel shift on the right edge, which can be "corrected" using $dff102 shift of 3 pixels to the right..

7 0 7 0 7 0 7 0 7 0

+----+----+----+----+-...-+----+----+----+----+----+----+

|....|....|.........| |1111|1000|0111|1111|1111|1...| bitplane 0

+----+----+----+----+-...-+----+----+----+----+----+----+

The above blitter setup applied into the odd and even chunky buffer yields to the following after SHIFTs, masks, logic operation and area fill. The bits affected by the logic operation for the XOR operation are underlined. The green background marks a left shift.

7 0 7 0 7 0 7 0 7 0

+----+----+----+----+-...-+----+----+----+----+----+----+

|defg|hefg|h........| |l110|0110|0110|1110|1000|0000| odd pixels

|DABC|DEFG|HEFG|H...| |1111|1100|1100|1011|1011|1000| even pixels

|1000|1000|1000|1000| |1000|1000|1000|1000|1000|1000| Mask

|?000|?000|?000|?000| |0000|1000|1000|0000|0000|1000| XOR result

+----+----+----+----+-...-+----+----+----+----+----+----+

|????|????|????|????| |1111|1000|0111|1111|1111|1000| filled result

+----+----+----+----+-...-+----+----+----+----+----+----+

The same is repeated for all remaining 3 bitplanes, however, using different shift amounts. The produced planar pixels are actually 4x1 size. One can use bitplane modulos or reload bitplane pointers to stretch the pixels in y-direction to achieve 4x4 pixel size. I think I used the latter in my crack intro due to bitplanes 4 and 5 being used for the writer and the scroller. Actually halfbrite mode is used.. I recall places where there is no writer or scroller copper enables normal 4 bitplane mode to save some DMA. There are other nice features with this method. For example, there is no need to setup destination channel pointer more than once i.e., at the beginning of the Pass #1 blitt and maybe some other stuff I have forgotten.

Wednesday, July 21, 2021

Smarty and the Nasty Gluttons - Part 4 (bootblock for the online ADF game)

Preface

Continuing from where we left off in the post #3 on Smarty and the Nasty Gluttons.. this post is about the floppy disk bootblock used in the online ADF version of the game.

The online ADF game bootblock

I had never coded a TVD (trace vector decoder) or even cracked one. Also my experience using "weird stuff" one can do with mc68000 prefetch, use of instructions like STOP and RESET were very thin. Demo coding had little use for those and my demos hardly had anything worth protecting from someone to take a peek.. Anyway, another field that was yet to be explored and I knew(!) there are folks that find such stuff "a nice surprise" in a game. So, the opportunity to code and release "a protection scheme" or an attempt of such came when we got closer releasing the Smarty and the Nasty Gluttons. I kept saying to my fellow game developers that there are folks who hardly ever will play the game but will disasm, train, fix etc the heck out of it. Adding even a small non-obvious "road block" would definitely touch a soft spot for some of them..

I had my plan - a lot did not materialize, though:

must have a TVD - a bucket list item
must have use of seldom used opcodes like STOP, RESET, RTR, etc
must have obfuscated stuff and somehow using Amiga's custom chips/hardware peculiarities
must have checksums
must have encryption depending on the bootblock that is spread all around the game files
must have self-modifying code
must have code that makes no sense but is still required..
must work on all game dev group Amigas (includes CPUs up to mc68040 - at the end Jope @EAB did huge work testing with many.. many.. Amiga setups)
should have AR protection

I started with experimenting with RESET + Memory Overlay. Got it working on my A600 (up to 68020) but never on my A3000 (with 68040). There was even a short discussion started by me about it in EAB. Sadly, I had to drop this. It would have been so cool just to have it ;-)

Next in testing was the mc68000 prefetch experimenting with self-modifying code. It turned out to work nicely. Then I targeted STOP instruction and releasing it with a Copper generated IRQ. Again turned out to work nicely.

AR protection I thought a bit but the only "good" mechanism I knew of was using odd address A7 and that did not really fit to the rest of the plan. I needed to use stack and IRQs to work. Also, I had no AR hardware to use for testing. Found one on the web but sadly I got no Amiga model that it required. Another item to drop..

Then came the TVD.. I have now huge respect to folks in 1980s and early 1990s implementing and cracking these weirdo TVD protections with tools back then.. I mean, I coded/tested/debugged the one and only in Smarty on a REAL HARDWARE and getting that run was an absolute nightmare ;-) Once I finally managed to get the execution path recorded and a mini sized emulator written to do the trace vector encoding part I never changed that anymore. Remember, this was my first ever TVD and had no experience of cracking a TVD. Obviously I had looked at few in past and read the excellent article found in Flashtro about "Basic TVD Cracking" (kudos to WayneKerr).. still, I consider myself a total newbie on anything cracking & protecting related foo.

Anyway, at some point of time we just ran out of time (no pun intended for a game being under development since early 1990s), which meant I dropped checksums, blitter decoding of stuff, copper orchestrated stuff (e.g., blitter) and actually anything that would have required modifications into the actual game code. We were play testing, fixing bugs and fine tuning so much that all this checksum crap etc just did not make it into the plan anymore. I really regret it but.. it was more important to get the game out. That's the reason the "protection attempt" is only on the bootblock and nothing in the actual game is protected. Sorry.

Let's start looking into the actual code.. against my usual principles the code snippets etc are taken from FS-UAE. Stepping through all this weird stuff is just too easy with it. Taking screen captures using Amiga and transferring them to a modern social-media-enabled platform is a bit tedious.. and I am currently slightly short of soothing liqueurs to ease the work as well. And don't laugh.. while the stuff here is trivial'ish I really, really, enjoyed the journey except for some parts of the TVD debugging.

I have highlighted few important areas in the bootblock hexdump. The hexdump in red with yellow background is the copper list, which also is runnable mc68000 code. Actually, our copper list is the privilege violation exception handler.

The hexdump in black and yellow background is a stop mark for the loop relocating the bootblock into low memory and the following word in red background is the initial USP pointer. The hexdump in green background is the reminder of the non-system cache killing code (remember what I said about the TVD code execution path recording..) These three mentioned here are all related in the bootblock code.

00005C40 444F 5300 605A 247E 7FFF 7F7F 4BFA 001A DOS.`Z$~....K...
00005C50 43FA FFF2 4CD1 0030 740A 95CA 264F 4EAE C...L..0t...&ON.
00005C60 FFE2 58AF 0002 4E73 49F9 00DF 89A0 2945 ..X...NsI.....)E
00005C70 66FA 13C5 00BF ED01 429A 9BC9 24D9 66FC f.......B...$.f.
00005C80 21E3 0010 3659 4E63 7000 7200 4E7B 0801 !...6YNcp.r.N{..
00005C90 4E7A 1002 4E91 0695 FFFD B203 06A5 001D Nz..N...........
00005CA0 A7E1 3F0D 2F15 3F02 0000 006C 2978 0068 ..?./.?....l)x.h
00005CB0 66FA 4E71 4E73 4C97 0301 4C90 0801 B153 f.NqNsL...L....S
00005CC0 D25D 4841 3009 B540 4890 0201 B151 4E73 .]HA0..@H....QNs
00005CD0 4C97 0302 4841 0342 D29D 0542 0C83 00FE L...HA.B...B....
00005CE0 009C B399 4E77 00FE 0180 0228 FFFF FFFE ....Nw.....(....
00005CF0 0D04 00B0 66E0 397C 8290 66F6 06AF C372 ....f.9|..f....r
00005D00 4576 FFF2 760A 7A92 66FA 7C08 247C 0000 Ev..v.z.f.|.$|..
00005D10 00B0 4E62 A166 4E71 4C90 66EA 2768 47A0 ..Nb.fNqL.f.'hG.
........ quite a bit of encrypted code follows..
00005FC0 24EC 58B4 75BE 4D40 D8C7 5CF2 CEC6 BF51 $.X.u.M@..\....Q
00005FD0 7AC0 4A25 DA0D 0000 0000 5000 0000 0000 z.J%......P.....
00005FE0 00CC 4A81 6A02 F4F8 4E7B 0002 4E7B 0808 ..J.j...N{..N{..
00005FF0 4E75 434F 4F50 4552 2F43 4F4F 5045 522F NuCOOPER/COOPER/
00006000 84A2 A262 62F2 C204 A2B2 F2CA 04A2 D292 ...bb...........
00006010 3204 2232 AAF2 EA04 9204 2272 8204 CA4A 2."2......"r...J
00006020 82A2 9A04 AC4C 044A A22A 6282 042A A2A2 .....L.J.*b..*..
00006030 B204 2292 2204 A2EA A5AD B007 C0DE DBAD .."."...........

The bootblock code starts here.. The D2 is initialized to $10 as we need it for code decrypting. A5 points at the code to be run in Supervisor mode. A1 points at bootblock after the DOS Type.

00005C4C 4bfa 001a LEA.L (PC,$001a) == $00005c68,A5
00005C50 43fa fff2 LEA.L (PC,$fff2) == $00005c44,A1
00005C54 4cd1 0030 MOVEM.L (A1),D4-D5
00005C58 740a MOVE.L #$0000000a,D2
00005C5A 95ca SUBA.L A2,A2
00005C5C 264f MOVEA.L A7,A3
00005C5E 4eae ffe2 JSR (A6, -$001e) == $00000658

This is the Illegal Instruction exception handler used by the non-system cache killing code to skip the 4 bytes long offending instruction. We saved the current USP to A3 to later pop the JSR return address (i.e., $5c62) as the exception address and put that into address $10. The neat thing here is that if the CPU is mc68000 the write really goes into the exception vector at address $10. On better CPUs the exception vector may be relocated using VBR into what ever address, however, the code we have for killing cache etc do not trigger illegal instruction exception either..

00005C62 58af 0002 ADD.L #$00000004,(A7, $0002) == $000018b4
00005C66 4e73 RTE

Take note here.. A4 is loaded with a $df89a0, which we use to refer to hardware registers with an offset $6660. The value is selected purposely and later we will see why. Here these three lines disables all relevant interrupts.

00005C68 49f9 00df 89a0 LEA.L $00df89a0,A4
00005C6E 2945 66fa MOVE.L D5,(A4, $66fa) == $00dff09a
00005C72 13c5 00bf ed01 MOVE.B D5,$00bfed01

Relocate the bootblock code starting from $5c44 i.e., just after the bootblock DOS Type to memory address $4. We clear the bytes at $0 to $3 as that serves as an initial context for the TVD. Take a note of "SUB.L A1,A5", which will init A5 to $24 i.e., the address for the Trace exception vector. In the non-relocated code A5 would point at $5c64. The copying continues until the first 4 bytes aligned 0 long word shows up in the bootblock.

00005C78 429a CLR.L (A2)+
00005C7A 9bc9 SUBA.L A1,A5
00005C7C 24d9 MOVE.L (A1)+,(A2)+
00005C7E 66fc BNE.B #$fffffffc == $00005c7c (F)
00005C80 21e3 0010 MOVE.L -(A3),$00000010

Get the initial USP ($cc) and now A1 points at the reminder of the non-system cache killing code.

00005C84 3659 MOVEA.W (A1)+,A3

00005C86 4e63 MVR2USP.L A3

00005C88 7000 MOVE.L #$00000000,D0
00005C8A 7200 MOVE.L #$00000000,D1
00005C8C 4e7b 0801 [ MOVEC D0,VBR ]
00005C8E 0801 4e7a BTST.L #$4e7a,D1
00005C92 1002 MOVE.B D2,D0
00005C94 4e91 JSR (A1)

The following two lines modify the "relocated code" (we are still running non-relocated code here) at addresses $20 (equals to $5c60) and $24 (equals to $5c64) generating the privilege violation and trace exception handler addresses.

00005C96 0695 fffd b203 ADD.L #$fffdb203,(A5)
00005C9C 06a5 001d a7e1 ADD.L #$001da7e1,-(A5)

$00000020 will contain $00000090 i.e., code also located at $5cd0.

$00000024 will contain $00000076 i.e., code also located at $5cb6.

Next three instructions create a stack frame Format 0 that looks like we were returning from a privilege violation exception handler. The D2 here will cause setting the X-flag upon return and move CPU to user mode (and no trace bit set). Code execution will resume at the relocated address $90 i.e. the privilege violation exception handler once we execute the RTE instruction a bit later. A detail there is that the exception handler will then be called in user mode not in supervisor and this has a significance when it comes to selection of stack pointers.

00005CA2 3f0d MOVE.W A5,-(A7)
00005CA4 2f15 MOVE.L (A5),-(A7)
00005CA6 3f02 MOVE.W D2,-(A7)

In the relocated bootblock code $5ca8 is at address $68 i.e., Level 2 exception vector. It contains value $6c meaning the code for the exception handler is at $6c. The same value is also adequate to clear the pending IRQs triggered the Level 2 IRQ. We write the entire content of $68 i.e., $0000006c into $dff09a but only the bits for $dff09c take effect.

00005CA8 0000 006c OR.B #$6c,D0
00005CAC 2978 0068 66fa MOVE.L $00000068,(A4, $66fa) == $0000c33a
00005CB2 4e71 NOP
00005CB4 4e73 RTE

The RTE causes a jump to address $90. The stack used now is USP, which points at $cc. Note the loading of D1/A0/A1 with values from stack and later the "EOR.L D1,(A1)+" to modify the instruction that caused the exception just before exiting the handler. Although in this case below we have a fabricated stack frame thus we modify a different place in memory.. see below.

Also, the "ADD.L (A5)+,D1" starts to calculate a checksum over the bootblock code. Note that the "RTR" pops also the CCR but the supervisor portion of the status register is unaffected.

00000090 4c97 0302 MOVEM.W (A7),D1/A0-A1
00000094 4841 SWAP.W D1
00000096 0342 BCHG.L D1,D2
00000098 d29d ADD.L (A5)+,D1
0000009A 0542 BCHG.L D2,D2
0000009C 0c83 00fe 009c CMP.L #$00fe009c,D3
000000A2 b399 EOR.L D1,(A1)+
000000A4 4e77 RTR

What we have at $cc is a piece of code, which is actually used later to load A2 with a wanted value. The handler loads the value $000000b0 into A1 (remember the sign extending properties of "movem.w" and loading a word into an address register in general). Address $b0 is also the address where code execution continues after the "RTR".

000000CC 247c 0000 00b0 MOVEA.L #$000000b0,A2

Also, the above handler will now modify code at address $b0, which originally looks like:

000000B0 0d04 BTST.L D6,D4
000000B2 00b0 66e0 397c 8290 OR.L #$66e0397c,(A0, A0.W*2, $ffffff90)
000000BA 66f6 BNE.B #$fffffff6 == $000000b2 (F)

After that "EOR.L D1,(A1)+" the code changes to:

000000B0 2978 0020 66e0 MOVE.L $00000020,(A4, $66e0) == $00dff080
000000B6 397c 8290 66f6 MOVE.W #$8290,(A4, $66f6) == $00dff096

When the above code is executed the privilege violation exception handler address is used as the COP1 address and the relevant DMAs are started. The privilege violation exception handler at $90 is also a valid copper list. The significant piece there is the triggering of a Level 2 IRQ (write of $b399 to $dff09c). This will be useful later.

00000090: 4c97 0302 ; VP 4c, VE 03; HP 96, HE 02; BFD 0

00000094: 4841 0342 ; VP 48, VE 03; HP 40, HE 42; BFD 0

00000098: d29d 0542 ; VP d2, VE 05; HP 9c, HE 42; BFD 0

0000009c: 0c83 00fe ; VP 0c, VE 00; HP 82, HE fe; BFD 0

000000a0: 009c b399 ; INTREQ := 0xb399

000000a4: 4e77 00fe ; VP 4e, VE 00; HP 76, HE fe; BFD 0

000000a8: 0180 0228 ; COLOR00 := 0x0228

000000ac: ffff fffe ; VP ff, VE 7f; HP fe, HE fe; BFD 1

; End of Copperlist

The next four lines of funky code play around with the prefetch of the mc680x0. After the recent "RTR" the USP points at $d2. The "ADD.L" here will therefore change the code at address $c4 i.e. the code immediately following the "ADD.L".

000000BC 06af c372 4576 fff2 ADD.L #$c3724576,(A7, -$000e) == $000000c4
000000C4 760a MOVE.L #$0000000a,D3
000000C6 7a92 MOVE.L #$ffffff92,D5
000000C8 66fa BNE.B #$fffffffa == $000000c4 (F)

000000CA 7c08 MOVE.L #$00000008,D6
000000CC 247c 0000 00b0 MOVEA.L #$000000b0,A2

However, due to prefetch the CPU has already loaded the "old code", which sets registers D3 and D5. The "BNE.B" above causes looping back to just modified code at $c4. Now the selected offset for A4 register makes sense. It was the "BNE.B" instruction opcode. The code essentially initializes registers for the TVD and enables a "PORTS" Level 2 IRQ (that will be triggered from the copper list).

000000C4 397c c008 66fa MOVE.W #$c008,(A4, $66fa) == $00dff09a
000000CA 7c08 MOVE.L #$00000008,D6
000000CC 247c 0000 00b0 MOVEA.L #$000000b0,A2

The following instruction looks legitimate and actually is. However, it will trigger a privilege violation exception and not really change the USP at all.. but having $b0 in A2 is important later.

000000D2 4e62 MVR2USP.L A2

000000D4 a166 ILLEGAL

000000D6 4e71 NOP

Once the privilege violation handler returns back to $d2 the code has changed a bit and remember that we return from the handler using "RTR" so we remain in supervisor mode..

000000D2 4e72 a110 STOP #$a110

000000D6 4e71 NOP

..and the CPU stops until an IRQ takes place with priority/level higher than 1. This means we just sit here until the copper triggers next Level 2 IRQ. This code really has no particular useful meaning but it was just something I had to have to justify the copper list in a code thingy ;-) Well, actually, the "STOP" writes the status register and here it will keep the CPU in supervisor and enable tracing! The next instruction ("NOP") will then kick-off our TVD..

The trace exception handler is located at address $76. Few notes.. The bootblock is always in the lower 32K of RAM, thus we can me use of 16 bit addressing when assigning values into address registers. For example the first "MOVEM.W" will always load A0 with address $0 and in that location we maintain the TVD context for previously decrypted instruction. We want to decrypt the next instruction but also re-encrypt what we already executed. The word at $0 is the "previous XOR key" and the word at $2 is the "previously decrypted instruction address". Initially these both were initialized to zero.

Here D0 is not used for anything. A0 is the TVD context address i.e. $0 and A1 is the address of the next instruction to execute. D1 will accumulate the checksum over the bootblock and D2 is "salt" that gets modified here and there in the decrypted code while TVD runs.

00000076 4c97 0301 MOVEM.W (A7),D0/A0-A1

Re-encrypt the previously executed instruction.

0000007A 4c90 0801 MOVEM.W (A0),D0/A3
0000007E b153 EOR.W D0,(A3)

Update the code checksum. Note that we changed to ".W" and this is to avoid the checksum calculation pointer (A5) reaching the code we decrypt with non-TVD decrypters.. those decrypters are protected by the TVD.. uhh.. I was lazy as the crossing point would have required some extra care to work properly. Most of my TVD stuff was trial'n'error and at this point the error rate started to get too high.

00000080 d25d ADD.W (A5)+,D1
00000082 4841 SWAP.W D1

The actual "decryption XOR key" is a mix of the D2 ("salt") and the address of the next instruction.. before decrypting the next instruction both the calculated "decryption XOR key" and the instruction address get stored into the TVD context location. This lame'ish key algorithm was selected because we needed to allow encrypting and decrypting code that has conditional branches. So, if the D2 ("salt") remains constant within a code block that branches around we can deterministically calculate the "decryption XOR key". As can be seen later the D2 gets modified only in carefully selected spots..

00000084 3009 MOVE.W A1,D0
00000086 b540 EOR.W D2,D0
00000088 4890 0201 MOVEM.W D0/A1,(A0)
0000008C b151 EOR.W D0,(A1)
0000008E 4e73 RTE

The following code is what gets decrypted by the TVD. The actual execution path varies depending on your Amiga's memory configuration as the memlist check for autoconfigured memory is also part of the code that is protected with the TVD. At the beginning D3 is $0000000a, D5 is $ffffff92 and A2 is $000000b0.

000000D8 4042 NEGX.W D2
000000DA 95c5 SUBA.L D5,A2

A2 points at $11e now, which is the start of the "normal encrypted code" outside the TVD protected code. The TVD protected code contains two decryption loops to decrypt the minimal FFS file loader etc. The one below is the decrypter #1 and it uses the previous code checksum as the decryption key.

000000DC d441 ADD.W D1,D2
000000DE b39a EOR.L D1,(A2)+
000000E0 5303 SUB.B #$00000001,D3
000000E2 6afa BPL.B #$fffffffa == $000000de (T)

000000E4 d705 ADDX.B D5,D3
000000E6 4681 NOT.L D1
000000E8 d441 ADD.W D1,D2

The next "ADD.L" is important.. at the very beginning of the boot the system passed Execbase address in A6 to the bootblock code. A2 points at address $14a after the first decrypter and tada.. that's also the index to memlist structure in the Execbase for autoconfigured memory.

Furthermore, D3 was $000000ff after the first decrypter and adding D5 to it gives us $92 (note, X-flag was 1 when the "ADDX.B" was executed), which is the decrypter #2 loop count.

000000EA ddca ADDA.L A2,A6
000000EC b39a EOR.L D1,(A2)+
000000EE 51cb fffc DBF .W D3,#$fffc == $000000ec (F)
000000F2 4042 NEGX.W D2

After the second decrypter follows (a slightly bugged as reported by Ross @EAB ;-) memlist code that finds out the autoconfigured and ranger RAM. We will use 512K as a RAM disk during the game.

000000F4 9cc6 SUBA.W D6,A6
000000F6 4846 SWAP.W D6
000000F8 2c56 MOVEA.L (A6),A6
000000FA 2816 MOVE.L (A6),D4
000000FC 6718 BEQ.B #$00000018 == $00000116 (F)
000000FE 282e 0014 MOVE.L (A6, $0014) == $000008d6,D4
00000102 2a2e 0018 MOVE.L (A6, $0018) == $000008da,D5
00000106 b886 CMP.L D6,D4
00000108 6402 BCC.B #$00000002 == $0000
0000010A 2806 MOVE.L D6,D4
0000010C 4244 CLR.W D4
0000010E da83 ADD.L D3,D5
00000110 4245 CLR.W D5
00000112 9a84 SUB.L D4,D5
00000114 6fe2 BLE.B #$ffffffe2 == $000000f8 (F)

We are done with decrypting and memlist stuff now. Time to set stacks and exist the TVD by clearing the trace bit in the status register and after setting the SSP to $300 returning to user mode.

00000116 46fc 2700 MV2SR.W #$2700
0000011A 4ff8 0300 LEA.L $00000300,A7
0000011E 46c6 MV2SR.W D6
00000120 4ff8 0400 LEA.L $00000400,A7

The last tweaks of obfuscation before calling the file loader.. D3 is $0000ffff and the code below will make D6 to $00080080, which we use to stop the copper DMA and disable the "PORTS" Level 2 IRQ. Finally we store the found memory location & size of RAM disk in D4 and D5 into addresses $8c and $90. Those locations are used by the main game engine.

00000124 ea0b LSR.B #$00000005,D3
00000126 07c6 BSET.L D3,D6
00000128 3946 66f6 MOVE.W D6,(A4, $66f6) == $00dff096
0000012C 2946 66fa MOVE.L D6,(A4, $66fa) == $00dff09a
00000130 397c 9500 66fe MOVE.W #$9500,(A4, $66fe) == $00dff09e
00000136 397c 4489 66de MOVE.W #$4489,(A4, $66de)
0000013C 3446 MOVEA.W D6,A2
0000013E 48d2 003e MOVEM.L D1-D5,(A2)

And call the FFS file loader, load a file called "DOS" (which is the second stage loader and the main game engine) into address $5000 and execute it from there. I won't go through the loader. It is at this point uninteresting. Just a quick note that the loader is tailored to load a file named "DOS" i.e. hash functions to locate specific structures in FFS are precalculated etc. The reason for the imaginary naming of the second stage loader is that originally I intended to use the DOS Type at address $0 as the file name (i.e. the string "DOS\0").

00000148 610e BSR.B #$0000000e == $00000158
0000014A 6604 BNE.B #$00000004 == $00000150 (T)
0000014C 4ef8 5000 JMP $00005000
00000150 396c 6666 67e0 MOVE.W (A4, $6666) == $00dff006,(A4, $67e0) == $00dff180
00000156 60f8 BT .B #$fffffff8 == $00000150 (T)

Phew.. that was it. Again, useless stuff but it was worth the journey for me ;-)

Sunday, July 18, 2021

Smarty and the Nasty Gluttons - Part 3 (bootblock for the boxed game)

Preface

Continuing from where we left off in the post #2 on Smarty and the Nasty Gluttons.. this post is about the floppy disk bootblock used in the boxed version of the game. There will be a follow-up post going through the online ADF version bootblock once I get around to do it. It is a bit involved. The latter bootblock contains encrypted data and a protection. We received reports from folks using FPGA based Amiga emulators that the game does not boot.. well.. the ADF version bootblock requires exact emulation of the mc680x0 hardware and obviously those FPGA emulators do not have such ;-) The boxed game bootblock is not encrypted or obfuscated in any way.

The boxed game bootblock

This post contains a lot of source code as it is easier to explain certain things directly from the code. Since the whole release process of a boxed Amiga game in 2020s is an "act of love" or a serious hobby at best the fine line between what makes sense and what not is meaningless. Therefore, the bootblock contains geek things just for "because we can" and "I had to try it" reasons.

In comparison to the ADF version bootblock the boxed game bootblock does not contain a minimal FFS file loader for loading the second stage file loader program. Instead, the floppy disk layout is mastered (by hand) in a way that the all data sectors of the Amiga DOS file containing the second stage loader start immediately after the bootblock sectors. So when the bootblock code rereads itself from the floppy (using trackdisk.device) it purposely reads past the bootblock to also get the secondary loader compressed file into the memory. The bootblock has the S405 decompressor in it so we can now decompress the second stage FFS file loader and jump into it.

The other geek or useful things added to the bootblock include:

Booting and playing the game from any disk drive, not just DF0:. And furthermore, this also works on any Kickstart, not just on 2.x or newer. There's a hack'ish example code for that.
The bootblock content starts from offset 4, not the standard 12. This involves generating such bootblock content that the checksum is exactly what we want it to be ;-) There's an example code how to generate a bootblock checksum with a predefined content.
Using code embedded into perfectly readable ASCII text messages.

Generating bootblock checksums with a desired content

To fully understand the checksum code below look at the bootblock source itself. It is later in this post. The main idea here is that we first calculate the normal bootblock checksum and subtract the value of the "desired checksum value" from that. The difference then is placed somewhere in the bootblock and then recalculate the checksum. Now the checksum becomes to the value we wanted it to be. I must thank Ross @EAB getting me to think about this. His superb one disk trainer (which, btw has a superb trainer intro as well ;-) for the Smarty and the Nasty Glutton had this thing. That man is full of crazy ideas!

The code below does the magic. It assumes the "desired checksum" is already in the place of the checksum field of the bootblock.

section boot,code_c
j:      ; "sta" is the start of the bootblock code
lea sta+4(pc),a1
move.l (a1),d2
move.l d2,d3
clr.l (a1)
not.l d2
bsr.b calcBootCRC
sub.l d0,d2
bcc.b .ok
subq.l #1,d2
.ok:
       move.l d2,crc
bsr.b calcBootCRC
not.l d0
move.l d0,d2
       ; This checksum calculation is for verification purposes
bsr.b calcBootCRC
not.l d0
move.l d2,(a1)
rts
calcBootCRC:
lea sta(pc),a0
moveq #0,d0
move.w #(1024/4)-1,d1
.loop:
       add.l (a0)+,d0
bcc.b .skip
addq.l #1,d0
.skip:
dbf d1,.loop
rts

The boxed game bootblock source in full

The following lengthy piece of code is the actual boxed game version bootblock. I have put some comments in between here and there. First the defines.. I am still stuck at old Seka times doing assembly coding... sorry. If there are bugs in the bootblock code.. oh boy.. this code is already out in the wild as part of the physical game disk of the boxed version.

;-----------------------------------------------------------------
;
LVOFindResident equ -96
LVOSupervisor equ -30
LVODoIO equ -456
LVOAllocMem equ -198
LVOAllocAbs equ -204
LVOOpenDevice equ -444
LVOCloseDevice equ -450
LVODisable equ -120
IOSTD_SIZE equ 56
IO_UNIT equ 24
MP_SIZE equ 34
MN_REPLYPORT equ 14
MN_LENGTH equ 18
LN_TYPE equ 8
NT_REPLYMSG equ 7
MH_UPPER equ 24
MH_LOWER equ 20
MAXLOCMEM equ 62
LOADSIZE equ (18+2)*512

The bootblock itself. Note that the copyright message starts immediately after the DOS Type at offset 4. You need the above checksum calculation code to fabricate "*Sma" as the bootblock checksum. Furthermore, as you can see there is no "visible" code at the offset 12, instead the ASCII "and the.." starts from there. The geek thing is that "an" is $616e in hex, which translates to a "BSR.B go" in the bootblock binary. Since "bsr" pushes the return address into the stack we pop it later and use that address as a base address for variables.

sta: dc.b "DOS",0
dc.b "*Smarty and the Nasty Gluttons*"
dc.b " (c) 2020 Eero Tunkelo."
dc.b " HardCopy release (c) 2020 by Bitmap Soft."
crc: dc.l 0 ; used for biasing the real checksum
tdname: dc.b "trackdisk.device",0
dc.b 'SCX'
; A4=area for variables
go: move.l (sp)+,a4
move.l a1,a2 ; save IORequest

Next we reread the bootblock into the CHIPRAM address $30000. This address is "high enough" in memory not to mess with system stuff and "low enough" to be available in Amigas with just 256K of CHIPRAM (I know this is ridiculous but our friend Ross insisted on stuff like this and those non-existent 256K Fast RAM expansions). The LOADSIZE is way more that 2 sectors needed for the bootblock - remember the trick to load the second stage file loader code at this step.

; Read start program into CHIP..
lea $30000,a3
move.w #2,$1c(a2)    ; io_Command = CMD_READ
move.l #LOADSIZE,$24(a2)   ; io_length
move.l a3,$28(a2)    ; io_Data
clr.l $2c(a2)    ; io_Offset
move.l a2,a1
jsr LVODoIo(a6)
tst.l d0
bne.b errorExit

The next routine finds out which disk drive was used to boot the game disk. This code is contributed by one EAB thread and specifically Stingray (greetings fellow Scoopexian). The basic idea is to iterate through all device units and find which device has the same IO_UNIT as the IORequest address passed to the bootblock. This is somewhat hack'ish but seems to work just fine.

findBootDrive:
; Find boot drive - from EAB & Stingray
moveq #IOSTD_SIZE+MP_SIZE-1,d6
.clear:
clr.b 0(a4,d6.w)
dbf d6,.clear
;
moveq #4-1,d6 ; device number & count
move.l IO_UNIT(a2),d7
lea IOSTD_SIZE(a4),a5 ; A5 = Message
move.l a5,MN_REPLYPORT(a4) ; IOExtReq.io_Message.mn_ReplyPort
;move.w #IOSTD_SIZE,MN_LENGTH(a4)
;move.b #NT_REPLYMSG,LN_TYPE(a4)
.findLoop:
move.l d6,d0 ; Unit
moveq #0,d1 ; Flags
lea tdname(pc),a0 ;
move.l a4,a1 ; A4 = IOStdReq
jsr LVOOpenDevice(a6)
tst.l d0
bne.b .noDevice
move.l IO_UNIT(a4),d5
move.l a4,a1
jsr LVOCloseDevice(a6)
cmp.l d5,d7
.noDevice:
dbeq d6,.findLoop
tst.w d6
bpl.b bootDriveFound
errorExit:
moveq #-1,d0
rts

Now we found the boot drive. The next code does the cache disable stuff. Again, a variation and result of the codes found in relevant EAB threads. Kudos to Keir and Ross with a bit of my own (the illegal instruction handler stuff). We do not use the typical (and what would make sense) CacheControl() provided by Kickstart 2.x and newer. Instead, we have a code that works (has been tested) on any Kickstart and 680x0. I bet there is an accelerator out there which will cause this code to break due some magic done in the accelerator firmware... Anyway, I had a perfectly good reason for what I did. The main game developer Sami Karjalainen has Amiga 2000 (Kickstart ROM 1.3) with GVP 68030 turbo. So when the game is booted up for the very first time Sami's A2000 has Kickstart 1.3 and 68030. Obviously the game had to work on his machine as well.

Furthermore, his GVP was a source of additional headache. It generated spurious Level 7 IRQs, which we did not count for.. it took quite a bit of debugging to find this out. Once I figured it all the numerous threads in EAB came immediately as a backflash about the very same topic. Sigh..

bootDriveFound:
; Turn disk motor off
move.l a2,a1
move.w #9,$1c(a1)
clr.l $24(a1)
jsr LVODoIo(a6)
jsr LVODisable(a6)
; Cache code
lea disableCaches(pc),a5
move.l sp,a2
jsr LVOSupervisor(a6)
; Illegal instruction handler
addq.l #4,2(sp)
rte
disableCaches:
move.l -(a2),$10.w ; JSR return address is the handler
moveq #0,d0 ;
moveq #0,d1
dc.l $4e7b0801 ; movec d0,vbr
dc.l $4e7a1002 ; movec cacr,d1
tst.l d1
bpl.b .skip
dc.w $f478 ; cpusha dc
.skip: dc.l $4e7b0002 ; movec d0,cacr
dc.l $4e7b0808 ; movec d0,pcr
; Jump to code in fixed address..
jmp findMemory-sta(a3)

At this point we have all caches etc fancy features turned off and VBR set to 0. We are also executing the code from the fixed address somewhere around that $30000.

I have a track record embedding bugs into my memlist code finding out all autoconfiguring + ranger memory. The original Smarty and the Nasty Gluttons code from 1990s had a bug, the "cracked" game preview release had a different bug and finally the ADF version of the game had a bug ;-) I hope I finally got this right.. I presume Ross @EAB (if he ever reads this post) will point out the bugs.. as he did previous times.

; Search exec memlist for 512K RAM disk.. any mem
findMemory:
lea $142(a6),a0 ; memlist
moveq #8,d0
swap d0 ; D0=$80000
moveq #0,d1
subq.w #1,d1 ; D1=$ffff
moveq #0,d4 ; mem found flag
; Check for 256K CHIP
cmp.l MAXLOCMEM(a6),d0
bhi.b .noMemFound
.memLoop:
move.l (a0),a0
tst.l (a0)
beq.b .noMemFound
movem.l MH_LOWER(a0),d2/d3
add.l d1,d3
clr.w d3
sub.l d0,d3
; loop if 256K or 512K of CHIP
ble.b .memLoop
; loop if 256K mem expansion
clr.w d2
cmp.l d2,d3
blo.b .memloop
.done:
move.l d0,d4
.noMemFound:
movem.l d3/d4/d6,$8c.w ; $8C = ptr to min 512K RAM disk
; $90 = amount of RAM disk
; $94 = boot drive

If there is not enough RAM or 256K of CHIPRAM is found the second stage file loader code will inform the player about the fact.

Now we are at the final steps.. the rest of to code relocates both stacks and calls the decompressor. The code execution for the second stage file loader starts at $5000. That's it. For your convenience the S405 decompression routine is also included in full..

; Stacks..
lea $400.w,a7 ; SSP
move.l a7,a2 ; decompressor work space
move.w d0,sr ; back to user mode
lea $300.w,a7 ; USP
lea $dff096,a0
move.w #$0180,(a0) ; #$0180,$dff096
move.w d0,$180-$96(a0)
      ; decompress and execute..
lea $604(a3),a0
lea $5000.w,a1
;lea ($5000-TABLE_SIZE).w,a2
pea (a1)

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Version: 5d
;
; This file is released into the public domain for commercial
; or non-commercial usage with no restrictions placed upon it.
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

LUTSIZE     equ 15*8
HALFLUTSIZE    equ LUTSIZE/2
PRESYMS     equ 16
LTMSYMS     equ 512
HGHSYMS     equ 256
LOWSYMS     equ 256
rsreset ; DO NOT REARRANGE
PRETABLE rs.b (LUTSIZE+PRESYMS*2)
LTMTABLE rs.b (LUTSIZE+LTMSYMS*2)
HGHTABLE rs.b (LUTSIZE+HGHSYMS*2)
LOWTABLE rs.b (LUTSIZE+LOWSYMS*2)
OVLTABLE rs.b 256
TABLE_SIZE rs.b 0
MTF equ LOWTABLE+HGHSYMS

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Parameters:
; A0 - src - a ptr to the compressed data.
; A1 - dst - a ptr to the destination memory.
; A2 = ptr to work area (TABLE_SIZE bytes)
;
; Returns:
; none
;
; Notes:
; No checks if source and destination memory areas
; overlap. In such case a crash is very probable.
; Runtime RAM usage is 2816 ($b00) bytes, relocatable using A2.
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

decrunch: addq.w #2,a0
;
; A0 = ptr to crunched data (ID skipped) + 2
; A1 = ptr to destination mem
; A2 = ptr to work area
;
blockLoop: tst.w -(a0)
bne.b .cont
rts
;
; Initialize the bitshifter
.cont: move.l (a0)+,d7
swap d7
moveq #0,d6
;
; A0 = src
; A1 = dst
; a2 = tmp
; D6 = _bc
; D7 = _bb
;
decodeTrees:
lea MTF(a2),a3 ; MTF table
move.l a2,a4 ; A4 = PRETABLE
lea LUTSIZE+PRESYMS*2(a2),a5 ; leaf depths
;
; Init MTF table M while reading 16 leaf depths.
;
moveq #PRESYMS,d3
moveq #29,d4
moveq #0,d5 ; max leaf depth is 15
.getb3: ;
move.b d5,(a3)+ ; init MTF
move.l d7,d0
lsr.l d4,d0
move.b d0,(a5)+
moveq #3,d1
bsr.b getB
addq.w #1,d5
cmp.w d5,d3
bne.b .getb3
; Build pretree..
; D3 = num symbols = 16
; A4 = ptr to PRETABLE
bsr.w buildDecodingTables ; PRETREE

;; Build LTMTABLE
lsl.w #5,d3
; D3 = 512
; A4 = A2+LTMTABLE
;lea LTMTABLE(a2),a4
bsr.b buildDecodingTree
lsr.w #1,d3
; D3 = 256
; A4 = A2+HGHTABLE
bsr.b buildDecodingTree
;
; D3 = 256
; A4 = A2+LOWTABLE
bsr.b buildDecodingTree
;
; D3 = 256
; D5 = oldLength (PMR) no init required
; A4 = oldOffset (PMR) no init required
; A5 = oldOffsetLong (PMR) no init required
;
mainLoop:
lea LTMTABLE(a2),a3    ; LTMTABLE
bsr.b getSyml
sub.w d3,d2
decodeLiteral: ;
; Symbol > 256 => PMR or match
bgt.b matchFound
;
; Symbol = 256 => End of Block
beq.b blockLoop
;
; Symbol < 256 => Literal
move.b d2,(a1)+
bra.b mainLoop
;
matchFound:
subq.w #1,d2
ble.b copyLoop ; Symbol = 257 => PMR
                       ; D2 = match_length-1
move.w d2,d5 ; D5 = PMR oldLength
decodeOffset:
lea HGHTABLE(a2),a3
bsr.b getSyml
move.w d2,d4
bne.b .notoldoffsetlong
move.w a5,d4
bra.b oldOffsetLong
.notoldoffsetlong:
bclr #7,d4
beq.b oldOffsetLong
;
lsl.w #8,d4
;
lea LOWTABLE(a2),a3
bsr.b getSyml
move.b d2,d4
;
move.w d4,a5    ; A5 = PMR oldOffsetLong
oldOffsetLong: move.w d4,a4    ; A4 = PMR oldOffset
copyLoop: move.w d5,d0    ; D5 = PMR oldLength
;
; D3 = matchLength-1
; D4 = offset
move.l a1,a3
sub.l a4,a3
;
.copy: move.b (a3)+,(a1)+
dbf d0,.copy
bra.b mainLoop
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Description:
; Decodes the next symbol from the given huffman tree.
;
; Parameters:
; a3 = ptr to (huffman) tables..
;
; Returns:
; D1 = cnt (num bits needed for symbol..)
; D2 = symbol
;
; Trashes:
; D0,a3
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
getSyml:
.lut: cmp.l (a3)+,d7
bhi.b .lut
;
movem.w HALFLUTSIZE-4(a3),d0/d1
; D0 = base index
; D1 = number of bits to extract
move.l d7,d2
clr.w d2
rol.l d1,d2
sub.w d0,d2
add.w d2,d2
move.w -4(a3,d2.w),d2
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Description:
; Extract n bits from the compressed data stream.
;
; Parameters:
; D1.w = num_bits_to_extract
;
; Returns:
; Nothing.
;
; Trashes:
; flags,D1
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
getB: cmp.w d1,d6
bge.b .getb
lsl.l d6,d7
move.w (a0)+,d7
sub.w d6,d1
moveq #16,d6
.getb: sub.w d1,d6
lsl.l d1,d7
return: rts

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Description:
; Build huffman tree & decoding structures. Read symbols
; from the compressed stream and then does inverse MTF
; symbols. Used only to decode/build the pretree.
;
; Parameters:
; D3 = num symbold
; A4 = ptr to the tree
;
; Returns:
; D3 = num of symbols
; A4 = ptr to dest table
;
; Trashes:
; D0,D1,D2,D4,D5,A3,A4,A5
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

buildDecodingTree:
move.w d3,d0
add.w d0,d0
lea LUTSIZE(a4,d0.w),a5
;moveq #0,d5
move.w d3,d5
.loop: move.l a2,a3 ; PRETABLE
bsr.b getSyml ; trashes a3
; inverse MTF
lea MTF(a2),a3
add.w d2,a3
move.w d2,d0
move.b (a3),d2
bra.b .mtf1
;
.mtf: move.b -(a3),1(a3)
.mtf1: dbf d0,.mtf
move.b d2,(a3)
move.b d2,(a5)+
;addq.w #1,d5
;cmp.w d5,d3
subq.w #1,d5
bne.b .loop
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
; Description:
; Build huffman tree & decoding structures. Assume that all
; symbols are already loaded into memory.
;
; Parameters:
; D3 = num symbols
; A4 = ptr to the tree table
;
; Returns:
; D3 = num of symbols
; A4 = ptr to next table
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

buildDecodingTables:
move.l a4,a5
moveq #HALFLUTSIZE/4,d0
; clear LUTTABLE
.clrLoop:
clr.l (a5)+
subq.w #1,d0
bne.b .clrLoop
;
move.w d3,d1
add.w d1,d1
add.w d3,d1
lea LUTSIZE(a4,d1.w),a3
; Read depths and count occurrences of each depth
; A3 = leaf depths
; A4 = tree/lut
move.w d3,d2
;moveq #0,d0 ; D0=0
.countLoop:
move.b -(a3),d0
beq.b .zeroDepth
add.b d0,d0
add.b d0,d0
addq.w #1,2-4(a4,d0.w) ; Count or Index
.zeroDepth: subq.w #1,d2
bne.b .countLoop
; count prefix and base
move.l a4,a5
moveq #15,d0
;moveq #0,d2 ; prefix, D2=0
moveq #HALFLUTSIZE,d5 ; inxed or count
.indexLoop:
move.l (a5)+,d4
beq.b .zeroCount
add.w d4,d2
movem.w d2/d5,-4(a5)
add.w d4,d5
.zeroCount:
add.w d2,d2
subq.w #1,d0
bne.b .indexLoop
;
; Sort symbols
;moveq #0,d0 ; D0=0
moveq #0,d1
.sortLoop:
move.b 0(a3,d0.w),d1
beq.b .zeroIndex
add.b d1,d1
add.b d1,d1
move.w 2-4(a4,d1.w),d2
addq.w #1,2-4(a4,d1.w) ; Count or Index
add.w d2,d2
move.w d0,0(a4,d2.w)
.zeroIndex:
addq.w #1,d0
cmp.w d0,d3
bne.b .sortLoop
;
; Calculate LUT tables
move.l a4,a5
moveq #1,d1
moveq #-1,d4
moveq #0,d5
.lutLoop: ;
move.w (a4)+,d0 ; prefix
move.w (a4)+,d2 ; index
beq.b .zeroLut
lsl.l d1,d4
or.w d0,d4
subq.w #1,d4
ror.l d1,d4
move.l d4,(a5)+ ; code
;
sub.w d5,d2
sub.w d2,d0
movem.w d0/d1,HALFLUTSIZE-4(a5)
addq.w #2,d5
.zeroLut:
addq.w #1,d1
cmp.w #16,d1 ; this counter is wrong
bne.b .lutLoop ; should be max 15 times
move.l a3,a4
rts
;
   dc.b "Thanks to EAB for the idea of any-drive "
   dc.b "booting & loading and the non-system cache code. "
   dc.b "Greetings to Scoopex, Flashtro and Cave. "
cnop 0,4
end: