FTN.e08002

Apple II FTN - ShrinkIt (NuFX) document

NOTE to CRC-16 seekers:

Looks like a lot of people are hitting this page while looking for a CRC-16 algorithm. That's not what this page is about. If you want a CRC-16 implementation in C, try this one. If you want one in 6502 assembly, skip down to the end of this document.
Apple II
File Type Notes
_____________________________________________________________________________
                                                  Developer Technical Support

File Type:         $E0 (224)
Auxiliary Type:    $8002

Full Name:     NuFile Exchange Archival Library
Short Name:    ShrinkIt (NuFX) document

Revised by:    Andy Nicholas and Matt Deatherage                    July 1990
Written by:    Matt Deatherage                                      July 1989

Files of this type and auxiliary type contain NuFX Archival Libraries.
Changes since July 1989:  Rewrote major portions to reflect Master Version 
$0002 of the NuFX standard.
_____________________________________________________________________________

Introduction

NuFX is a robust, full-featured archival standard for the Apple II family.  
The standard, as presented in this Note, allows for full archival of ProDOS 
and GS/OS files while keeping all file attributes with each file, as well as 
providing necessary archival functions such as multiple compression schemes 
and multiple archival implementations of the same standard.  NuFX is 
implemented in the application ShrinkIt, a free archival utility program for 
enhanced IIe, IIc and IIgs computers.  (Versions for earlier Apple II models 
are also available.)

The NuFX standard was developed by Andrew Nicholas for Paper Bag Productions.  
Comments or suggestions on the NuFX standard, or comments and suggestions on 
ShrinkIt are welcome at:

                    Paper Bag Productions
                    8415 Thornberry Drive East
                    Upper Marlboro, MD  20772
                    Attn:  NuFX Technical Support
                    America Online:    ShrinkIt
                    GEnie:    ShrinkIt
                    CompuServe:    70771,2615


History

The Apple II community has always lacked a well-defined method for archiving 
files.  NuFX is an attempt to rectify the situation by providing a flexible, 
consistent standard for archiving files, disks, and other computer media.  
Although many files are archived using the Binary II standard (see Apple II 
File Type Note, File Type $E0, Auxiliary Type $8000), it was not designed as 
an archival standard and its continued use as such creates problems.  More 
people are using Binary II as an archival standard than as a way to keep 
attributes with a file when transferred, and this use is causing the original 
intent of Binary II to become lost and unused.

NuFX, developed as an archival standard for the days of GS/OS, allows:

  o  Filenames longer than 64 characters (GS/OS can create 8,000-
     character filenames).
  o  A convenient way to add to, remove from, and work on an archive.
  o  Including GS/OS files which contain resource forks.
  o  Including entire disk images.
  o  Including comments with a file.
  o  A convenient way to represent a file compressed or encrypted by a 
     specific application.
  o  A true archive standard.  Binary IIs original intent was to make 
     transfer of Apple II files from local machines to large 
     information services possible; otherwise, a file's attribute 
     information would be lost.  Use of Binary II to archive files 
     rather than simply maintain their attributes stretches it beyond 
     its original intent.

Adding all of these features to the existing Binary II standard would be 
nearly impossible without violating the existing standard and causing a great 
deal of confusion.  Although Binary II is flexible, it is simply unable to 
address all of these concerns without alienating existing Binary II extraction 
programs.

To provide some differentiation between standards and provide a better 
functioning format, this Note presents a new standard called NuFX (NuFile 
eXchange for the Apple II; pronounced new-F-X).  NuFX fixes the problems that 
Apple IIgs users would soon be experiencing as other filing systems become 
available for GS/OS.  NuFX attempts to stem a set of problems before they have 
a chance to develop.  NuFX provides all of the features of Binary II, but goes 
further to allow the user the ultimate in flexibility, usefulness and 
performance.


Additional Date/Time Data type:

Date/Time (8 Bytes):

+000    second    Byte    The second, 0 through 59.
+001    minute    Byte    The minute, 0 through 59.
+002    hour      Byte    The hour, 0 through 23.
+003    year      Byte    The current year minus 1900.
+004    day       Byte    The day, 0 through 30.
+005    month     Byte    The month, 0 through 11 (0 = January).
+006    filler    Byte    Reserved, must be zero.
+007    weekDay   Byte    The day of the week, 1 through 7 (1 = Sunday).

The format of the Date/Time field is identical to that described for the 
ReadTimeHex call in the Apple IIgs Toolbox Reference Manual.


Implementation

Figure 1 illustrates the basic structure of a NuFX archive.

                    |     First Record      |      Next Record      |
     _______________|_______________________|_______________________|
    | Master Header | Header |     Data     | Header |     Data     |
    |_______________|________|______________|________|______________|

                    Figure 1-NuFX Archive Structure

A single master header block contains values which describe the entire archive 
(those with knowledge of structured programming may consider them archive 
globals).  Each of the succeeding header blocks contains only information 
about the record it precedes (consider each an archive local).

Each header block is followed by a list of threads, which is followed by the 
actual threads.  The data for each thread may be a data fork, resource fork, 
message, control sequence for a NuFX utility program, or almost any kind of 
sequential data.

Possible Block Combinations:

The blocks must occur in the following fashion:

    Master Header Block containing N entries

    Header Block
    Threads list:
        filename_thread (16 bytes)
        message_thread (16 bytes)
        data thread (16 bytes)
        .
        .
        .
    filename_thread's data (filename_thread's comp_thread_eof # of bytes)
    message_thread's data (message_thread's comp_thread_eof # of bytes)
    data_thread's data (data_thread's comp_thread_eof # of bytes)
    .
    .
    .
    Next Header Block (notice no second Master Header block)
    Threads list (message, control, data or resource)
    .
    .
    .
    Nth Header Block
    Threads list (message, control, data or resource)

Master Header Block Contents

+000    nufile_id      6 Bytes    These six bytes spell the word "NuFile" in 
                                  alternating ASCII (low, then high) for 
                                  uniqueness.  The six bytes are $4E $F5 $46 
                                  $E9 $6C $E5.
+006    master_crc     Word       A 16-bit cyclic redundancy check 
                                  (CRC) of the remaining fields in this 
                                  block (bytes +008 through +047).  Any 
                                  programs which modify the master header 
                                  block must recalculate the CRC for the 
                                  master header.  (see the section "A Sample 
                                  CRC Algorithm")  The initial value of this 
                                  CRC is $0000.
+008    total_records  Long       The total number of records in this 
                                  archive file.  It is possible to chain 
                                  multiple records (files or disks) 
                                  together, as it is possible to chain 
                                  different types of records together (mixed 
                                  files and disks).
+012    archive_create_when
                       Date/Time  The date and time on which this archive 
                                  was initially created.  This field should 
                                  never be changed once initially written.  
                                  If the date is not known, or is unable to 
                                  be calculated, this field should be set to 
                                  zero.  If the weekday is not known, or is 
                                  unable to be calculated, this field should 
                                  be set to null.
+020    archive_mod_when
                       Date/Time  The date of the last modification to this 
                                  archive.  This field should be changed 
                                  every time a change is made to any of the 
                                  records in the archive.  If the date is 
                                  not known, or is unable to be calculated, 
                                  this field should be set to zero.  If the 
                                  weekday is not known, or is unable to be 
                                  calculated, this field should be set to 
                                  null.
+028    master_version
                       Word       The master version number of the NuFX 
                                  archive.  This Note describes 
                                  master_version $0002, for which the next 
                                  eight bytes are zeroed.
+030    reserved       8 Bytes    Must be null ($00000000).
+038    master_eof     Long       The length of the NuFX archive, in 
                                  bytes.  Any programs which modify the 
                                  length of an archive, either increasing it 
                                  or decreasing it in size, must change this 
                                  field in the master header to reflect the 
                                  new size.

Header Block Contents:

Following the Master Header block is a regular Header Block, which precedes 
each record within the NuFX archive.  A cyclic redundancy check (CRC) has been 
provided to detect archives which have possibly been corrupted.  The only time 
the CRC should be included in a block is for the Master Header and for each of 
the regular Header Blocks.  The CRC ensures reliability and data integrity.

+000    nufx_id        4 Bytes    These four bytes spell the word "NuFX" in 
                                  alternating ASCII (low, then high) for 
                                  uniqueness.  The four bytes are $4E $F5 
                                  $46 $D8.
+004    header_crc     Word       The 16-bit CRC of the remaining 
                                  fields of this block (bytes +006 through 
                                  the end of the header block and any 
                                  threads following it).  This field is used 
                                  to verify the integrity of the rest of the 
                                  block.  Programs which create NuFX 
                                  archives must include this in every 
                                  header.  It is up to the discretion of the 
                                  extracting program to check the validity 
                                  of this CRC.  Any programs which might 
                                  modify the header of a particular record 
                                  must recalculate the CRC for the header 
                                  block.  The initial value for this CRC is 
                                  zero ($0000).
+006    attrib_count   Word       This field describes the length of 
                                  the attribute section of each record in 
                                  bytes.  This count measures the distance 
                                  in bytes from the first field (offset 
                                  +000) up to and including the 
                                  filename_length field.  By convention, the 
                                  filename_length field will always be the 
                                  last 2 bytes of the attribute section 
                                  regardless of what has preceded it.
+008    version_number
                       Word       Version of this record.  If version_number 
                                  is $0000, no option_list fields are 
                                  present.  If the version_number is $0001 
                                  option_list fields may be present.  If the 
                                  version_number is $0002 then option_list 
                                  fields may be present and a valid CRC-16 
                                  exists for the compressed data in the data 
                                  threads of this record.  If the 
                                  version_number is $0003 then option_list 
                                  fields may be present and a valid CRC-16 
                                  exists for the uncompressed data in the 
                                  data threads of this record.  The current 
                                  version number is $0003 and should always 
                                  be used when making archives.
+010    total_threads  Long       The number of thread subrecords 
                                  which should be expected immediately 
                                  following the filename or pathname at the 
                                  end of this header block.  This field is 
                                  extremely important as it contains the 
                                  information about the length of the last 
                                  third of the header.
+014    file_sys_id    Word       The native file system identifier:
                                      $0000    reserved
                                      $0001    ProDOS/SOS
                                      $0002    DOS 3.3
                                      $0003    DOS 3.2
                                      $0004    Apple II Pascal
                                      $0005    Macintosh HFS
                                      $0006    Macintosh MFS
                                      $0007    Lisa File System
                                      $0008    Apple CP/M
                                      $0009    reserved, do not use (The 
                                               GS/OS Character FST returns 
                                               this value)
                                      $000A    MS-DOS
                                      $000B    High Sierra
                                      $000C    ISO 9660
                                      $000D    AppleShare
                                      $000E-$FFFF    Reserved, do not use
                                  If the file system of a disk being 
                                  archived is not known, it should be set to 
                                  zero.
+016    file_sys_info  Word       Information about the current filing 
                                  system.  The low byte of this word (offset 
                                  +016) is the native file system separator.  
                                  For ProDOS, this is the slash (/ or $2F).  
                                  For HFS and GS/OS, the colon (: or $3F) is 
                                  used, and for MS-DOS, the separator is the 
                                  backslash (\ or $5C).  This separator is 
                                  provided so archival utilities may know 
                                  how to parse a valid file or pathname from 
                                  the filename field for the receiving file.  
                                  GS/OS archival utilities should not 
                                  attempt to parse pathnames, as it is not 
                                  possible to build in syntax rules for file 
                                  systems not currently defined.  Instead, 
                                  pass the pathname directory to GS/OS and 
                                  attempt translation (asking the user for 
                                  suggestions) only if GS/OS returns an 
                                  "Invalid Path Name Syntax" error.  The 
                                  high byte of this word is reserved and 
                                  should remain zero.
+018    access         Flag Long      Bits 31-8    reserved, must be zero
                                      Bit 7 (D)    1 = destroy enabled
                                      Bit 6 (R)    1 = rename enabled
                                      Bit 5 (B)    1 = file needs to be 
                                                   backed up
                                      Bits 4-3    reserved, must be zero
                                      Bit 2 (I)    1 = file is invisible
                                      Bit 1 (W)    1 = write enabled
                                      Bit 0 (R)    1 = read enabled
+022    file_type      Long       The file type of the file being archived.  
                                  For ProDOS 8 or GS/OS, this field should 
                                  always be what the operating system 
                                  returns when asked.  For disks being 
                                  archived, this field should be zero.
+026    extra_type     Long       The auxiliary type of the file being 
                                  archived.  For ProDOS 8 or GS/OS, this 
                                  field should always be what the operating 
                                  system returns when asked.  For disks 
                                  being archived, this field should be the 
                                  total number of blocks on the disk.
+030    storage_type   Word       For Files:  The storage type of the 
                                  file.  Types $1 through $3 are standard 
                                  (one-forked) files, type $5 is an extended 
                                  (two-forked) file, and type $D is a 
                                  subdirectory.
        file_sys_block_size
                       Word       For Disks:  The block size used by the 
                                  device should be placed in this field.  
                                  For example, under ProDOS, this field will 
                                  be 512, while for HFS it might be 524.  
                                  The GS/OS Volume call will return this 
                                  information if asked.
+032    create_when    Date/Time  The date and time on which 
                                  this record was initially created.  If the 
                                  creation date and time are available from 
                                  a disk device, this information should be 
                                  included.  If the date is not known, or is 
                                  unable to be calculated, this field should 
                                  be set to zero.  If the weekday is not 
                                  known, or is unable to be calculated, this 
                                  field should be set to zero.
+040    mod_when       Date/Time  The date and time on which this record was 
                                  last modified.  If the modification date 
                                  and time are available from a disk device, 
                                  this information should be included.  If 
                                  the date is not known, or is unable to be 
                                  calculated, this field should be set to 
                                  zero.  If the weekday is not known, or is 
                                  unable to be calculated, this field should 
                                  be set to zero.
+048    archive_when   Date/Time  The date and time on which 
                                  this record was placed in this archive.  
                                  If the date is not known, or is unable to 
                                  be calculated, this field should be set to 
                                  zero.  If the weekday is not known, or is 
                                  unable to be calculated, this field should 
                                  be set to zero.

The following option_list information is only present if the NuFX version 
number for this record is $0001 or greater.

+056    option_size    Word       The length of the FST-specific 
                                  portion of a GS/OS option_list returned by 
                                  GS/OS.  This field may be $0000, 
                                  indicating the absence of a valid 
                                  option_list.

A GS/OS option_list is formatted as follows:

        +000           buffer_size
                       Word       Size of the buffer for GS/OS to 
                                  place the option_list in, including 
                                  this count word.  This must be at 
                                  least $2E.
        +002           list_size
                       Word       The number of bytes of information 
                                  returned by GS/OS.
        +004           file_sys_ID
                       Word       A file system ID word (see list 
                                  above) identifying the FST owning 
                                  the file in question.
        +006           option_bytes
                       Bytes      The bytes returned by the FST.  
                                  There are (buffer_size - 6) of them.

The option_list contains information specific to native file systems that 
GS/OS doesn't normally use (such as true creator_type, file_type, and access 
privileges for AppleShare).  Other FSTs released in the future will follow 
similar conventions to return native file system specific parameters in the 
option_list.  Information in the option_list should always be copied from file 
to file.

The value option_size in the NuFX header is the value of list_size minus two.  
Immediately following the option_size count word are (list_size - 2) bytes.  
To pass these values back to the destination file system, construct an 
option_list with a suitably large buffer_size, a list_size of the NuFX 
option_size + 2, the file_sys_id of the source file, and the FST-returned 
option_bytes.

+058    list_bytes     Bytes      FST-specific bytes returned in an 
                                  option_list.  These are the bytes in the 
                                  GS/OS option_list not including the FST ID 
                                  word.  There are option_size of them.  If 
                                  option_size is an odd number, one zero 
                                  byte of padding is added to keep the block 
                                  size an even number.

Because the attributes section does not have a fixed size, the next field must 
be found by looking two bytes before the offset indicated by attrib_count 
(+006).

+attrib_count - 2
        filename_length
                       Word       Obsolete, should be set to zero.  In 
                                  previous versions of NuFX, this field was 
                                  the length of a file name or pathname 
                                  immediately following this field.

                                  To allow the inclusion of future 
                                  additional parameters in the attributes 
                                  section, NuFX utility programs should rely 
                                  on the attribs_count field to find the 
                                  filename_length field.

                                  Current convention is to zero this field 
                                  when building an archive and put the file 
                                  or pathname into a filename thread so the 
                                  record can be renamed in the archive.  
                                  Archival programs should recognize both 
                                  methods to find a valid file name or 
                                  pathname.
+attrib_count
        filename       Bytes      Filename or partial pathname if 
                                  applicable.  If this is a disk being 
                                  archived, then the volume_name should be 
                                  included in this field.  If a volume name 
                                  is included in this field, a separator 
                                  should not be included in, or precede the 
                                  name.  If a volume name is not available, 
                                  then this field should be zeros.

                                  If a partial pathname is specified, the 
                                  directories to which the current pathname 
                                  refers need not have preceded this 
                                  particular record.  The extraction program 
                                  must test each referenced directory 
                                  individually.  If the directory in 
                                  question does not exist, the extracting 
                                  program should create it.

                                  Any utility which extracts file from a 
                                  NuFX archive must not assume that this 
                                  field will be in a format it is able to 
                                  handle.  In particular, extraction 
                                  programs should check for syntax 
                                  unacceptable to the operating system under 
                                  which they run and perform whatever 
                                  conversions are necessary to parse a legal 
                                  filename or pathname.  In general, assume 
                                  nothing.  (GS/OS programs should pass the 
                                  filename or pathname directly to GS/OS, 
                                  and only attempt to convert the name if 
                                  GS/OS returns an "invalid pathname syntax" 
                                  error.)

                                  Both high and low ASCII values are valid 
                                  but may not mean the same to each file 
                                  system (for example, all eight bits are 
                                  significant in AppleShare pathnames while 
                                  only seven are significant in ProDOS 
                                  pathnames).


Threads

Thread Records are 16-byte records which immediately follow the Header Block 
(composed of the attributes and file name of the current record) and describe 
the types of data structures which are included with a given record.  The 
number of Thread Records is described in the attribute section by a Word, 
total_threads.

Each Thread Record should be checked for the type of information that a given 
utility program can extract.  If a utility is incapable of extracting a 
particular thread, that thread should be skipped (with the exception of 
extended files under ProDOS 8, which should be dearchived into AppleSingle 
format, or both threads should be skipped).  If a utility finds a redundancy 
in a Thread Record, it must decide whether to skip the record or to do 
something with that particular thread (i.e., if a utility finds two 
message_thread threads it can either ignore the second one or display it.  
Likewise, if a utility finds two data_thread threads for the same file, it 
should inspect the thread_kind of each.  If they match, it can either 
overwrite the first thread extracted, or warn the user and skip the second 
thread).

Thread records can be represented as follows:

+000    thread_class   Word       The classification of the thread:
                                      $0000    message_thread
                                      $0001    control_thread
                                      $0002    data_thread
                                      $0003    filename_thread
+002    thread_format  Word       The format of the data within the thread:
                                      $0000    Uncompressed
                                      $0001    Huffman Squeeze
                                      $0002    Dynamic LZW/1 (ShrinkIt 
                                               specific)
                                      $0003    Dynamic LZW/2 (ShrinkIt 
                                               specific)
                                      $0004    Unix 12-bit Compress
                                      $0005    Unix 16-bit Compress
+004    thread_kind    Word       Describes the kind of data within 
                                  the thread.

thread_kind must be interpreted on the basis of thread_class.  See the table 
below for the currently defined thread_kind interpretations:

            class $0000  class $0001       class $0002            class $0003
            -----------  ----------------  ---------------------  -----------
kind $0000  ASCII text   create directory  data fork of file      filename
kind $0001  see below    undefined         disk image             undefined
kind $0002  see below    undefined         resource fork of file  undefined

+006    thread_crc     Word       For version_number $0003, this field 
                                  is the CRC of the original data before it 
                                  was compressed or otherwise transformed.  
                                  The CRC-16's initial value is set to $FFFF.
+008    thread_eof     Long       The length of the thread when uncompressed.
+012    comp_thread_eof
                       Long       The length of the thread when compressed.

Class $0000 with kind $0000 is obsolete and should not be used.

Class $0000 with kind $0001 has a predefined comp_thread_eof and a thread_eof 
whose length may change.  This way, a certain amount of space may be allocated 
when a record is created and edited later.

Class $0000 with kind $0002 is a standard Apple IIgs icon.  comp_thread_eof is 
the length of the icon image; thread_eof is ignored.

Class $0003 with kind $0000 has a predefined comp_thread_eof and a thread_eof 
whose length may change.  After this record is placed into the archive, the 
thread_eof can be changed if the name is changed, but the length of the name 
may not extend beyond the space allocated for it, comp_thread_eof.

A thread_format of $0001 indicates Huffman Squeeze.  NuFX's Huffman is the 
same Huffman used  by ARC v5.x, SQ and USQ, the source of which is publicly 
available and was originally written by Richard Greenlaw.  The first word of 
the thread data is the number of nodes followed by the Huffman tree and the 
actual data.  This is also the same algorithm decoded by the Apple II version 
of USQ written by Don Elton.  The C source to this is widely available.

A thread_format of $0002 indicates a special variant of LZW (LZW/1) used by 
ShrinkIt.  The first two bytes of this thread are a CRC-16 of the uncompressed 
data within the thread.  This CRC-16 is initialized to zero ($0000). The third 
byte is the low-level volume number used by the eight-bit version of ShrinkIt 
to format 5.25" disks.  The fourth byte is the run-length character used to 
decode the rest of the thread.  The data which comprises the compressed file 
or disk immediately follows the RLE character.

When ShrinkIt compresses a file, it reads 4096-byte chunks of the file until 
it reaches the file's EOF.  The last 4096-byte chunk is padded with zeroes if 
the file's length is not an exact multiple of 4096.  Compressing a disk is 
also done by reading sequential blocks of 4096-bytes.

Each 4K chunk is first compressed with RLE compression.  The RLE character is 
determined by reading the fourth byte of the thread.  The RLE character which 
is used by most current versions of ShrinkIt is $DB.  A run of characters is 
represented by three bytes, consisting of the run character, the number of 
characters in the run and the character in the run.  If the 4K chunk expands 
after being compressed with RLE then the uncompressed 4K chunk is passed to 
the LZW compressor. If the 4K chunk shrinks after being compressed with RLE 
then the RLE-compressed image of the 4K chunk is passed to the LZW compressor.

ShrinkIt's LZW compressor individually compresses each 4K chunk passed to it 
by using variable length (9 to 12 bits) codes.  The way that ShrinkIt's LZW 
compressor functions is almost identical to the algorithm used in the public 
domain utility Compress.  The first code is $0101.  The LZW string table is 
cleared before compressing each 4K chunk.  If the compressed chunk increases 
in size, then the previous 4K chunk (which may be run-length-encoded or just 
uncompressed data) is written to the file.

The first word of every 4K chunk is aligned to a byte boundary within the file 
and is the length which resulted from the attempt at compressing the chunk 
with RLE.  If the value of this word is 4096, then RLE was not successful at 
compressing the chunk.  A single byte follows the word and indicates whether 
or not LZW was performed on this chunk.  A value of zero indicates that LZW 
was not used, while a value of one indicates that LZW was used and that the 
chunk must first be decompressed with LZW before doing any further processing.

To decompress a file, each 4K chunk must first be expanded if it was 
compressed by LZW.  If the 4K chunk wasn't compressed with LZW, then the word 
which appears at the beginning of each chunk must be used to determine if the 
data for the current chunk needs to be processed by the run-length decoder.  
If the value of the word is 4096, then run-length decoding does not need to 
occur because the data is uncompressed.

If the word indicates that the length of the chunk after being decompressed by 
LZW is 4096-bytes long, then no run-length decoding needs to take place.  If 
value of the word is less than 4096 then the chunk must be run-length decoded 
to 4096 bytes.

There are four varying degrees of compression which can occur with a chunk: it 
can be uncompressed data.  It can be run-length-encoded data without LZW 
compression.  It can also be uncompressed data on which RLE was attempted (but 
failed) and then was subsequently compressed with LZW.  Or, finally, the chunk 
can be compressed with RLE and then also compressed with LZW.

A thread_format of $0003 indicates a special variant of LZW (LZW/2) used by 
ShrinkIt.  The first byte is the low-level volume number used by the eight-bit 
version of ShrinkIt to format 5.25" disks.  The second byte is the run-length 
character used to decode the rest of the thread.  The data which comprises the 
compressed file or disk immediately follows the second byte of the thread.

The format of LZW/2 is almost the same as LZW/1 with a few exceptions.  Unlike 
LZW/1, where the LZW string table is automatically cleared before each 4K 
chunk is processed, the LZW string table used by LZW/2 is only cleared when 
the table becomes full, indicating a change in the redundancy of the source 
text.  Not clearing the string table almost always yields improved compression 
ratios because the compressor's dictionary is not being depleted every 4K and 
larger strings are allowed to accumulate. The clear code used by ShrinkIt is 
$100.  Whenever the decompressor sees a $100 code, it must clear the string 
table.

The string table is also cleared when the compressor has to "back track" 
because a 4K chunk became larger.  Whenever a chunk that is not compressed by 
LZW is seen by the decompressor, the LZW string table must be cleared.  Bits 
0-12 of the first word of each chunk in a LZW/2 thread indicate the size of 
the chunk after being compressed with RLE.  The high bit (bit 15) indicates 
whether or not LZW was used on the chunk.  If LZW was not used (bit 15 = 0), 
the data for the chunk immediately follows the first word.  If LZW was used 
(bit 15 = 1), a second word which is a count of the total number of bytes used 
by the current chunk follows the first word.  The mark of the next chunk can 
be found by taking the mark at the beginning of the current chunk and adding 
the second word to it, using that as an offset for a ProDOS 8 or GS/OS SetMark 
call.  This is not normally necessary because the next chunk is processed 
immediately after the current chunk.

This second word is an improvement over LZW/1 because if a chunk becomes 
corrupted, but the second word is valid, the next chunk can be found and most 
of the file recovered.  The second word is not needed (and not present) when 
LZW is not used on the chunk because the first word is also a count of the 
number of bytes which follow that word.

A thread_format of $0004 indicates that a maximum of 12 bits per LZW code by 
Compress was used to build this thread.  The actual thread data contains 
Compress's usual three-byte signature, the third byte of which contains the 
actual number of bits per LZW code that was actually used.  The number of bits 
may be less than or equal to 12.  Optimally, this requires (at 12 bits) a 16K 
hash table to decode and should be used only for transferring to machines with 
limited amounts of memory.  The C source to Compress is in the public domain 
and is widely available.

A thread_format of $0005 indicates that a maximum of 16 bits per LZW code by 
Compress was used to build this thread.  The actual thread data contains 
Compress's usual three-byte signature, the third byte of which contains the 
actual number of bits per LZW code that was actually used.  The number of bits 
may be less than or equal to 16.  Optimally, this requires (at 16 bits) a 256K 
hash table to decode.  The C source to Compress is in the public domain and is 
widely available.

If a control_thread indicates that a directory should be created on the 
destination device, the path to be created must take the form of a ProDOS 
partial pathname.  That is, the path must not be preceded with a volume name.  
For example, /Stuff/SubDir is an invalid path for this control_thread, while 
SubDir/AnotherSubDir is valid.

If a control_thread indicates that a path is to be created, all subdirectories 
that are contained in the pathname must be created.

control_thread threads will eventually be used to control the execution of 
utility programs by allowing them to create, rename, and delete directories 
and files and to move and modify files.  A form of scripting language will 
eventually be able to allow utility programs to perform these actions 
automatically.  control_thread threads will allow extraction programs to 
perform operations similar to those of the Apple IIgs Installer, allowing 
updates to program sets dependent on such things as creation or modification 
dates and version numbers.


Extra Information

If the file system of a particular disk is not known, the file_sys_id field 
should be set to zero, the volume name should also be zeroed, and all the 
other fields pertaining only to files should be set to zero.

If the file system of a particular disk is known, as many of the fields as 
possible should be filled with the correct information.  Fields which do not 
pertain to an archived disk should remain set to zero.

If an entire disk is added to the archive without some form of compression 
(i.e., record_format = uncompressed), then the blocks which comprise the disk 
image must be added sequentially from the first through the last block.  Since 
there will be no character included in the data stream to mark the end or 
beginning of a block, extraction programs should rely on the 
file_sys_block_size field to determine how many bytes to read from the record 
to properly fill a block.

Some Useful Thread Algorithms:

The beginning of the thread records can be found with the following algorithm:

    Threads := (mark at beginning of header) + (attrib_count) + 
               (filename_length)

The end of the thread records can be found with the following algorithm:

    endOfThreads := Threads + (16 * total_threads)

The beginning of a data_thread can be found with the following formula:

    Data Mark := endOfThreads + (comp_thread_eof of all threads in the thread 
                 list which are not data prior to finding a data_thread)

The beginning of a resource_thread may be found with the following algorithm:

    Resource Mark := endOfThreads + (comp_thread_eof of all threads in the 
                     thread list which are not data prior to finding a 
                     resource_thread)

The next record can be found using the following algorithm:

    Next Mark := endOfThreads + (comp_thread_eof of each thread)

The file name and its length can be found with the following algorithm:

    if (filename_length > 0)
        then
            length of filename is filename_length;
            filename is found at attrib_count;
        else
            look through list of threads for a filename_thread;
            if you find one, then length of filename is thread_eof;
            if you don't find one, then you don't have a filename.


Directories

Directories are handled almost the same way that normal files are handled with 
the exception that there will be no data in the thread which follows the 
entry.  A Thread Record must exist to inform a utility that a directory is to 
be created through the use of the proper control_thread value.

Directories do not necessarily have to precede a record which references a 
directory.  For example, if a record contains Stuff/MyStuff, the directory 
Stuff need not exist for the extracting program to properly extract the 
record.  The extracting program must check to see if each of the directories 
referenced exist, and if one does not exist, create it.  While this method 
places a great burden on the abilities of the extraction program, it avoids 
the anomalies associated with the deletion of directories within an archive.


A Sample CRC Algorithm

Paper Bag Productions provides the source code to a very fast routine which 
does the CRC calculation as needed for NuFX archives.  The routine makeLookup 
needs to be called only once.  After the first call, the routine doByte should 
be called repeatedly with each new byte in succession to generate the 
cumulative CRC for the block.  The CRC word should be reset to null ($0000) 
before beginning each new CRC.

This is the same CRC calculation which is done for CRC/Xmodem and Ymodem.  The 
code is easily portable to a 16-bit environment like the Apple IIgs.  The only 
detrimental factor with this routine is that it requires 512 bytes of main 
memory to operate.  If you can spare the space, this is one of the fastest 
routines Paper Bag Productions knows to generate a CRC-16 on a 6502-type 
machine.

The CRC word should be reset to $0000 for normal CRC-16 and to $FFFF before 
generating the CRC on the unpacked data for each data thread.


*-------------------------------
* fast crc routine based on table lookups by
* Andy Nicholas - 03/30/88 - 65C02 - easily portable to nmos 6502 also.
* easily portable into orca/m format, just snip and save.
* Modified for generic EDAsm type assemblers - MD 6/19/89

         X6502                          turn 65c02 opcodes on

*-------------------------------
* routine to make the lookup tables
*-------------------------------

makeLookup
         LDX   #0                       zero first page
zeroLoop STZ   crclo,x                  zero crc lo bytes
         STZ   crchi,x                  zero crc hi bytes
         INX
         BNE   zeroLoop

*-------------------------------
* the following is the normal bitwise computation
* tweeked a little to work in the table-maker

docrc
         LDX   #0                       number to do crc for

fetch    TXA
         EOR   crchi,x                  add byte into high
         STA   crchi,x                  of crc

         LDY   #8                       do 8 bits
loop     ASL   crclo,x                  shift current crc-16 left
         ROL   crchi,x
         BCC   loop1

* if previous high bit wasn't set, then don't add crc
* polynomial ($1021) into the cumulative crc.  else add it.

         LDA   crchi,x                  add hi part of crc poly into
         EOR   #$10                     cumulative crc hi
         STA   crchi,x

         LDA   crclo,x                  add lo part of crc poly into
         EOR   #$21                     cumulative crc lo
         STA   crclo,x
loop1    DEY                            do next bit
         BNE   loop                     done? nope, loop

         INX                            do next number in series (0-255)
         BNE   fetch                    didn't roll over, so fetch more
         RTS                            done

crclo    ds    256                      space for low byte of crc table
crchi    ds    256                      space for high bytes of crc table


*-------------------------------
* do a crc on 1 byte/fast
* on initial entry, CRC should be initialized to 0000
* on entry, A = byte to be included in CRC
* on exit, CRC = new CRC
*-------------------------------

doByte
         EOR   crc+1                    add byte into crc hi byte
         TAX                            to make offset into tables

         LDA   crc                      get previous lo byte back
         EOR   crchi,x                  add it to the proper table entry
         STA   crc+1                    save it

         LDA   crclo,x                  get new lo byte
         STA   crc                      save it back

         RTS                            all done

crc      dw    0000                     cumulative crc for all data

The following CRC check is written in APW assembler format for an Apple IIgs 
with 16-bit memory and registers on entry.

crcByte  start

crc      equ         $0
crca     equ         $2
crcx     equ         $4
crctemp  equ         $6

         sta         crca                                                 4
         stx         crcx                                                 4

         eor         crc+1              on entry, number to add to CRC    4
         and         #$00ff             is in (A)                         3
         asl         a                                                    2
         tax                                                              2
         lda         crc16Table,x                                         5
         and         #$00ff                                               3
         sta         crcTemp                                              4

         lda         crc-1                                                4
         eor         crc16Table,x                                         5
         and         #$ff00                                               3
         ora         crcTemp                                              4
         sta         crc                                                  4

         lda         crca                                                 4
         ldx         crcx                                                 4
         rts                                                    cycles = 59


;
; CRC-16 Polynomial = $1021
;
crc16table anop
         dc    i'$0000, $1021, $2042, $3063, $4084, $50a5, $60c6, $70e7'
         dc    i'$8108, $9129, $a14a, $b16b, $c18c, $d1ad, $e1ce, $f1ef'
         dc    i'$1231, $0210, $3273, $2252, $52b5, $4294, $72f7, $62d6'
         dc    i'$9339, $8318, $b37b, $a35a, $d3bd, $c39c, $f3ff, $e3de'
         dc    i'$2462, $3443, $0420, $1401, $64e6, $74c7, $44a4, $5485'
         dc    i'$a56a, $b54b, $8528, $9509, $e5ee, $f5cf, $c5ac, $d58d'
         dc    i'$3653, $2672, $1611, $0630, $76d7, $66f6, $5695, $46b4'
         dc    i'$b75b, $a77a, $9719, $8738, $f7df, $e7fe, $d79d, $c7bc'
         dc    i'$48c4, $58e5, $6886, $78a7, $0840, $1861, $2802, $3823'
         dc    i'$c9cc, $d9ed, $e98e, $f9af, $8948, $9969, $a90a, $b92b'
         dc    i'$5af5, $4ad4, $7ab7, $6a96, $1a71, $0a50, $3a33, $2a12'
         dc    i'$dbfd, $cbdc, $fbbf, $eb9e, $9b79, $8b58, $bb3b, $ab1a'
         dc    i'$6ca6, $7c87, $4ce4, $5cc5, $2c22, $3c03, $0c60, $1c41'
         dc    i'$edae, $fd8f, $cdec, $ddcd, $ad2a, $bd0b, $8d68, $9d49'
         dc    i'$7e97, $6eb6, $5ed5, $4ef4, $3e13, $2e32, $1e51, $0e70'
         dc    i'$ff9f, $efbe, $dfdd, $cffc, $bf1b, $af3a, $9f59, $8f78'
         dc    i'$9188, $81a9, $b1ca, $a1eb, $d10c, $c12d, $f14e, $e16f'
         dc    i'$1080, $00a1, $30c2, $20e3, $5004, $4025, $7046, $6067'
         dc    i'$83b9, $9398, $a3fb, $b3da, $c33d, $d31c, $e37f, $f35e'
         dc    i'$02b1, $1290, $22f3, $32d2, $4235, $5214, $6277, $7256'
         dc    i'$b5ea, $a5cb, $95a8, $8589, $f56e, $e54f, $d52c, $c50d'
         dc    i'$34e2, $24c3, $14a0, $0481, $7466, $6447, $5424, $4405'
         dc    i'$a7db, $b7fa, $8799, $97b8, $e75f, $f77e, $c71d, $d73c'
         dc    i'$26d3, $36f2, $0691, $16b0, $6657, $7676, $4615, $5634'
         dc    i'$d94c, $c96d, $f90e, $e92f, $99c8, $89e9, $b98a, $a9ab'
         dc    i'$5844, $4865, $7806, $6827, $18c0, $08e1, $3882, $28a3'
         dc    i'$cb7d, $db5c, $eb3f, $fb1e, $8bf9, $9bd8, $abbb, $bb9a'
         dc    i'$4a75, $5a54, $6a37, $7a16, $0af1, $1ad0, $2ab3, $3a92'
         dc    i'$fd2e, $ed0f, $dd6c, $cd4d, $bdaa, $ad8b, $9de8, $8dc9'
         dc    i'$7c26, $6c07, $5c64, $4c45, $3ca2, $2c83, $1ce0, $0cc1'
         dc    i'$ef1f, $ff3e, $cf5d, $df7c, $af9b, $bfba, $8fd9, $9ff8'
         dc    i'$6e17, $7e36, $4e55, $5e74, $2e93, $3eb2, $0ed1, $1ef0'
         end

Further Reference
_____________________________________________________________________________
  o  ProDOS 8 Technical Reference Manual
  o  GS/OS Reference
  o  Apple IIgs Toolbox Reference Manual
  o  Apple II File Type Note, File Type $E0, Auxiliary Type $8000
  o  Apple II Miscellaneous Technical Note #14, Guidelines for 
     Telecommunication Programs
  o  "A Technique for High-Performance Data Compression," T. Welch, 
     IEEE Computer, Vol. 17, No.6, June 1984, pp. 8-19.