NuFX Addendum |
This addendum clarifies and extends certain aspects of the NuFX specification. This is not an "official" modification of the original document - it has not been reviewed and approved by the original author - but anyone developing NuFX utilities would do well to follow these recommendations.
The NuFX specification defines a very loose structure, and leaves much to the imagination of the implementer. For example, "If a utility finds a redundancy in a Thread Record, it must decide whether to skip the record or to do something with that particular thread...". A strict specification would declare that the situation must never arise, and define a standard approach for dealing with the anomalous condition. The current specification declares that the situation may arise, and requires the application author to come up with a solution.
This document refines the NuFX specification and brings some of the "fuzzy" areas into sharper focus. Nothing in this document contravenes the original document.
In the text below, "must" is an imperative that has to be obeyed, and "should" is a recommendation that authors are strongly encouraged to follow.
What's the correct way to pronounce "NuFX"? The specification doesn't say. There are two basic camps, letter-by-letter ("en you eff ecks") and minimal-syllable ("new fix" or "new fuchs"). I don't recall how Andy Nicholas says it, so let it be "new fix".
Originally, only ".SHK" was used to represent a NuFX archive. Over time, a convention of using ".SDK" to represent archives with a single disk image in them has arisen. This is very convenient for emulators on systems that rely on the file extension (e.g. Windows), so use of ".SDK" is encouraged.
An archive without records, i.e. nothing but a master header block, serves no purpose.
Creating: Archives without any records in them must never be created. All archives must have at least one record.
Opening: If asked to open a record-less archive, the application should recognize that the archive is empty and proceed as if it were a new archive.
Modifying: If all records in an archive are deleted, the archive file must be deleted as well.
A record without threads is pretty pointless.
Creating: Records without threads must never be created. All records must have at least one thread.
Extracting: Empty records should be ignored.
GS/ShrinkIt v1.1 has a bug that prevents it from creating an empty data thread when asked to add a zero-byte file. This results in a thread with a filename and nothing else. (If it was the first new record added, it will have an empty comment thread as well.)
There is no valid reason for deliberately creating such a file.
Creating: Records composed solely of a filename thread must not be created.
Extracting: Records with nothing but a filename thread should be ignored. For GSHK v1.1 bug compatibility: if a record has a filename thread, and no other threads except "message" threads (i.e. no data threads or control threads), then a zero-byte data fork file should be created. Otherwise, the record should be ignored. If the ProDOS storage type field indicates an extended file, a zero-byte resource fork should also be created.
A record without a filename thread is a curious beast. Ideally, there wouldn't be any such thing as a filename thread, since it doesn't really make sense to have a record without one. Expanding the record header to hold a pre-sized buffer would've made many things simpler.
This particular situation occurred with older versions of ShrinkIt (e.g. v1.1) that failed to store a volume name when compressing a DOS 3.3 disk. There was no filename in the record header, nor one in a filename thread.
The only situation where a record without a filename makes sense is if the record holds nothing but comments or other archive "meta data", such as a "create directory" control thread.
Creating: Records without filenames must not be created, unless the record is intended to contain nothing but archive meta-data. Deletion of the filename thread should only be done if a new filename thread is being added. If data threads are added to a record without a filename, then a filename thread must be added as well.
Extracting: If the record contains file data, the application may either prompt the user for a filename to use, or supply a generated one.
This is an unusual situation that should only arise if an application is buggy. Every record created by a modern application should have no more than one filename thread.
Creating: Records with multiple filename threads must not be created.
Extracting: Applications must use the first filename thread. If a buggy application wants to append an additional filename thread, their buggy filename will be ignored.
The old way of storing filenames, used by NuLib and old versions of ShrinkIt, was to put the filename in the record header. To facilitate renaming, the filename was moved into a thread. Thus, there are two possible locations where the filename may live, and no guarantee that only one will be used..
Creating: Never put the filename in the record header when creating a new record. It's okay to leave existing records alone, but if an application has the opportunity to rewrite the record header, the record filename must be removed.
Extracting: The thread filename takes precedence over the record header filename.
Filenames in NuFX archives use the Mac OS Roman character set, which is ASCII plus some symbols and the usual set of latin language characters (see Unicode definition). The NuFX filename definition was intended to accommodate files from HFS volumes, which may contain any character except ':'. Control characters, including NUL ('\0'), were allowed but discouraged.
On modern systems, converting between Mac OS Roman and Unicode is useful and (mostly) straightforward. Dealing with embedded null bytes is very annoying in C-like languages though.
Creating: Convert Unicode to Mac OS Roman, replacing any untranslatable characters with '?'. Embedded nulls must be replaced with '?'.
Extracting: Convert Mac OS Roman to Unicode. If embedded nulls are encountered, they should be replaced with something appropriate for the current system. Applications are allowed to ignore the problem and truncate the filename, but must be prepared to handle duplicate or empty filenames.
Every record header has a "file system separator" character ("fssep") in the "file_sys_info" word. This is usually something like ':' for GS/OS or '/' for UNIX. It's necessary to know what the separator is in order to break a pathname down into its individual components.
Not all filesystems support subdirectories, however, which means that not all filenames need to have a separator. The appropriate separator character for such a filesystem is not defined in the NuFX spec. Clearly it should be something illegal on the source filesystem, or we could inadvertently see pathnames where they don't exist (e.g. a file called "foo:bar" on DOS 3.3 if the fssep char were set to ':').
The trouble is, DOS 3.3 doesn't actually have any illegal characters, just a field of 30 characters padded with spaces. Pascal disks are similar. Since we must define an fssep for every filename, our best choice is to use '\0' (0x00), because it's unlikely to occur, and any program that stores names in C strings will find it awkward to store and scan for '\0'.
This situation also applies to archived disk images, which must be simple filenames.
(NOTE: as of v2.0.3, NufxLib rejects 0x00 as an fssep character. This is a bug.)
Creating: When adding files directly from filesystems without subdirectories, use 0x00 as the fssep char.
Extracting: An fssep char of 0x00 means the pathname is just the filename.
While files may have multiple path components (e.g. "subdir:subdir2:filename"), it makes no sense for disk images to have them. The stored filename for a disk is either the disk's ProDOS volume name, or for non-ProDOS disks, a simple label defined by the user. Since the eventual target is a disk device, specifying a subdirectory path makes no sense.
The issue becomes a little more confusing when storage of disk images used for emulators is considered. At first glance, it seems useful to be able to store a hierarchy of disk images. In practice, such images would either be archived as a hierarchy of .PO files, or as an archive of .SDK archives.
Adding/renaming Applications must strip any leading path components from disk image "storage names". (The NuFX specification does explicitly forbid the use of a filesystem separator character in a disk volume name.)
Extracting: Applications extracting directly to a disk must strip leading path components before assigning the ProDOS volume name. Applications extracting images to a file don't need to do anything unusual.
There isn't a "filename is case-sensitive" flag in NuFX archives. Since it was designed primarily for ProDOS and HFS filesystems, neither of which is case-sensitive, we should assume that case is not meant to be significant when determining whether two records have the same filename. This becomes important when adding files (to test for duplicates), extracting files by name, and when attempting to display archive contents as a hierarchical tree.
Applications should try to recognize that "foo/bar", "foo/BAR", and "FOO/bar" are the same file, but it's probably not worth "probing" a case-sensitive filesystem like Linux ext2 to guarantee such.
There is nothing in the NuFX specification that prevents having more than one file with the same name in an archive. In practice, this is inconvenient, especially for users with command-line tools. On the other hand, if the underlying filesystem is case-sensitive, the extracted files may not actually collide, so it may not make sense for all applications to treat this as an iron-clad rule.
When comparing names, be sure to take the filesystem separator character into account. "foo:bar" could be a simple filename or a partial pathname depending on whether ':' is the separator. Two names should be considered identical if each distinct path component matches, so "foo/bar" and "foo:bar" are identical if the separators are '/' and ':', respectively. Comparisons should be case-insensitive.
Adding/renaming: Applications should prevent multiple records from having the same filename.
The specification declares that filename threads and comments use pre-sized buffers. It does not define what other members of the message and filename classes are, which makes it difficult to know what to do with a request to create a heretofore undefined thread type. The NuFX format does not provide any definitive clue as to whether a thread is pre-sized, so such decisions must be based on the thread class and thread kind.
Filename threads and comment threads are pre-sized. All other threads are not pre-sized (including other members of the "filename" and "message" classes).
ShrinkIt allocates a 32-byte pre-sized buffer for the filename. If the filename is larger than 32 bytes, the buffer grows to fit the filename exactly. If renaming files is considered useful, then the buffer should always be slightly larger than is needed to hold the filename. (Filenames longer than 32 characters are most likely the result of nested directories, so renaming the file itself is inhibited if the buffer length is an exact match.)
Side note: GSHK appears to have a bug where it can't deal with 32-byte HFS filenames (e.g. "foo:abcdefghijabcdefghijabcdefghijxy" can't be added to an archive). Emulating this behavior is discouraged.
Creating: If GS/ShrinkIt compatibility is not important, all filenames should have at least 8 bytes of free space in the filename thread. For GSHK compatibility, the filename thread compThreadEOF must be the greater of 32 and the filename length.
Renaming: It is acceptable to have fewer than 8 bytes of free space remaining after a file is renamed. However, if the filename itself exceeds the buffer size and the thread must be rebuilt, the 8-byte padding should be added.
The NuFX specification does not require that threads appear in any particular order. However, writing them in a certain order can make some operations significantly easier.
For example, if an archive is being unpacked as it is received, it is important to know the filename before receiving the data. If the filename thread comes after the data threads, the application has to write the incoming data into a temp file, and then rename it later when the filename thread finally shows up. It would also be nice to be able to display file comments as the file is being downloaded.
Creating: The filename thread must precede all other threads. The recommended (but not required) ordering for common thread types is:
Filename
Message(s) (i.e. comments)
Data fork
Disk image
Resource fork
all other threads
Extracting: If the filename thread does not appear before the first data-class thread, the record may be ignored.
There are some combinations of threads that must never appear in a single record.
Creating:
If a data fork is present, the record must not contain another data fork or a disk image.
If a resource fork is present, the record must not contain another resource fork or a disk image.
If a disk image is present, the record must not contain another disk image, a data fork, or a resource fork.
If a control-class thread is present, the record must not contain any data-class threads.
Extracting: When incompatible threads are found, they should be ignored in favor of the earlier threads. For example, if two data forks are found in the same record, only the first one should be extracted. If a data-class thread is found first, subsequent control-class threads should be ignored, and vice-versa.
Some threads are compressed, some aren't. The specification isn't very specific.
All data-class threads may be compressed. All other classes of threads must not be compressed.
The ProDOS storage type has little meaning on most systems. However, certain values are significant.
For records with only a data fork, the storage type must be one of 0, 1, 2, or 3. The value "2" is recommended for applications that don't wish to mimic ProDOS behavior exactly.
For records with a resource fork, the storage type must be "5" (ProDOS extended file).
For records with a disk image thread, the storage type must be equal to the disk block size (typically 512).
For records without data-class threads, the storage type must be "0".
Storage type 0x0d, which is used by ProDOS for directories, must not be used.
It is important to update the storage type as threads are added and deleted, so that it always accurately reflects the contents of the record.
For a compressed disk image, the "storage_type" and "extra_type" fields take on a different meaning, notably the block size (typically 512) and block count (e.g. 280 for a 140K disk) of the disk.
These fields are more important than you might expect, because some older versions of ShrinkIt would set the thread EOF to a strange value like 68096 (which, curiously enough, is 133 * 512). These same versions of ShrinkIt tended to leave the "storage_type" set to 2. Apparently, ShrinkIt just used extra_type * 512 as the uncompressed size when trying to figure out what sort of disk it had. An early version of GS/ShrinkIt went one step further: it used a block count of 280 with a block size of 256, resulting in archives that apparently held 70K disk images.
It is simple enough to disregard the thread EOF value, and replace the storage_type when it is absurdly small, but there is a deeper problem. If you delete a 140K disk image thread and replace it with an 800K disk image thread, the block count stored in the extra_type no longer accurately reflects the contents of the record. (This linkage between the record header and the thread contents is the reason why this document forbids mixing of disk image threads with any other data-class thread, including other disk images.)
Creating: Applications must update the extra_type whenever a disk image thread is added. The value (storage_type * extra_type) must be equal to the uncompressed size. The application may wish to reject threads that are not a multiple of 512 bytes.
Extracting: The application must normalize storage_type to 512 if it is less than 16 (0x0f is the largest possible ProDOS storage type). The value storage_type * extra_type must then be used as the uncompressed size. If the uncompressed size is zero, the thread may be ignored.
NuFX supports four boolean access permission flags (read, write, destroy, rename) and two boolean attributes (backup needed, invisible) in the "access" field. This matches up with ProDOS capabilities nicely, but very few other operating systems support all six.
Applications authors should consider the following approaches:
Preserve all. All flags in the access field must be preserved. It is not required that the extracted files obey the original semantics -- an "invisible" file might be visible, and a file with "rename" disabled might still be rename-able -- but when the files are re-added, the permissions must match.
Locked/unlocked. A file with read enabled, and write, destroy, rename, and invisible disabled, is considered "locked" (access 0x01 or 0x21). All other files are considered "unlocked". When a file is extracted and then added to an archive, the locked/unlocked status must be preserved. Locked files are added with access 0x21, and unlocked files are added with access 0xe3.
It is acceptable for an application to find a middle ground between these two, and preserve more of the flags accurately than approach #2 does, but approach #2 should be considered the minimum acceptable level of support.
Directories do not need to be stored explicitly unless they are empty. The NuFX specification manages to avoid describing how directories are actually supposed to be stored, saying only: "A Thread Record must exist to inform a utility that a directory is to be created through the use of the proper control_thread value."
What is in a "create directory" control thread? It appears that the intent was to have the thread contain the pathname that needed to be created. In theory, you could have several of these things, and create an entire hierarchy from a single record. Such threads should not be compressed, but their compThreadEOF should always match their threadEOF (i.e. they're not pre-sized).
It's a little tricky to say, "add a control thread whenever you find a directory with nothing in it". What if the directory has files in it, but you don't have the access permissions necessary to read the files?
Does such a record require a filename? Probably not. However, if it doesn't have a filename, ShrinkIt might not display the record, and you'd have no way to manipulate it. Adding a "record label" is easy and useful.
(I'm strongly tempted to punt on the control threads and just use storage type 0x0d to indicate that a directory should be created. This is in direct opposition to the NuFX specification, however, so I'm reluctant to do so.)
Creating: Applications not interested in preserving empty directories need do nothing. Otherwise, the application must add a "create directory" control thread whenever a directory is encountered for which no files are added to the archive.
Extracting: A directory must be created when a control thread is present. As noted in the NuFX specification, the application must also create any directories listed in the record's pathname that don't yet exist.
The specification says that message threads are ASCII text, but doesn't specify an EOL character. For the benefit of Apple II utilities, it's best to use a carriage return (ctrl-M). The comments are expected to be readable on 8-bit Apple IIs, so plain ASCII rather than Mac OS Roman should be used.
Creating: Convert any EOL markers to CR, and any non-ASCII characters (i.e. bytes with the high bit set) to ASCII.
Extracting: Assume that the comment may be using CR, LF, or CRLF, and convert as needed for display. GS/ShrinkIt used a proportional font, so there is no need to worry about formatting to preserve "ASCII art" in comments.
Files archived from HFS AppleShare volumes come with "option lists", a GS/OS feature that provides a way for non-ProDOS filesystem information to be preserved. GS/ShrinkIt tries to save this information, but it doesn't seem to do a very good job. It appears to drop a big chunk of the data without altering the size (e.g. the size field says 36 bytes, but there's only space for 18 bytes in the record header).
GS/ShrinkIt seems to work correctly whether the option list size is correct or not, so other applications should do the same.
Opening: Assume the option_size field is correct unless it exceeds attrib_count-2. If it's too large, clip it down to size.
Updating: Always use the actual size. Do not propagate incorrect values. Discarding existing option lists is discouraged but allowed.
For the most part, ShrinkIt correctly sets the MasterEOF field in the Master Header block. A very old version of ShrinkIt left it set to zero (this is the same version that completely omitted the filename for DOS 3.3 disk images). GS/ShrinkIt appears to initialize it to 48 (the size of the MH block), and if the creation process is interrupted you can end up with a partial archive with a nonzero EOF.
Opening: Accept a MasterEOF of zero, but reject a MasterEOF of 48. Don't assume the MasterEOF is accurate.
Updating: Applications must write the correct MasterEOF value if an archive is modified.
Unofficial extensions to the NuFX specification. Anyone working with NuFX archives should take heed.
Thread formats 0x0000 through 0x0005 are already defined. The following thread format values have been added:
0x0006 - deflate. The thread contains data conforming to RFC 1951 (deflate1.3 specification). A more practical way of putting it is it contains exactly the data that zlib v1.1.4 outputs. Visit http://www.zlib.org/ for more details.
0x0007 - bzip2. The thread contains BWT+Huffman compressed data as output by Julian Seward's "libbz2" v1.0.2. Visit http://sources.redhat.com/bzip2/ for more information.
Support for these formats is nonexistent on the Apple II, so they should not be used except in situations where compatibility is unimportant (e.g. collections of disk archives for use with A2 emulators).
I found that "deflate" generally does as well or better than "bzip2" on Apple II binaries, disk images, and small text files. Deflate is also faster and uses less memory, and you're more likely to find libz installed on a given system than you are libbz2 For these reasons, use of deflate should be encouraged in favor of bzip2.
This section identifies some quirks in NuFX or ShrinkIt that, while not bugs, are worth noting.
Originally, the filename was stored in the record header, so it made sense that the filename separator character ("fssep char") should also be there. When the filenames were moved into threads, the fssep char got left behind. If a record has two filenames, they'd better have the same fssep char, or interpreting one of them will be impossible. (This is one of the reasons why it's important to clearly define which filename takes precedence in all circumstances.)
The "threadCRC" field in the thread header block can have one of three meanings: nothing (v0, v1), the CRC of the compressed data (v2), or the CRC of the uncompressed data (v3). The version 2 meaning wasn't used in anything significant, and can be ignored.
Version 1 records generally have threads compressed with LZW/1 data. The LZW/1 compression format includes a 16-bit CRC at the start of the thread. Version 3 records generally have threads compressed with LZW/2 data, which does not include a CRC.
Applications like P8 ShrinkIt and NuLib creation v1 records and compress with LZW/1, while GS/ShrinkIt and NuLib2 create v3 records and compress with LZW/2. This means that each compressed thread has exactly one CRC. So what happens if you tell NuLib2 to create a new record with LZW/1, or tell it to add a new LZW/2 thread to an existing v1 record?
In one case, you end up with two CRCS; in the other, you end up with no CRC on your data at all. For some bizarre reason, the v3 thread CRC is computed with a different initial value, so it is necessary to compute the CRC twice, not merely store the same value twice.
Please select your compression methods appropriately. Also, bear in mind that uncompressed data stored with P8 ShrinkIt has no CRC whatsoever.
ShrinkIt adds an extra byte at the end of all LZW compressed data, probably due to an off-by-one bug in the compression code. It turns out that it's possible to get even more "extra" bytes at the end.
ShrinkIt's LZW-I algorithm always operates on a 4K buffer, largely because it was originally designed for compressing 5.25" disks with 4K tracks. On small files, or at the end of a large one, the last bit of data is padded out to 4K and then compressed. Ordinarily this is barely noticeable, because the compression routines do an RLE (Run-Length Encoding) pass before applying LZW.
However, if both RLE and LZW fail to make the 4K block any smaller, it is stored without compression. This means the whole 4K, complete with padding, gets written to the archive. This doesn't cause any problems, but can make you wonder where all the extra bits came from.
The SQ compression algorithm, as implemented by Don Elton's SQ3, appears to add an extra 0xff to the end of the compressed data. It can safely be ignored.
Preserving BXY wrappers is pretty easy, since the Binary II format is well documented. Updating block counts and file lengths is all that is required.
Preserving SEA wrappers is a little harder, since (as far as I can tell) there is no documentation on the format. A little experimentation shows that the SEA header is always 12005 bytes long, and the only part that changes from file to file is a short piece right before the NuFX archive begins.
It is necessary to update the file length in three different places, all right next to each other, one of which is offset by 64 bytes. I would guess the header allows for more than one archive to be present, but since such things have never actually been created, the possibility can be ignored.
The NuFX standard says that the Date/Time format is the same as that returned by the IIgs ReadTimeHex toolbox call. That call returns the year as (year - 1900), so the year 2000 is stored as "100". ProDOS 8 clock drivers, on the other hand, return 40-99 for 1940-1999, and 0-39 for 2000-2039. As a result, archives created with P8 ShrinkIt use 0 for the year 2000 instead of 100.
When creating archives, always use 100 for the year 2000, but also accept the year 0. However, if you find a Date/Time with zero in all useful fields (second, minute, hour, day, month, year), treat it as an unspecified date rather than midnight of January 1, 2000.
This document is Copyright © 2000-2004 by Andy McFadden. All Rights Reserved.
The latest version can be found on the NuLib web site at http://www.nulib.com/.