CAP/applications/aufs/design.notes

538 lines
21 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Internal Directory IDs
----------------------
As of Sun Apr 26.
The current implementation of the Macintosh Hierarichal File System
uses "fixed dirids" which means that every directory has a unique id
fixed for the lifetime of the file system and which is never reused.
A directory id should to be fixed over invocations of the server
because the macintosh remembers directory ids. An example of this is
the standard file package which likes to pickup the selection in the
directory you last left off.
Unfortunately, Under unix there is no mechanism available for
generating consistent directory ids in the manner required. The
directory/file inode would be suitable since it is unique, but (a) it
is derived from the file's disk address and it is possible to see
recycled ids (even in the same session) and there is no efficient
"unix version independent" method of translating from inode to file.
We have determined that the minimum functionality that the AFP client
requires is that the directory IDs be unique and constant over the
lifetime of the server. It would be nice if a full function variable
dirid filesystem would be available on the macintosh side; however,
evidently Apple has no plans to do this.
The mechanism we choose is to return indices of a table which contains
pointers to nodes in an internal tree representing the volumes
directory structure. To minimize the effect of "same dirid, but
different directory" we randomly assign a "base" to the range of
dirids (e.g. pick a number like 300 out of the hat and add it to every
index of the table when translating from internal to external). The
internal structure is called an IDir:
typedef struct idir { /* local directory info (internal) */
char *name; /* the directory name */
struct idir *next; /* ptr to next at same level */
struct idir *subs; /* ptr to children */
struct idir *pdir; /* ptr to parent */
} IDir, *IDirP;
The name stored is the component name, not the full path name. The
next pointer links directories at the same level. The subs pointer is
a list of subdirectories, and the pdir pointer is a pointer to the
parent directory.
So, given an IDir you can generate the path by following the parent
links until pdir == rootd.
The mount point for a volume is not necessarily at "/" so there is
also the notion of the volume's root directory or mount point. This
is needed in order to translate between the special ids 1 and 2 which
are parent of root directory and root directory.
Using the IDir structure you can either build the tree completely by
scanning the unix filesystem at startup or you can add nodes as
needed. We choose to add nodes as needed.
The internal directory structure needs to be modified to be kept in
sync with the actual filesystem directory structure. Because of this
the FPMove and FPRename calls can result in the modification of the
internal representation.
There some large drawbacks in this design:
1. Does not handle the case of other servers or unix users changing
the directory structure.
2. Does not generate constant directory ids between invocations.
The benefits:
1. Very simple and quick resolution of directory ids to paths.
Given the alternatives we will be keeping this method for the time
being and including a validation word (or magic word) in the IDir
nodes to prevent random memory references.
We have considered other methods which include using the directory's
inode in the IDir node, and using the inode as the directory id. In
this case the IDir would have a second thread(s) for looking up by
inode number. This would marginally improve the situation and would
increase the complexity of the code substantially.
Folder groups and creators
--------------------------
As of Aug 1987
Under AFP a folder has a group and creator. These attributes are used
in resolving access privileges.
Unix maps AFP groups and creators into unix groups and creators. This
means that a file you would have access to under unix by virtue of
group/creator modes are also available under the server.
Under BSD unix you need to be su in order to change a files creator,
thus you may not change the owners (available under sys v machines).
Protections
-----------
As of Aug 1987
Under AFP only folders, aka directories, have protections. These
protections are: See Files (read), See Folders (search), write objects.
This does not map well into the unix protection scheme. Our
"solution" is simply to respect the unix protections as best as
possible. The ramifications are:
(1) we do not distinguish between see files and see folders
(2) read and write ability is based upon the protection of
individual files - not folders
We do not see (1) as a serious problem. (2) does not present any
problems if the user is not the owner of the file. (2) does present a
minor problem in that there is no way for the user to change the
permissions through AFP.
Specifically, when mapping from unix to mac protections:
Read access is translated to search folders/files (search, read)
Write access is translated to write access (write)
Note: The owner protection is the same as the user if the user owns
the file, else group if you have group access, else other access. The
elses are important because if you have write access as other, but not
as group or user, then you will not be able to write to a directory -
this is the way unix interprets the protections (right or wrong).
From Mac to unix, we take:
See Files (read), to be unix read/search(execute)
Make Changes (write), to be unix write
See Folders (search), is not mapped
When a "folder" protection is changed, all files in the folder also
have their protections changed to the same protection.
On file or folder creates, the protection is taken from the protection
of the superior "folder" (directory).
Newline conversion
As of Tue July 22, 1987
On Unix the newline character is \n (code 012, lf), however, most Macintosh
applications use \r (code 015, cr) as newline.
We now translate lf to cr on reads and cr to lfs on writes if the file
creator is unix and the file type is TEXT. These are the defaults for
"unix" files.
In addition, lf and cr are defined internally as INEWLINE (internal
unix newline) and ENEWLINE (external mac newline), and conversion
between them is carried out in the following case:
1. NONLXLATE must not be defined when compiling module afpos.c.
The symbol NONNLXLATE disables newline conversion.
2. If conversion code is enabled and FPRead is issued with
NewLineMask equal to 0xFF, NewLineChar equal to
ENEWLINE, then the read terminates on either an ENEWLINE
or INEWLINE, and the INEWLINE is converted to ENEWLINE.
In summary, for this method: conversion only occurs when reading from
the unix filesystem and FPRead is issued in the special break on
newline mode, and the requested break character is the expected
macintosh newline character. There is no conversion when writing to
the Unix filesystem.
Name Conversion
---------------
Filenames on the mac can contain characters which are illegal in unix
file names. These include all 8 bit characters and /. The only chars
which are illegal on the mac are ":" and null.
The current name conversion maps special mac characters to be ":"
followed by two hex digits. For example the name "Copy/Hey" on the
mac would convert to "Copy:2fHey" on unix. A ":" found on a unix file
which does not have 2 hex digits following is mapped into "|" on the
mac.
When a unix file name is encountered a check is made to see if the
converted name is longer than MAXLFLEN (31) chars. If the name is
longer than the mac allows, then it is skipped completely and is not
seen by the mac.
The algorithm for name conversion is more specifically:
Mac to Unix
if (char is ascii and not control and is printable and
is not "/") then
leave as is
else
replace with ":" followed by two hexidecimal
digits that is a direct encoding of the binary
value of the char
Unix to Mac:
if (":" followed by two hex digits) then
replace with binary value of hex digits
if (":" not followed by two hex digits) then
replace ":" with "|".
File Format (.finderinfo, .resources, data)
Macintosh files are currently stored in a directory in three main
files. We have the resource fork, the data fork, and various file
specific finder information to store. Since the data fork is the
closest match to a unix file, it is stored as-is in the
specified directory, the resource fork and the so
called "finder info" fork are "special" and can be
stored by the same name in special subdirectories of the specified
directory. To be concrete, the Mac file "keeper" stored in a
directory "stuff" would be stored by Aufs on the unix file system as:
stuff/keeper - data fork
stuff/.finderinfo/keeper - "finder info" fork
stuff/.resource/keeper - resource fork
It is important to note that the .finderinfo and .resource directories
are only created by a "create directory" afp (Finder new folder)
command. This prevents these directories from drifting into places
people probably don't want them. However, these directories are
created iff the superior directory also had them.
Getting to the finder info and resource fork is a pain under unix, but
how often do you really need to anyway?
The defaults for a file that has no finder information is
type: TEXT [first four bytes]
creator: unix [second four bytes]
rest is set to zeros
file attributes: none
comment:
"This is a Unix\252 created file." (\252 is the tm sign)
A directory with no finder information defaults the comment to one of:
This is a unix directory
This is an Aufs Macintosh directory
This is an Aufs unix directory (.finderinfo only)
if it has a no .finderinfo and .resource directory or just a resource
directory, both a .finderinfo and .resource directory, or just a
.finderinfo directory respectively.
Turning on SMART_FINDERINFO in afpudb.c will yield more information;
however, it is unix variant dependent and slows things down
considerably.
See MAJOR FILE FORMATS below for the finderinfo formats.
Desktop databases.
As of Feb 1988
The icon data base is stored in .IDeskTop in the volume's root. A new
Icon are always appended to end unless it replaces an old one in which
case the old space will be reused if possible. New icons are written
out when received. The .IDeskTop is only read on the inital
"open desk" call. Locking is done if possible to prevent corruption.
(cf. section on locking).
The APPL mappings are stored in .ADeskTop in the volume's root. We
store for each mapping the File creator, the path relative to the
volume root to the application, and the application name. Modified or
changed entries are appended to the .ADeskTop on every "flush" if you
have write access. It is possible for this database to grow rapidly
or be corrupted. The problems lie in the fact that we always append
(the solution for now is to rebuild the desktop now and then). It may
get corrupted because people move directories around (though we try to
minimize this). Also, note that entries are never deleted from the
.ADeskTop - there should be a mechanism to do this. Like the Icon
database, the APPL mappings are read only when the inital desktop open
is issued.
See MAJOR FILE FORMATS below for the .ADeskTop and .IDeskTop file formats.
MAJOR FILE FORMATS
Aufs Version 3 File Formats (CURRENT)
-------------------------------------
In the following:
byte: unsigned 8 bits
word: unsigned 16 bits
dword: unsigned 32 bits
sdword: signed 32 bits
Important: all items are stored in network order! This means on a vax
you use htons/ntohs on words and htonl/ntohl on dwords.
.ADeskTop format:
The Applications mapping database is kept as an array of the
APPLFileRecords and associated data as shown following. The
associated data is the parent directory name relative to the volume
root and the application name as null terminated strings.
/* never use zero or 0x1741 as the major version */
#define AFR_MAGIC 0x00010002
/* version 1.2 (don't use 1.1, 2.2, etc) */
/* version 1.0 (version 0x1741.0000/0x1741) */
typedef struct { /* APPL information */
byte a_FCreator[4]; /* creator of application */
byte a_ATag[4]; /* user bytes */
} APPLInfo;
typedef struct { /* File Format APPL record */
dword afr_magic; /* magic number for check */
APPLInfo afr_info; /* the appl info */
sdword afr_pdirlen; /* length of (relative) parent directory */
sdword afr_fnamlen; /* length of application name */
/* names follows */
} APPLFileRecord;
.IDeskTop format:
The Applications mapping database is kept as an array of the
ICONFileRecords and associated data as shown following. The
associated data is the bitmap.
IconInfo in the below is padded to a double word boundary. Hopefully,
this is good enough to prevent differences in structure size in
ICONFileRecord on different machines.
/* never use zero or 0x2136 as the major version */
#define IFR_MAGIC 0x00010002 /* Version 1.2, skip 1.1, 2.2, etc. */
/* version 1.0: 0x2136.0000/0x2136 */
#define FCreatorSize 4
#define FTypeSize 4
#define ITagSize 4
typedef struct { /* Icon Information */
sdword i_bmsize; /* 4: size of the icon bitmap */
byte i_FCreator[FCreatorSize]; /* 4[8]: file's creator type */
byte i_FType[FTypeSize]; /* 4[12] file's type */
byte i_IType; /* 1[13] icon type */
byte i_pad1; /* 1[14] */
byte i_ITag[ITagSize]; /* 4[18] user bytes */
byte i_pad2[2]; /* 2[20] pad to double word boundary */
} IconInfo;
typedef struct { /* File Format ICON record */
dword ifr_magic; /* the magic check */
IconInfo ifr_info; /* the icon info */
/* bitmap follows this */
} IconFileRecord;
.finderinfo format:
In the following space for all entries is allocated. The bitmap
merely tells us if the indicated items are valid.
#define FINFOLEN 32
#define MAXCLEN 199
typedef struct {
byte fi_fndr[FINFOLEN]; /* finder info */
word fi_attr; /* attributes */
#define FI_MAGIC1 255
byte fi_magic1; /* was: length of comment */
#define FI_VERSION 0x10 /* version major 1, minor 0 */
/* if more than 8 versions then */
/* something wrong anyway */
byte fi_version; /* version number */
#define FI_MAGIC 0xda
byte fi_magic; /* magic word check */
byte fi_bitmap; /* bitmap of included info */
#define FI_BM_SHORTFILENAME 0x1 /* is this included? */
#define FI_BM_MACINTOSHFILENAME 0x2 /* is this included? */
byte fi_shortfilename[12+1]; /* possible short file name */
byte fi_macfilename[32+1]; /* possible macintosh file name */
byte fi_comln; /* comment length */
byte fi_comnt[MAXCLEN+1]; /* comment string */
} FileInfo;
Aufs Version 1 and Version 2 FILE FORMATS
-----------------------------------------
In the following, "bit.x" defines a type of x bits. "str" means an
ascii string (256 character ascii set) terminated by a null. The
formats are defined below in the formats section.
IMPORTANT: THESE FILES WERE STORED IN THE HOST MACHINE ORDER. You
cannot transport a these files from a byte swapped machine to a
non-bytes swapped machine.
.ADeskTop format:
The .ADeskTop file contains an array of the following structure:
bit.32 afr_magic; /* magic number for check */
bit.8 a_FCreator[4]; /* creator of application */
bit.8 a_ATag[4]; /* user tag information */
bit.32 afr_pdirlen; /* length of parent directory name */
bit.32 afr_fnamlen; /* length of application name */
str pdir[afr_pdirlen]; /* path to directory holding */
/* appl. relative to volume root */
str file[afr_fnamlen]; /* file name */
The file names are stored are the unix file names. Note: the
directory path is relative to the volume root directory. The magic
number is a consistency check and is currently: AFR_MAGIC (8107+8556).
.IDeskTop format
The .IDeskTop file contains an array of the following structure:
bit.32 ifr_magic; /* the magic check */
bit.8 i_FCreator[4]; /* file's creator type */
bit.8 i_FType[4]; /* file's type */
bit.8 i_IType; /* icon type */
bit.8 i_ITag[4]; /* user bytes */
bit.32 i_bmsize; /* size of the icon bitmap */
bit.8 i_icon[i_bmsize]; /* icon */
The magic number is a consistency check and is currently: IFR_MAGIC
(8107+5750).
The .finderinfo files contain the following information:
bit.8 fi_fndr[32]; /* finder info */
bit.16 fi_attr; /* attributes */
bit.8 fi_comln; /* length of comment */
bit.8 fi_comnt[200]; /* comment string */
LOCKING
Draft 2: Jan, 1988
Charlie C. Kim
User Services
Columbia University
Coordination of multiple access to files is best done through system
calls that implement the locks internal to the system. Advisory locks
allow coordination of multiple Aufs processes (if they all honor the
locks), but processes external to Aufs may cause problems. "Hard"
locks would be real real nice, but we haven't seen them.
Two systems calls, "lockf" and "flock", are known to exist in a number
of different unix systems to allow "advisory" locks. Where these
exist, they can be used to coordinate Aufs processes (c.f.
INSTALLATION notes). The basic semantics of these calls (as known)
are:
flock - for an open file, establish a "shared" or "exclusive"
lock. A "exclusive lock" may be placed iff no locks are in place. A
shared lock may be upgraded to an exclusive lock iff no other locks
are in place. Multiple shared locks are allowed :-). Locks go away
when the file is closed. Allow locks to be tested and removed (can't
distingush between exclusive and shared on test though).
lockf - for an open file, allow exclusive locks at various
offsets for particular lengths. Also allow locking of the entire
file. Locks only allow if the file is open for read/write (sigh).
Locks can be removed and/or tested. (Do locks go away when file
closes?).
FPOpen
FPOpen allows "deny read", "deny write", and "deny r/w" and "deny
none" "locks" to be place on a file. We still do not implement these
(major pain because it requires access to lock information AND
previous open statuses).
File-locks
Certain files, such as, .ADeskTop, .IDeskTop and the file info files
must be coordinated between servers (ignore outside access). Two
system calls (exists in various unixs) help do this: flock and lockf.
Coordination can be easily accomplished by using the "flock" system
call if it exists. flock allows exclusive and shared locks.
Basically, when a file is "read", then a shared lock is set first. If
a file is to be written, then an exclusive lock must be set first -
this fails if a shared lock is already set. Some systems might have
"lockf" available which allows "exclusive" locks only. In theory this
would be okay (though you can't have multiple readers then) too;
however, "lockf" only works if the file is open for write, so if a
process has "read-only" access to one of the above files, then it
can't be guaranteed that the data is okay.
ByteRangeLock
The only available unix system call option for this is "lockf". This
allows pretty much what is necessary except you cannot lock
"read-only" files. (Reading Inside Mac Volume 4 seems to lead me to
believe that this is correct, but the AFP specification doesn't
really make this clear).
Warning: NFS systems may not allow locks across remotely mounted file
systems. Even when they are allowed, special daemons must be run
since locking is not within the NFS protocol.
NORMALIZING CHARACTER SETS
Dan Sahlin of the Swedish Institute of Computer Science pointed out
the need to normalize between Unix character sets and the Macintosh
character set. Previously, Aufs provided this feature in a very
limited fashion: it would map between cr and lf when the file type was
"TEXT" and creator was "unix" (defaults for unix files). This
provided "good" functionality in the US.
However, people outside the United States need to make use of various
international character sets that must be mapped to the Macintosh
character sets to be useful. The primary intent of this mapping is to
allow unix files to be mapped; however, it is also possible to allow a
large class of files to be mapped (such as all text files ending in
.swe, etc).
The design is quite simple: define a mac to unix and unix to mac table
of 256 entries each that contain a direct mapping (must be one
character to one since file sizes, etc. require this).
The routine that decides whether mapping is necessary or not bases it
decision on an internal table (should be per volume, not per server as
it is now). For each normalizing set of tables, Aufs records a file
extention, file creator, and file type of which any can be null. In
addition it stores a "conjuction" operator. It decides whether to
apply one when file has the specified extention "conjuction" file type
and file creator matches. null entries are treated as always true.
For the old unix files, the table entry is:
extension: NULL
creator: unix
type: TEXT
and for Swedish D47:
extension: .swe
creator: NULL
type: TEXT
which means any file of type TEXT with the extension .swe will have
normalization applied.
Note: the defaults for Swedish D47, Swedish-Finnish E47, and IOS
8859-1 Latin 1 were establish by Dan Sahlin.
Packing unpacking packets.
Enumeration cache.