There has been a lot of interest in the format of MacWrite documents,
specifically in terms of the compressed data format for the 3.x and 4.x
test versions which will support disk-based files.  I am hesitant to type
the entire documentation for 3.x and 4.x file formats here and now, but if
there is a sufficient uproar, I may be persuaded to do it.

First, a word about MW 2.2.  A MACA/WORD file whose first two bytes
are '00 03' is in 2.2 format.  As we all know, 2.2 reads the entire document
into memory, and you manipulate the document within memory. Here is the RECORD
format for some global variables stored at the beginning of a 2.2 file:

MWGlobals = RECORD
        VerNum,          Version number = 3 for MW2.2
        ParaOffset,      Pointer to start of first "paragraph"
        MainPCount,      Number of paragraphs in Main document
        HeadPCount,      Number of paragraphs in Header
        FootPCount:      Number of paragraphs in Footer
                        INTEGER;

        TitlePgF,        Title page?
        ScrapDispF,      Display scrap?
        FootDispF,       Display footers?
        HeadDispF,       Display headers?
        RulerDispF,      Display rulers?
        Unused1:         Byte for word alignment
                        BOOLEAN;

        AcDocNum,        Document that is currently active:
                                0 = Main
                                1 = Header
                                2 = Footer
        StPgNum:         Starting page number offset
                        INTEGER;
END;  MWGlobals

Every MacWrite formatted file has three sections (Main, header, footer) each
comprised of paragraphs. A "paragraph" can be either text, a ruler, or a
picture. The first paragraph is ALWAYS a ruler. So if you are using Fedit,
and you follow the ParaOffset pointer, you will start at a ruler.  The first
two bytes of a paragraph are an integer indicating what kind of paragraph it
is (0=ruler, 1=text, 2=picture).  The next two bytes are an integer indicating
the length of the paragraph.  If you are only interested in text paragraphs,
you can follow ParaOffset to the first paragraph, read the paragraph type
and paragraph length, and skip that many bytes if it is not a text paragraph.
Then read the next paragraph, and so on.  There are MainPCount paragraphs for
the main, followed by HeadPCount paragraphs for the header, followed by
FootPCount paragraphs for the footer.

If it is a text paragraph, there would be the two bytes for paragraph type,
two bytes for paragraph length, and then two bytes for the length of the
text.  After the string of text characters, there is a formatting run, the
layout of which is rather grody.

Enough for MW2.2 format.  That should be enough to get people started.  I
encourage people to hack around in Fedit to learn more about it.

A real sticky wicket, a veritable nasty noodle, is MW 3.x and 4.x file
formats.  They can be identified by a two-byte version number equal to
'00 06'.  The format for the globals at the beginning of the file is similar,
but not identical, to 2.2:

MWGlobals = RECORD
        VerNum,          Version number = 6 for MW 3.x and 4.x
        MainPCount,      Number of paragraphs in Main document
        HeadPCount,      Number of paragraphs in Header
        FootPCount:      Number of paragraphs in Footer
                        INTEGER;

        TitlePgF,        Title page?
        Unused1:         Byte for word alignment
        ScrapDispF,      Display scrap?
        FootDispF,       Display footers?
        HeadDispF,       Display headers?
        RulerDispF,      Display rulers?
                        BOOLEAN;

        AcDocNum,        Document that is currently active:
                                0 = Main
                                1 = Header
                                2 = Footer
        StPgNum:         Starting page number offset
                        INTEGER;
END;  MWGlobals

After the starting page number, there is information for using the "free list",
which is how MW 3.x and 4.x manipulate pages on disk and in memory.  It is not
at all necessary to understand the "free list" unless you are writing a
MacWrite clone where you need to be able to swap and edit pages from a
MacWrite-formatted file.  If you only need to be able to read, and perhaps
display, a MacWrite document, you can pretty much avoid the free list stuff
altogether.

Reading a paragraph is tricky.  There are "document variables" at positions
00A0 (footer), 00CE (header), and 00FC (main).  Bytes 12-15 in the document
variables contain the position of the "information array" for the first
paragraph.  (There is an information block for each paragraph in the
header, footer, and main, and each information block is a total of 16 bytes
long.)  The 8th byte in the information block is the status byte
for that paragraph.  Bit number 3 is of special interest, in that
it tells whether the paragraph is in compressed format (to be discussed
shortly).  Bytes 9-11 contain the location of the paragraph data, and bytes
12-13 are the length of the paragraph data.  To determine what kind of
paragraph it is, bytes 0-1 of the information block contain the paragraph
height.  A positive value indicates the paragraph is text, a negative value
means a picture, and zero indicates a ruler.  Now follow bytes 9-11
to find the paragraph data.  If it is text, the first two bytes of the
paragraph data will be the length of the text in bytes.  It is important to
note that the length is in bytes, not characters, since the characters may be
compressed.

***Text Compression***
This is one of the most interesting things about MW 3.x and 4.x.  In an
attempt to save disk space, Apple came up with a scheme to compress ASCII
text down so that two compressed characters would fit into one byte.  It
works as follows: for any given language (in our case, English), the
developers of the MacWrite system determine the 15 most common characters in
that language.  Note that for almost every language, the most common character
is the space character.  After the space character, it varies from one language
to another, based upon statistical analysis. These 15 characters are then
combined to form a Str255 of length 15.  In the resource fork of the file,
is a resource of type STR (id = 700) which is this string. For the English
language, it looks like this: ' etnroaisdlhcfp'.  Now, when it comes to
compressing text, all you need is one NIBBLE to represent one of these 15
characters.  The nibbles that are used are 0-E.  The nibble F is used to
indicate that the two nibbles that follow are NOT compressed characters, but
comprise a complete ASCII character.  So for example, the word 'tent' would
look like '21 32', and the word 'Tent' would look like 'F5 41 32'.  The
string 'The tent' would look like 'F5 4B 10 21 32'.  Notice immediately that
we gain a nibble for each compressed character, but we lose one for each
non-compressed character.  In the long run, this technique wins bytes.  If,
however, the text was pathologically bizarre, and it had lots of capital
letters and punctuation and infrequent letters, then we would be wasting space
to use up an extra nibble on each uncompressible character.  That is why there
is a compressed bit.  The MacWrite program determines whether it will win
anything by using this compression technique, and acts accordingly.

Let's try running through how we can read out the text of the main part of
a MacWrite document:

1) Verify version number = 6

2) MainPCount = bytes 0002-0003

3) M_IAP = bytes 0108-010B  Main Info Array Pointer
   The byte positions are calculated from the Main Document Variable
   offset of 00FC, plus the offsets of 12D-15D into the MainDocVars.

4) Paratype = bytes (M_IAP) - (M_IAP+1)

5) psb = byte (M_IAP+8)  Paragraph Status Byte

6) CompBit = bit 3 of psb  Compression Bit

7) pdp = bytes (M_IAP+9) - (M_IAP+11)  Paragraph Data Pointer

8) para_len = bytes (M_IAP+12) - (M_IAP+13)  Paragraph length

Now, if Paratype is positive, we go to the paragraph itself...

9) text_len = bytes (pdp) - (pdp+1)

10) proceed to read text_len number of bytes, and decompress using the
    scheme described above (if compressed bit is set)


The information described above is excerpted from documentation furnished by:
Encore Systems
20823 Stevens Creek Blvd. C1-B
Cupertino, California  95014
(408)446-9565

We can vouch for the above description, since we are currently involved with
developing software which will interact with MacWrite files.  If you have
any questions about details of MacWrite file format, or about our project here
at Cornell, write or call us.  We may respond individually, or via the net.

Kate MacGregor and Douglas Young
Decentralized Computer Services
401 Uris Hall
Cornell University
Ithaca, NY  14853
(607)256-4981

Doug Young's electronic address is DMYJARTJ@CORNELLA.BITNET
-------