There has been a lot of interest in the format of MacWrite documents, specifically in terms of the compressed data format for the 3.x and 4.x test versions which will support disk-based files. I am hesitant to type the entire documentation for 3.x and 4.x file formats here and now, but if there is a sufficient uproar, I may be persuaded to do it. First, a word about MW 2.2. A MACA/WORD file whose first two bytes are '00 03' is in 2.2 format. As we all know, 2.2 reads the entire document into memory, and you manipulate the document within memory. Here is the RECORD format for some global variables stored at the beginning of a 2.2 file: MWGlobals = RECORD VerNum, Version number = 3 for MW2.2 ParaOffset, Pointer to start of first "paragraph" MainPCount, Number of paragraphs in Main document HeadPCount, Number of paragraphs in Header FootPCount: Number of paragraphs in Footer INTEGER; TitlePgF, Title page? ScrapDispF, Display scrap? FootDispF, Display footers? HeadDispF, Display headers? RulerDispF, Display rulers? Unused1: Byte for word alignment BOOLEAN; AcDocNum, Document that is currently active: 0 = Main 1 = Header 2 = Footer StPgNum: Starting page number offset INTEGER; END; MWGlobals Every MacWrite formatted file has three sections (Main, header, footer) each comprised of paragraphs. A "paragraph" can be either text, a ruler, or a picture. The first paragraph is ALWAYS a ruler. So if you are using Fedit, and you follow the ParaOffset pointer, you will start at a ruler. The first two bytes of a paragraph are an integer indicating what kind of paragraph it is (0=ruler, 1=text, 2=picture). The next two bytes are an integer indicating the length of the paragraph. If you are only interested in text paragraphs, you can follow ParaOffset to the first paragraph, read the paragraph type and paragraph length, and skip that many bytes if it is not a text paragraph. Then read the next paragraph, and so on. There are MainPCount paragraphs for the main, followed by HeadPCount paragraphs for the header, followed by FootPCount paragraphs for the footer. If it is a text paragraph, there would be the two bytes for paragraph type, two bytes for paragraph length, and then two bytes for the length of the text. After the string of text characters, there is a formatting run, the layout of which is rather grody. Enough for MW2.2 format. That should be enough to get people started. I encourage people to hack around in Fedit to learn more about it. A real sticky wicket, a veritable nasty noodle, is MW 3.x and 4.x file formats. They can be identified by a two-byte version number equal to '00 06'. The format for the globals at the beginning of the file is similar, but not identical, to 2.2: MWGlobals = RECORD VerNum, Version number = 6 for MW 3.x and 4.x MainPCount, Number of paragraphs in Main document HeadPCount, Number of paragraphs in Header FootPCount: Number of paragraphs in Footer INTEGER; TitlePgF, Title page? Unused1: Byte for word alignment ScrapDispF, Display scrap? FootDispF, Display footers? HeadDispF, Display headers? RulerDispF, Display rulers? BOOLEAN; AcDocNum, Document that is currently active: 0 = Main 1 = Header 2 = Footer StPgNum: Starting page number offset INTEGER; END; MWGlobals After the starting page number, there is information for using the "free list", which is how MW 3.x and 4.x manipulate pages on disk and in memory. It is not at all necessary to understand the "free list" unless you are writing a MacWrite clone where you need to be able to swap and edit pages from a MacWrite-formatted file. If you only need to be able to read, and perhaps display, a MacWrite document, you can pretty much avoid the free list stuff altogether. Reading a paragraph is tricky. There are "document variables" at positions 00A0 (footer), 00CE (header), and 00FC (main). Bytes 12-15 in the document variables contain the position of the "information array" for the first paragraph. (There is an information block for each paragraph in the header, footer, and main, and each information block is a total of 16 bytes long.) The 8th byte in the information block is the status byte for that paragraph. Bit number 3 is of special interest, in that it tells whether the paragraph is in compressed format (to be discussed shortly). Bytes 9-11 contain the location of the paragraph data, and bytes 12-13 are the length of the paragraph data. To determine what kind of paragraph it is, bytes 0-1 of the information block contain the paragraph height. A positive value indicates the paragraph is text, a negative value means a picture, and zero indicates a ruler. Now follow bytes 9-11 to find the paragraph data. If it is text, the first two bytes of the paragraph data will be the length of the text in bytes. It is important to note that the length is in bytes, not characters, since the characters may be compressed. ***Text Compression*** This is one of the most interesting things about MW 3.x and 4.x. In an attempt to save disk space, Apple came up with a scheme to compress ASCII text down so that two compressed characters would fit into one byte. It works as follows: for any given language (in our case, English), the developers of the MacWrite system determine the 15 most common characters in that language. Note that for almost every language, the most common character is the space character. After the space character, it varies from one language to another, based upon statistical analysis. These 15 characters are then combined to form a Str255 of length 15. In the resource fork of the file, is a resource of type STR (id = 700) which is this string. For the English language, it looks like this: ' etnroaisdlhcfp'. Now, when it comes to compressing text, all you need is one NIBBLE to represent one of these 15 characters. The nibbles that are used are 0-E. The nibble F is used to indicate that the two nibbles that follow are NOT compressed characters, but comprise a complete ASCII character. So for example, the word 'tent' would look like '21 32', and the word 'Tent' would look like 'F5 41 32'. The string 'The tent' would look like 'F5 4B 10 21 32'. Notice immediately that we gain a nibble for each compressed character, but we lose one for each non-compressed character. In the long run, this technique wins bytes. If, however, the text was pathologically bizarre, and it had lots of capital letters and punctuation and infrequent letters, then we would be wasting space to use up an extra nibble on each uncompressible character. That is why there is a compressed bit. The MacWrite program determines whether it will win anything by using this compression technique, and acts accordingly. Let's try running through how we can read out the text of the main part of a MacWrite document: 1) Verify version number = 6 2) MainPCount = bytes 0002-0003 3) M_IAP = bytes 0108-010B Main Info Array Pointer The byte positions are calculated from the Main Document Variable offset of 00FC, plus the offsets of 12D-15D into the MainDocVars. 4) Paratype = bytes (M_IAP) - (M_IAP+1) 5) psb = byte (M_IAP+8) Paragraph Status Byte 6) CompBit = bit 3 of psb Compression Bit 7) pdp = bytes (M_IAP+9) - (M_IAP+11) Paragraph Data Pointer 8) para_len = bytes (M_IAP+12) - (M_IAP+13) Paragraph length Now, if Paratype is positive, we go to the paragraph itself... 9) text_len = bytes (pdp) - (pdp+1) 10) proceed to read text_len number of bytes, and decompress using the scheme described above (if compressed bit is set) The information described above is excerpted from documentation furnished by: Encore Systems 20823 Stevens Creek Blvd. C1-B Cupertino, California 95014 (408)446-9565 We can vouch for the above description, since we are currently involved with developing software which will interact with MacWrite files. If you have any questions about details of MacWrite file format, or about our project here at Cornell, write or call us. We may respond individually, or via the net. Kate MacGregor and Douglas Young Decentralized Computer Services 401 Uris Hall Cornell University Ithaca, NY 14853 (607)256-4981 Doug Young's electronic address is DMYJARTJ@CORNELLA.BITNET -------