[ICON] Internals and Programmer's Guide


This guide includes a specification of the modules in the World-Wide Web Library of common code and the relationships between them. Every module in the library has a HTML document associated with it containing a detailed description of the functionality and interface to other modules. This page is the top node for the implementation specific documentation and contains links to all the modules in the library. The documentation on the library is dynamically kept up to date as the actual include files (.h) are generated from the HTML documents using the Line Mode Browser. Look here to see the current version number of the library

The document is divided into the following sections:

Note: When compiling the library please make sure that you have a relatively new version of the Line Mode Browser that parses the HTML documents correctly (currently 2.13 and newer versions, June 1994). Also remember that when editing the module interfaces or adding functionality then always use the HTML files and not the .h files!

Overview

Many current World-Wide Web browsers and servers share a common architecture based on the World-Wide Web Common Code Library. The interface between the library and a WWW client or server is described in Browser operation and Using the common library. See also a general description of Hypertext Terms.

The control flow diagram of the Library shows the module interface to a WWW Browser. In the diagram, the common code presented in this document is to the right of the green line.

The main application module is a platform dependent module called by the operating system, and it manages the overall running of the program. In the CERN Line Mode Browser, it is HTBrowse. In the CERN HTTP Server, it is HTDaemon.

On behalf of the user, the client parses a request to the HTAccess Module that is the main entry node to the library. When the request has been fulfilled (either in terms of an load or an error message) the data is parsed back to the client through the HTML Module and the HText Module, see also Information on presentation modules.

HTAccess
The Access manager module which actually loads documents is based in HTAccess.c. This uses all the protocol modules. Given an anchor ID to jump to, it asks the anchor object for the address in order to load it.
HTFormat
The Format Manager uses the parser modules to load the document as appropriate. It can also decide on the format of a file from its name. This module defines the HTConverter type.
HTAnchor object
An anchor exists for very URL or fragment identifier. The HTAnchor Module takes care of creating anchors, managing the links between them and their attributes. This module is independent of the type of graphics object (text, line drawing etc). It stores hypertext addresses of anchors, and ensures that anchors with the same address are the same anchor. See also the Anchor Description )
HTHistory
Optional client module records and replays on request the documents which the user visits

Configuration of the Library

Much of the trouble is getting everything defined appropriately, because the rest of the library is so flexible. Modules to note are:
HTInit
The initialisation module for clients. This modules contains functions to setup a list of MIME type converters that the client is capable of handling and a set of bindings between file suffixes and MIME types. The module may be replaced in custom clients.
HTSInit
The initialisation module for servers. May be replaced for custom servers. Not part of the library, but part of the server.
HTRule
The module which reads the configuration file, and does translation of URLs according to rules. It is currently replaced by HTConfig Module in the CERN Server and no rule file is used in the Line Mode Browser in order to keep it simple.
Many modules have a set of configuration flags that are external globals described in the documentation of each module, i.e., the HTML files that defines the .h files.

Modules to Overwrite

Many of the library modules that output data directly to the client or through a server can easily be overwritten, e.g, by a GUI client. Below is a list of moduels that specifically have been propared for this, but other modules can be overwritten as well:
HTML
The HTML parser build upon the SGML parser
HTErrorMsg
This module generates the messages on the error stack
HTAlert
See also Description of HTAlert
HTEvent
This is the internal event loop in the library for multi-threaded protocol access. See also the description of multiple threads

Communication Modules

This section covers the part of the library that is directly related to the Network communication.

Communications Interface to the Network

The functionality of the HTTCP Module covers several topics but they are all related to TCP/IP communication. All active and passive connection establishment from the Protocol Modules goes through this module. Furthermore, the module manages a local host cache of visited hosts so that the Domain Name Server is only consulted when necessary.

Other topics includes:

Protocol modules

A protocol module is invoked by the HTAcces Module in order to access a document. Each protocol module is responsible for extracting information from a local file or remote server using a particular protocol. Depending on the protocol, the protocol module either builds a graphic object (e.g. hypertext) itself, or it passes a socket descriptor to the format manager for parsing by one of the parser modules.

When the client parses a request to the library a HTRequest Structure is filled out and parsed to, e.g. HTLoadAnchor() in HTAccess module. HTRequest is a hierarchical data structure that contains all information needed by the client, the server and the library to fulfill the request. The default values in the structure makes it very easy for the client to do a normal request of, e.g., a HTML document.

File access
This module provides access to files on a local file system. Due to general confusion of the "file://" access scheme in the URL Specifications tries FTP access on failure, but this will be changed in a new major release, June 1994.
FTP access
Uses HTTCP for common TCP routines.
HTTP access
The HTTP module handles document search and retrieve using the HTTP protocol. See also information on the current implementation of the HTTP client. The module is now a complete state machine which is a required functionality in a multi-threaded enviromnent.
News access
The NNTP internet news protocol is handled by HTNews which builds a hypertext object.
Gopher access
The internet gopher access to menus and flat files (and links to telnet nodes, WhoIs servers, CSO Name Server etc.) is handled by HTGopher Module.
Telnet access
A dummy access which forks a session. Also rlogin, tn3270.
WAIS access
WAIS access is also implemented in a separate gateway program.

Multi-Threaded Clients

From version 3.0 (unreleased) of the Common Code Library, multi-threaded functionality has been added as an extra set of modules. For the moment, only the HTTP module has multiple threads but both FTP and Gopher are foreseen for the same functionality. For more information, see Specifications on Multiple Threads. The modules included are:

HTThread
This module provides the functionality for registrating sockets as ready for READ or WRITE (this includes the CONNECT statement that is basically a WRITE request).
HTEvent
This is the Libray's own version of the event-loop serving the HTTP client. The user can interrupt via stdin and a call-back function is used so that the user can determine whether it was an interrupt or a type-ahead.

Reading Data from a Socket

To avoid reading directly from the socket a module has been put up to provide an input buffer and some functionality to make it easier to get data from the network.
Non-reentrant Reading Functions
This module is a submodule of HTFormat and it provides the functionality of reading data from the network. When multi-threaded access gets incorporated it is essential that all requests to the network goes through one point in the library. The reason keeping these is to support the single-threaded protocol access.
Reentrant Reading Functions
This is also a submodule of the HTFormat but the difference from the other socket parsing module is that this is completely reentrant. That is, it can be used together with a multi-threaded client.
The same is not yet the case for writing to a socket, but this is on the "to do list".

Access Authorization

In order to prevent unauthorized access on a World-Wide Web server, a basic authorization scheme has been developed, see Access Authorization for more details on the scheme. The access authorization is implemented in the following modules:
HTAABrow
This module contains WWW Browser specific code, that is composing the HTTP Authorization Header, recording users information etc.
HTAAUtil
This module contains the authorization code that is common to both the servers and clients, e.g., handling information on different authentication etc.
UU Encoding and Decoding
Provides functions to encode and decode a data buffer according to the RFC 1113 printable encoding format.

Presenting Directory Listings and other Listings

When listings return from the protocol modules they are converted into HTML and parsed to the client. Listings might be HTTP directory listings, Gopher menus, FTP directory listings, CSO Name server etc. The modules providing this functionality are:
HTDirBrw
This is a very configurable module to actually present the listings
HTDescript
This module handles the description field in a HTTP directory listing. For a HTML file, the default action is to peek the title of the document.
HTIcons
This module handles the set of icons used in the listings (HTTP, Gopher, FTP etc.).

Multi Linguistic Documents

The HTTP protocol provides the possibility of handling Multi Linguistic Documents.
HTMulti
This is the current implementation of the Multi Linguistic support in the Library.

The Stream Concept

This section describes the stream concept which is heavily used throughout the library.

Streams

Streams are unidirectional objects where you can pump data character by character, using strings, and/or using large blocks of data. The HTStream Module is a generic representation of a stream class so that the interface to any stream in the library is identical.

Streams can be thought of as like files open for write. The stream-based architecture allows the software to be event-driven in the sense that when input arrives, it is put into a stream, and any necessary actions then cascade off that. Even reading from the Network can in fact be done using a stream having the read function pumping data down the input stream.

Stream might be cascaded so that one stream writes into into another stream after having performed some processing on the data. An output stream is often referred to as the "target" or "sink" stream.

Currently the following specific stream objects are implemented:

HTWriter
Writes to a socket or something opened with the UNIX file I/O open() function.
HTFWriter
Writes to an ANSI C FILE * object, as opened by fopen(), etc.
SGML
Parses the data and generates a structured stream. Each parser instance is created with reference to a particular DTD structure.
Furthermore a set of basic utility stream objects have been implemented to be used in a variety of constructions:
HTTee
Just writes into two streams at once. Useful for taking a copy for a cache.
Black Hole
A quite expensive way of piping data into a hole for then to be forgotten forever.
Through Line
A short circuited stream that returns the same output sink as it is called with.
Content Length Counter
This stream counts the number of bytes pumped into the stream. This is used in, e.g., POST'ing in the HTTP Protocol

Converters

Streams can be more than just a hole to put data into (e.g. to a file). They can have a converter connected to it so that the data format is modified going through the stream. The Library contains a large amount of converters, many of them converting to or from HTML.
HTNetToText
Converts "Net ascii" (the stuff telnet sessions are made of) info local C textual data stream.
HTMIME
Parse a MIME format message. This module also contains a Content Length Counter
HTWSRC
Parses a "WAIS source" description.
HTPlain
Takes plain ASCII text and converts into whatever -- typically styled text for display or HTML.

Structured streams

A structured stream is a subclass of a stream, but instead of just accepting data, it also accepts SGML events such as begin and end elements. A structured stream therefore represents a structured document. You can think of a structured stream as like the output from an SGML parser. It is more efficient for modules which generate hypertext objects to output a structured stream than to output SGML which is then parsed.

The elements and entities in the stream are referred to by numbers, rather than strings. The DTD contains the mapping between element names and numbers, so each structured stream when created is associated with the DTD which it using.

The structured stream data structures are defined in the SGML module above.

Any instance of a structured stream has a related DTD which gives the rules and element and entity names for events on the structured stream. The only DTD which is currently in the library is the HTML DTD, in the HTMLDTD module.

The SGML parser uses a DTD to output to a structured stream from a stream of SGML. A hypertext editor will output to a structured stream when writing out a document. Many protocol modules output to a structures stream when generating their data structures.

The current structured stream modules are

HTML
This module is an important one as it actually PRESENTS a hypertext object to the user. GUI writers replace this module with your own, or use it and replaced the HText module which it feeds.
HTMLGen
This structured stream regenerates a plain stream. It makes reference to the HTML DTD to regenerate the names of entities and elements. This is used by the server to pass a hypertext document on to the client, and also by the client to save HTML.
HTTeXGen
Structured streams can be used for other conversions than from SGML to HTML. As an example, this is a SGML to LaTex converter that makes it easier to construct paper versions of a HTML document.
These modules refer to the DTD modules (pick one of the following, probably HTMLPDTD)
HTMLDTD
Table describing some aspects of the HTML document type: entity and element names, element contents and allowed attributes.
HTMLPDTD
The same thing but for HTML-Plus document type.

Cascading Streams Using a Stream Stack

When more than one conversion is needed a stack of stream can be created having the first stream pumping data into the next etc. Creation of each stream is based on the content type of the actual data. The following modules provide the functionality for handling stream conversions:
HTFormat
This module performs the registration of MIME types and conversion modules. Currently we only parse HTML, plain text and MIME messages, but obviously other formats can be added.
Stream Stack
As mentioned above, streams can be cascaded to perform a multistep conversion. This submodule handles a stack of streams, but or the moment only direct conversion is supported so the size of the stack is always 1.
HTGuess
If the input format is unknown at the time when putting up a stream stack, then this module scans a part of the stream and on a statistical basis determines the type of stream needed from the content-type.

Presentation Modules

This section describes the presentation to the user. Often the implementation is made for a simple browser like the Line Mode Browser so more advanced browsers will overwrite the actual implementation.

From Structured Streams to Styles

Some hypertext objects work by storing the whole structure of the document. Others work by converting the nested structure into a linear sequence of styled text for display.

The library provides code for doing this "flattening". You don't have to use these modules: if you want to work the first way, just replace the entry points in HTML with your own, to prevent the library modules from being loaded.

The HTML object mentioned above flattens structured text into styled text. The HTPlain object generates a styled text for a plain ascii document.

The style system uses a set of names styles, each of which contains paragraph and character formatting information. This is managed by the HTStyle module.

HTStyle
Style and style sheet manipulation.

Styled Text object

A graphic object is a (complex) displayable entity. It is built by a protocol module directly or using a parser. Graphic objects are in general necessarily coded differently on different window systems. The graphic object is responsible for displaying itself, catching mouse clicks, and calling the navigation object in order to follow links. We use the more common term "document" to describe the logical entity which a graphics object represents and displays.
HText
This object is window-system dependent. In the line mode browser, the GridText module is the hypertext object, providing the generic functionality of this module. Note that this interface is an alternative to the HTML.h interface above when you are building a client: you can replace the library code at either point depending on the interface you require.

General Modules and Utilities

This section covers the basic programming utilities that can be used in the client or server to make life easier.

Container Modules

These modules are generic data object storage modules that might be used wherever convenient. The general rule for freeing memory from these modules is that free methods handles data structures generated within the modules whereas user data is for the caller to free. The modules consist of:

Binary Trees
This is a complete balanced binary tree that might be used for storage and sort of a large number of data objects, e.g. filenames in directory listings etc.
Chunks
A Chunk is a blockwise expandable array of type (char *) and is a sort of apology for real strings in C. Chunks make it easier to handle dynamic strings of unknown size. It is often faster than using the String Copy Routines.
Linked Lists
This module provides the functionality for managing a generic list of data objects. The module is implemented as a single linked list using the scheme first in - last out (FILO).
Association Lists
This is a small module build on top of HTList that provides a way to store Name-Value pairs in an easy way.
Strings
A few manipulation routines for dynamic arrays of characters. The routines include string copy, case insensitive comparison etc.
Atoms
Atoms are strings which are given representative pointer values so that they can be stored more efficiently, and comparisons for equality done more efficiently. The pointer values are in fact entries into a hash table.

Error Message and Information Handling

These modules are available for handling information and parse it from the Library back to the user:

HTAlert User Messages
This modules contains the code for prompting the user for file names, userid, password etc. Furthermore, it presents messages containing status information, error messages etc. to the user. The implementation in the library is meant for the Line Mode Browser (i.e. it writes to stderr) but can easily be overwritten by GUI browsers.
HTError Module
This module maintains an message stack within the HTRequest Structure. The module classifies messages in the range from Information to Fatal Error. As an example, the new URL specified in a HTTP Redirection gets parsed back to the user so that a clever client can edit the link directly.
HTErrorMsg
This module is in fact a part of the HTError module but a this is written specifically for a browser like Line Mode Browser, GUI clients would like to overwrite it. It presents the content of the message stack to the user in a readable form. Often, the information could be shown in a separate information window in the client.

URL Management

The functionality for handling URLs is all placed in one module:
HTParse
This module provides functions for parsing URLs, simplify them by removing redundant information, escape and unescape them according to the URL Specifications.

Basic Utilities

The list of basic utilities are currently as follows:
System specifics
The tcp.h file includes system-specific include files and flags for I/O to network and disk. The only reason for this file is that the Internet world is more complicated than Posix and ANSI.
HTUtils
The HTUtil.h file includes things we need everywhere, generally macros for declarations, booleans, etc.

Tim BL and Henrik Frystyk, libwww@info.cern.ch, August 1994