Internals and Programmer's Guide
This guide includes a specification of the modules in the World-Wide
Web Library of common code and the relationships between them.
Every module in the library has a HTML document associated with it
containing a detailed description of the functionality and interface
to other modules. This page is the top node for the implementation
specific documentation and contains links to all the modules in the
library. The documentation on the library is dynamically kept up to
date as the actual include files (.h) are generated from the HTML
documents using the Line
Mode Browser. Look here to see the current
version number of the library
The document is divided into the following sections:
Note: When compiling the library please make sure that you
have a relatively new version of the Line Mode Browser that parses the
HTML documents correctly (currently 2.13 and newer versions, June 1994).
Also remember that when editing the module interfaces or adding functionality
then always use the HTML files and not the .h files!
Overview
Many current World-Wide
Web browsers and servers share a common architecture based on the
World-Wide Web Common Code Library. The interface between the library
and a WWW client or server is described in
Browser operation and
Using the common library. See also a general description of Hypertext
Terms.
The
control flow diagram of the Library shows the module interface to a WWW
Browser. In the diagram, the common code presented in this document is to the
right of the green line.
The main application module is a platform dependent module called by the
operating system, and it manages the overall running of the program. In the
CERN Line
Mode Browser, it is
HTBrowse. In the CERN HTTP
Server, it is
HTDaemon.
On behalf of the user, the client parses a request to the HTAccess Module that is the main entry node to the
library. When the request has been fulfilled (either in terms of an load or an
error message) the data is parsed back to the client through the HTML Module and the HText
Module, see also Information on
presentation modules.
- HTAccess
- The Access manager module which actually loads documents is based in
HTAccess.c. This uses all the protocol
modules. Given an anchor ID to jump to, it asks the anchor object for the
address in order to load it.
- HTFormat
- The Format Manager uses the parser
modules to load the document as appropriate. It can also decide on the
format of a file from its name. This module defines the HTConverter type.
- HTAnchor object
- An anchor exists for very URL or fragment identifier. The HTAnchor Module takes care of creating anchors,
managing the links between them and their attributes. This module is
independent of the type of graphics object (text, line drawing etc). It stores
hypertext addresses of anchors, and ensures that anchors with the same address
are the same anchor. See also the Anchor
Description )
- HTHistory
- Optional client module records and replays on request the documents which
the user visits
Configuration of the Library
Much of the trouble is getting everything defined appropriately, because the
rest of the library is so flexible. Modules to note are:
- HTInit
- The initialisation module for clients. This modules contains functions to
setup a list of MIME type converters that the client is capable of handling
and a set of bindings between file suffixes and MIME types. The module may be
replaced in custom clients.
- HTSInit
- The initialisation module for servers. May be replaced for custom
servers. Not part of the library, but part of the server.
- HTRule
- The module which reads the configuration file, and does translation of
URLs according to rules. It is currently replaced by
HTConfig Module in the CERN Server and no rule file is used in the Line
Mode Browser in order to keep it simple.
Many modules have a set of configuration flags that are external globals
described in the documentation of each module, i.e., the HTML files that
defines the .h files.
Modules to Overwrite
Many of the library modules that output data directly to the client or through
a server can easily be overwritten, e.g, by a GUI client. Below is a list of
moduels that specifically have been propared for this, but other modules can
be overwritten as well:
- HTML
- The HTML parser build upon the SGML parser
- HTErrorMsg
- This module generates the messages on the error
stack
- HTAlert
- See also Description of HTAlert
- HTEvent
- This is the internal event loop in the library for multi-threaded protocol
access. See also the description of multiple
threads
Communication Modules
This section covers the part of the library that is directly related to the
Network communication.
Communications Interface to the Network
The functionality of the HTTCP Module covers several
topics but they are all related to TCP/IP communication. All active and
passive connection establishment from the
Protocol Modules goes through this module. Furthermore, the module manages
a local host cache of visited hosts so that the Domain Name Server is only
consulted when necessary.
Other topics includes:
- I/O status indication (errno etc.)
- Information on remote hosts
- Information on local host (domain name etc.)
- Information on current user (mail address)
A protocol module is invoked by the HTAcces Module
in order to access a document. Each protocol module is responsible for
extracting information from a local file or remote server using a particular
protocol. Depending on the protocol, the protocol module either builds a
graphic object (e.g. hypertext) itself, or it passes a socket descriptor to
the format manager for parsing by one of the parser modules.
When the client parses a request to the library a
HTRequest Structure is filled out and parsed to, e.g. HTLoadAnchor() in
HTAccess module. HTRequest is a hierarchical data structure that contains all
information needed by the client, the server and the library to fulfill the
request. The default values in the structure makes it very easy for the client
to do a normal request of, e.g., a HTML document.
- File access
- This module provides access to files on a local file system. Due
to general confusion of the "file://" access scheme in the URL
Specifications tries FTP access on failure, but this will be
changed in a new major release, June 1994.
- FTP access
- Uses HTTCP for common TCP routines.
- HTTP access
- The HTTP module handles document
search and retrieve using the HTTP
protocol. See also information on the
current implementation of the HTTP client. The module is now a complete
state machine which is a required functionality in a multi-threaded enviromnent.
- News access
- The NNTP internet news protocol is handled by HTNews which builds a
hypertext object.
- Gopher access
- The internet gopher access to menus and flat files (and links to telnet
nodes, WhoIs servers, CSO Name Server etc.) is handled by HTGopher Module.
- Telnet access
- A dummy access which forks a session. Also rlogin, tn3270.
- WAIS access
- WAIS access is also implemented in a separate gateway
program.
Multi-Threaded Clients
From version 3.0 (unreleased) of the Common Code Library, multi-threaded
functionality has been added as an extra set of modules. For the moment, only
the HTTP module has multiple threads but both FTP and Gopher are foreseen for
the same functionality. For more information, see Specifications
on Multiple Threads. The modules included are:
- HTThread
- This module provides the functionality for registrating sockets as ready
for READ or WRITE (this includes the CONNECT statement that is basically a
WRITE request).
- HTEvent
- This is the Libray's own version of the event-loop serving the HTTP
client. The user can interrupt via stdin and a call-back function is used so
that the user can determine whether it was an interrupt or a type-ahead.
Reading Data from a Socket
To avoid reading directly from the socket a module has been put up to provide
an input buffer and some functionality to make it easier to get data from the
network.
- Non-reentrant Reading Functions
- This module is a submodule of HTFormat and it
provides the functionality of reading data from the network. When
multi-threaded access gets incorporated it is essential that all requests to
the network goes through one point in the library. The reason keeping these is
to support the single-threaded protocol access.
- Reentrant Reading Functions
- This is also a submodule of the HTFormat but
the difference from the other socket parsing module is that this is completely
reentrant. That is, it can be used together with a multi-threaded client.
The same is not yet the case for writing to a socket, but this is on the "to
do list".
Access Authorization
In order to prevent unauthorized access on a World-Wide Web server, a basic
authorization scheme has been developed, see
Access Authorization for more details on the scheme. The access
authorization is implemented in the following modules:
- HTAABrow
- This module contains WWW Browser specific code, that is composing the
HTTP Authorization Header, recording users information etc.
- HTAAUtil
- This module contains the authorization code that is common to both the
servers and clients, e.g., handling information on different authentication
etc.
- UU Encoding and Decoding
- Provides functions to encode and decode a data buffer according to the
RFC 1113 printable encoding format.
Presenting Directory Listings and other Listings
When listings return from the protocol modules they are converted into HTML
and parsed to the client. Listings might be HTTP directory listings, Gopher
menus, FTP directory listings, CSO Name server etc. The modules providing this
functionality are:
- HTDirBrw
- This is a very configurable module to actually present the listings
- HTDescript
- This module handles the description field in a HTTP directory listing.
For a HTML file, the default action is to peek the title of the document.
- HTIcons
- This module handles the set of icons used in the listings (HTTP, Gopher,
FTP etc.).
Multi Linguistic Documents
The HTTP protocol provides the possibility of handling
Multi Linguistic Documents.
- HTMulti
- This is the current implementation of the Multi Linguistic support in the
Library.
The Stream Concept
This section describes the stream concept which is heavily used throughout the
library.
Streams are unidirectional objects where you can pump data character by
character, using strings, and/or using large blocks of data. The HTStream Module is a generic representation of a
stream class so that the interface to any stream in the library is
identical.
Streams can be thought of as like files open for write. The stream-based
architecture allows the software to be event-driven in the sense that when
input arrives, it is put into a stream, and any necessary actions then cascade
off that. Even reading from the Network can in fact be done using a stream
having the read function pumping data down the input stream.
Stream might be cascaded so that one
stream writes into into another stream after having performed some processing
on the data. An output stream is often referred to as the "target" or "sink"
stream.
Currently the following specific stream objects are implemented:
- HTWriter
- Writes to a socket or something opened with the UNIX file I/O open()
function.
- HTFWriter
- Writes to an ANSI C FILE * object, as opened by fopen(), etc.
- SGML
- Parses the data and generates a structured
stream. Each parser instance is created with reference to a particular DTD
structure.
Furthermore a set of basic utility stream objects have been implemented to be
used in a variety of constructions:
- HTTee
- Just writes into two streams at once. Useful for taking a copy for a
cache.
- Black Hole
- A quite expensive way of piping data into a hole for then to be forgotten
forever.
- Through Line
- A short circuited stream that returns the same output sink as it is
called with.
- Content Length Counter
- This stream counts the number of bytes pumped into the stream. This is
used in, e.g.,
POST'ing in the HTTP Protocol
Converters
Streams can be more than just a hole to put data into (e.g. to a file). They
can have a converter connected to it so that the data format is modified going
through the stream. The Library contains a large amount of converters, many of
them converting to or from HTML.
- HTNetToText
- Converts "Net ascii" (the stuff telnet sessions are made of) info local C
textual data stream.
- HTMIME
- Parse a MIME format message. This module also contains a Content Length Counter
- HTWSRC
- Parses a "WAIS source" description.
- HTPlain
- Takes plain ASCII text and converts into whatever -- typically styled
text for display or HTML.
A structured stream is a subclass of a stream, but instead of just accepting data, it also
accepts SGML events such as begin and end elements. A structured stream
therefore represents a structured document. You can think of a structured
stream as like the output from an SGML parser. It is more efficient for
modules which generate hypertext objects to output a structured stream than to
output SGML which is then parsed.
The elements and entities in the stream are referred to by numbers, rather
than strings. The DTD contains the mapping between element names and numbers,
so each structured stream when created is associated with the DTD which it
using.
The structured stream data structures are defined in the SGML module above.
Any instance of a structured stream
has a related DTD which gives the rules and element and entity names for
events on the structured stream. The only DTD which is currently in the
library is the HTML DTD, in the HTMLDTD
module.
The SGML parser uses a DTD to output to a structured stream from a stream of
SGML. A hypertext editor will output to a structured stream when writing out
a document. Many protocol modules output
to a structures stream when generating their data structures.
The current structured stream modules are
- HTML
- This module is an important one as it actually PRESENTS a hypertext
object to the user. GUI writers replace this module with your own, or use it
and replaced the HText module which it feeds.
- HTMLGen
- This structured stream regenerates a plain stream. It makes reference to
the HTML DTD to regenerate the names of entities and elements. This is used by
the server to pass a hypertext document on to the client, and also by the
client to save HTML.
- HTTeXGen
- Structured streams can be used for other conversions than from SGML to
HTML. As an example, this is a SGML to LaTex converter that makes it easier to
construct paper versions of a HTML document.
These modules refer to the DTD modules (pick one of the following, probably
HTMLPDTD)
- HTMLDTD
- Table describing some aspects of the HTML document type: entity and
element names, element contents and allowed attributes.
- HTMLPDTD
- The same thing but for HTML-Plus document type.
When more than one conversion is needed a stack of stream can be created
having the first stream pumping data into the next etc. Creation of each
stream is based on the content type of the actual data. The following modules
provide the functionality for handling stream conversions:
- HTFormat
- This module performs the registration of MIME types and conversion
modules. Currently we only parse HTML, plain text and MIME messages, but
obviously other formats can be added.
- Stream Stack
- As mentioned above, streams can be cascaded to perform a multistep
conversion. This submodule handles a stack of streams, but or the moment only
direct conversion is supported so the size of the stack is always 1.
- HTGuess
- If the input format is unknown at the time when putting up a stream
stack, then this module scans a part of the stream and on a statistical basis
determines the type of stream needed from the content-type.
Presentation Modules
This section describes the presentation to the user. Often the implementation
is made for a simple browser like the Line Mode Browser so more advanced
browsers will overwrite the actual implementation.
From Structured Streams to Styles
Some hypertext objects work by storing the whole structure of the document.
Others work by converting the nested structure into a linear sequence of
styled text for display.
The library provides code for doing this "flattening". You don't have to use
these modules: if you want to work the first way, just replace the entry
points in HTML with your own, to prevent the library modules from being
loaded.
The HTML object mentioned above flattens
structured text into styled text. The HTPlain object generates a styled text
for a plain ascii document.
The style system uses a set of names styles, each of which contains paragraph
and character formatting information. This is managed by the HTStyle module.
- HTStyle
- Style and style sheet manipulation.
A graphic object is a (complex) displayable entity. It is built by a protocol
module directly or using a parser. Graphic objects are in general necessarily
coded differently on different window systems. The graphic object is
responsible for displaying itself, catching mouse clicks, and calling the
navigation object in order to follow links. We use the more common term
"document" to describe the logical entity which a graphics object represents
and displays.
- HText
- This object is window-system dependent. In the line mode browser, the GridText
module is the hypertext object, providing the generic functionality of this
module. Note that this interface is an alternative to the HTML.h interface
above when you are building a client: you can replace the library code at
either point depending on the interface you require.
General Modules and Utilities
This section covers the basic programming utilities that can be used in the
client or server to make life easier.
Container Modules
These modules are generic data object storage modules that might be used
wherever convenient. The general rule for freeing memory from these modules is
that free methods handles data structures generated within the modules whereas
user data is for the caller to free. The modules consist of:
Binary Trees
This is a complete balanced binary tree that might be used for storage
and sort of a large number of data objects, e.g. filenames in directory
listings etc.
Chunks
A Chunk is a blockwise expandable array of type (char *) and is a sort of
apology for real strings in C. Chunks make it easier to handle dynamic strings
of unknown size. It is often faster than using the String Copy Routines.
Linked Lists
This module provides the functionality for managing a generic list of data
objects. The module is implemented as a single linked list using the scheme
first in - last out (FILO).
Association Lists
This is a small module build on top of HTList that provides a way to
store Name-Value pairs in an easy way.
Strings
A few manipulation routines for dynamic arrays of characters. The
routines include string copy, case insensitive comparison etc.
Atoms
Atoms are strings which are given representative pointer values so that
they can be stored more efficiently, and comparisons for equality done more
efficiently. The pointer values are in fact entries into a hash table.
These modules are available for handling information and parse it from the
Library back to the user:
- HTAlert User Messages
- This modules contains the code for prompting the user for file names,
userid, password etc. Furthermore, it presents messages containing status
information, error messages etc. to the user. The implementation in the
library is meant for the Line Mode Browser (i.e. it writes to stderr) but can
easily be overwritten by GUI browsers.
- HTError Module
- This module maintains an message stack within the
HTRequest Structure. The module classifies messages in the range from
Information to Fatal Error. As an example, the new URL
specified in a HTTP
Redirection gets parsed back to the user so that a clever client can edit
the link directly.
- HTErrorMsg
- This module is in fact a part of the HTError module but a this is written
specifically for a browser like Line Mode Browser, GUI clients would like to
overwrite it. It presents the content of the message stack to the user in a
readable form. Often, the information could be shown in a separate information
window in the client.
URL Management
The functionality for handling URLs is all placed in one module:
- HTParse
- This module provides functions for parsing URLs, simplify them by
removing redundant information, escape and unescape them according to the URL
Specifications.
Basic Utilities
The list of basic utilities are currently as follows:
- System specifics
- The tcp.h file includes system-specific include files and flags for I/O
to network and disk. The only reason for this file is that the Internet world
is more complicated than Posix and ANSI.
- HTUtils
- The HTUtil.h file includes things we need everywhere, generally macros
for declarations, booleans, etc.
Tim BL and Henrik
Frystyk, libwww@info.cern.ch, August 1994