Most early development of computers and related systems was done by English-language speakers; therefore, the encoding systems were based on English. The encoding encompassed the English alphabet, numbers, common punctuation, and special non-printing characters for computer display (tabs, spaces, etc.). Most computing systems in use today still rely on English for their encoding, which can be problematic when you want to create documents in other languages.
The most common encoding is ASCII (American Standard Code for Information Interchange). In ASCII, the upper and lower case English letters total 52. Add the ten numerical digits, punctuation marks, and common special characters, and the total reaches 96 printing characters. Most of these printing characters are universally available [?on all computer systems?]. In addition to the printing characters, ASCII contains 32 reserved, non-printing "control" characters (including "newline" and "escape"), bringing the total number of characters to 128.
In the base-2 number system used by computers, the 128 ASCII character set equals 7 bits. Seven is an odd number, however, so developers set 8 bits as the common size of the base unit in computing. The 8-bit unit became known as a byte. Eight bits makes possible more characters in a set, and ISO standard 8859 defines 8-bit character sets to cover several common languages with relatively small numbers of characters. For example, ISO 8859-1 "Latin 1" covers many western European languages; ISO 8859-5 is Cyrillic; ISO 8859-8 is Hebrew; and ISO 8859-7 is Greek.
The legacy of the days before ISO 8859, which was established around [?what year?] remains. Macintosh and PC development began before ISO 8859 was common. The creators of these systems followed their own unique methods to extend the ASCII character coding to 8 bits. Also, different national standards predating ISO 8859 continue to be used (for example, KOI8-R for Cyrillic, and other non-ISO encodings as one finds listed in Netscape under the menu View > Character Set.
Furthermore, some software applications were not designed to handle 8 bits of character data, expecting to handle only the 7 bits of ASCII (e.g., electronic mail systems), with the one free bit in a byte being used for special purposes. The consequence is that some programs may convert 8-bit, ISO 8859-1 characters to 7-bit ASCII, (e.g., "café" becomes "cafi", and "München" becomes "M|nchen".) This is why HTML provides for non-ASCII characters through entities such as é.
Early typewriters had no separate characters for the numerals one (1) or zero (0), substituting instead the letters 'l' (lower-case L) and 'O' (upper case). Although appearing the same (on typewriters), they are not the same, and for computer data it is critical that the correct character be used. The visual representation of a character (for example, the character named "LATIN CHARACTER CAPITAL O") is called a "glyph". ASCII, ISO 8859, the different obsolete encodings, and Unicode, all specify a character by a mapping it to a number. They do not specify the glyph (i.e., how the character looks).
Thus the number-to-character mapping does not carry information about the style of the character: exact look, detailed shape, size, plain, bold, color, etc. It is simply an agreed-upon mapping of the abstract concept of a particular character to a number. An author producing a document usually wants to control such style features, however, and so digital fonts were created, as well as software which lets the user manipulate and control the character style and other formatting.
Once the concept of a font and applying it to textual data was established, an approach to working in non-English alphabets followed. Any alphabet (or symbol set) containing fewer than 256 characters can become a "font". When you are using that font and type the character "a" on the keyboard (normally programmed to send an ASCII code "97" ) the computer finds what is matched with "97" in the selected font and displays that character or symbol.
The problem with this approach is that fonts are typically machine and/or operating-system dependent, and each font can have a different mapping of ASCII numbers to characters, even for the same language. If you ever tried to exchange a file written in a non-English alphabet with a colleague using a different type of computer, you almost certainly come upon this problem. It can be difficult to mix text from several different languages in one document with this approach, as each language must shift between fonts, and there is no one encoding which can be applied to the entire document. (More complex formatting information that is possible in applications like Word or WordPerfect -- size and style of the characters; complex page-layout information -- is usually in an encoding completely different from ASCII, and is even more machine-dependent and difficult to transfer.)
Not discussed above is the problem of languages such as Chinese, Japanese, and Korean, which have more glyphs than can be represented in 8 bits. For each of these languages, two bytes of data is used to represent a single character. Because of the way binary numbers work, two bytes can represent up to 65,536 unique characters, more than enough for any single writing system. Korean, Japanese, and Chinese each have at least one, and often several different, two-byte encoding systems to represent their characters.
There is now a standardized character encoding for nearly all languages (with more being added). This is the merger of ISO standard 10646 with the industry-sponsored Unicode consortium, and it is referred to interchangably as "Unicode" or ISO 10646. The primary difference between ISO 10646 and its predecessors is that it allows up to 4 bytes, or 32 bits, per character. This is required to accomodate all the glyphs for all the languages of the world. A subset of the full encoding is the 2-byte character set of Unicode. For details, see The Unicode Home Page.
The key to understanding Unicode is the relationship between number, character, and glyph as described above. Terms to understand are UCS, for "Universal Character Set", and UTF for "UCS Transformation Format". You may encounter these terms in the Character Set menu of your browser.
UCS-2 is the two-byte form of ISO 10464; UCS-4 is the full four-byte specification. Programs may store files in memory or on disk in UCS-2 or UCS-4.
UTF is an alternative, used to transfer data from one place to another, or to store the UCS-2 or UCS-4 data more compactly. Three UTFs are currently used: UTF-8, UTF-7, and UTF-16.
UTF-8 encodes character data in 1 to 6 bytes, as follows: ASCII characters use 1 byte; many UCS-2 characters are represented by 2 bytes; the rest of UCS-2 takes 3 bytes; UCS-4 may take up to 6 bytes.
UTF-16 uses 2-byte units to extend UCS-2.
UTF-7 is an internet standard to allow UCS-2 data to be sent using only ASCII characters.
Unicode/ISO 10646 specifies only the mapping of character to number. It is necessary for internationalization, but it is not totally sufficient. The language still needs to be specified, because some characters have different glyphs depending on the language in which they are used. Also, some languages read text in directions other than left to right.
HTML 4.0 includes support for labeling documents or parts of a document with the language and direction of text, and provides other features for internationalization. Most importantly, it defines the character set of HTML documents to be ISO 10646.