Go to the previous section.

Internal aspects

Suppose that four elementary steps are selected at path optimization time. Then recode will split itself into four different tasks interconnected with pipes, logically equivalent to:

step1 <input | step2 | step3 | step4 >output

Overall organization

The main driver constructs, while initializing all conversion modules, a table giving all the conversion routines available (single steps) and for each, the starting charset and the ending charset. If we consider these charsets as being the nodes of a directed graph, each single step may be considered as oriented arc from one node to the other. A cost is attributed to each arc: for example, a high penalty is given to single steps which are prone to loosing characters, a low penalty is given to those which need studying more than one input character for producing an output character, etc.

Given a starting code and a goal code, recode computes the most economical route through the elementary recodings, that is, the best sequence of conversions that will transform the input charset into the final charset. To speed up execution, recode looks for subsequences of conversions which are simple enough to be merged, it then dynamically creates new single steps for these mergings.

A double step is a sequence of two single steps, the output of the first being the special charset rfc1345, the input of the second single step being also rfc1345. A special machinery dynamically produces efficient, reversible, merge-able single steps out of these double steps.

The main part of recode is written in C, as are most single steps. A few single steps need to recognize sequences of multiple characters, they are often better written in flex.

Adding new charsets

It is easy for a programmer to add a new charset to recode. All it requires is making a few functions kept in a single `.c' file, adjusting `Makefile.in', and remaking recode.

One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not loose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not loose too much information while converting.

If, for any of these two functions, you have to read multiple bytes of the old charset before recognizing the character to produce, you might prefer programming it in flex in a separate `.l' file. Prototype your C or flex files after one of those which exist already, so to keep the sources uniform. Besides, at make time, all `.l' files are automatically merged into a single big one by the script `mergelex.awk', which requires sources to follow some rules. Mimetism is a simple approach which relieves me of explaining all these rules!

Each of your source files should have its own initialization function, named module_charset, which is meant to be executed quickly, once, prior to any recoding. It should declare the name of your charsets and the single steps (or elementary recodings) you provide, by calling declare_step one or more times. Besides the charset names, declare_step expects a description of the recoding quality (see `recode.h') and two functions you also provide.

The first such function has the purpose of allocating structures, preconditioning conversion tables, etc. It is also the usual way of further modifying the STEP structure. This function is executed only if and when the single step is retained in an actual recoding sequence. If you do not need such delayed initialization, merely use NULL for the function argument.

The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:

Some single steps do nothing else than a pure copy of the input onto the output, in this case, you can use the predefined function file_one_to_one, while having a delayed initialization for presetting the STEP field one_to_one to the predefined value one_to_same.
Some single steps are driven by a table which recodes one character into another; if the recoding does nothing else, you can use the predefined function file_one_to_one, while having a delayed initialization for presetting the STEP field one_to_one with your table.
Some single steps are driven by a table which recodes one character into a string; if the recoding does nothing else, you can use the predefined function file_one_to_many, while having a delayed initialization for presetting the STEP field one_to_many with your table.

If you have a recoding table handy in a suitable format but do not use one of the predefined recoding functions, it is still a good idea to use a delayed initialization to save it anyway, because recode option -h will take advantage of this information when available.

Finally, edit `Makefile.in' to add the source file name of your routines to the C_STEPS or L_STEPS macro definition, depending on the fact your routines is written in C or in flex. For C files only, also modify the STEPOBJS macro definition.

Go to the previous section.