6.813/6.831 — User Interface Design & Implementation
Spring 2018

Reading 19: Internationalization

UI Hall of Fame or Shame?

Source: i18nguy.com

Today’s reading concerns internationalization: supporting users who speak different languages and have different cultural conventions. For instance, consider our two hall of fame or shame candidates on the left: The first one shows an advertisement for a laundry detergent. A Western reader will clearly be able to parse this frequence of images: put dirty shirt into laundry machine, add detergent, get clean shirt. But what if the same advertisement was displayed in the Middle East or in Asia where people follow different reading conventions?

UI Hall of Fame or Shame?

Source: i18nguy.com

Consider this second example: The all too familiar VCR player that uses arrows to depict first, previous, next, and last functions. What would a Western user expect when pushing the outer left button? What would an Eastern user expect?

Internationalization & Localization

In the next sections, we will talk more about the reasons why internationalization can be hard, and discuss some of the support that exists in GUI toolkits for making it easier.

A good source of information about this problem is Java Internationalization, by Andy Deitsch and David Czarnecki (O’Reilly, 2001). There’s also a trail in the Java Tutorial about Java’s internationalization features.

Internationalization and Localization

Interfaces with international user populations - such as Microsoft Word, shown here - have to be carefully designed to make them easy to adapt to other languages and cultures. The process of making a user interface ready for translation is called internationalization (often called i18n for short - “18” because it replaces 18 characters in the middle of “internationalization”).

Essentially, internationalization separates the language-specific parts of the interface from the rest of the code, so that those parts can be easily replaced. The translation is usually done by nonprogrammers, so their job is easier if the textual messages are separate from the code. Actually doing this translation for a particular language and culture is called localization.

One way to understand the difference between these two technical terms is by analogy to portability across operating system platforms. If you write your program carefully so that it doesn’t depend on specific features of an operating system or processor, you’ve made it portable. Making a program portable is analogous to internationalizing it. Actually porting it to another particular platform, e.g., by recompiling it, is analogous to localizing it.

Unfortunately localization is much harder than merely knowing what words to substitute (and online translators like Babelfish and Google Translate can only barely do that, so don’t rely on them!) You can’t necessarily rely on bilingual members of your design team, either. They may be reasonably fluent in the other language, but not sufficiently immersed in the culture or national standards to notice all the places where the application needs to change. You are not the user is especially true in internationalization.

Design Challenges

Translation

  • All user-visible text has to be translated

    • Object approach
      <button> OK </button>

    • Stroke approach
      canvas.fillText("Name:",...)

    • Pixel approach

  • Error messages too

Here are some of the reasons why internationalization is hard.

First, every piece of text that might be shown to the user is a potential candidate for translation. That means not just properties of components (like menu items and button labels), but also text drawn with stroke drawing calls, and text embedded in pixel images (like this one taken from the MIT EECS web page). Translation can easily change the size or aspect ratio of the text; German labels tend to be much longer than English ones, for example.

Error messages also need to be translated, of course - which is another reason not to expose internal system names in error messages. An English-reading user might be able to figure out what FileNotFoundException means, but it won’t internationalize well.

Risks of Translation

Source:

BBC

Here’s a sign from Wales, where official signs are required to be bilingual (English and Welsh). The English is clear enough to English-speaking lorry drivers - but the Welsh actually reads “I am not in the office at the moment. Send any work to be translated.” The translation was outsourced by email, you see…

There’s a larger lesson here that translation without sufficient context can lead to errors. The BBC article cited just above has some amusing examples of other English/Welsh signs that are mistranslated (“staff” → “wooden stick”) because the translator wasn’t fully aware of the context.

Different Scripts

  • Cyrillic

  • Hangul (Korean)

  • Chinese

  • Greek

  • Arabic

Different languages obviously use scripts other than the Latin alphabet. Here are some of the scripts that Windows, Mac, and web browsers all support.

It’s important to distinguish between script (or alphabet) and language. Western languages like English, French, German, and Italian are different languages that all use the Latin alphabet (basically). Russian, Ukrainian, and Bulgarian (among others) use the Cyrillic alphabet.

Text Direction

  • Some scripts don’t read left-to-right

    • Arabic, Hebrew are right-to-left

    • Affects drawing, screen layout, even icons

Many scripts are not even written left-to-right; Arabic and Hebrew are the most common languages with scripts written right-to-left. CJK (Chinese, Japanese, Korean) characters are usually written left-to-right, but can also appear vertically (top-to-bottom) and occasionally even right-to-left. Reversing the direction of reading requires reversing the entire layout of your screen, since the user is now accustomed to starting from the right edge when they scan. It might even affect the “natural” direction of arrow icons. The picture above shows the Hebrew version of Firefox. Notice that the menu bar is reversed (it starts from the right, and the rightmost menu is the File menu), the toolbar is reversed, and the Back button (which is now the rightmost button) is now pointing to the right! The URL box isn’t reversed, however, because it uses the Latin alphabet, not Hebrew. This is another common wrinkle in right-to-left languages: when they embed foreign words, or Arabic numbers, the embedded words go in left-to-right order. So the text might be constantly switching direction.

Sort Order

  • Unicode order isn’t even right for English

  • Uppercase/lowercase, accents affect order

  • Norwegian: … x y z æ ø å

  • Traditional Spanish: c, ch, d, …, l, ll, m, …

Sorting, or collation, is another way that languages differ. In software, each character is represented by a number (or a short sequence of numbers). This mapping is the character encoding. Modern programming languages use the Unicode character set, and may represent each character internally by a 16-bit or 32-bit number (since Unicode has now outgrown its original 16 bits).

But the ordering of these numbers doesn’t necessarily match the conventional ordering of the characters in the language, so sorting text with < or String.compareTo() is almost certainly wrong. It’s even wrong for English! Unicode groups the uppercase and lowercase letters separately, so that the sort order by < would be ABC…XYZ…abc…xyz…

Similarly, in most European languages, accented characters are sorted with the plain version of the character, even though the Unicode characters may be nowhere near each other in numerical order. And that general rule is not true in Norwegian, where å actually appears at the end of the alphabet, after z.

Formatting

  • Numbers

    US/UK

    72,350.55

    France

    72 350.55

    Germany

    72.350,55

  • Date & time formatting

    US

    10/31/2006

    (M/D/Y)

    Everywhere else

    31/10/2006

    (D/M/Y)

Number formats and date formats also vary - not just by language, but by country. In the US, commas are used for millions and thousands, and a period for the decimal point, as in “72,350.55”. But the convention in Germany is precisely the opposite: “72.350,55”. Even countries that share the same language may differ on conventional formats. Americans tend to write dates as MM/DD/YY, but British write DD/MM/YY (as does most of the rest of the world).

The target for localization therefore needs to be specified by a language/country pair, also called a locale, such as US English, UK English, or Canadian French.

Color Conventions

US

China

White

Red

Localizing a user interface requires knowing about the cultural associations attached to symbols or colors, and making sure you don’t send the wrong message.

For example, colors have different meanings in different cultures. In East Asia, particularly China, white is associated with death, and is used as a color theme for funerals. In the West, on the other hand, white is a symbol of purity, and brides wear white at their weddings. Traditional Chinese weddings involve a lot of red, because it symbolizes luck. Western cultures don’t have the same association for red.

Icons

  • Familiar icons in one culture aren’t in others

Icons must also be carefully chosen, or replaced when the interface is localized. Metaphorical icons that refer to everyday objects like mailboxes and stop signs aren’t necessarily recognizable, because the objects may look different in different countries. (Stop signs are actually pretty universal, however - I had to look hard to find a stop sign that wasn’t a red octagon, like this Japanese inverted triangle.) Hand gestures pictured as icons may actually be offensive in some countries. And visual puns are always a bad idea - an English-speaking designer might think it’s cute to use a picture of a table (the furniture) to represent table (the 2D grid), because the words are the same in English. But the words in German are tisch (furniture) and tabelle (grid), so a German may find the joke incomprehensible.

reading exercises

Exercise

When localizing a website, it is important to pay attention to (choose all good answers):

(missing explanation)

Implementation Techniques

Implementation Support for I18N

  • Message files

  • Unicode

  • Bidirectionality

  • Formatting libraries

  • Separating structure from presentation

Now that we’ve surveyed the challenges, let’s talk about some solutions. Modern UI toolkits provide support that make it easier to implement internationalized interfaces.

Message Files

  • A message file separates localizable messages from source code

    • Also called resource files or resource bundles

  • Human translators generate a message file for each supported locale

    • Doesn’t require translators to read source code or recompile

  • Messages with dynamic parts can be tricky

    • {N} users have visited since {date}

The most basic feature for internationalization is supporting translation into different languages. UI toolkits that support this allow textual messages to be stored separately from the code, as dynamically loadable, structured text files, so that an application can be localized simply by replacing those text messages. The messages are referred to by names, such as "file-menu-label".

This is an example of the general strategy for internationalization. First, use abstraction to isolate the parts of your system that need to change from one locale to another, separating it from the rest of your program. (This is an application of a familiar software engineering rule - if you know something will change, isolate it.) Second, as much as possible, design these locale-specific parts so that they don’t require reading source code or recompiling the program, so that localization can be done by nonprogrammers.

Internationalization gets a little tricky when a message has dynamic parts, like “25 users have visited since January 1”. In an uninternationalized program, you might simply concatenate in your source code: num + " users have visited since " + date. For internationalization, you need to give the translator flexibility to put the dynamic parts anywhere, using a format like “%1 users have visited since %2”, so that it could be rewritten as “Since %2, %1 users have visited” if the language demands it. And you also need to think about plurals, usually by having different versions of the entire message that depend on the value of num:

num == 0 → “%1 users have visited since %2”

num == 1 → “%1 user has visited since %2”

num > 1 → “%1 users have visited since %2”

Format.js supports pluralization rules that make this task somewhat easier. But be careful - Arabic has a different plural form when num == 2 than when num > 2.

Bidirectionality

  • Bidirectional text display and editing

    • String in memory: This is

      arabic text

    • Drawn on screen:

      (base direction English)

      This is txet cibara

      (base direction Arabic)

      txet cibara This is

  • Bidirectional layout

To handle languages that read right-to-left, UI toolkits like HTML/CSS provide support for bidirectional text (sometimes called “BiDi” or BIDI for short). The trickiest part here is that Unicode strings may (and often do!) mix characters from multiple scripts: Arabic and English, for example. A good UI toolkit will ensure that when you draw such a string to the screen, it draws the appropriate characters in the appropriate order. There must be a base direction that determines whether the whole string starts at the left or the right; if the interface is primarily English, for example, then the base direction should be left to right, but if it’s primarily Arabic, the base direction should be right to left. To avoid messing up bidirectionality, don’t try to draw a sentence in little pieces; instead, put together a string first, and draw it all at once, letting the toolkit figure it out. (If you’re using message files properly, of course, this will happen anyway.)

International toolkits must also support bidirectional text editing, making (for example) arrow keys and selection work in the correct direction for the script.

Automatic layout managers can also support bidirectionality. In CSS, for example, table cells are laid out either left-to-right or right-to-left, depending on the HTML dir attribute or CSS direction property of the parent.

Formatting Libraries

  • Javascript library support for parsing and printing numbers and dates:

Separating Structure From Presentation

  • Replaceable icons and images

  • Fonts

  • Colors

Finally, to handle other changes that localization might impose, it helps to isolate details of the presentation. Images and icons might need language translation (if they contain text) or cultural translation (if they use unfamiliar symbols). Fonts might need to change to handle different scripts, since fonts rarely have glyphs for every script in Unicode. And colors might need to change if they have cultural problems.

For web programming, CSS makes this kind of separation easier.

reading exercises

Exercise

Which of the following are useful implementation techniques for localization? (choose all good answers):

(missing explanation)

Summary

  • Internationalization abstracts a user interface so that it can be localized for different locales

    • Languages

    • Scripts

    • Formatting conventions

    • Cultures