Unicode – long trails

Yesterday, the a new major version of the Unicode Standard was published in Unicode 6.0, a year after version 5.2 and more than four after the last major upgrade to 5.0.

There is of course a slew of new stuff in it, and I’m sure I’ll spend a good while digesting at least some of it. The most visible effect of a Unicode Standard revision are the new characters — 2,088 of them, bringing the total to 109,449. Most of them are added to the Supplementary Planes, outside the Basic Multilingual Plane, and therefore require surrogate pairs in UTF-16. (In other words, they are encoded in UTF-16 using two 16-bit code units, unlike the BMP characters, which use only one.)

But enough initialisms: what’s in it? New scripts for languages of course: Brāhmī, an abugida — an alphasyllabary based on consonants with secondary but required vowel notation — which is a historical script from India and of interest to archaeologists and historical linguists; the Mandaic alphabet for a variant of Aramaic that has a classical (liturgical Christian) and a vernacular form used by small communities in Iran and Iraq; and Batak, another abugida used to write an Austronesian language spoken by millions of people in northern Sumatra. In addition, of course, numerous updates, additions and improvements to existing scripts.

More striking is the number of new symbols and pictograms, including entirely new blocks. Emoticons (including Western ones, emoji, and also for example U+1F648 SEE-NO-EVIL MONKEY, U+1F649 HEAR-NO-EVIL MONKEY and U+1F64A SPEAK-NO-EVIL MONKEY). Useful transport and map symbols. Alchemical symbols certain to be welcomed by historians and fortune-tellers. Playing cards. And the catch-all block “Miscellaneous Symbols And Pictographs”, which brings us hundreds of animals, vegetables, fruit, tools, office symbols, communication symbols etc. pp. down to stuff like U+1F4A9, useful if you need to represent a pile of dog poop in a comic-book style.

Some have joked the date was a-propos: in the US, it was National Coming-Out Day, so fittingly we now have U+1F46C and U+1F46D: TWO MEN HOLDING HANDS and TWO WOMEN HOLDING HANDS.

But to get those new characters on paper or screen, we need fonts. Unfortunately, fonts are often many versions behind, and usually only implement specific ranges or blocks of the standard, depending on the font’s purpose. I opened Google to search for what’s out there already, and thanks to Hacker News found an impressively up-to-date font called Symbola by George Douros, third from the top on this collection of fonts for Ancient scripts.

It downloaded and installed fine on my Mac (running OS X Snow Leopard), but the OS X Character Viewer application is clearly not updated (I ran System Update just to make sure I wasn’t missing anything): As the highlighted areas show, neither the character names nor the Unicode blocks are known to OS X just yet.

OS X Character Viewer with new Unicode 6.0 pictogram characters

This doesn’t prevent us from using the font, though, but in the end, whether the characters are displayed depends on the application. My browser, unfortunately, seems to be stuck on Unicode 5.2. But still, here I bring you U+1F427 PENGUIN, in the hope it will automagically appear on this page as soon as the application stack has caught up:

🐧

Update: I can see it in Firefox! But not in Chrome or Safari, so it may be a Webkit problem.

A few days ago, my friend Melinda Shore, who knows I’m interested in internationalization, sent me a screenshot from the search bar of her Safari browser. It is a drop-down list of search suggestions provided by Google just after typing the letter h:

The top suggestion is a mess:

What does it mean?
Why is it a legitimate search suggestion for the letter h? (If it is.)

Regarding 1., the search suggestion in Firefox is nearly identical, but I cannot reproduce the effect in Google’s own browser Chrome or on the search page directly. In the Safari example, we’re dealing with an odd mix of regular character strings (6.626068, 10, sup, -34), numeric HTML (or XML) entities (×) and raw Unicode-escaped characters that you might find in Python, C or Java source code (\u003C, \u003E). Let’s decode the second and third type of components:

\u003C and \u003E simply represent the Unicode code points U+003C and U+003E: the less-than and the greater-than signs < and >.
× is U+00D7 MULTIPLICATION SIGN: ×

Putting it together, we get the already much more user-friendly form

6.626068 × 10<sup>-34

or, completing and resolving the HTML: 6.626068 × 10^-34.

Once I realized this, my physics training kicked in and the answer to 2. became clearer – Planck’s constant, abbreviated as h, has the value of 6.62606889 × 10^-34 J s (or m² kg / s). This is not the result of injecting broken text into the search engine results, but a feature of Google’s calculator. Typing “G” into the browser’s search bar also yields similar semi-numeric character salad, while the results for “c“, “e” or “pi” are much more legible.

Still, the entire story raises questions about intent and execution. This is not really an internationalization issue because the form of those physical and mathematical constants is largely invariant by convention. Yet, the tools of internationalization — HTML entities, Unicode code point escapes — have leaked into scientific character display, too. Internationalization is a user interface (usability, user experience) issue [1].

On the execution side, Google got it wrong on several counts, and Apple and Mozilla share some of the blame. Browser search bar drop-down lists don’t allow for superscripts and aren’t sophisticated enough to strip markup, so they display ugly raw HTML. Choosing a numeric entity instead of the character × probably led to its display breaking. And < and > are even in ASCII, so they should display fine, but probably security concerns and their status as reserved HTML characters led to the odd choice of escaping method. All in all, at least one decoding step was not carried out.

More fundamentally, should Google suggest “6.626068 × 10^-34 m² kg / s” when you type a lowercase h? There was a time in my life when I used Planck’s constant daily, and I do use Google’s handy calculator via my browser’s search bar for quick arithmetic and unit conversions. But I think just spitting out the value with no label is going a little too far, and will for more than 99% of users be entirely unexpected: too different from the genuinely useful (for Americans) “hotmail”, “hulu” and “home depot”. Especially considering that for most letters of the alphabet, you could possibly find a scientific constant, function or theorem that starts with it.

Though maybe it is a ploy to spread more science among the people.

[1] It is also a design issue. The two aren’t mutually exclusive.

Google search result for "h" - very different from the broken suggestion

EDIT (2010-08-02): Commenters inside Facebook’s walled garden have remarked that if you actually take up the suggestion and search for it, you get to Planck’s constant. Currently this is partially right, but these things are constantly shifting: In my tests this morning, whether you use the Safari/Firefox search field or Google’s search page directly, you get a mix of results, the first of which are people wondering about the odd string on SEO forums. A little further down you do get collections of scientific constants, but you have to attentively read the result. Right now, this post is (after less than 12h) number 7 on the results page. None of the pages looks like what you get if you do a Google search for “h” (and hit return) — which is nice and helpful.

Another commenter remarks that for her, the suggestion is now prefaced with “Planck’s constant”, which is a vast improvement.

Tag: Unicode

Welcome Unicode 6.0 and your crazy stable of symbols

Google’s h mystery