Friday, May 16, 2008

Google & Unicode - Misinformation

Some of the discussions about Google moving to Unicode 5.1 had some misperceptions:
  • UTF-8 doubles the storage of web pages. UTF-8 is one of the ways Unicode is encoded. It varies from 1 to 4 bytes per character. For languages using A-Z there is no penalty; 1 byte per character just like ASCII. Moreover, because web pages on average have so much ASCII in them (markup, javascript, spaces between words, etc), even for languages taking more than 1 byte per character, the average cost is much lower than people suspect.
  • Google is miscounting the encodings by using "stated encodings". We cannot depend on what the page (or server) says about its encoding (or language), because it is far too often wrong or missing. We use encoding detection to determine the encoding, and language detection to determine the language. The stated values can't be used as more than a hint as to the real value.
  • Google is counting ASCII as UTF-8. While ASCII is valid UTF-8 (and also valid Latin-1, and so on), we choose encodings based on the "narrowes" valid one. So the ASCII pages for the chart are only those that have bytes from 00 to 7F.
  • Chinese use on the web is declining. That couldn't be further from the case. The encoding is very different than the language. With HMTL I can encode Chinese in GB2312, or Unicode (eg UTF-8), or even ASCII (using syntax like 中 to represent δΈ­).
Some questions people asked:
  • What about UTF-16? UTF-16 is the other main form of Unicode, but isn't nearly as frequent on the web. It is good for internal processing, but for web pages UTF-8 is generally better for reducing size and because of its ASCII compatibility.
  • Since any ASCII page is also UTF-8, doesn't that mean that the web is 50% UTF-8? While that's one way to look at it, it'd be misleading. Most because most encodings also include ASCII, so 48% would be Latin-1/1252, 28% be GB1212, 27% be SJIS, and so on.
I have two related presentations on my home page, macchiato.com (and will be talking more about this topic at the next Unicode conference).
  • Unicode at Google
  • Unicode Myths