soledad penadés
repeat 4[fd 100 rt 90]

Reasons for using UTF-8

The subject on encoding is quite confusing and at the beginning one does never really know what are the differences between encoding types, and most importantly, what are the consequences of choosing ISO-8859 instead of UTF-8, so now that I begin to have more arguments than the Trust me, I think this is the right decision one, I want to share with you what I know - and of course, please correct me where I am wrong!

The main problem is the development platform, which happens to be Windows most of the times - and its default encoding, which seems to be ISO-8859. Since the majority of web developers are in countries which have more than enough with ISO-8859 (Europe, North America, etc…), and that majority also tend to use Windows, their servers are set to use ISO-8859, the databases are created using ISO-8859, and the code and templates and by extension the pages come out automatically with ISO-8859 (although I have also noticed that Eclipse sets the default encoding to CP-1252, in every platform - which is something that keeps puzzling me!).

It is fine if you don't expect to have any non ISO-8859 in your content ever, but that only happens in very specific cases - and often you are the only person entering content. But most of the websites you build will probably allow people from all around the world to register and submit their content, and here's where the fun begins:

  1. Even if the site is in English, people's names are still in their own language. Let them enter their name with their characters and don't force them to pseudotranslate them into English. Obviously, the name is just an example. It could also be book and movie titles, or music albums, etc…
  2. If you aggregate feeds from other sites, they most probably will come in UTF-8. If your site is not in UTF-8, you'll have to either use utf8_decode (in php) or convert that text into html entities.
  3. If you use Flash with dynamic content (which you generate), it will expect the content to show up encoded in UTF-8. There's no way of changing that unless you mess around with the evil systemCodepage setting (but that's a bad idea)
  4. If you use AJAX, you need to return UTF-8 content. Just like the Flash case
  5. If you expect to use the content of your non-UTF-8 website in other applications which do support UTF-8 (for example, a reports system) but are not web based and you used the html entities trick for storing UTF-8 content in your database, you'll have to convert back the html entities into UTF-8 or sort of it (and fingers crossed!)

Considering all the above situations, it's easy to see that it's better to use UTF-8 straight on from the beginning.

In that case:

  1. People can register using their normal name. Japanese people (for example) will be happy.
  2. Aggregate everything you want and don't worry about external feeds having characters that your page encoding doesn't include
  3. Flash will be happy. Now you just need to make sure to embed all the characters you may need - but that's another story
  4. AJAX will be happy too
  5. Generate reports without having to mess with html entities. What you query is what you need.

… and the best of all is that as long as your system is properly set up, you don't need to do anything special about UTF-8 in your code. You just need to think about the content, and stop worrying about utf8_encode's or utf8_decode's, htmlentities and all that mess!

Did I convince you to use UTF-8?

// 16 responses to Reasons for using UTF-8

esaiz
esaiz
20071203

Whoever which builds two websites convinces himself to use utf-8, both in the encoding and in the database.

JP
JP
20071204

Windows is capable of handling unicode applications for years (since NT 3.5 I think). The problem exists in application development - it still widespread practice not to use unicode.
Also, assuming that everything web is UTF may lead you to some bitter surprises - not all unix defaults to UTF-8 (namely the *BSD, some old/weirdo Linux distros, Solaris?), and you will find problems in those situations you can't mess with the server.
I still believe the best strategy is to assume nothing about the encoding, defining the current encoding as some sort of var an then explicitly convert from/to UTF8 when needed. If someday everything is UTF compatible, fine - just bypass the conversion and you app will run happy.

Christoffer Hammarström
Christoffer Hammarström
20071204

Windows does in fact not use ISO-8859, it uses CP 1252 and friends. The windows "codepages" (encodings) were invented before ISO-8859 was finalized, but based on it.

So CP 1252 which is the default encoding in western Windows is like ISO-8559-1, but with some extra characters in code points 128-159, which are reserved in ISO-8859.

sole
sole
20071204

@JP: I don't assume anything - I'm saying "as long as your system is properly set up". Mine are! Although it may be interesting to write a check list about this, actually. Thanks for the indirect suggestion :-)

@Christoffer: yeah you're right, it just slipped out of my mind. So that's why Eclipse in Windows inherits that, because it uses the System's encoding.

JP
JP
20071204

sole: the problem is, when you deploy your utf-based application, the system may be using another default encoding. Using and/or not using utf has nothing to do with "proper setup". It is all fine and dandy till the day you need interoperability with older systems, or you need to deploy it on a production infrastructure you don't control.
I'm not saying utf is bad - far from it - and that you don't raise some interesting and valid points. I'm just saying just because you default everything to utf8 it doesn't mean you won't have problems, and some of those problems that may arise may not be easily solvable.

Derek Allard
Derek Allard
20071204

Sole, this is beautifully written. Just 4 days ago I was discussing creating databases and setting default encodings, and the only reason I could come up with for why I'd suggest they use utf-8 was "its more international, trust me" ;)

What you've done here is outlined the arguments wonderfully, so I'll just be pointing people here from now on. Thanks!

sole
sole
20071204

@JP: I guess that falls in the exception and as such needs to be treated with special care.

@Derek: Thanks! I wish I had understood these things from the beginning… I wouldn't have done certain things the (wrong) way I did them :-)

Remco
Remco
20071204

This particular discussion was decided years ago - using Unicode is a no-brainer. It's not more painful than ISO-8859 and works in far more cases. In practice if you're in contact with any XML etc, you're going to be in contact with UTF-8 anyway.

I thought you were going to talk about whether UTF-8 is the best encoding to use for Unicode. After all, people in Asia aren't very happy with it - western characters take 1 or 2 bytes, but many Asian ones take 3, 4 or even 5.

UTF-8 is surely the most ubiquitous and best supported encoding now, but I wonder if it'll stay that way.

Remco
Remco
20071204

@JP: of course, regardless of the encoding you use, the system you deploy to may be using another default encoding. That's not particular to UTF-8.

UTF-8 checklist - soledad penadés
UTF-8 checklist - soledad penadés
20071211

[...] the discussion in the previous post (Reasons for using UTF-8) I thought it could be interesting to gather a series of steps needed to get a UTF-8 friendly [...]

Razões para utilizar UTF-8 | Tecnologia da Informação - Desenvolvimento e Educação
Razões para utilizar UTF-8 | Tecnologia da Informação - Desenvolvimento e Educação
20071211

[...] Soledad Penadés divulgou um post sobre esta dúvida. Ela comenta que é muito comum que as soluções comecem por utilizar ISO-8859. Contudo ela cita as razões pelas quais o UTF-8 deveria ser recomendado: [...]

Cid R Andrade
Cid R Andrade
20071211

Posted about it in my blog

Isaac Z. Schlueter
Isaac Z. Schlueter
20071213

Sole,

Another reason to use UTF-8 in all code and databases is that it greatly facilitates compatibility between team members. Enforcing it requires a bit of buy-in from the team, but it's worth it.

On my current project at Yahoo, some of us use Macs, and others use Windows. A few developers write their code on RHEL or FreeBSD, either in Vim or Eclipse. The servers either FreeBSD or RHEL.

Once you set up your editor and databases to use UTF-8, it's immediately apparent when someone is saving in ISO-8859-1. The bad characters stick out like a sore thumb. The site looks wrong if they're in the markup. Our policy is to stop everything, immediately change the bad characters, check the file back into CVS, and then look at the CVS logs and figure out who's not using UTF-8. It's very effective, because no one wants to be "that guy."

The default encoding on most *nix systems (including Mac) is UTF-8, but the default in Windows is ISO-8859-1, and some *nix programs try to be "convenient" by silently supporting the Windows encoding. Unless we are all vigilant, it will cause problems.

@Remco

people in Asia aren't very happy with it - western characters take 1 or 2 bytes, but many Asian ones take 3, 4 or even 5.

Forgive the insensitivity, but they need to get over it. At least for the foreseeable future, UTF-8 is the most widely supported Unicode worldwide. There are editors for every Asian language that can save in UTF-8. As long as a request is all in one language, the extra bytes are mostly taken care of by serving gzip-encoded pages, anyhow. (If you're not telling your web server to gzip textual files, why not? It's not 1990. Browsers actually support gzip these days!)

Hardware increases in power and capacity with unbelievable speed, and software is necessarily fraught with irreducible complexity. So, whenever possible, it seems that it is generally best to make the hardware do some extra work (storing and encoding 5 bytes per glyph instead of 2), if doing so will make the software simpler and easier to understand (by eliminating the character encoding layer.)

sole
sole
20071213

Very good point actually. I unconsciously had it in mind but it failed to materialise when writing this.

And it's definitely right. I've had virtually zero problems when working with mac and linux together, whereas with windows there's always the problem with encodings and line feeds (although line feeds do not show up as weird characters in the middle of the page).

Norbert
Norbert
20071219

Sole, I agree with you in many points. In fact our team is thinking about converting our Windows application from Latin-1 to UTF-8, instead of going the entire way to Unicode. The reason is, that not all places in our code (of several 100.000 lines) are so cleanly written that just flipping a switch in the project settings will bring us to Unicode. It appears to me a lot painless to stay with an 8-bit based encoding like UTF-8 and nevertheless enjoy all the advantages of Unicode.

That however means to talk to all Windows Controls in Unicode (via the …W calls) whereas the main application code remains 8-bit based. Has anybody attempted this approach before us, or would you deem this in the end more troublesome than just fixing all "unclean" places in the code, e.g. where literals are used without T(" …. ")?

sole
sole
20071219

I understand (and I agree with it): sometimes it's not easy to switch to utf8. I for example have to do some work for a certain website which has currently a mix of latin and utf8 content and it's looking to take quite a bit of work!

But once that's sorted out "you just forget about that".

Unfortunately I don't have any experience with programming Windows Controls in unicode or utf8… last time I did any Windows GUI thing was with MFC and I can hardly remember anything! Sorry :-/

Feel free to leave a reply

Comments are moderated: Rude and offtopic ones are out!