Working with UTF-8 on the Web
Ignoring older (and badly implemented) browsers for a second, handling UTF-8 data on the web is quite simple. You just need to indicate in the header and/or body of your document the character set, like so (using PHP):
PHP:
<?php header("Content-type: text/html; charset=utf-8");?> <html> <head> <meta http-equiv="Content-type" value="text/html; charset=utf-8"> ...
If your HTML page contains a form, browsers will generally send the results back in the character set of the page. So if your page is sent in UTF-8, you will (usually) get UTF-8 results back. The default encoding of HTML documents is ISO-8859-1, so by default you will get form data encoded as ISO-8859-1, with one big exception: some browsers (including Microsoft Internet Explorer and Apple Safari) will actually send the data encoded as Windows-1252, which extends ISO-8859-1 with some special symbols, like the euro (€) and the curly quotes (“”).
It's those "usually" and "ignoring older (and badly implemented) browsers" qualifiers that make it a little bit tricky: if you want to make sure to catch these edge cases, you'll need to do a little bit of extra work. One thing you can do is add a hidden field to your form containing some data is likely to be corrupted if the client isn't handling the character set correctly:
Code:
<input type="hidden" name="charset_check" value="ä™®">
You can also verify that you have gotten valid UTF-8 content with
this regular expression published by the W3C.
If the data is not valid UTF-8, or you already know that you are dealing with data in another character set that you want to convert into UTF-8, PHP supports a few different ways of converting the data:
- mbstring extension: mb_convert_encoding(string, to, from)
- iconv extension: iconv(from, to, string)
- recode extension: recode_string(request, string)
- built-in function: utf8_encode(string) (converts from ISO-8859-1 to UTF-8 )
So handling input might look something like this:
PHP:
<?php $test = $_REQUEST['charset_check']; /* our test field */ $field = $_REQUEST['field']; /* the data field */ if (bin2hex($test) == "c3a4e284a2c2ae") { /* UTF-8 for "ä™®" */ /* Nothing to do: it's UTF-8! */ } elseif (bin2hex($test) == "e499ae") { /* Windows-1252 */ $field = iconv("windows-1252", "utf-8", $field); } else { die("Sorry, I didn't understand the character set of the data you sent!"); } mysql_query("INSERT INTO table SET field = _utf8'" . addslashes($field) . "'") or die("INSERT failed: " . mysql_error());