Unicode search query in MySQL

Status
Not open for further replies.

sachin_kothari

Ambassador of Buzz
I am doing in web project in which i take input in a text box in hindi. I have a mysql database which again contains hindi elements. I want to match the user string with that in my mysql database. I did a search query for it, but I am unable to retrieve records from the database. How do I write the query so that i can retrieve elements from database?

Btw, i am doing a project like google suggest, which gives suggestion in hindi. I have managed all other aspects, except the retrieving from the database. I tried it with english and it did work, so the code is ready, only retrieving part in unicode(hindi) is remaining.
 

rohan

In the zone
Check the collation of your database. Make sure it is set to utf-8 or any other unicode transformation format. I always use utf-8 and everything works perfect for me, searching, inserting etc.

@QwertyManiac: that is the case with MS SQL, not MySQL.
 
OP
sachin_kothari

sachin_kothari

Ambassador of Buzz
Rohan, could you write a sample query to retrieve records from the database. Because i have done everything you said but i am still unable to retrieve records from database.
 

rohan

In the zone
I primarily use MySQL along with php, and here is a query:

Code:
$query  = "INSERT INTO entries(u_id, content, category_id, genre_id, timestamp, orig_author, disp_type) values(";
				$query .= "".$this->auth->u_id.", \"".$content."\", ".$cat.", ".$gen.", ".time().", \"".$orig_author."\", '".$disp_type."')";

Now here, all the inputs i get are from a html form. Even if i type in anything that is in Unicode, it searches it properly. The thing is that.. a query doesn't have much to do with character encoding in MySQL. it's how your database is set.. For example, the complete parameters of my table are:

Code:
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

this might help:

*dev.mysql.com/tech-resources/articles/4.1/unicode.html
*dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
 
OP
sachin_kothari

sachin_kothari

Ambassador of Buzz
rohan bro, the query you have displayed is for inserting records into database. i have no problems inserting records into database, my problem is to retrieve them back. would be a great help if you could solve his problem for me.
 

victor_rambo

हॉर्न ओके प्लीज़
Will this help?
PHP:
$sql = 'SELECT * FROM `table` WHERE `column` LIKE CONVERT(_utf8 \'$searchString\' USING latin1) COLLATE latin1_general_ci';
 

DigitalDude

PhotonAttack
aviod using innodb engine unless you have a lot of transactions and have performance issues with myisam... innodb uses lot of memory :D

btw its better to set your db table types directly instead to hardcode it in the query unless its specifically necessary ;)

if you have your table set for unicode theres nothing to worry in query...



_
 
OP
sachin_kothari

sachin_kothari

Ambassador of Buzz
Will this help?
PHP:
$sql = 'SELECT * FROM `table` WHERE `column` LIKE CONVERT(_utf8 \'$searchString\' USING latin1) COLLATE latin1_general_ci';
I am using UTF8 and COLLATE as utf8_unicode_ci. So that's no problem.

aviod using innodb engine unless you have a lot of transactions and have performance issues with myisam... innodb uses lot of memory :D

btw its better to set your db table types directly instead to hardcode it in the query unless its specifically necessary ;)

if you have your table set for unicode theres nothing to worry in query...



_
Thanks for that tip. Will keep that in mind.

The problem gets a bit clearer.
When i type anything into the text box, for eg. if i type 'स्' i get '%u0938%u094D' and i pass this to the query. How can i write the code such that i can pass 'स्' to the query? Because when i hardcoded my query with 'स्' in phpmyadmin i got my intended records.

So the problem now is not the query but the conversion form '%u0938%u094D' to 'स्'.
 

DigitalDude

PhotonAttack
I guess this part will be apt for you

Working with UTF-8 on the Web

Ignoring older (and badly implemented) browsers for a second, handling UTF-8 data on the web is quite simple. You just need to indicate in the header and/or body of your document the character set, like so (using PHP):
PHP:
<?php header("Content-type: text/html; charset=utf-8");?> <html> <head> <meta http-equiv="Content-type" value="text/html; charset=utf-8"> ...
If your HTML page contains a form, browsers will generally send the results back in the character set of the page. So if your page is sent in UTF-8, you will (usually) get UTF-8 results back. The default encoding of HTML documents is ISO-8859-1, so by default you will get form data encoded as ISO-8859-1, with one big exception: some browsers (including Microsoft Internet Explorer and Apple Safari) will actually send the data encoded as Windows-1252, which extends ISO-8859-1 with some special symbols, like the euro (€) and the curly quotes (“”).
It's those "usually" and "ignoring older (and badly implemented) browsers" qualifiers that make it a little bit tricky: if you want to make sure to catch these edge cases, you'll need to do a little bit of extra work. One thing you can do is add a hidden field to your form containing some data is likely to be corrupted if the client isn't handling the character set correctly:
Code:
<input type="hidden" name="charset_check" value="ä™®">
You can also verify that you have gotten valid UTF-8 content with this regular expression published by the W3C.
If the data is not valid UTF-8, or you already know that you are dealing with data in another character set that you want to convert into UTF-8, PHP supports a few different ways of converting the data:
  • mbstring extension: mb_convert_encoding(string, to, from)
  • iconv extension: iconv(from, to, string)
  • recode extension: recode_string(request, string)
  • built-in function: utf8_encode(string) (converts from ISO-8859-1 to UTF-8 )
So handling input might look something like this:
PHP:
<?php $test = $_REQUEST['charset_check']; /* our test field */ $field = $_REQUEST['field']; /* the data field */ if (bin2hex($test) == "c3a4e284a2c2ae") { /* UTF-8 for "ä™®" */ /* Nothing to do: it's UTF-8! */ } elseif (bin2hex($test) == "e499ae") { /* Windows-1252 */ $field = iconv("windows-1252", "utf-8", $field); } else { die("Sorry, I didn't understand the character set of the data you sent!"); } mysql_query("INSERT INTO table SET field = _utf8'" . addslashes($field) . "'") or die("INSERT failed: " . mysql_error());

quoted from *dev.mysql.com/tech-resources/articles/4.1/unicode.html (the link rohan gave)

and what version of mysql r u using ? the above is for 4 and 4.1

if you are using 5 check this page *dev.mysql.com/doc/mysql/en/charset-metadata.html




and one more point.. I dont know how this might help you but I think there might be some connection

you must be familiar with phpMyAdmin... it requires 'mbstring' php extension to work with utf-8 characters so mbstring might also have a role to play :D



_
 
Status
Not open for further replies.
Top Bottom