3 Nov 2009

MySQL and UTF-8 - locales and some advanced settings

Default locales, and collation in MySQL since 4.1 is latin1_swedish_ci, it works well but if you have some polish texts it's not so good, there is way to change defaults, and it is quite easy. After installation of MySQL edit your config file (default location of this file is /etc/mysql/my.cnf).
  1. Find [mysqld] section in my.cnf
  2. Add this entries to it:
    [mysqld]
    #
    # Enable Polish collocation and default utf8 encoding
    # Change mysql default character set
    #
    character_set_server=utf8
    collation_server=utf8_polish_ci
    default-character-set=utf8
    
    init-connect="SET NAMES utf8;"
    # It makes
    # SET character_set_client = utf8;
    # SET character_set_results = utf8;
    # SET character_set_connection = utf8;
    You can also add option: skip-character-set-client-handshake to enforce using of utf8 encoding in database and connection, to read more about this you can visit mysql manual.
  3. Type: /etc/init.d/mysql restart
  4. You're done ;)
This setting alters only databases which will be created.

Changing character sets in VARCHAR data

To change charset in table loaded with text data, you could run SQL query like:
ALTER DATABASE `DATABASE_NAME` DEFAULT CHARACTER SET utf8 COLLATE utf8_polish_ci;
Or in batch: Snippet below comes from stackoverflow.com (visit it for more details) (if you're using some indices you should drop it).
mysql --database=dbname -B -N -e "SHOW TABLES" \
| awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_polish_ci;"}' \
| mysql --database=dbname &

If you want to check current CHARACTER_SET and COLLATION

SHOW VARIABLES LIKE "character_set%";
SHOW VARIABLES LIKE "collation%";
or
SHOW CREATE DATABASE `DATABASE_NAME`;

6 comments:

  1. If you are interested in this topic you could read further guides and advices - on this page is also PHP script to do massive change of collation.

    ReplyDelete
  2. If you're web developer, you should checkout this optiuons too (via: http://fschiettecatte.wordpress.com/2007/05/18/mysql-encoding/)

    httpd.conf:
    AddCharset UTF-8 .utf8
    AddDefaultCharset UTF-8

    php.ini
    default_charset = "utf-8"

    ReplyDelete
  3. If you're having problems with SELECT ... WHERE field='your_accents' returns data even without accents, try to change:

    in my.cnf:

    character_set_server=utf8
    collation_server=utf8_bin
    default-character-set=utf8

    init-connect="SET NAMES utf8 COLLATE utf8_bin;"


    and create tables like:

    CREATE TABLE `tab` (
    `field` varchar(255) NOT NULL,
    ) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

    Eventually try:

    SET NAMES binary;

    or:

    SELECT * FROM tab t WHERE t.field = 'ść' COLLATE utf8_bin;

    or:

    SET NAMES utf8 COLLATE utf8_bin;

    Maybe one of these will work ... :> argh! I don't like MySQL collations ...

    ReplyDelete
  4. ps - recommended reading: http://www.bluetwanger.de/blog/2006/11/20/mysql-and-utf-8-no-more-question-marks/

    and try to invoke in mysql client:

    SHOW VARIABLES LIKE "char%";

    maybe output of these will help you :)

    ** this is note for myself;

    ReplyDelete
  5. Fixing column encoding mess in MySQL


    Citation:

    The column (actually the whole table) was created with DEFAULT CHARSET cp1251. Most of the data was in proper cp1251 national encoding indeed. However, because of web application failure to properly set the encoding, some of the rows were actually in UTF-8. That needed to be fixed.
    Simply using CONVERT(column USING xxx) did not work because MySQL treated the source data as if it was in cp1251. One obvious solution would be to write a throwaway PHP script which would SET NAMES cp1251, pull the offending rows (they’d come out in UTF-8), iconv() them to proper cp1251, and UPDATE them with new values.
    However it’s possible to fix the issue within MySQL. The trick is to tell it to treat the string coming from the table as binary, and then do charset conversion:

    RUN:

    UPDATE table SET column=CONVERT(CONVERT(CONVERT(column USING binary) USING utf8) USING cp1251) WHERE id=123;

    UPDATE table SET column=CONVERT(CONVERT(column USING binary) USING utf8) WHERE id=123;

    ReplyDelete