21.2. Character Set Support

The character set support in PostgreSQL allows you to store text in a variety of character sets, including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code. All character sets can be used transparently throughout the server. (If you use extension functions from other sources, it depends on whether they wrote their code correctly.) The default character set is selected while initializing your PostgreSQL database cluster using initdb. It can be overridden when you create a database using createdb or by using the SQL command CREATE DATABASE. So you can have multiple databases each with a different character set.

21.2.1. Supported Character Sets

Table 21-1 shows the character sets available for use in the server.

Table 21-1. Server Character Sets

NameDescriptionLanguageBytes/CharAliases
BIG5Big FiveTraditional Chinese1-2WIN950, Windows950
EUC_CNExtended UNIX Code-CNSimplified Chinese1-3 
EUC_JPExtended UNIX Code-JPJapanese1-3 
EUC_KRExtended UNIX Code-KRKorean1-3 
EUC_TWExtended UNIX Code-TWTraditional Chinese, Taiwanese1-3 
GB18030National StandardChinese1-2 
GBKExtended National StandardSimplified Chinese1-2WIN936, Windows936
ISO_8859_5ISO 8859-5, ECMA 113Latin/Cyrillic1 
ISO_8859_6ISO 8859-6, ECMA 114Latin/Arabic1 
ISO_8859_7ISO 8859-7, ECMA 118Latin/Greek1 
ISO_8859_8ISO 8859-8, ECMA 121Latin/Hebrew1 
JOHABJOHABKorean (Hangul)1-3 
KOI8KOI8-R(U)Cyrillic1KOI8R
LATIN1ISO 8859-1, ECMA 94Western European1ISO88591
LATIN2ISO 8859-2, ECMA 94Central European1ISO88592
LATIN3ISO 8859-3, ECMA 94South European1ISO88593
LATIN4ISO 8859-4, ECMA 94North European1ISO88594
LATIN5ISO 8859-9, ECMA 128Turkish1ISO88599
LATIN6ISO 8859-10, ECMA 144Nordic1ISO885910
LATIN7ISO 8859-13Baltic1ISO885913
LATIN8ISO 8859-14Celtic1ISO885914
LATIN9ISO 8859-15LATIN1 with Euro and accents1ISO885915
LATIN10ISO 8859-16, ASRO SR 14111Romanian1ISO885916
MULE_INTERNALMule internal codeMultilingual Emacs1-4 
SJISShift JISJapanese1-2Mskanji, ShiftJIS, WIN932, Windows932
SQL_ASCIIunspecified (see text)any1 
UHCUnified Hangul CodeKorean1-2WIN949, Windows949
UTF8Unicode, 8-bitall1-4Unicode
WIN866Windows CP866Cyrillic1ALT
WIN874Windows CP874Thai1 
WIN1250Windows CP1250Central European1 
WIN1251Windows CP1251Cyrillic1WIN
WIN1252Windows CP1252Western European1 
WIN1256Windows CP1256Arabic1 
WIN1258Windows CP1258Vietnamese1ABC, TCVN, TCVN5712, VSCII

Not all APIs support all the listed character sets. For example, the PostgreSQL JDBC driver does not support MULE_INTERNAL, LATIN6, LATIN8, and LATIN10.

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting, because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

21.2.2. Setting the Character Set

initdb defines the default character set for a PostgreSQL cluster. For example,

initdb -E EUC_JP

sets the default character set (encoding) to EUC_JP (Extended Unix Code for Japanese). You can use --encoding instead of -E if you prefer to type longer option strings. If no -E or --encoding option is given, initdb attempts to determine the appropriate encoding to use based on the specified or default locale.

You can create a database with a different character set:

createdb -E EUC_KR korean

This will create a database named korean that uses the character set EUC_KR. Another way to accomplish this is to use this SQL command:

CREATE DATABASE korean WITH ENCODING 'EUC_KR';

The encoding for a database is stored in the system catalog pg_database. You can see that by using the -l option or the \l command of psql.

$ psql -l
            List of databases
   Database    |  Owner  |   Encoding    
---------------+---------+---------------
 euc_cn        | t-ishii | EUC_CN
 euc_jp        | t-ishii | EUC_JP
 euc_kr        | t-ishii | EUC_KR
 euc_tw        | t-ishii | EUC_TW
 mule_internal | t-ishii | MULE_INTERNAL
 postgres      | t-ishii | EUC_JP
 regression    | t-ishii | SQL_ASCII
 template1     | t-ishii | EUC_JP
 test          | t-ishii | EUC_JP
 utf8          | t-ishii | UTF8
(9 rows)

Important: Although you can specify any encoding you want for a database, it is unwise to choose an encoding that is not what is expected by the locale you have selected. The LC_COLLATE and LC_CTYPE settings imply a particular encoding, and locale-dependent operations (such as sorting) are likely to misinterpret data that is in an incompatible encoding.

Since these locale settings are frozen by initdb, the apparent flexibility to use different encodings in different databases of a cluster is more theoretical than real. It is likely that these mechanisms will be revisited in future versions of PostgreSQL.

One way to use multiple encodings safely is to set the locale to C or POSIX during initdb, thus disabling any real locale awareness.

21.2.3. Automatic Character Set Conversion Between Server and Client

PostgreSQL supports automatic character set conversion between server and client for certain character sets. The conversion information is stored in the pg_conversion system catalog. You can create a new conversion by using the SQL command CREATE CONVERSION. PostgreSQL comes with some predefined conversions. They are listed in Table 21-2.

Table 21-2. Client/Server Character Set Conversions

Server Character SetAvailable Client Character Sets
BIG5not supported as a server encoding
EUC_CNEUC_CN, MULE_INTERNAL, UTF8
EUC_JPEUC_JP, MULE_INTERNAL, SJIS, UTF8
EUC_KREUC_KR, MULE_INTERNAL, UTF8
EUC_TWEUC_TW, BIG5, MULE_INTERNAL, UTF8
GB18030not supported as a server encoding
GBKnot supported as a server encoding
ISO_8859_5ISO_8859_5, KOI8, MULE_INTERNAL, UTF8, WIN866, WIN1251
ISO_8859_6ISO_8859_6, UTF8
ISO_8859_7ISO_8859_7, UTF8
ISO_8859_8ISO_8859_8, UTF8
JOHABJOHAB, UTF8
KOI8KOI8, ISO_8859_5, MULE_INTERNAL, UTF8, WIN866, WIN1251
LATIN1LATIN1, MULE_INTERNAL, UTF8
LATIN2LATIN2, MULE_INTERNAL, UTF8, WIN1250
LATIN3LATIN3, MULE_INTERNAL, UTF8
LATIN4LATIN4, MULE_INTERNAL, UTF8
LATIN5LATIN5, UTF8
LATIN6LATIN6, UTF8
LATIN7LATIN7, UTF8
LATIN8LATIN8, UTF8
LATIN9LATIN9, UTF8
LATIN10LATIN10, UTF8
MULE_INTERNALMULE_INTERNAL, BIG5, EUC_CN, EUC_JP, EUC_KR, EUC_TW, ISO_8859_5, KOI8, LATIN1 to LATIN4, SJIS, WIN866, WIN1250, WIN1251
SJISnot supported as a server encoding
SQL_ASCIIany (no conversion will be performed)
UHCnot supported as a server encoding
UTF8all supported encodings
WIN866WIN866, ISO_8859_5, KOI8, MULE_INTERNAL, UTF8, WIN1251
WIN874WIN874, UTF8
WIN1250WIN1250, LATIN2, MULE_INTERNAL, UTF8
WIN1251WIN1251, ISO_8859_5, KOI8, MULE_INTERNAL, UTF8, WIN866
WIN1252WIN1252, UTF8
WIN1256WIN1256, UTF8
WIN1258WIN1258, UTF8

To enable automatic character set conversion, you have to tell PostgreSQL the character set (encoding) you would like to use in the client. There are several ways to accomplish this:

If the conversion of a particular character is not possible — suppose you chose EUC_JP for the server and LATIN1 for the client, then some Japanese characters cannot be converted to LATIN1 — it is transformed to its hexadecimal byte values in parentheses, e.g., (826C).

If the client character set is defined as SQL_ASCII, encoding conversion is disabled, regardless of the server's character set. Just as for the server, use of SQL_ASCII is unwise unless you are working with all-ASCII data.

21.2.4. Further Reading

These are good sources to start learning about various kinds of encoding systems.

http://www.i18ngurus.com/docs/984813247.html

An extensive collection of documents about character sets, encodings, and code pages.

ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf

Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW appear in section 3.2.

http://www.unicode.org/

The web site of the Unicode Consortium

RFC 2044

UTF-8 is defined here.