Firebird Character Sets and Collations

Firebird Character Sets and Collations

Every CHAR or VARCHAR field can (or, better: must) have a character set assigned. Firebird uses this information to correctly store the bytes that make up the character string. 

In order to be able to sort or compare strings, you also need to define a collation. A collation defines the sort ordering and uppercase conversions for a string.

Firebird is unable to transliterate between character sets. So you must set the correct values on the server and on the client if everything is to work smoothely.

An Example

In the German language there are the "Umlauts", special vowels with a double-dot (diaeresis) over them. A common last name in Germany is Müller. (If you don't have umlauts on your keyboard, you could also write "Mueller", but that's not what we want to discuss here ;-)

When you convert Müller to uppercase you get MÜLLER, so there is an uppercase Ü and a lowercase ü.

When you want to perform a lexicographic compare on the name, you have several options:

  • You can treat the Ü like a U (German "Duden" dictionary)
  • You can treat the Ü like UE (German telephone book)
  • You can treat the Ü like a special character, sorted in after Z (a common practice in Scandinavia)

Creating a Database

You can define the default character set for a new database in the CREATE DATABASE statement:

CREATE DATABASE <database> 
  USER <username> 
  PASSWORD <password> 
  PAGE_SIZE <pagesize> 
  DEFAULT CHARACTER SET <charset>

For example:

CREATE DATABASE localhost:meter
  USER SYSDBA
  PASSWORD masterkey
  PAGE_SIZE 4096
  DEFAULT CHARACTER SET ISO8859_1;

From now on, any VARCHAR or CHAR field will default to the ISO8859_1 character set. You can, however, specify a special character set for each column:

CREATE TABLE users (
  CZECH_NAME VARCHAR(50) CHARACTER SET ISO8859_2,
  ...

Collations

There is no default collation. So you should define a collation for every field that is to be used for sorting (ORDER BY) or comparing (UPPER):

CREATE TABLE users (
  NAME VARCHAR(50) COLLATE DE_DE,
  ...

COLLATE DE_DE means: "use a collation for the German language (the first DE), applying the rules from Germany (the second DE)"

You can also specify the collation with the ORDER BY clause:

ORDER BY LASTNAME COLLATE FR_CA, FIRSTNAME COLLATE FR_CA

or with the WHERE clause:

WHERE LASTNAME COLLATE FR_CA = :lastnametosearch

or when searching:

WHERE UPPER (LAST_NAME COLLATE SV_SV) = 'PAULSEN';

The UPPER() function

UPPER() only works correctly if there is a collation defined for the parameter field:

WHERE UPPER (NAME COLLATE DE_DE) = 'MÜLLER';

Specifying the client character set

ISQL

SET NAMES ISO8859_1;

InterBase Objects (Ibo) by Jason Wharton

The TIb_Connection class has a string property named CharSet. Assign it the name of the character set to use:

Ib_Connection1.CharSet := 'ISO8859_1';

InterBase Express (IBX), built into Delphi

The TIbDatabase class has a TStrings property named Params. Add a field with the name lc_ctype and specify the character set:

IbDatabase1.Params.Add ('lc_ctype=ISO8859_1');

PHP

In PHP you define the Client Character Set when you connect (or pconnect) to the database.

$db = ibase_connect ($Name, $Usr, $Pwd, "ISO8859_1");

Conversions

Conversions between character sets are always done as: CHARSET1 -> UNICODE -> CHARSET2

With NONE or OCTETS as the connection character set, the bytes are just copied: NONE/OCTETS -> CHARSET2 and CHARSET1 -> NONE/OCTETS.

Case insensitive searching

I have written a separate article about this.


Character Sets and Collations

Character SetLanguagesCollationComments
Generic Character Sets
NONEAllNONENo character set applied. With this character set setting, Firebird is unable to perform conversion operations like UPPER() correctly on anything other than the standard 26 latin letters.
OCTETS | BINARYAllOCTETSSame as NONE. Cannot be used as client connection character set. Space character is #x00. Will be displayed as hex in ISQL 2.0.
ASCIIEnglishASCIIEnglish
Unicode based Character Sets
UNICODE_FSSAllUNICODE_FSSUnicode UTF-8. An old implementation that accepts malformed strings and does not enforce correct max. string length. All characters 3 bytes, no case mapping.

Superseded in Firebird 2.0 with the UTF8 character set. Deprecated.

UTF8AllUCS_BASICUCS_BASIC sorts in Unicode code-point order (Firebird 2.0)
UNICODESorts using the Unicode Collation Algorithm (UCA) (Firebird 2.0)
UTF-8Case insensitive collation (Firebird 2.1)
UNICODE_CI_AICase insentive, Accent insensitive collation for Unicode (Firebird 2.5)
Current Character Sets
ISO8859_1Western EuropeISO8859_1Latin-1
DA_DADanish/Danmark
DE_DEGerman/Germany
DU_NLDutch/The Netherlands
EN_UKEnglish/United Kingdom
EN_USEnglish/USA
ES_ESSpanish/Spain
ES_ES_CI_AISpanish/Spain, case insensitive, accent insensitive (Firebird 2.0)
FI_FIFinnish/Finnland
FR_CAFrench/Canada
FR_FRFrench/France
FR_FR_CI_AIFrench/France, case insensitive, accent insensitive (Firebird 2.1)
IS_ISIcelandic/Iceland
IT_ITItalian/Italy
NO_NONorwegian/Norway
PT_PTPortuguese/Portugal
PT_BRPortuguese/Brasil (Firebird 2.0). Case+Accent insensitive
SV_SVSwedish/Sweden
ISO8859_2Central EuropeISO8859_2Central Europe
CS_CZCzech
ISO_HUNHungarian
ISO_PLKPolish (Firebird 2.0)
ISO8859_3Southern EuropeISO8859_3Maltese, Esperanto
ISO8859_4North EuropeanISO8859_4Estonian, Latvian, Lithuanian, Greenlandic, Lappish
ISO8859_5CyrillicISO8859_5Russian, Ukrainian
ISO8859_6ArabicISO8859_6
ISO8859_7Modern GreekISO8859_7
ISO8859_8HebrewISO8859_8
ISO8859_9TurkishISO8859_9
ISO8859_13BalticISO8859_13Baltic
LT_LTLithuanian
WIN1250Central EuropeWIN1250Central Europe
BS_BABosnian (Firebird 2.0)
WIN_CZCzech, case-insensitive (Firebird 2.0)
WIN_CZ_CI_AICzech, case-insensitive, accent-insensitive (Firebird 2.0)
PXW_CSYCzech
PXW_HUNHungarian
PXW_HUNDCHungarian, Dictionary sort
PXW_PLKPolish
PXW_SLOVSlovanian
WIN1251CyrillicWIN1251Cyrillic
WIN1251_UAUkrainian
PXW_CYRLCyrillic, Paradox compatibility
WIN1252Western Europe, AmericaWIN1252Latin-1 with Windows extensions
WIN_PTBRBrasilian Portuguese (Firebird 2.0). Case+Accent insensitive
PXW_INTLParadox ANSI International
PXW_INTL850Paradox Multi-Lingual Latin-1
PXW_NORDAN4Paradox Norwegian and Danish
PXW_SPANParadox Spanish
PXW_SWEDFINParadox Swedish, Finnish
WIN1253Modern GreekWIN1253
PXW_GREEKParadox Greek
WIN1254TurkishWIN1254
PXW_TURKParadox Turkish
WIN1255HebrewWIN1255
WIN1256ArabicWIN1256
WIN1257BalticWIN1257Baltic
WIN1257_LVLatvian dictionary collation (Firebird 2.0)
WIN1257_LTLithuanian dictionary collation (Firebird 2.0)
WIN1257_EEEstonian dictionary collation (Firebird 2.0)
WIN1258VietnameseVietnamese (Firebird 2.0)
MS-DOS, dBASE and Paradox compatibility
DOS437Western Europe, AmericaDOS437English/USA
DB_DEU437dBASE German
DB_ESP437dBASE Spanish
DB_FIN437dBASE Finnish
DB_FRA437dBASE French
DB_ITA437dBASE Italian
DB_NLD437dBASE Dutch
DB_SVE437dBASE Swedisch
DB_UK437dBASE English/UK
DB_US437dBASE English/US
PDOX_ASCIIParadox ASCII code page
PDOX_INTLParadox International English code page
PDOX_SWEDFINParadox Swedish/Finnish code page
DOS737GreekDOS737Greek
DOS775BalticDOS775Baltic
DOS850Western Europe, AmericaDOS850Latin-1 (without Euro € symbol)
DB_DEU850dBASE German
DB_ESP850dBASE Spanish
DB_FRA850dBASE French/France
DB_FRC850dBASE French/Canada
DB_ITA850dBASE Italian
DB_NLD850dBASE Dutch
DB_PTB850dBASE Portuguese/Brasil
DB_SVE850dBASE Swedish
DB_UK850dBASE English/UK
DB_US850dBASE English/USA
DOS852Central EuropeDOS852Latin-2 (Central Europe)
DB_CSYdBASE Czech
DB_PLKdBASE Polish
DB_SLOdBASE Slovakian
PDOX_CSYParadox Czech
PDOX_HUNParadox Hungarian
PDOX_PLKParadox Polish
PDOX_SLOParadox Slovakian
DOS857TurkishDOS857Turkish
DB_TRKdBASE Turkish
DOS858DOS858Latin-1 plus Euro symbol €
DOS860PortugueseDOS860Portuguese
DB_PTG860dBASE Portuguese
DOS861IcelandicDOS861Icelandic
PDOX_ISLParadox Icelandic
DOS862HebrewDOS862Hebrew
DOS863Canadian FrenchDOS863French/Canada
DB_FRC863dBASE French/Canada
DOS864ArabicDOS864Arabic
DOS865ScandinavianDOS865Nordic
DB_NOR865dBASE Norwegian
DB_DAN865dBASE Danish
PDOX_NORDAN4Paradox Norwegian
DOS866RussianDOS866Russian
DOS869GreekDOS869Modern Greek
Others
BIG_5ChineseBIG_5Chinese
KOI8RRussianRussian character set and dictionary collation (Firebird 2.0)
KOI8UUkrainianUkrainian character set and dictionary collation (Firebird 2.0)
CYRLRussian/UkrainianCYRLCyrillic
DB_RUSdBASE Russian
PDOX_CYRLParadox Cyrillic
KSC_5601KoreanKSC_5601Unified Korean Hangeul, also known as windows-949
KSC_DICTIONARYKorean dictionary ordering
NEXTNeXT ComputersNEXTNeXTSTEP encoding
NXT_DEUGerman
NXT_ESPSpanish
NXT_FRAFrench
NXT_ITAItalian
NXT_USUS-English
SJIS_0208JapaneseSJIS_0208Shift-JIS
EUCJ_0208JapaneseEUCJ_0208EUC Japanese
GB_2312ChineseGB_2312Simplified Chinese (HongKong, PRC), a subset of GBK/windows-936
CP943CJapaneseCP943C_UNICODEJapanese character set (Firebird 2.1)
TIS620ThaiTIS620_UNICODEThai character set, single byte (Firebird 2.1)

Which one to choose?

The question now is: which character set do I choose for my database?

Note: I don't have any experience with Asian scripts (Chinese, Korean, Japanese) so I can't give you any hint on these

  • You should chose the DOS, dBASE and Paradox character sets only if you have legacy applications to support
  • The WINxxx character sets are extensions of the corresponding ISOxxx character sets, however you will have problems on non-Windows systems. So if you have a cross-platform application, stay with the ISOxxx character sets.
  • The ISOxxx character sets are missing a few characters of the WINxxx character sets (like typographic dash signs) or they can have some different characters, so your application must be prepared to handle this.

Unicode?

The Unicode situation dramatically improved with Firebird 2.0. Now there is the new UTF8 character set that correctly handles Unicode strings in UTF-8 format. The Unicode collation algorithm has been implemented so now you can use UPPER() and the new LOWER() function without the need to specify a collation.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值