(一)Character Set Encoding
Code point/code value即字符对应的字符编码
A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as a character set. An encoded character set assigns a unique numeric code to each character in the character set. The numeric codes are called code points or encoded values.
一个字符集可支持多种语言,字符集受限于它的字符库
Different character sets support different character repertoires. Because character sets are typically based on a particular writing script, they can support multiple languages. When character sets were first developed, they had a limited character repertoire. Even now there can be problems using certain characters across platforms.
无论Oracle是什么字符集均可转化以下字符,但其它字符使用时就注意数据库是否支持了
The following CHAR and VARCHAR characters are represented in all Oracle Database character sets and can be transported to any platform:
- Uppercase and lowercase English characters A through Z and a through z
- Arabic digits 0 through 9
- The following punctuation marks: % ' ' ( ) * + - , . / \ : ; < > = ! _ & ~ { } | ^ ? $ # @ " [ ]
- The following control characters: space, horizontal tab, vertical tab, form feed
If you are using characters outside this set, then take care that your data is supported in the database character set that you have chosen.
- How are Characters Encoded?
- Single-Byte Encoding Schemes
每个字符均使用1byte存储
Single-byte encoding schemes are efficient. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.
Single-byte encoding schemes are classified as one of the following types:
- 7-bit encoding schemes
Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).
- 8-bit encoding schemes
Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages. One example is ISO 8859-1, which supports many Western European languages. The following figure shows the ISO 8859-1 8-bit encoding scheme.
- Multibyte Encoding Schemes
Multibyte encoding schemes are used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These encoding schemes use either a fixed number or a variable number of bytes to represent each character.
- Fixed-width multibyte encoding schemes
In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of bytes. The number of bytes is at least two in a multibyte encoding scheme.
- Variable-width multibyte encoding schemes
A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that represents a character. For example, if two bytes is the maximum number of bytes used to represent a character, then the most significant bit can be used to indicate whether that byte is a single-byte character or the first byte of a double-byte character.
- Shift-sensitive variable-width multibyte encoding schemes
Some variable-width encoding schemes use control codes to differentiate between single-byte and multibyte characters with the same code values. A shift-out code indicates that the following character is multibyte. A shift-in code indicates that the following character is single-byte. Shift-sensitive encoding schemes are used primarily on IBM platforms. Note that ISO-2022 character sets cannot be used as database character sets, but they can be used for applications such as a mail server.
- Naming Convention for Oracle Database Character Sets
Oracle Database uses the following naming convention for its character set names:
<region><number of bits used to represent a character><standard character set name>[S|C]
可选的S或C用于区分只能在服务器(S)或仅在客户端(C)上使用的字符集。
Keep in mind that:
- You should use the server character set (S) on the Macintosh platform. The Macintosh client character sets are obsolete. On EBCDIC platforms, use the server character set (S) on the server and the client character set (C) on the client.
- UTF8 and UTFE are exceptions to the naming convention.
The following table shows examples of Oracle Database character set names.
- Subsets and Supersets
The terms subset