iconv

最新推荐文章于 2023-09-25 16:04:28 发布

cuihui8789

最新推荐文章于 2023-09-25 16:04:28 发布

阅读量159

点赞数

National Language Support Guide and Reference

Using the iconv Command

Any converter installed in the system can be used through the iconv command, which uses the iconv library. The iconv command acts as a filter for converting from one code set to another. For example, the following command filters data from PC Code (IBM-850) to ISO8859-1:

The iconv command converts the encoding of characters read from either standard input or the specified file and then writes the results to standard output.

Understanding libiconv

The iconv application programming interface (API) consists of the following subroutines that accomplish conversion:

iconv_open

Performs the initialization required to convert characters from the code set specified by the FromCode parameter to the code set specified by the ToCode parameter. The strings specified are dependent on the converters installed in the system. If initialization is successful, the converter descriptor, iconv_t, is returned in its initial state.

iconv

Invokes the converter function using the descriptor obtained from the iconv_open subroutine. The inbuf parameter points to the first character in the input buffer, and the inbytesleft parameter indicates the number of bytes to the end of the buffer being converted. The outbuf parameter points to the first available byte in the output buffer, and the outbytesleft parameter indicates the number of available bytes to the end of the buffer.

For state-dependent encoding, the subroutine is placed in its initial state by a call for which the inbuf value is a null pointer. Subsequent calls with the inbuf parameter as something other than a null pointer cause the internal state of the function to be altered as necessary.

iconv_close

Closes the conversion descriptor specified by the cd variable and makes it usable again.

In a network environment, the following factors determine how data should be converted:

Code sets of the sender and the receiver Communication protocol (8-bit or 7-bit data)

The following table outlines the conversion methods and recommends how to convert data in different situations. See the Interchange Converters--7-bit and the Interchange Converters--8-bit for more information.

Outline of Methods and Recommended Choices
	Communication with system using the same code set		Communication with system using different code set (or receiver's code set is unknown)
	Protocol		Protocol
Method to choose	7-bit only	8-bit	7-bit only	8-bit
as is	Not valid	Best choice	Not valid	Not valid if remote code set is unknown
fold7	OK	OK	Best choice	OK
fold8	Not valid	OK	Not valid	Best choice
uucode	Best choice	OK	Not valid	Not valid

If the sender uses the same code set as the receiver, the following possibilities exist:

When protocol allows 8-bit data, the data can be sent without conversions. When protocol allows only 7-bit data, the 8-bit code points must be mapped to 7-bit values. Use the iconv interface and one of the following methods:

uucode	Provides the same mapping as the uuencode and uudecode commands. This is the recommended method. For more information, see Interchange Converters--uucode.
7-bit	Converts internal code sets using 7-bit data. This method passes ASCII without any change. For more information, see Interchange Converters--7-bit.

If the sender uses a code set different from the receiver, there are two possibilities:

When protocol allows only 7-bit data, use the fold7 method. When protocol allows 8-bit data and you know the receiver's code set, use the iconv interface to convert the data. If you do not know the receiver's code set, use the following method:

8-bit

Converts internal code sets to standard interchange formats. The 8-bit data is transmitted and the information is preserved so that the receiver can reconstruct the data in its code set. For more information, see Interchange Converters--8-bit.

Using the iconv_open Subroutine

The following examples illustrate how to use the iconv_open subroutine in different situations:

When the sender and receiver use the same code sets, and if the protocol allows 8-bit data, you can send data without converting it. If the protocol allows only 7-bit data, do the following:

Sender:
 cd = iconv_open("uucode", nl_langinfo(CODESET));


Receiver:
 cd = iconv_open(nl_langinfo(CODESET), "uucode");

Whne the sender and receiver use different code sets, and if the protocol allows 8-bit data and the receiver's code set is unknown, do the following:

Sender:
 cd = iconv_open("fold8", nl_langinfo(CODESET));


Receiver:
 cd = iconv_open(nl_langinfo(CODESET),"fold8" );

If the protocol allows only 7-bit data, do the following:

Sender:
 cd = iconv_open("fold7", nl_langinfo(CODESET));


Receiver:
 cd = iconv_open(nl_langinfo(CODESET), "fold7" );

The iconv_open subroutine uses the LOCPATH environment variable to search for a converter whose name is in the following form:

The FromCodeSet string represents the sender's code set, and the ToCodeSet string represents the receiver's code set. The underscore character separates the two strings.

Note:

All setuid and setgid programs ignore the LOCPATH environment variable.

Because the iconv converter is a loadable object module, a different object is required when running in the 64-bit environment. In the 64-bit environment, the iconv_open routine uses the LOCPATH environment variable to search for a converter whose name is in the following form:

The iconv library automatically chooses whether to load the standard converter object or the 64-bit converter object. If the iconv_open subroutine does not find the converter, it uses the from,to pair to search for a file that defines a table-driven conversion. The file contains a conversion table created by the genxlt command.

The iconvTable converter uses the LOCPATH environment variable to search for a file whose name is in the following form:

If the converter is found, it performs a load operation and is initialized. The converter descriptor, iconv_t, is returned in its initial state.

Converter Programs versus Tables

Converter programs are executable functions that convert data according to a set of rules. Converter tables are single-byte conversion tables that perform stateless conversions. Programs and tables are in separate directories, as follows:

/usr/lib/nls/loc/iconv	Converter programs
/usr/lib/nls/loc/iconvTable	Converter tables

After a converter program is compiled and linked with the libiconv.a library, the program is placed in the /usr/lib/nls/loc/iconv directory.

To build a table converter, build a source converter table file. Use the genxlt command to compile translation tables into a format understood by the table converter. The output file is then placed in the /usr/lib/nls/loc/iconvTable directory.

Unicode and Universal Converters

Unicode (or UCS-2) conversion tables are found in:

The $LOCPATH/uconv/UCSTBL converter program is used to perform the conversion to and from UCS-2 using the iconv utilities.

A Universal converter program is provided that can be used to convert between any two code sets whose conversions to and from UCS-2 is defined. Given the following uconv tables:

a universal conversion can be defined that maps the following:

by use of the $LOCPATH/iconv/Universal_UCS_Conv.

Universal UCS Converter

UCS-2 is a universal 16-bit encoding that can be used as an interchange medium to provide conversion capability between virtually any code sets. The conversion can be accomplished using the Universal UCS Converter, which converts between any two code sets XXX and YYY as follows:

The XXX and YYY conversions must be included in the supported List of UCS-2 Interchange Converters, and must be installed on the system.

The universal converter is installed as the file /usr/lib/nls/loc/iconv/Universal_UCS_Conv.

The conversion between multibyte and wide character code depends on the current locale setting. Do not exchange wide character codes between two processes, unless you have knowledge that each locale that might be used handles wide character codes in a consistent fashion. Most locales for this operating system use the Unicode character value as a wide character code, except locales based on IBM-eucTW codesets.

Using Converters

The iconv interface is a set of the following subroutines used to open, perform, and close conversions:

iconv_open iconv iconv_close

Code Set Conversion Filter Example

The following example shows how you can use these subroutines to create a code set conversion filter that accepts the ToCode and FromCode parameters as input arguments:

Naming Converters

Code set names are in the form CodesetRegistry-CodesetEncoding where:

CodesetRegistry	Identifies the registration authority for the encoding. The CodesetRegistry must be made of characters from the portable code set (usually A-Z and 0-9).
CodesetEncoding	Identifies the coded character set defined by the registered authority.

The from,to variable used by the iconv command and iconv_open subroutine identifies a file whose name should be in the form /usr/lib/nls/loc/iconv/%f_%t or /usr/lib/nls/loc/iconvTable/%f_%t, where:

%f	Represents the FromCode set name
%t	Represents the ToCode set name

List of Converters

Converters change data from one code set to another. The sets of converters supported with the iconv library are listed in the following sections. All converters shipped with the BOS Runtime Environment are located in the /usr/lib/nls/loc/iconv/* or /usr/lib/nls/loc/iconvTable/* directory.

These directories also contain private converters; that is, they are used by other converters. However, users and programs should only depend on the converters in the following lists.

Any converter shipped with the BOS Runtime Environment and not listed here should be considered private and subject to change or deletion. Converters supplied by other products can be placed in the /usr/lib/nls/loc/iconv/* or /usr/lib/nls/loc/iconvTable/* directory.

Programmers are encouraged to use registered code set names or code set names associated with an application. The X Consortium maintains a registry of code set names for reference. See Code Sets for National Language Support for more information about code sets.

PC, ISO, and EBCDIC Code Set Converters

These converters provide conversion between PC, ISO, and EBCDIC single-byte stateless code sets. The following types of conversions are supported: PC to/from ISO, PC to/from EBCDIC, and ISO to/from EBCDIC.

Conversion is provided between compatible code sets such as Latin-1 to Latin-1 and Greek to Greek. However, conversion between different EBCDIC national code sets is not supported. For information about converting between incompatible character sets, refer to the Interchange Converters--7-bit and the Interchange Converters--8-bit.

Conversion tables in the iconvTable directory are created by the genxlt command.

Compatible Code Set Names

The following table lists code set names that are compatible. Each line defines to/from strings that may be used when requesting a converter.

Note:

The PC and ISO code sets are ASCII-based.

Code Set Compatibility
Character Set	Languages	PC	ISO	EBCDIC
Latin-1	U.S. English, Portuguese, Canadian French	N/A	ISO8859-1	IBM-037
Latin-1	Danish, Norwegian	N/A	ISO8859-1	IBM-277
Latin-1	Finnish, Swedish	N/A	ISO8859-1	IBM-278
Latin-1	Italian	N/A	ISO8859-1	IBM-280
Latin-1	Japanese	N/A	ISO8859-1	IBM-281
Latin-1	Spanish	N/A	ISO8859-1	IBM-284
Latin-1	U.K. English	N/A	ISO8859-1	IBM-285
Latin-1	German	N/A	ISO8859-1	IBM-273
Latin-1	French	N/A	ISO8859-1	IBM-297
Latin-1	Belgian, Swiss German	N/A	ISO8859-1	IBM-500
Latin-2	Croatian, Czechoslovakian, Hungarian, Polish, Romanian, Serbian Latin, Slovak, Slovene	IBM-852	ISO88859-2	IBM-870
Cyrillic	Bulgarian, Macedonian, Serbian Cyrillic, Russian	IBM-855	ISO8859-5	IBM-880 IBM-1025
Cyrillic	Russian	IBM-866	ISO8859-5	IBM-1025
Hebrew	Hebrew	IBM-856 IBM-862	ISO8859-8	IBM-424 IBM-803
Turkish	Turkish	IBM-857	ISO8859-9	IBM-1026
Arabic	Arabic	IBM-864 IBM-1046	ISO8859-6	IBM-420
Greek	Greek	IBM-869	ISO8859-7	IBM-875
Greek	Greek	IBM-869	ISO8859-7	IBM-875
Baltic	Lithuanian, Latvian, Estonian	IBM-921 IBM-922		IBM-1112 IBM-1122

Note:

A character that exists in the source code set but does not exist in the target code set is converted to a converter-defined substitute character.

Files

The following table describes the inconvTable converters found in the /usr/lib/nls/loc/iconvTable directory:

iconvTable Converters
Converter Table	Description	Language
IBM-037_IBM-850	IBM-037 to IBM-850	U.S. English, Portuguese, Canadian-French
IBM-273_IBM-850	IBM-273 to IBM-850	German
IBM-277_IBM-850	IBM-277 to IBM-850	Danish, Norwegian
IBM-278_IBM-850	IBM-278 to IBM-850	Finnish, Swedish
IBM-280_IBM-850	IBM-280 to IBM-850	Italian
IBM-281_IBM-850	IBM-281 to IBM-850	Japanese-Latin
IBM-284_IBM-850	IBM-284 to IBM-850	Spanish
IBM-285_IBM-850	IBM-285 to IBM-850	U.K. English
IBM-297_IBM-850	IBM-297 to IBM-850	French
IBM-420_IBM_1046	IBM-420 to IBM-1046	Arabic
IBM-424_IBM-856	IBM-424 to IBM-856	Hebrew
IBM-424_IBM-862	IBM-424 to IBM-862	Hebrew
IBM-500_IBM-850	IBM-500 to IBM-850	Belgian, Swiss German
IBM-803_IBM-856	IBM-803 to IBM-856	Hebrew
IBM-803_IBM-862	IBM-803 to IBM-862	Hebrew
IBM-850_IBM-037	IBM-850 to IBM-037	U.S. English, Portuguese, Canadian-French
IBM-850_IBM-273	IBM-850 to IBM-273	German
IBM-850_IBM-277	IBM-850 to IBM-277	Danish, Norwegian
IBM-850_IBM-278	IBM-850 to IBM-278	Finnish, Swedish
IBM-850_IBM-280	IBM-850 to IBM-280	Italian
IBM-850_IBM-281	IBM-850 to IBM-281	Japanese-Latin
IBM-850_IBM-284	IBM-850 to IBM-284	Spanish
IBM-850_IBM-285	IBM-850 to IBM-285	U.K. English
IBM-850_IBM-297	IBM-850 to IBM-297	French
IBM-850_IBM-500	IBM-850 to IBM-500	Belgian, Swiss German
IBM-856_IBM-424	IBM-856 to IBM-424	Hebrew
IBM-856_IBM-803	IBM-856 to IBM-803	Hebrew
IBM-856_IBM-862	IBM-856 to IBM-862	Hebrew
IBM-862_IBM-424	IBM-862 to IBM-424	Hebrew
IBM-862_IBM-803	IBM-862 to IBM-803	Hebrew
IBM-862_IBM-856	IBM-862 to IBM-856	Hebrew
IBM-864_IBM-1046	IBM-864 to IBM-1046	Arabic
IBM-921_IBM-1112	IBM-921 to IBM-1112	Lithuanian, Latvian
IBM-922_IBM-1122	IBM-922 to IBM-1122	Estonian
IBM-1112_IBM-921	IBM-1121 to IBM-921	Lithuanian, Latvian
IBM-1122_IBM-922	IBM-1122 to IBM-922	Estonian
IBM-1046_IBM-420	IBM-1046 to IBM-420	Arabic
IBM-1046_IBM-864	IBM-1046 to IBM-864	Arabic
IBM-037_ISO8859-1	IBM-037 to ISO8859-1	U.S. English, Portuguese, Canadian French
IBM-273_ISO8859-1	IBM-273 to ISO8859-1	German
IBM-277_ISO8859-1	IBM-277 to ISO8859-1	Danish, Norwegian
IBM-278_ISO8859-1	IBM-278 to ISO8859-1	Finnish, Swedish
IBM-280_ISO8859-1	IBM-280 to ISO8859-1	Italian
IBM-281_ISO8859-1	IBM-281 to ISO8859-1	Japanese-Latin
IBM-284_ISO8859-1	IBM-284 to ISO8859-1	Spanish
IBM-285_ISO8859-1	IBM-285 to ISO8859-1	U.K. English
IBM-297_ISO8859-1	IBM-297 to ISO8859-1	French
IBM-420_ISO8859-6	IBM-420 to ISO8859-6	Arabic
IBM-424_ISO8859-8	IBM-424 to ISO8859-8	Hebrew
IBM-500_ISO8859-1	IBM-500 to ISO8859-1	Belgian, Swiss German
IBM-803_ISO8859-8	IBM-803 to ISO8859-8	Hebrew
IBM-852_ISO8859-2	IBM-852 to ISO8859-2	Croatian, Czechoslovakian, Hungarian, Polish, Romanian, Serbian Latin, Slovak, Slovene
IBM-855_ISO8859-5	IBM-855 to ISO8859-5	Bulgarian, Macedonian, Serbian Cyrillic, Russian
IBM-866_ISO8859-5	IBM-866 to ISO8859-5	Russian
IBM-869_ISO8859-7	IBM-869 to ISO8859-7	Greek
IBM-875_ISO8859-7	IBM-875 to ISO8859-7	Greek
IBM-870_ISO8859-2	IBM-870 to ISO8859-2	Croatian, Czechoslovakian, Hungarian, Polish, Romanian, Serbian, Slovak, Slovene
IBM-880_ISO8859-5	IBM-880 to ISO8859-5	Bulgarian, Macedonian, Serbian Cyrillic, Russian
IBM-1025_ISO8859-5	IBM-1025 to ISO8859-5	Bulgarian, Macedonian, Serbian Cyrillic, Russian
IBM-857_ISO8859-9	IBM-857 to ISO8859-9	Turkish
IBM-1026_ISO8859-9	IBM-1026 to ISO8859-9	Turkish
IBM-850_ISO8859-1	IBM-850 to ISO8859-1	Latin
IBM-856_ISO8859-8	IBM-856 to ISO8859-8	Hebrew
IBM-862_ISO8859-8	IBM-862 to ISO8859-8	Hebrew
IBM-864_ISO8859-6	IBM-864 to ISO8859-6	Arabic
IBM-1046_ISO8859-6	IBM-1046 to ISO8859-6	Arabic
ISO8859-1_IBM-850	ISO8859-1 to IBM-850	Latin
ISO8859-6_IBM-864	ISO8859-6 to IBM-864	Arabic
ISO8859-6_IBM-1046	ISO8859-6 to IBM-1046	Arabic
ISO8859-8_IBM-856	ISO8859-8 to IBM-856	Hebrew
ISO8859-8_IBM-862	ISO8859-8 to IBM-862	Hebrew
ISO8859-1_IBM-037	ISO8859-1 to IBM-037	U.S. English, Portuguese, Canadian French
ISO8859-1_IBM-273	ISO8859-1 to IBM-273	German
ISO8859-1_IBM-277	ISO8859-1 to IBM-277	Danish, Norwegian
ISO8859-1_IBM-278	ISO8859-1 to IBM-278	Finnish, Swedish
ISO8859-1_IBM-280	ISO8859-1 to IBM-280	Italian
ISO8859-1_IBM-281	ISO8859-1 to IBM-281	Japanese-Latin
ISO8859-1_IBM-284	ISO8859-1 to IBM-284	Spanish
ISO8859-1_IBM-285	ISO8859-1 to IBM-285	U.K. English
ISO8859-1_IBM-297	ISO8859-1 to IBM-297	French
ISO8859-1_IBM-500	ISO8859-1 to IBM-500	Belgian, Swiss German
ISO8859-2_IBM-852	ISO8859-2 to IBM-852	Croatian, Czechoslovakian, Hungarian, Polish, Romanian, Serbian Latin, Slovak, Slovene
ISO8859-2_IBM-870	ISO8859-2 to IBM-870	Croatian, Czechoslovakian, Hungarian, Polish, Romanian, Serbian Latin, Slovak, Slovene
ISO8859-5_IBM-855	ISO8859-5 to IBM-855	Bulgarian, Macedonian, Serbian Cyrillic, Russian
ISO8859-5_IBM-880	ISO8859-5 to IBM-880	Bulgarian, Macedonian, Serbian Cyrillic, Russian
ISO8859-5_IBM-1025	ISO8859-5 to IBM-1025	Bulgarian, Macedonian, Serbian Cyrillic, Russian
ISO8859-6_IBM-420	ISO8859-6 to IBM-420	Arabic
ISO8859-5_IBM-866	ISO8859-5 to IBM-866	Russian
ISO8859-7_IBM-869	ISO8859-7 to IBM-869	Greek
ISO8859-7_IBM-875	ISO8859-7 to IBM-875	Greek
ISO8859-8_IBM-424	ISO8859-8 to IBM-424	Hebrew
ISO8859-8_IBM-803	ISO8859-8 to IBM-803	Hebrew
ISO8859-9_IBM-857	ISO8859-9 to IBM-857	Turkish
ISO8859-9_IBM-1026	ISO8859-9 to IBM-1026	Turkish

Multibyte Code Set Converters

Multibyte code-set converters convert characters among the following code sets:

PC multibyte code sets EUC multibyte code sets (ISO-based) EBCDIC multibyte code sets

The following table lists code set names that are compatible. Each line defines to/from strings that may be used when requesting a converter.

Code Set Compatibility
Language	PC	ISO	EBCDIC
Japanese	IBM-932	IBM-eucJP	IBM-930, IBM-939
Japanese (MS compatible)	IBM-943	IBM-eucJP	IBM-930, IBM-939
Korean	IBM-934	IBM-eucKR	IBM-933
Traditional Chinese	IBM-938, big-5	IBM-eucTW	IBM-937
Simplified Chinese	IBM-1381	IBM-eucCN	IBM-935

Conversions between Simplified and Traditional Chinese are provided (IBM-eucTW UTF-8 is an additional code set. See UTF-8 Interchange Converters for more information.

Files

The following list describes the Multibyte Code Set converters that are found in the /usr/lib/nls/loc/iconv directory.

Converter	Description
IBM-eucJP_IBM-932	IBM-eucJP to IBM-932
IBM-eucJP_IBM-943	IBM-eucJP to IBM-943
IBM-eucJP_IBM-930	IBM-eucJP to IBM-930
IBM-eucCN_IBM-936(PC5550)	IBM-eucCN to IBM-936(PC5550)
IBM-eucCN_IBM-935	IBM-eucCN to IBM-935
IBM-eucJP_IBM-939	IBM-eucJP to IBM-939
IBM-eucCN_IBM-1381	IBM-eucCN to IBM-1381
IBM-943_IBM-932	IBM-943 to IBM-932
IBM-932_IBM-943	IBM-932 to IBM-943
IBM-930_IBM-932	IBM-930 to IBM-932
IBM-930_IBM-943	IBM-930 to IBM-943
IBM-930_IBM-eucJP	IBM-930 to IBM-eucJP
IBM-932_IBM-eucJP	IBM-932 to IBM-eucJP
IBM-932_IBM-930	IBM-932 to IBM-930
IBM-943_IBM-eucJP	IBM-943 to IBM-eucJP
IBM-943_IBM-930	IBM-943 to IBM-930
IBM-936(PC5550)_IBM-935	IBM-936(PC5550) to IBM-935
IBM-936_IBM-935	IBM-936 to IBM-935
IBM-932_IBM-939	IBM-932 to IBM-939
IBM-939_IBM-932	IBM-939 to IBM-932
IBM-943_IBM-939	IBM-943 to IBM-939
IBM-939_IBM-943	IBM-939 to IBM-943
IBM-935_IBM-936(PC5550)	IBM-935 to IBM-936(PC5550)
IBM-935_IBM-936	IBM-935 to IBM-936
IBM-1381_IBM-935	IBM-1381 to IBM-935
IBM-935_IBM-1381	IBM-935 to IBM-1381
IBM-935_IBM-eucCN	IBM-935 to IBM-eucCN
IBM-936(PC5550)_IBM-eucCN	IBM-936(PC5550) to IBM-eucCN
IBM-eucTW_IBM-eucCN	IBM-eucTW to IBM-eucCN
big5_IBM-eucCN	big5 to IBM-eucCN
IBM-1381_IBM-eucCN	IBM-1381 to IBM-eucCN
IBM-939_IBM-eucJP	IBM-939 to IBM-eucJP
IBM-eucKR_IBM-934	IBM-eucKR to IBM-934
IBM-934_IBM-eucKR	IBM-934 to IBM-eucKR
IBM-eucKR_IBM-933	IBM-eucKR to IBM-933
IBM-933_IBM-eucKR	IBM-933 to IBM-eucKR
IBM-eucTW_IBM-937	IBM-eucTW to IBM-937
IBM-938_IBM-937	IBM-938 to IBM-937
big-5_IBM-937	big-5 to IBM-937
IBM-eucCN_IBM-eucTW	IBM-eucCN to IBM-eucTW
IBM-937_IBM-eucTW	IBM-937 to IBM-eucTW
IBM-937_IBM-938	IBM-937 to IBM-938
IBM-eucTW_IBM-938	IBM_eucTW to IBM_938
IBM-eucCN_big5	IBM-eucCN to big5

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/7343861/viewspace-888299/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/7343861/viewspace-888299/

cuihui8789

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
iconv

iconv Command[@more@][ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Sear...
复制链接

扫一扫