Unicode-enabling Microsoft C/C++ Source Code

Unicode-enabling Microsoft C/C++ Source Code

Initial Steps for Unicode-enabling Microsoft C/C++ Source

·         Define _UNICODE, undefine _MBCS if defined.

·         Convert literal strings to use L or _T

·         Convert string functions to use Wide or TCHAR versions.

·         Clarify string lengths in API as byte or character counts. For character-based display or printing (as opposed to GUI which is pixel-based) use column counts, not byte or character.

·         Replace character pointer arithmetic with GetNext style, as characters may consist of more than one Unicode code unit.

·         Watch buffer size and buffer overflows- changing encodings may require either larger buffers or limiting string lengths. If character size changes from 1 byte to as many as 4 bytes, and string length was formerly 20 characters and 20 bytes, either expand the string buffer(s) from 20 to 80 bytes or limit the string to 5 characters (and therefore 20 bytes). Note maximum buffer expansion may be constrained (for example to 65 KB). Reducing string length to a fixed number of characters may break existing applications. Limiting strings to a fixed byte length is dangerous. For example, allowing any string that fits into 20 bytes. Simple operations such as uppercasing a string may cause it to grow and exceed the byte length.

·         Replace functions that accept or return arguments of a single character, with functions that use strings instead. (International) Operations on a single character may result in more than one code point being returned. For example, upper('ß') returns "SS".

·         Use wmain instead of main. The environment variable is then _wenviron instead of _environ.
wmain( int argc, wchar_t *argv[ ], wchar_t *envp[ ] ).

·         MFC Unicode applications use wWinMain as the entry point.
In the Output page of the Linker folder in the project's Property Pages dialog box, set the Entry Point symbol to
wWinMainCRTStartup.

·         Consider fonts. Identify the fonts that will render each language or script used.

Top of page

File I/O, Database, Transfer Protocol Considerations

·         Consider whether to read/write UTF-8 or UTF-16 in files, databases, and for data exchange.

·         Consider Endian-ness in UTF-16 files.
Read/Write Big-Endian on networks. Use Big-Endian if you don't produce a BOM.
Endian-ness of files will depend on the file format and/or the architecture of the source or target machine.
When reading files encoded in UTF-16 or UTF-32, be prepared to swap-bytes to convert endian-ness.
Also consider streams and transfer protocols and the encoding used in each.

·         Label files or protocols for data exchange with the correct character encoding. E.g. set HTTP, HTML, XML to UTF-8 or UTF-16.

·         Consider Unicode BOM (Byte Order Marker) and whether it should be written with data. Remove it when reading data.

·         Consider encoding conversion of legacy data and files, import and export, transfer protocols. (MultiByteToWideChar, WideCharToMultiByte, mbtowc, wctomb, wctombs, mbstowcs )

·         Consider writing to the Clipboard-
use
CF_TEXT format and write native character encoding (ANSI) text, and
use
CF_UNICODETEXT format and write Unicode text.

·         Database applications should consider Data Type (NCHAR, NVARCHAR) and Schema Changes, Triggers, Stored Procedures, and Queries. Data Storage growth, Indexes and Performance.
Note that the Unicode schema changes will have different impacts and concerns on different vendors' databases. If database portability is a requirement, the features and behaviors of each database need to be taken into account.
(I know this item is seriously understated. To be expanded sometime in the future.)

Top of page

Stream I/O

Streams are difficult in Microsoft C++. You may run into 3 types of problems:

1.    Unicode filenames are not supported. The workaround is to use FILE * _wfopen and if needed, use the FILE handle in subsequent stream I/O.

std::ifstream  stm(_wfopen(pFilename, L"r"));

2.    Stream I/O will convert Unicode data from/to native (ANSI) code page on read/write, not UTF-8 or UTF-16. However the stream class can be modified to read/write UTF-8. You can implement a facet to convert between Unicode and UTF-8.

codecvt <wchar_t, char_traits <wchar_t> >

3.    To read/write UTF-16 with stream I/O, use binary opens and binary I/O. To set binary I/O:

_setmode( _fileno( stdin ), _O_BINARY );


Also see the Microsoft run-time library reference: "Unicode Stream I/O in Text and Binary Modes".

Note: There aren't TCHAR equivalents for cout/wcout, cin/wcin, etc. You may want to make your own preprocessor definition for "tout", if you are compiling code both ways.

Top of page

Internationalization, Advanced Unicode, Platform and Other Considerations

·         Consider using locale-based routines and further internationalization.

·         For Windows 95, 98 and ME, consider using the Microsoft MSLU (Microsoft Layer for Unicode)

·         Consider string compares and sorting, Unicode Collation Algorithm

·         Consider Unicode Normalization

·         Consider Character Folding

·         Reconsider doing this on your own. Bring in an experienced Unicode consultant, and deploy your existing resources on the tasks they do best. (Hey, an I18nGuy's gotta earn a living...)

Top of page

Unicode BOM Encoding Values

Encoding Form

BOM Encoding

UTF-8

EF BB BF

UTF-16
(big-endian)

FE FF

UTF-16
(little-endian)

FF FE

UTF-16BE, UTF-32BE
(big-endian)

No BOM!

UTF-16LE, UTF-32LE
(little-endian)

No BOM!

UTF-32
(big-endian)

00 00 FE FF

UTF-32
(little-endian)

FF FE 00 00

SCSU
(compression)

0E FE FF

The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also represent a Zero Width No-break Space.) The code point U+FFFE is illegal in Unicode, and should never appear in a Unicode character stream. Therefore the BOM can be used in the first character of a file (or more generally a string), as an indicator of endian-ness. With UTF-16, if the first character is read as bytes FE FF then the text has the same endian-ness as the machine reading it. If the character is read as bytes FF FE, then the endian-ness is reversed and all 16-bit words should be byte-swapped as they are read-in. In the same way, the BOM indicates the endian-ness of text encoded with UTF-32.

Note that not all files start with a BOM however. In fact, the Unicode Standard says that text that does not begin with a BOM MUST be interpreted in big-endian form.

The character U+FEFF also serves as an encoding signature for the Unicode Encoding Forms. The table shows the encoding of U+FEFF in each of the Unicode encoding forms. Note that by definition, text labeled as UTF-16BE, UTF-32BE, UTF-32LE or UTF-16LE should not have a BOM. The endian-ness is indicated in the label.

For text that is compressed with the SCSU (Standard Compression Scheme for Unicode) algorithm, there is also a recommended signature.

 

Data Types

ANSI

Wide

TCHAR

char

wchar_t

_TCHAR

_finddata_t

_wfinddata_t

_tfinddata_t

__finddata64_t

__wfinddata64_t

_tfinddata64_t

_finddatai64_t

_wfinddatai64_t

_tfinddatai64_t

int

wint_t

_TINT

signed char

wchar_t

_TSCHAR

unsigned char

wchar_t

_TUCHAR

char

wchar_t

_TXCHAR

 

L

_T or _TEXT

LPSTR
(char *)

LPWSTR
(wchar_t *)

LPTSTR
(_TCHAR *)

LPCSTR
(const char *)

LPCWSTR
(const wchar_t *)

LPCTSTR
(const _TCHAR *)

LPOLESTR
(For OLE)

LPWSTR

LPTSTR

 

 

Constant and Global Variables

ANSI

Wide

TCHAR

EOF

WEOF

_TEOF

_environ

_wenviron

_tenviron

_pgmptr

_wpgmptr

_tpgmptr

 

 

Platform SDK String Functions

There are many Windows API that compile into ANSI or Wide forms, depending on whether the symbol UNICODE is defined. Modules that operate on both ANSI and Wide characters, need to be aware of this. Otherwise, using the Character Data Type-independent name requires no changes, just compile with the symbol UNICODE defined.

The following list is by no means all of the Character Data Type-dependent API, just some character and string related ones. Look in WinNLS.h for some code page and locale related API.

ANSI

Wide

Character Data Type-
Independent Name

CharLowerA

CharLowerW

CharLower

CharLowerBuffA

CharLowerBuffW

CharLowerBuff

CharNextA

CharNextW

CharNext

CharNextExA

CharNextExW

CharNextEx

CharPrevA

CharPrevW

CharPrev

CharPrevExA

CharPrevExW

CharPrevEx

CharToOemA

CharToOemW

CharToOem

CharToOemBuffA

CharToOemBuffW

CharToOemBuff

CharUpperA

CharUpperW

CharUpper

CharUpperBuffA

CharUpperBuffW

CharUpperBuff

CompareStringA

CompareStringW

CompareString

FoldStringA

FoldStringW

FoldString

GetStringTypeA

GetStringTypeW

GetStringType

GetStringTypeExA

GetStringTypeExW

GetStringTypeEx

IsCharAlphaA

IsCharAlphaW

IsCharAlpha

IsCharAlphaNumericA

IsCharAlphaNumericW

IsCharAlphaNumeric

IsCharLowerA

IsCharLowerW

IsCharLower

IsCharUpperA

IsCharUpperW

IsCharUpper

LoadStringA

LoadStringW

LoadString

lstrcatA

lstrcatW

lstrcat

lstrcmpA

lstrcmpW

lstrcmp

lstrcmpiA

lstrcmpiW

lstrcmpi

lstrcpyA

lstrcpyW

lstrcpy

lstrcpynA

lstrcpynW

lstrcpyn

lstrlenA

lstrlenW

lstrlen

OemToCharA

OemToCharW

OemToChar

OemToCharBuffA

OemToCharBuffW

OemToCharBuff

wsprintfA

wsprintfW

wsprintf

wvsprintfA

wvsprintfW

wvsprintf

 

 

TCHAR String Functions

Functions sorted by ANSI name, for ease of converting to Unicode.

ANSI

Wide

TCHAR

_access

_waccess

_taccess

_atoi64

_wtoi64

_tstoi64

_atoi64

_wtoi64

_ttoi64

_cgets

_cgetws

cgetts

_chdir

_wchdir

_tchdir

_chmod

_wchmod

_tchmod

_cprintf

_cwprintf

_tcprintf

_cputs

_cputws

_cputts

_creat

_wcreat

_tcreat

_cscanf

_cwscanf

_tcscanf

_ctime64

_wctime64

_tctime64

_execl

_wexecl

_texecl

_execle

_wexecle

_texecle

_execlp

_wexeclp

_texeclp

_execlpe

_wexeclpe

_texeclpe

_execv

_wexecv

_texecv

_execve

_wexecve

_texecve

_execvp

_wexecvp

_texecvp

_execvpe

_wexecvpe

_texecvpe

_fdopen

_wfdopen

_tfdopen

_fgetchar

_fgetwchar

_fgettchar

_findfirst

_wfindfirst

_tfindfirst

_findnext64

_wfindnext64

_tfindnext64

_findnext

_wfindnext

_tfindnext

_findnexti64

_wfindnexti64

_tfindnexti64

_fputchar

_fputwchar

_fputtchar

_fsopen

_wfsopen

_tfsopen

_fullpath

_wfullpath

_tfullpath

_getch

_getwch

_gettch

_getche

_getwche

_gettche

_getcwd

_wgetcwd

_tgetcwd

_getdcwd

_wgetdcwd

_tgetdcwd

_ltoa

_ltow

_ltot

_makepath

_wmakepath

_tmakepath

_mkdir

_wmkdir

_tmkdir

_mktemp

_wmktemp

_tmktemp

_open

_wopen

_topen

_popen

_wpopen

_tpopen

_putch

_putwch

_puttch

_putenv

_wputenv

_tputenv

_rmdir

_wrmdir

_trmdir

_scprintf

_scwprintf

_sctprintf

_searchenv

_wsearchenv

_tsearchenv

_snprintf

_snwprintf

_sntprintf

_snscanf

_snwscanf

_sntscanf

_sopen

_wsopen

_tsopen

_spawnl

_wspawnl

_tspawnl

_spawnle

_wspawnle

_tspawnle

_spawnlp

_wspawnlp

_tspawnlp

_spawnlpe

_wspawnlpe

_tspawnlpe

_spawnv

_wspawnv

_tspawnv

_spawnve

_wspawnve

_tspawnve

_spawnvp

_wspawnvp

_tspawnvp

_spawnvpe

_wspawnvpe

_tspawnvpe

_splitpath

_wsplitpath

_tsplitpath

_stat64

_wstat64

_tstat64

_stat

_wstat

_tstat

_stati64

_wstati64

_tstati64

_strdate

_wstrdate

_tstrdate

_strdec

_wcsdec

_tcsdec

_strdup

_wcsdup

_tcsdup

_stricmp

_wcsicmp

_tcsicmp

_stricoll

_wcsicoll

_tcsicoll

_strinc

_wcsinc

_tcsinc

_strlwr

_wcslwr

_tcslwr

_strncnt

_wcsncnt

_tcsnbcnt

_strncnt

_wcsncnt

_tcsnccnt

_strncnt

_wcsncnt

_tcsnccnt

_strncoll

_wcsncoll

_tcsnccoll

_strnextc

_wcsnextc

_tcsnextc

_strnicmp

_wcsnicmp

_tcsncicmp

_strnicmp

_wcsnicmp

_tcsnicmp

_strnicoll

_wcsnicoll

_tcsncicoll

_strnicoll

_wcsnicoll

_tcsnicoll

_strninc

_wcsninc

_tcsninc

_strnset

_wcsnset

_tcsncset

_strnset

_wcsnset

_tcsnset

_strrev

_wcsrev

_tcsrev

_strset

_wcsset

_tcsset

_strspnp

_wcsspnp

_tcsspnp

_strtime

_wstrtime

_tstrtime

_strtoi64

_wcstoi64

_tcstoi64

_strtoui64

_wcstoui64

_tcstoui64

_strupr

_wcsupr

_tcsupr

_tempnam

_wtempnam

_ttempnam

_ui64toa

_ui64tow

_ui64tot

_ultoa

_ultow

_ultot

_ungetch

_ungetwch

_ungettch

_unlink

_wunlink

_tunlink

_utime64

_wutime64

_tutime64

_utime

_wutime

_tutime

_vscprintf

_vscwprintf

_vsctprintf

_vsnprintf

_vsnwprintf

_vsntprintf

asctime

_wasctime

_tasctime

atof

_wtof

_tstof

atoi

_wtoi

_tstoi

atoi

_wtoi

_ttoi

atol

_wtol

_tstol

atol

_wtol

_ttol

character compare

Maps to macro or inline function

_tccmp

character copy

Maps to macro or inline function

_tccpy

character length

Maps to macro or inline function

_tclen

ctime

_wctime

_tctime

fgetc

fgetwc

_fgettc

fgets

fgetws

_fgetts

fopen

_wfopen

_tfopen

fprintf

fwprintf

_ftprintf

fputc

fputwc

_fputtc

fputs

fputws

_fputts

freopen

_wfreopen

_tfreopen

fscanf

fwscanf

_ftscanf

getc

getwc

_gettc

getchar

getwchar

_gettchar

getenv

_wgetenv

_tgetenv

gets

getws

_getts

isalnum

iswalnum

_istalnum

isalpha

iswalpha

_istalpha

isascii

iswascii

_istascii

iscntrl

iswcntrl

_istcntrl

isdigit

iswdigit

_istdigit

isgraph

iswgraph

_istgraph

islead (Always FALSE)

(Always FALSE)

_istlead

isleadbyte (Always FALSE)

isleadbyte (Always FALSE)

_istleadbyte

islegal (Always TRUE)

(Always TRUE)

_istlegal

islower

iswlower

_istlower

isprint

iswprint

_istprint

ispunct

iswpunct

_istpunct

isspace

iswspace

_istspace

isupper

iswupper

_istupper

isxdigit

iswxdigit

_istxdigit

main

wmain

_tmain

perror

_wperror

_tperror

printf

wprintf

_tprintf

putc

putwc

_puttc

putchar

putwchar

_puttchar

puts

_putws

_putts

remove

_wremove

_tremove

rename

_wrename

_trename

scanf

wscanf

_tscanf

setlocale

_wsetlocale

_tsetlocale

sprintf

swprintf

_stprintf

sscanf

swscanf

_stscanf

strcat

wcscat

_tcscat

strchr

wcschr

_tcschr

strcmp

wcscmp

_tcscmp

strcoll

wcscoll

_tcscoll

strcpy

wcscpy

_tcscpy

strcspn

wcscspn

_tcscspn

strerror

_wcserror

_tcserror

strftime

wcsftime

_tcsftime

strlen

wcslen

_tcsclen

strlen

wcslen

_tcslen

strncat

wcsncat

_tcsncat

strncat

wcsncat

_tcsnccat

strncmp

wcsncmp

_tcsnccmp

strncmp

wcsncmp

_tcsncmp

strncpy

wcsncpy

_tcsnccpy

strncpy

wcsncpy

_tcsncpy

strpbrk

wcspbrk

_tcspbrk

strrchr

wcsrchr

_tcsrchr

strspn

wcsspn

_tcsspn

strstr

wcsstr

_tcsstr

strtod

wcstod

_tcstod

strtok

wcstok

_tcstok

strtol

wcstol

_tcstol

strtoul

wcstoul

_tcstoul

strxfrm

wcsxfrm

_tcsxfrm

system

_wsystem

_tsystem

tmpnam

_wtmpnam

_ttmpnam

tolower

towlower

_totlower

toupper

towupper

_totupper

ungetc

ungetwc

_ungettc

vfprintf

vfwprintf

_vftprintf

vprintf

vwprintf

_vtprintf

vsprintf

vswprintf

_vstprintf

WinMain

wWinMain

_tWinMain

 

 
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值