(Unicode) UTF-8与UTF-16之间转换

最新推荐文章于 2025-03-11 14:46:08 发布

韩搏

最新推荐文章于 2025-03-11 14:46:08 发布

阅读量3.8w

点赞数 14

分类专栏： C语言文章标签： c语言 C++ Linux unicode utf-8

本文链接：https://blog.csdn.net/hanbo622/article/details/52882438

版权

C语言专栏收录该内容

14 篇文章

订阅专栏

一、Unicode的由来
1、我们知道计算机其实只认识0101这样的字符串，当然了让我们看这样的01串会比较头晕，所以为了描述简单一般都用八进制、十进制、十六进制表示。
实际上都是等价的。其它像文字图片音视频等计算机也是不认识的，为了让计算机能表示这些信息就必须转换成一些数字，必须按照一些规则转换。
比如：刚开始的时候就有ASCII字符集(American Standard Code for Information Interchange，美国信息交换标准码)它使用7 bits来表示一个字符，
总共表示128个字符，我们一般都是用字节(byte：即8个01串)来作为基本单位。当时一个字节来表示字符时第一个bit总是0，剩下的七个字节就来表示实际内容。后来IBM公司在此基础上进行了扩展，用8bit来表示一个字符，总共可以表示256个字符。也就是当第一个bit是0时仍表示之前那些常用的字符，当为1时就表示其他补充的字符。
2、英文字母再加一些其他标点字符之类的也不会超过256个，一个字节表示足够了。但其他一些文字不止这么多，像汉字就上万个，
于是又出现了其他各种字符集。这样不同的字符集交换数据时就有问题了，可能你用某个数字表示字符A，但另外的字符集又是用另外一个数字表示A。
为了适应全球化的发展，便于不同语言之间的兼容交互，而ASCII不再能胜任此任务了。所以就出现了Unicode和ISO这样的组织来统一制定一个标准，任何一个字符只对应一个确定的数字。ISO取的名字叫UCS(Universal Character Set)(ucs-2对应utf-16,ucs-4对应utf-32)，Unicode取的名字就叫unicode了。

二、UTF-8和UTF-16的由来
1、Unicode第一个版本涉及到两个步骤：首先定义一个规范，给所有的字符指定一个唯一对应的数字，Unicode是用0至65535(2的16次方)之间的数字来表示所有字符，其中0至127这128个数字表示的字符仍然跟ASCII完全一样；第二怎么把字符对应的数字(0至65535)转化成01串保保存在计算机中。在保存时就涉及到了在计算机中占多少字节空间，就有不同的保存方式，于是出现了UTF(unicode transformation format)：UTF-8和UTF-16。

三、UTF-8和UTF-16的区别

    1、UTF-16：是任何字符对应的数字都用两个字节来保存，但如果都是英文字母(一个字节能表示一个字符)这样做有点浪费。
2、UTF-8：是任何字符对应的数字保存时所占的空间是可变的，可能用一个、两个或三个字节表示一个字符。

四、UTF-8和UTF-16的优劣
1、如果全部英文或英文与其他文字混合(英文占绝大部分)，用UTF-8就比UTF-16节省了很多空间。
2、而如果全部是中文这样类似的字符或者混合字符(中文占绝大多数),UTF-16就可以节省很多空间，另外还有个容错问题(比如：UTF-8需要判断每个字节中的开头标志信息,所以如果一当某个字节在传送过程中出错了,就会导致后面的字节也会解析出错；而UTF-16不会判断开头标志,即使错也只会错一个字符,所以容错能力强)。

五、Unicode举例说明
1、例如：中文字"汉"对应的unicode是6C49(这是用十六进制表示,用十进制表示是27721)；
2、UTF-16表示"汉"：比较简单，就是01101100   01001001(共16 bit,两个字节)，程序解析的时候知道是UTF-16就把两个字节当成一个单元来解析。
3、UTF-8表示"汉"：比较复杂，因为程序是一个字节一个字节的来读取，然后再根据字节中开头的bit标志来识别是该把一个、两个或三个字节做为一个单元来处理。规则如下：
   0xxxxxxx：如果是这样的格式，也就是以0开头就表示把一个字节做为一个单元，就跟ASCII完全一样；
   110xxxxx 10xxxxxx：如果是这样的格式，则把两个字节当一个单元；
   1110xxxx 10xxxxxx 10xxxxxx：如果是这样的格式，则把三个字节当一个单元。

4、由于UTF-16不需要用其它字符来做标志，所以两字节也就是2的16次能表示65536个字符；
5、而UTF-8由于里面有额外的标志信息，所有一个字节只能表示2的7次方128个字符，两个字节只能表示2的11次方2048个字符，而三个字节能表示2的16次方，65536个字符。

6、由于"汉"的编码27721大于2048了所有两个字节还不够，所以用1110xxxx 10xxxxxx 10xxxxxx这种格式，把27721对应的二进制从左到右填充XXX符号(实际上不一定从左到右,也可以从右到左)。
7、由于填充方式的不一样，于是就出现了Big-Endian、Little-Endian的术语。Big-Endian就是从左到右，Little-Endian是从右到左。

六、Unicode第二个版本
第一个版本的65536显然不算太多的数字，用它来表示常用的字符是没一点问题足够了，但如果加上很多特殊的也就不够了。于是从1996年有了第二个版本，用四个字节表示所有字符，这样就出现了UTF-8、UTF16、UTF-32，原理和之前是完全一样的，UTF-32就是把所有的字符都用32bit也就是4个字节来表示。然后UTF-8、UTF-16就视情况而定了。UTF-8可以选择1至8个字节中的任一个来表示，而UTF-16只能是选两字节或四字节。

七、代码

utf.c

/* ************************************************************************
 *       Filename:  utf.c
 *    Description:  
 *        Version:  1.0
 *        Created:  2016年10月21日 09时50分05秒
 *       Revision:  none
 *       Compiler:  gcc
 *         Author:  YOUR NAME (), 
 *        Company:  
 * ************************************************************************/
#include <stdio.h>
#include <string.h>
#include "utf.h"
static boolean isLegalUTF8(const UTF8 *source, int length)
{
    UTF8 a;
    const UTF8 *srcptr = NULL;
    
    if (NULL == source){
        printf("ERR, isLegalUTF8: source=%p\n", source);
        return FALSE;
    }
    srcptr = source+length;

    switch (length) {
		default:
			printf("ERR, isLegalUTF8 1: length=%d\n", length);
			return FALSE;
		/* Everything else falls through when "TRUE"... */
		case 4:
			if ((a = (*--srcptr)) < 0x80 || a > 0xBF){
				printf("ERR, isLegalUTF8 2: length=%d, a=%x\n", length, a);
				return FALSE;
			}
		case 3:
			if ((a = (*--srcptr)) < 0x80 || a > 0xBF){
				printf("ERR, isLegalUTF8 3: length=%d, a=%x\n", length, a);
				return FALSE;
			}
		case 2: 
			if ((a = (*--srcptr)) > 0xBF){
				printf("ERR, isLegalUTF8 4: length=%d, a=%x\n", length, a);
				return FALSE;
			}
			switch (*source)
			{
				/* no fall-through in this inner switch */
				case 0xE0: 
					if (a < 0xA0){
						printf("ERR, isLegalUTF8 1: source=%x, a=%x\n", *source, a);
						return FALSE; 
					}
					break;
				case 0xED:
					if (a > 0x9F){
						printf("ERR, isLegalUTF8 2: source=%x, a=%x\n", *source, a);
						return FALSE; 
					}
					break;
				case 0xF0:
					if (a < 0x90){
						printf("ERR, isLegalUTF8 3: source=%x, a=%x\n", *source, a);
						return FALSE; 
					}
					break;
				case 0xF4:
					if (a > 0x8F){
						printf("ERR, isLegalUTF8 4: source=%x, a=%x\n", *source, a);
						return FALSE; 
					}
					break;
				default:
					if (a < 0x80){
						printf("ERR, isLegalUTF8 5: source=%x, a=%x\n", *source, a);
						return FALSE; 
					}
			}
		case 1: 
			if (*source >= 0x80 && *source < 0xC2){
				printf("ERR, isLegalUTF8: source=%x\n", *source);
				return FALSE;
			}
    }
    if (*source > 0xF4)
		return FALSE;
    return TRUE;
}
ConversionResult Utf8_To_Utf16 (const UTF8* sourceStart, UTF16* targetStart, size_t outLen , ConversionFlags flags)
{
    ConversionResult result = conversionOK;
    const UTF8* source = sourceStart;
    UTF16* target      = targetStart;
    UTF16* targetEnd   = targetStart + outLen/2;
    const UTF8*  sourceEnd = NULL;

    if ((NULL == source) || (NULL == targetStart)){
        printf("ERR, Utf8_To_Utf16: source=%p, targetStart=%p\n", source, targetStart);
        return conversionFailed;
    }
    sourceEnd   = strlen((const char*)sourceStart) + sourceStart;

    while (*source){
        UTF32 ch = 0;
        unsigned short extraBytesToRead = trailingBytesForUTF8[*source];
        if (source + extraBytesToRead >= sourceEnd){
            printf("ERR, Utf8_To_Utf16----sourceExhausted: source=%p, extraBytesToRead=%d, sourceEnd=%p\n", source, extraBytesToRead, sourceEnd);
            result = sourceExhausted;
			break;
        }
        /* Do this check whether lenient or strict */
        if (! isLegalUTF8(source, extraBytesToRead+1)){
            printf("ERR, Utf8_To_Utf16----isLegalUTF8 return FALSE: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
            result = sourceIllegal;
            break;
        }
        /*
        * The cases all fall through. See "Note A" below.
        */
        switch (extraBytesToRead) {
			case 5: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */
			case 4: ch += *source++; ch <<= 6; /* remember, illegal UTF-8 */
			case 3: ch += *source++; ch <<= 6;
			case 2: ch += *source++; ch <<= 6;
			case 1: ch += *source++; ch <<= 6;
			case 0: ch += *source++;
        }
        ch -= offsetsFromUTF8[extraBytesToRead];

        if (target >= targetEnd) {
            source -= (extraBytesToRead+1); /* Back up source pointer! */
            printf("ERR, Utf8_To_Utf16----target >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
            result = targetExhausted;
			break;
        }
        if (ch <= UNI_MAX_BMP){
			/* Target is a character <= 0xFFFF */
            /* UTF-16 surrogate values are illegal in UTF-32 */
            if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END){
                if (flags == strictConversion){
                    source -= (extraBytesToRead+1); /* return to the illegal value itself */
                    printf("ERR, Utf8_To_Utf16----ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_LOW_END: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
                    result = sourceIllegal;
                    break;
                } else {
                    *target++ = UNI_REPLACEMENT_CHAR;
                }
            } else{
                *target++ = (UTF16)ch; /* normal case */
            }
        }else if (ch > UNI_MAX_UTF16){
            if (flags == strictConversion) {
                result = sourceIllegal;
                source -= (extraBytesToRead+1); /* return to the start */
                printf("ERR, Utf8_To_Utf16----ch > UNI_MAX_UTF16: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
                break; /* Bail out; shouldn't continue */
            } else {
                *target++ = UNI_REPLACEMENT_CHAR;
            }
        } else {
            /* target is a character in range 0xFFFF - 0x10FFFF. */
            if (target + 1 >= targetEnd) {
                source -= (extraBytesToRead+1); /* Back up source pointer! */
                printf("ERR, Utf8_To_Utf16----target + 1 >= targetEnd: source=%p, extraBytesToRead=%d\n", source, extraBytesToRead);
                result = targetExhausted; break;
            }
            ch -= halfBase;
            *target++ = (UTF16)((ch >> halfShift) + UNI_SUR_HIGH_START);
            *target++ = (UTF16)((ch & halfMask) + UNI_SUR_LOW_START);
        }
    }
    return result;
}

int Utf16_To_Utf8 (const UTF16* sourceStart, UTF8* targetStart, size_t outLen ,  ConversionFlags flags)
{
    int result = 0;
    const UTF16* source = sourceStart;
    UTF8* target        = targetStart;
    UTF8* targetEnd     = targetStart + outLen;
    
    if ((NULL == source) || (NULL == targetStart)){
        printf("ERR, Utf16_To_Utf8: source=%p, targetStart=%p\n", source, targetStart);
        return conversionFailed;
    }
    
    while ( *source ) {
        UTF32 ch;
        unsigned short bytesToWrite = 0;
        const UTF32 byteMask = 0xBF;
        const UTF32 byteMark = 0x80; 
        const UTF16* oldSource = source; /* In case we have to back up because of target overflow. */
        ch = *source++;
        /* If we have a surrogate pair, convert to UTF32 first. */
        if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
            /* If the 16 bits following the high surrogate are in the source buffer... */
            if ( *source ){
                UTF32 ch2 = *source;
                /* If it's a low surrogate, convert to UTF32. */
                if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
                    ch = ((ch - UNI_SUR_HIGH_START) << halfShift) + (ch2 - UNI_SUR_LOW_START) + halfBase;
                    ++source;
                }else if (flags == strictConversion) { /* it's an unpaired high surrogate */
                    --source; /* return to the illegal value itself */
                    result = sourceIllegal;
                    break;
                }
            } else { /* We don't have the 16 bits following the high surrogate. */
                --source; /* return to the high surrogate */
                result = sourceExhausted;
                break;
            }
        } else if (flags == strictConversion) {
            /* UTF-16 surrogate values are illegal in UTF-32 */
            if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END){
                --source; /* return to the illegal value itself */
                result = sourceIllegal;
                break;
            }
        }
        /* Figure out how many bytes the result will require */
        if(ch < (UTF32)0x80){	     
			bytesToWrite = 1;
        } else if (ch < (UTF32)0x800) {     
            bytesToWrite = 2;
        } else if (ch < (UTF32)0x10000) {  
            bytesToWrite = 3;
        } else if (ch < (UTF32)0x110000){ 
            bytesToWrite = 4;
        } else {	
            bytesToWrite = 3;
            ch = UNI_REPLACEMENT_CHAR;
        }
		
        target += bytesToWrite;
        if (target > targetEnd) {
            source = oldSource; /* Back up source pointer! */
            target -= bytesToWrite; result = targetExhausted; break;
        }
        switch (bytesToWrite) { /* note: everything falls through. */
			case 4: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
			case 3: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
			case 2: *--target = (UTF8)((ch | byteMark) & byteMask); ch >>= 6;
			case 1: *--target = (UTF8)(ch | firstByteMark[bytesToWrite]);
        }
        target += bytesToWrite;
    }
    return result;
}
int main(int argc, char *argv[])
{
	int i=0;
	UTF8 buf8[256]="";
	UTF16 buf16[256]={0};
	strcpy(buf8,"程序员");
	Utf8_To_Utf16(buf8,buf16,sizeof(buf16),strictConversion);
	printf("\nUTF-8 => UTF-16 = ");
	while(buf16[i])
	{
		printf("%#x  ",buf16[i]);
		i++;
	}

	memset(buf8,0,sizeof(buf8));
	memset(buf16,0,sizeof(buf16));
	buf16[0]=0x7a0b;
	buf16[1]=0x5e8f;
	buf16[2]=0x5458;
	Utf16_To_Utf8 (buf16, buf8, sizeof(buf8) , strictConversion);
	printf("\nUTF-16 => UTF-8 = %s\n\n",buf8);
	return 0;
}

utf.h

/* ************************************************************************
 *       Filename:  utf.h
 *    Description:  
 *        Version:  1.0
 *        Created:  2016年10月21日 09时50分47秒
 *       Revision:  none
 *       Compiler:  gcc
 *         Author:  YOUR NAME (), 
 *        Company:  
 * ************************************************************************/
#ifndef __UTF_H__
#define __UTF_H__

#define FALSE  0
#define TRUE   1

#define halfShift	10
#define UNI_SUR_HIGH_START  (UTF32)0xD800
#define UNI_SUR_HIGH_END    (UTF32)0xDBFF
#define UNI_SUR_LOW_START   (UTF32)0xDC00
#define UNI_SUR_LOW_END     (UTF32)0xDFFF
/* Some fundamental constants */
#define UNI_REPLACEMENT_CHAR (UTF32)0x0000FFFD
#define UNI_MAX_BMP (UTF32)0x0000FFFF
#define UNI_MAX_UTF16 (UTF32)0x0010FFFF
#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF

typedef unsigned char   boolean;
typedef unsigned int	CharType ;
typedef unsigned char	UTF8;
typedef unsigned short	UTF16;
typedef unsigned int	UTF32;

static const UTF32 halfMask = 0x3FFUL;
static const UTF32 halfBase = 0x0010000UL;
static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC };
static const UTF32 offsetsFromUTF8[6] = { 0x00000000UL, 0x00003080UL, 0x000E2080UL, 0x03C82080UL, 0xFA082080UL, 0x82082080UL };
static const char trailingBytesForUTF8[256] =
{
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
typedef enum 
{
	strictConversion = 0,
	lenientConversion
} ConversionFlags;
typedef enum 
{
	conversionOK, 		/* conversion successful */
	sourceExhausted,	/* partial character in source, but hit end */
	targetExhausted,	/* insuff. room in target for conversion */
	sourceIllegal,		/* source sequence is illegal/malformed */
	conversionFailed
} ConversionResult;
#endif

运行结果如下：