自动检测汉字GB18030编码与UTF-8编码

本文链接：https://blog.csdn.net/firstboy0513/article/details/7349854

本文介绍了一种通过分析字节特征来区分GB18030与UTF-8编码的方法，并提供了相应的C语言实现代码。该方法主要依据两种编码的特定字节模式进行判断，通过实际测试验证了其有效性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

先看看汉字的GB18030编码与UTF-8编码范围

GB18030
第1位：0x81 ~ 0xFE                       1000 0001 ~ 1111 1110
第2位：0x40 ~ 0x7E                       0100 0000 ~ 0111 1110
或者：0x80 ~ 0xFE                       1000 0000 ~ 1111 1110

UTF-8
第1位：0xE0 ~ 0xEF                       1110 0000 ~ 1110 1111
第2位：0x80 ~ 0xBF                       1000 0000 ~ 1011 1111
第3位：0x80 ~ 0xBF                       1000 0000 ~ 1011 1111

想到如下特征来识别汉字：
   1. 如果第1位是0就不需要判断的，一定是ASCII字符。
   2. 如果第1位是1开头的，第2位是0开头的，一定是GB编码。
   3. 如果第1位是非1110开头的，则一定是GB编码。
   4. 多做几个汉字判断。

考虑到判断效率，简要写了如下代码做判断即可达到基本效果：

/*
 * Get the character code type. (UTF-8 or GB18030)
 * @param s the string to be operator.
 * @return return the code type. (1 means UTF-8, 0 for GB18030, -1 for error)
 */
int get_character_code_type(const char* s)
{
	if (NULL == s)
	{
		return -1;
	}
	
	int i = 0;
	for(; s[i] != '\0'; i++)
	{
		// ASCII character.
		if (!(s[i] & 0x80))
		{
			continue;
		}
		// Hanzi utf-8 code possiable.
		else if(!( (s[i] & 0xF0) ^ 0xE0) 
				&& s[i+1] 
				&& !( (s[i+1] & 0xC0) ^ 0x80) 
				&& s[i+2] 
				&& !( (s[i+2] & 0xC0) ^ 0x80))
		{
			return 1;
		}
		// Not a UTF-8 code.
		else
		{
			return 0;
		}
	}
	
	return -1;
}

写一个测试例子来测试一下：

#include "char_code.h"
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
	if (argc < 2)
	{
		printf("%s [file_path]\n", argv[0]);
		return -1;
	}
	
	// open file and read buf.
	int f = open(argv[1], O_RDONLY);
	if ( -1 == f )
	{
		fprintf(stderr, "file %s open failed.\n", argv[1]);
		return -1;
	}
	
	char buf[1024] = {0};
	read(f, buf, 1023);
	int ret = get_character_code_type(buf);
	fprintf(stdout, "char code type = %s\n", (ret == 1 ? "UTF-8" : "GB18030"));
	
	
	// close file.
	if ( 0 != close(f))
	{
		fprintf(stderr, "file %s close failed.\n", argv[1]);
		return -1;
	}
	
	return 0;
}

编译：

gcc test.c -o test -O0 -g3 -Wall

运行结果：

$ ./test gb18030.txt 
char code type = GB18030
$ ./test utf8.txt 
char code type = UTF-8

参考文献：
   UTF-8编码检测失败特例
   http://www.kuqin.com/language/20071201/2740.html
   UTF-8文件的Unicode签名BOM(Byte Order Mark)问题
   http://blog.csdn.net/thimin/article/details/1724393
   UTF-8
   http://zh.wikipedia.org/zh-cn/UTF-8
   GB 18030
   http://zh.wikipedia.org/wiki/GB_18030

下载：

http://download.csdn.net/detail/firstboy0513/4137551