NSString在不同字符集下的ASCII码

最新推荐文章于 2022-09-20 11:35:02 发布

miki西游

最新推荐文章于 2022-09-20 11:35:02 发布

阅读量3.3k

点赞数 1

分类专栏： iOS开发文章标签： ios

本文链接：https://blog.csdn.net/mikixiyou/article/details/9302329

版权

iOS开发专栏收录该内容

36 篇文章 0 订阅

订阅专栏

在ios中，XCode开发时一直没理解字符集的关系。一个字符串赋值后，是什么字符集组成的。我根据每一个字符，分析它的ascii码，然后得到一些特征信息。

例如这个字符串"abc美国人123"，有英文、汉字和数字。

NSStringEncoding encoding2 = NSUTF8StringEncoding;

NSString *testStr = @"abc美国人123";

for (int i = 0; i < [testStr length]; i++) {
unichar c = [testStr characterAtIndex:i];

int bytesLeng = [[testStr substringWithRange:NSMakeRange(i, 1)] lengthOfBytesUsingEncoding:encoding2];

NSLog(@"testStr[%d]=%@ = %d,%d", i, [testStr substringWithRange:NSMakeRange(i, 1)], c, bytesLeng);
}

输出结果如下：

2013-07-11 15:43:51.918 demo[2561:13d03] testStr[0]=a = 97,1
2013-07-11 15:43:51.918 demo[2561:13d03] testStr[1]=b = 98,1
2013-07-11 15:43:51.918 demo[2561:13d03] testStr[2]=c = 99,1
2013-07-11 15:43:51.918 demo[2561:13d03] testStr[3]=美 =美
2013-07-11 15:43:51.918 demo[2561:13d03] testStr[3]=美 = 32654,3
2013-07-11 15:43:51.918 demo[2561:13d03] testStr[4]=国 = 22269,3
2013-07-11 15:43:51.919 demo[2561:13d03] testStr[5]=人 = 20154,3
2013-07-11 15:43:51.919 demo[2561:13d03] testStr[6]=1 = 49,1
2013-07-11 15:43:51.919 demo[2561:13d03] testStr[7]=2 = 50,1
2013-07-11 15:43:51.919 demo[2561:13d03] testStr[8]=3 = 51,1

这个字符串是根据默认字符集进行编码的，英文字符和数字可以很容易看出来，都是按照ASCII字符集进行编码，但汉字是采用什么字符集呢？是默认字符集进行编码的吗？

我先从操作系统上去思考，是不是采用操作系统默认的字符集。那么如何得到操作系统的默认字符集呢？

NSString的帮助文档告诉我，使用defaultCStringEncoding的NSString类方法得到结果值为30，对应的编码字符集是 NSMacOSRomanStringEncoding。

这是在XCODE中获取的。其实它欺骗了我一下。

字符串NSString * testStr=@"abc美国人123";的赋值操作是在一个文件中进行的，字符串的编码方式应该首先是从文件的编码上去考虑的。文件的编码是什么，字符串就是什么编码。
在XCODE中的text encoding是UTF8，事先都设置好的。
其实，这点和eclipse开发工具是一样的。很多时候，我们在eclipse看到源代码文件中很多乱码字符，都是因为字符集转换导致的。

因此，在XCODE中，我们看到的字符串的编码方式就是UTF8。

这个字符串编码可以做一些转换操作。需要使用方法- (BOOL)canBeConvertedToEncoding:(NSStringEncoding)encoding，判断一下这个字符串能不能无丢失转换到对应的字符集编码。

因为有些字符集可以转到超字符集，但是有些不能转，有些也不能逆转。例如GB2312可以转为GBK，但是就GBK不能转到GB2312了。这取决于它的编码方式。

GB2312是GBK的子集，GBK是GB18030的子集。

UTF- 8：Unicode Transformation Format-8bit，允许含BOM，但通常不含BOM。是用以解决国际上字符的一种多字节编码，它对英文使用8位（即一个字节），中文使用24为（三个字节）来编码。UTF-8包含全世界所有国家需要用到的字符，是国际编码，通用性强。UTF-8编码的文字可以在各国支持UTF8字符集的浏览器上显示。如，如果是UTF8编码，则在外国人的英文IE上也能显示中文，他们无需下载IE的中文语言支持包。
GBK是国家标准GB2312基础上扩容后兼容GB2312的标准。GBK的文字编码是用双字节来表示的，即不论中、英文字符均使用双字节来表示，为了区分中文，将其最高位都设定成1。GBK包含全部中文字符，是国家编码，通用性比UTF8差，不过UTF8占用的数据库比GBD大。
GBK、GB2312等与UTF8之间都必须通过Unicode编码才能相互转换：
GBK、GB2312－－Unicode－－UTF8
UTF8－－Unicode－－GBK、GB2312

在从互联网上获取到数据流时需要根据不同的字符集NSUTF8StringEncoding、CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000)、NSMacOSRomanStringEncoding 进行转换操作。

NSStringEncoding encodings[] = {NSUTF8StringEncoding, CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000),NSMacOSRomanStringEncoding};

for (int k = 0; k < 3; k++) {
BOOL canEncode = [testStr canBeConvertedToEncoding:encodings[k]];

NSLog(@" encode \"%@\" using encoding %X", testStr, encodings[k]);

if (!canEncode) {
NSLog(@" Can not encode \"%@\" using encoding %X", testStr, encodings[k]);
} else {
NSData *strData = [testStr dataUsingEncoding:encodings[k]];
NSString *str = [[NSString alloc] initWithData:strData encoding:encodings[k]];

for (int i = 0; i < [str length]; i++) {
unichar c = [str characterAtIndex:i];
int bytesLeng = [[str substringWithRange:NSMakeRange(i, 1)] lengthOfBytesUsingEncoding:encodings[k]];
NSLog(@"testStr[%d]=%@ = %d,%d", i, [str substringWithRange:NSMakeRange(i, 1)], c, bytesLeng);
}
}
}

输出结果如下：

2013-07-11 16:21:49.875 demo[3557:13d03] encode "abc美国人123" using encoding 4
2013-07-11 16:21:49.875 demo[3557:13d03] testStr[0]=a = 97,1
2013-07-11 16:21:49.875 demo[3557:13d03] testStr[1]=b = 98,1
2013-07-11 16:21:49.875 demo[3557:13d03] testStr[2]=c = 99,1
2013-07-11 16:21:49.876 demo[3557:13d03] testStr[3]=美 = 32654,3
2013-07-11 16:21:49.894 demo[3557:13d03] testStr[4]=国 = 22269,3
2013-07-11 16:21:49.894 demo[3557:13d03] testStr[5]=人 = 20154,3
2013-07-11 16:21:49.894 demo[3557:13d03] testStr[6]=1 = 49,1
2013-07-11 16:21:49.895 demo[3557:13d03] testStr[7]=2 = 50,1
2013-07-11 16:21:49.895 demo[3557:13d03] testStr[8]=3 = 51,1

2013-07-11 16:21:49.895 demo[3557:13d03] encode "abc美国人123" using encoding 80000632
2013-07-11 16:21:49.895 demo[3557:13d03] testStr[0]=a = 97,1
2013-07-11 16:21:49.896 demo[3557:13d03] testStr[1]=b = 98,1
2013-07-11 16:21:49.896 demo[3557:13d03] testStr[2]=c = 99,1
2013-07-11 16:21:49.896 demo[3557:13d03] testStr[3]=美 = 32654,2
2013-07-11 16:21:49.896 demo[3557:13d03] testStr[4]=国 = 22269,2
2013-07-11 16:21:49.896 demo[3557:13d03] testStr[5]=人 = 20154,2
2013-07-11 16:21:49.897 demo[3557:13d03] testStr[6]=1 = 49,1
2013-07-11 16:21:49.897 demo[3557:13d03] testStr[7]=2 = 50,1
2013-07-11 16:21:49.897 demo[3557:13d03] testStr[8]=3 = 51,1

2013-07-11 16:21:49.897 demo[3557:13d03] encode "abc美国人123" using encoding 1E
2013-07-11 16:21:49.897 demo[3557:13d03] Can not encode "abc美国人123" using encoding 1E

从结果上可以看到三点：

1、在字符集编码为NSUTF8StringEncoding时可以转换。当然了，本来就是UTF8编码的。

2、在字符集编码为GBK时，也可以。因为字符是汉字，并且UTF8的编码范围比GBK大，所以转换没有丢失数据。
3、在字符集编码为NSMacOSRomanStringEncoding时，就不行了。

在不同的字符集下，获取的每一个字符的字节数已经不一样的，UTF8下汉字三个字节编码，GBK下是两个。这点没有疑问。
有疑问的是，为什么获取的每一个字符的ASCII编码都是一样的。

我以汉字"美"为例，在UTF8下测试获得ASCII码的值是15712189，在GBK下获取的ASCII码的值是50112。

难道测试的方法有问题？