今天看见一个很棒的博客,只是无法粉丝之,就转载一下几篇很好用的博文吧
转载:http://www.cnblogs.com/zhwl/archive/2012/12/31/2840746.html
今天在尝试抓取起点中文网首页的时候遇到了一个问题 — 如果编码没有用对的话是没办法读取任何东西的.
这也算是C#用的太多养成的坏习惯, 以前基本没怎么考虑过编码问题. 应该说, C#里面就算编码错了, 也能读进来东西,
只是一片乱码而已. Cocoa里面就狠了点, 直接抛异常了.
下面是刚开始写的一段代码, 把起点中文网的主页下载到一个字符串中.
NSURL *url = [[NSURL alloc]
initWithString:@"http://www.cmfu.com"];
NSError *error;
NSString *xml = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:&error];
if(xml == nil)
{ NSLog(@"Error reading url at %@", [error localizedFailureReason]); }
else { [result setString:xml]; }
死活下载失败, 错误信息就是编码不对. 好吧, 我打开了帮助查看了下所有的编码:
enum {
NSASCIIStringEncoding = 1,
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5,
NSSymbolStringEncoding = 6,
NSNonLossyASCIIStringEncoding = 7,
NSShiftJISStringEncoding = 8,
NSISOLatin2StringEncoding = 9,
NSUnicodeStringEncoding = 10,
NSWindowsCP1251StringEncoding = 11,
NSWindowsCP1252StringEncoding = 12,
NSWindowsCP1253StringEncoding = 13,
NSWindowsCP1254StringEncoding = 14,
NSWindowsCP1250StringEncoding = 15,
NSISO2022JPStringEncoding = 21,
NSMacOSRomanStringEncoding = 30,
NSUTF16StringEncoding = NSUnicodeStringEncoding,
NSUTF16BigEndianStringEncoding = 0x90000100,
NSUTF16LittleEndianStringEncoding = 0x94000100,
NSUTF32StringEncoding = 0x8c000100,
NSUTF32BigEndianStringEncoding = 0x98000100,
NSUTF32LittleEndianStringEncoding = 0x9c000100,
};
我一个一个的试,
居然全都不行! 崩溃了, 这都什么年代了, 难道Cocoa还不支持中文? 不可能啊.
估计是上面那份文档里面只是列出了最长用的几种编码(这里是苹果认为最长用的, 可见对于中国基本是无视了, 鄙视下!),
我就写了下面这段代码输出了所有支持的编码:
const NSStringEncoding *encodings = [NSString availableStringEncodings];
NSMutableString *str = [[NSMutableString alloc] init];
NSStringEncoding encoding;
while ((encoding = *encodings++) != 0)
{
[str appendFormat: @"%@ === %in", [NSString localizedNameOfStringEncoding:encoding], encoding]; }
[result setString: str];
好家伙, 果然被我猜中了, 下面就是所有支持的编码列表
Western (Mac OS Roman) === 30
Japanese (Mac OS) === -2147483647
Traditional Chinese (Mac OS) === -2147483646
Korean (Mac OS) === -2147483645
Arabic (Mac OS) === -2147483644
Hebrew (Mac OS) === -2147483643
Greek (Mac OS) === -2147483642
Cyrillic (Mac OS) === -2147483641
Devanagari (Mac OS) === -2147483639
Gurmukhi (Mac OS) === -2147483638
Gujarati (Mac OS) === -2147483637
Thai (Mac OS) === -2147483627
Simplified Chinese (Mac OS) === -2147483623
Tibetan (Mac OS) === -2147483622
Central European (Mac OS) === -2147483619
Symbol (Mac OS) === 6
Dingbats (Mac OS) === -2147483614
Turkish (Mac OS) === -2147483613
Croatian (Mac OS) === -2147483612
Icelandic (Mac OS) === -2147483611
Romanian (Mac OS) === -2147483610
Celtic (Mac OS) === -2147483609
Gaelic (Mac OS) === -2147483608
Keyboard Symbols (Mac OS) === -2147483607
Farsi (Mac OS) === -2147483508
Cyrillic (Mac OS Ukrainian) === -2147483496
Inuit (Mac OS) === -2147483412
Unicode (UTF-32LE) === -1677721344
Unicode (UTF-8) === 4
Unicode (UTF-16) === 10
Unicode (UTF-16BE) === -1879047936
Unicode (UTF-16LE) === -1811939072
Unicode (UTF-32) === -1946156800
Unicode (UTF-32BE) === -1744830208
Western (ISO Latin 1) === 5
Central European (ISO Latin 2) === 9
Western (ISO Latin 3) === -2147483133
Central European (ISO Latin 4) === -2147483132
Cyrillic (ISO 8859-5) === -2147483131
Arabic (ISO 8859-6) === -2147483130
Greek (ISO 8859-7) === -2147483129
Hebrew (ISO 8859-8) === -2147483128
Turkish (ISO Latin 5) === -2147483127
Nordic (ISO Latin 6) === -2147483126
Thai (ISO 8859-11) === -2147483125
Baltic Rim (ISO Latin 7) === -2147483123
Celtic (ISO Latin) === -2147483122
Western (ISO Latin 9) === -2147483121
Romanian (ISO Latin 10) === -2147483120
Latin-US (DOS) === -2147482624
Greek (DOS) === -2147482619
Baltic Rim (DOS) === -2147482618
Western (DOS Latin 1) === -2147482608
Greek (DOS Greek 1) === -2147482607
Central European (DOS Latin 2) === -2147482606
Cyrillic (DOS) === -2147482605
Turkish (DOS) === -2147482604
Portuguese (DOS) === -2147482603
Icelandic (DOS) === -2147482602
Hebrew (DOS) === -2147482601
Canadian French (DOS) === -2147482600
Arabic (DOS) === -2147482599
Nordic (DOS) === -2147482598
Cyrillic (DOS) === -2147482597
Greek (DOS Greek 2) === -2147482596
Thai (Windows, DOS) === -2147482595
Japanese (Windows, DOS) === 8
Simplified Chinese (Windows, DOS) === -2147482591
Korean (Windows, DOS) === -2147482590
Traditional Chinese (Windows, DOS) === -2147482589
Western (Windows Latin 1) === 12
Central European (Windows Latin 2) === 15
Cyrillic (Windows) === 11
Greek (Windows) === 13
Turkish (Windows Latin 5) === 14
Hebrew (Windows) === -2147482363
Arabic (Windows) === -2147482362
Baltic Rim (Windows) === -2147482361
Vietnamese (Windows) === -2147482360
Western (ASCII) === 1
Japanese (Shift JIS X0213) === -2147482072
Chinese (GBK) === -2147482063
Chinese (GB 18030) === -2147482062
Japanese (ISO 2022-JP) === 21
Korean (ISO 2022-KR) === -2147481536
Japanese (EUC) === 3
Simplified Chinese (EUC) === -2147481296
Traditional Chinese (EUC) === -2147481295
Korean (EUC) === -2147481280
Japanese (Shift JIS) === -2147481087
Cyrillic (KOI8-R) === -2147481086
Traditional Chinese (Big 5) === -2147481085
Western (Mac Mail) === -2147481084
Simplified Chinese (HZ GB 2312) === -2147481083
Traditional Chinese (Big 5 HKSCS) === -2147481082
Ukrainian (KOI8-U) === -2147481080
Traditional Chinese (Big 5-E) === -2147481079
Western (NextStep) === 2
Non-lossy ASCII === 7
Western (EBCDIC Latin 1) === -2147480574
终于看到了熟悉的 GBK 编码, 对应的代码是 -2147482063. Ok, 更改一下最开始的代码
NSURL *url = [[NSURL alloc] initWithString:@"http://www.cmfu.com"];
NSError *error;
NSStringEncoding encoder;
NSString *xml = [NSString stringWithContentsOfURL:url encoding:encoder=-2147482063 error:&error];
if(xml == nil)
{ NSLog(@"Error reading url at %@", [error localizedFailureReason]); }
else { [result setString:xml]; }
终于搞定了! 看到熟悉的中文真是激动了.
以下为编码问题的扩展______________________________________________
// our secure service :-)
NSURL *server = [NSURL URLWithString:@http://www.cocoanetics.com/feed/];
NSURLRequest *request = [NSURLRequest requestWithURL:server];
// use synchronous convenience method
NSURLResponse *response = nil;
NSError *error = nil;
NSData *returnedData = [NSURLConnection sendSynchronousRequest:request
returningResponse:&response
error:&error];
if (!returnedData)
{
NSLog(@Error retrieving data, %@, [error localizedDescription]);
return NO;
}
// get the correct text encoding
// http://stackoverflow.com/questions/1409537/nsdata-to-nsstring-converstion-problem
CFStringEncoding cfEncoding = CFStringConvertIANACharSetNameToEncoding((CFStringRef)
[response textEncodingName]);
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(cfEncoding);
// output
NSString *xml = [[[NSString alloc] initWithData:returnedData encoding:encoding]
autorelease];
NSLog(@%@, xml);
技巧:不采用硬编码UTF8的方式,我们从应答中获取适当的编码.
____________________________________________________________
关于HTTP请求头
Accept:浏览器可接受的MIME类型
Accept-Charset:浏览器可接受的字符集
Accept-Encoding:浏览器能够进行解码的数据编码方式,如gzip.
Accept-Language:浏览器所希望的语言种类
Authorization:授权信息
Connection:表示是否需要持久连接
Content-Length:表示请求消息正文的长度
Cookie:请求头信息
From:请求发送者的email地址。
Host:初始URL中的主机和端口
If-Modified-Since:只有当所请求的内容在指定日期之后又经过修改才返回它,否者返回Not Modified 应答
Pragma:指定”no-cache”值 表示服务器必须返回一个刷新后的文档,即使他有代理服务器而且已经有叶面的本地拷贝
Referer:包含一个URL,用户从该URL代表的页面出发反问当前请求的页面
User-Agent:浏览器的类型
UA-Pixels,UA-Color,UA-OS和UA-CPU:非标准的请求头,表示屏幕大小,颜色深度,操作系统和CPU类型等。
HTTP应答头
setContentType: 设置Content-Type头。大多数Servlet都要用到这个方法。
setContentLength:设置Content-Length头。对于支持持久HTTP连接的浏览器来说,这个函数是很有用的。
addCookie:设置一个Cookie
Allow:服务器支持那些请求方法
Content-Encoding:文档的编码方法
Content-Length:
Content-Type 表示后面的文档属于什么MIME类型。
Date:当前的GMT时间
Expired:应该在什么时候文档已经过期,从而不再缓存了。
Last-Modified:文档的最后改动时间。
Location:表示客户应当到哪里去提取文档。Location通常不是直接设置的 而是通过HttpServletResponse 中的 serRedirect()方法,同时设置状态码为302
Refresh:表示浏览器应该在多少时间之后刷新页面。
Server: 服务器名。
Set-Cookie:设置和叶面相关的Cookie
www-Authenticate 客户应该在Authenticate 投中应该提供什么类型的授权信息.