<span style="font-family:Arial, Helvetica, sans-serif;">之前抓取京东的页面一直是好的,然后就开始显示 [Decode error - output not utf-8],接着加入decode和encode之后就会报出语法错误,编译不通过。和同学交流之下发现可能是传送过来的页面经过的压缩,加入解压缩的代码就可以正常抓取了,代码如下,try 和 except 的代码就是解压缩</span>
<span style="font-family: Arial, Helvetica, sans-serif;">
</span>
<span style="font-family: Arial, Helvetica, sans-serif;">
</span>
<span style="font-family: Arial, Helvetica, sans-serif;">#!usr/bin/env python</span>
# -*- coding:utf-8 -*-
import urllib
import re
import gzip
import StringIO
import chardet
url = 'http://www.jd.com'
page = urllib.urlopen(url)
content = page.read()
try:
gf = gzip.GzipFile(fileobj=StringIO.StringIO(content), mode="r")
content = gf.read()
except:
content = gf.extrabuf
charidt1 = chardet.detect(content)
print charidt1
print page.info()
print page.info().getparam('charset')
print content.decode('gbk').encode('utf-8')