python urllib2遇到Content-Encoding=gzip解码为乱码的解决方案
用Chrome的开发者工具,查看网页的headers,如果response headers出现Content-Encoding : gzip,则urllib2无法对其内容进行解码。
需要用gzip模块来处理,详细方法如下:
yresponse = urllib2.urlopen(url)
rspheaders = yresponse.info()
yread = yresponse.read()
if ('Content-Encoding' in rspheaders and rspheaders['Content-Encoding'] == 'gzip') or ('content-encoding' in rspheaders and rspheaders['content-encoding'] == 'gzip'):
import gzip
import StringIO
ydata = StringIO.StringIO(yread)
ygz = gzip.GzipFile(fileobj = ydata)
yread = ygz.read()
ygz.close()
ystr = yread.decode('utf8', 'ignore').encode('GB2312')
else:
ystr = yread.decode('utf8', 'ignore').encode('GB2312')
代码中url是需要访问的网址,decode()和encode()根据实际情况确定。