python gzip.open 显示utf-8无法解码_python以gzip header请求html数据时，response内容乱码无法解码的解决方案...-CSDN博客

本文链接：https://blog.csdn.net/weixin_29194693/article/details/111944234

在准备研究生毕业论文的过程，需要抓取网页数据，被网页编码问题困扰着啊。。。。。。

比如问题：python以gzip header请求html数据时，response内容乱码无法解码？在http请求中，如果在request header包含”Accept-Encoding”:”gzip, deflate”，对response内容用lxml.etree进行解析时，在pycharm IDE打印中文时会出现乱码的情况。。。。。

比如：网页url = "http://news.sina.com.cn/o/2015-11-25/doc-ifxmainy1109538.shtml"

response = requests.get(url, headers=headers)

print response.header #头信息

print response.header.get("Content-Encoding") #gizp

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0",'Accept-Encoding': 'deflate'}需要设置请求头信息

'Accept-Encoding': 'deflate'

html = response.content.decode('utf-8')

page = etree.HTML(html)

这样打印的中文才不是乱码。。。。。。。。。。

普通浏览器访问网页，之所以添加："Accept-Encoding" = "gzip,deflate"。那是因为，浏览器对于从服务器中返回的对应的gzip压缩的网页，会自动解压缩，所以，其request的时候，添加对应的头，表明自己接受压缩后的数据。而在爬虫抓取数据模拟请求时在此代码中，如果也添加此头信息，结果就是，返回的压缩后的数据，没有解码，而将压缩后的数据当做普通的html文本来处理，当前显示出来的内容，是乱码了。

想要获得正确网页内容，而非乱码的话，就有两种方式了：

1.不要设置Accept-Encoding的Header

2.设置Accept-Encoding的Header,同时设置对应的自动解压缩的模式

1. 问题背景

在使用urllib2 module抓取web数据时，如果希望使用如何request header，减少传输时数据量。返回的数据，是经过gzip压缩的。直接按照 content.decode(“utf8”), 解码会出现异常，并且也无法检测网页数据的实际编码类型。

2. 问题分析

因为http请求中，如果在request header包含”Accept-Encoding”:”gzip, deflate”, 并且web服务器端支持，返回的数据是经过压缩的，这个好处是减少了网络流量，由客户端根据header，在客户端层解压，再解码。urllib2 module，获取的http response数据是原始数据，没有经过解压，所以这是乱码的根本原因。

3. 解决方案

3.1 Request header移除”Accept-Encoding”:”gzip, deflate”

最快的方案，能直接得到可解码的数据，缺点是，传输流量会增加很多。

3.2 使用zlib module，解压缩，然后解码，得到可读的明文数据。

这也是本文使用的方案

4. 源码解析

代码如下, 这是一个典型的模拟form表单，post方式提交请求数据的代码，基于python 2.7 ,

代码块

代码块语法遵循标准markdown代码

#! /usr/bin/env python2.7

import sys

import zlib

import chardet

import urllib

import urllib2

import cookielib

def main():

reload( sys )

sys.setdefaultencoding('utf-8')

url = 'http://xxx.yyy.com/test'

values = {

"form_field1":"value1",

"form_field2":"TRUE",

}

post_data = urllib.urlencode(values)

cj=cookielib.CookieJar()

opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

headers ={"User-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0) Gecko/20100101 Firefox/36.0",

"Referer":"http://xxx.yyy.com/test0",

"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",

"Accept-Language":"en-US,en;q=0.5",

"Accept-Encoding":"gzip, deflate",

"Connection":"keep-alive",

# "Cookie":"QSession=",

"Content-Type":"application/x-www-form-urlencoded",

}

req = urllib2.Request(url,post_data,headers)

response = opener.open(req)

content = response.read()

gzipped = response.headers.get('Content-Encoding')

if gzipped:

html = zlib.decompress(content, 16+zlib.MAX_WBITS)

else:

html = content

result = chardet.detect(html)

print(result)

print html.decode("utf8")

if __name__ == '__main__':

main()