给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持

之前已经实现了用python获取网页的内容,相关已实现代码为:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def getUrlResponse(url, postDict = {}, headerDict = {}) :
     # makesure url is string, not unicode, otherwise urllib2.urlopen will error
     url = str (url);
 
     if (postDict) :
         postData = urllib.urlencode(postDict);
         req = urllib2.Request(url, postData);
         req.add_header( 'Content-Type' , "application/x-www-form-urlencoded" );
     else :
         req = urllib2.Request(url);
 
     if (headerDict) :
         print "added header:" ,headerDict;
         for key in headerDict.keys() :
             req.add_header(key, headerDict[key]);
 
     req.add_header( 'User-Agent' , gConst[ 'userAgentIE9' ]);
     req.add_header( 'Cache-Control' , 'no-cache' );
     req.add_header( 'Accept' , '*/*' );
     #req.add_header('Accept-Encoding', 'gzip, deflate');
     req.add_header( 'Connection' , 'Keep-Alive' );
     resp = urllib2.urlopen(req);
     
     return resp;
 
#------------------------------------------------------------------------------
# get response html==body from url
def getUrlRespHtml(url, postDict = {}, headerDict = {}) :
     resp = getUrlResponse(url, postDict, headerDict);
     respHtml = resp.read();
     return respHtml;

其中,是不支持html的压缩已解压缩的。

现在想要支持相关的压缩与解压缩。

其中,关于这部分内容,之前就已经通过C#实现了对应的功能,了解了对应的逻辑。所以,此处主要是具体是如何用python实现而已,对于内部机制,基本已经了解过了。

【解决过程】

1.之前就简单找过相关的帖子看,但是当时没来得及解决。

现在知道了,是要先对http的request添加gzip的header的,具体python代码是:

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

然后返回的http的response中,read所得到的数据,就是gzip后的压缩的数据了。

接下来就是想要搞懂,如何将其解压出来。

2.先去找了下gzip的解释,发现python官方文档中,是这样说的:

12.2. gzip — Support for gzip files

This module provides a simple interface to compress and decompress files just like the GNU programs gzip and gunzip would.

The data compression is provided by the zlib module.

即,gzip是针对文件来压缩与解压缩的。,而对于数据压缩与解压,是用zlib。

所以又去查看zlib:

zlib.decompress ( string[, wbits[, bufsize]] )

Decompresses the data in string, returning a string containing the uncompressed data. The wbits parameter controls the size of the window buffer, and is discussed further below. If bufsize is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream, wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. When wbits is negative, the standard gzip header is suppressed.

bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls to malloc(). The default size is 16384.

然后程序中直接去用:zlib.decompress,结果出错,后来解决了,具体过程见:

【已解决】Python中用zlib.decompress出错:error: Error -3 while decompressing data: incorrect header check

然后,就可以实现将返回的html解压了。

3.参考了这里:

http://flyash.itcao.com/post_1117.html

才知道可以去判断其中返回的http的response中,是否包含Content-Encoding: gzip,然后再决定是否去调用zlib去解压缩的。

4.最后实现了对应的全部代码,如下:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def getUrlResponse(url, postDict = {}, headerDict = {}, timeout = 0 , useGzip = False ) :
     # makesure url is string, not unicode, otherwise urllib2.urlopen will error
     url = str (url);
 
     if (postDict) :
         postData = urllib.urlencode(postDict);
         req = urllib2.Request(url, postData);
         req.add_header( 'Content-Type' , "application/x-www-form-urlencoded" );
     else :
         req = urllib2.Request(url);
 
     if (headerDict) :
         #print "added header:",headerDict;
         for key in headerDict.keys() :
             req.add_header(key, headerDict[key]);
 
     defHeaderDict = {
         'User-Agent'    : gConst[ 'userAgentIE9' ],
         'Cache-Control' : 'no-cache' ,
         'Accept'        : '*/*' ,
         'Connection'    : 'Keep-Alive' ,
     };
 
     # add default headers firstly
     for eachDefHd in defHeaderDict.keys() :
         #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
         req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
 
     if (useGzip) :
         #print "use gzip for",url;
         req.add_header( 'Accept-Encoding' , 'gzip, deflate' );
 
     # add customized header later -> allow overwrite default header
     if (headerDict) :
         #print "added header:",headerDict;
         for key in headerDict.keys() :
             req.add_header(key, headerDict[key]);
 
     if (timeout > 0 ) :
         # set timeout value if necessary
         resp = urllib2.urlopen(req, timeout = timeout);
     else :
         resp = urllib2.urlopen(req);
     
     return resp;
 
#------------------------------------------------------------------------------
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def getUrlRespHtml(url, postDict = {}, headerDict = {}, timeout = 0 , useGzip = True ) :
     resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip);
     respHtml = resp.read();
     if (useGzip) :
         #print "---before unzip, len(respHtml)=",len(respHtml);
         respInfo = resp.info();
         
         # Server: nginx/1.0.8
         # Date: Sun, 08 Apr 2012 12:30:35 GMT
         # Content-Type: text/html
         # Transfer-Encoding: chunked
         # Connection: close
         # Vary: Accept-Encoding
         # ...
         # Content-Encoding: gzip
         
         # sometime, the request use gzip,deflate, but actually returned is un-gzip html
         # -> response info not include above "Content-Encoding: gzip"
         # -> so here only decode when it is indeed is gziped data
         if ( ( "Content-Encoding" in respInfo) and (respInfo[ 'Content-Encoding' ] = = "gzip" )) :
             respHtml = zlib.decompress(respHtml, 16 + zlib.MAX_WBITS);
             #print "+++ after unzip, len(respHtml)=",len(respHtml);
 
     return respHtml;

 

【总结】

关于给python中的urllib2.urlopen添加gzip支持,其中主要逻辑就是:

1. 给request添加对应的gzip的header:

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

2. 然后获得返回的html后,用zlib对其解压缩:

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

其中解压缩之前,先要判断返回的内容,是否是真正的gzip后的数据,即“Content-Encoding: gzip”,因为可能出现你的http的请求中支持其返回gzip的数据,但是其返回的是原始的没有用gzip压缩的html数据。

0

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值