给Python中通过urllib2.urlopen获取网页的过程中，添加gzip的压缩与解压缩支持

最新推荐文章于 2023-07-08 14:15:00 发布

王肇朋

最新推荐文章于 2023-07-08 14:15:00 发布

阅读量2.9k

点赞数

之前已经实现了用python获取网页的内容，相关已实现代码为：

 
        #------------------------------------------------------------------------------ 
       
        # get response from url 
       
        # note: if you have already used cookiejar, then here will automatically use it 
       
        # while using rllib2.Request 
       
        def 
         getUrlResponse(url, postDict 
        = 
        {}, headerDict 
        = 
        {}) : 
       
        # makesure url is string, not unicode, otherwise urllib2.urlopen will error 
       
        url  
        = 
        str 
        (url); 
       
        if 
        (postDict) : 
       
        postData  
        = 
        urllib.urlencode(postDict); 
       
        req  
        = 
        urllib2.Request(url, postData); 
       
        req.add_header( 
        'Content-Type' 
        ,  
        "application/x-www-form-urlencoded" 
        ); 
       
        else 
        : 
       
        req  
        = 
        urllib2.Request(url); 
       
        if 
        (headerDict) : 
       
        print 
        "added header:" 
        ,headerDict; 
       
        for 
        key  
        in 
         headerDict.keys() : 
       
        req.add_header(key, headerDict[key]); 
       
        req.add_header( 
        'User-Agent' 
        , gConst[ 
        'userAgentIE9' 
        ]); 
       
        req.add_header( 
        'Cache-Control' 
        ,  
        'no-cache' 
        ); 
       
        req.add_header( 
        'Accept' 
        ,  
        '*/*' 
        ); 
       
        #req.add_header('Accept-Encoding', 'gzip, deflate'); 
       
        req.add_header( 
        'Connection' 
        ,  
        'Keep-Alive' 
        ); 
       
        resp  
        = 
        urllib2.urlopen(req); 
       
        return 
        resp; 
       
        #------------------------------------------------------------------------------ 
       
        # get response html==body from url 
       
        def 
         getUrlRespHtml(url, postDict 
        = 
        {}, headerDict 
        = 
        {}) : 
       
        resp  
        = 
        getUrlResponse(url, postDict, headerDict); 
       
        respHtml  
        = 
        resp.read(); 
       
        return 
        respHtml;

其中，是不支持html的压缩已解压缩的。

现在想要支持相关的压缩与解压缩。

其中，关于这部分内容，之前就已经通过C#实现了对应的功能，了解了对应的逻辑。所以，此处主要是具体是如何用python实现而已，对于内部机制，基本已经了解过了。

【解决过程】

1.之前就简单找过相关的帖子看，但是当时没来得及解决。

现在知道了，是要先对http的request添加gzip的header的，具体python代码是：

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

然后返回的http的response中，read所得到的数据，就是gzip后的压缩的数据了。

接下来就是想要搞懂，如何将其解压出来。

2.先去找了下gzip的解释，发现python官方文档中，是这样说的：

12.2. gzip — Support for gzip files

This module provides a simple interface to compress and decompress files just like the GNU programs gzip and gunzip would.

The data compression is provided by the zlib module.

即，gzip是针对文件来压缩与解压缩的。，而对于数据压缩与解压，是用zlib。

所以又去查看zlib：

zlib.decompress ( string[, wbits[, bufsize]] )

Decompresses the data in string, returning a string containing the uncompressed data. The wbits parameter controls the size of the window buffer, and is discussed further below. If bufsize is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream, wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. When wbits is negative, the standard gzip header is suppressed.

bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls to malloc(). The default size is 16384.

然后程序中直接去用：zlib.decompress，结果出错，后来解决了，具体过程见：

【已解决】Python中用zlib.decompress出错：error: Error -3 while decompressing data: incorrect header check

然后，就可以实现将返回的html解压了。

3.参考了这里：

http://flyash.itcao.com/post_1117.html

才知道可以去判断其中返回的http的response中，是否包含Content-Encoding: gzip，然后再决定是否去调用zlib去解压缩的。

4.最后实现了对应的全部代码，如下：

 
        #------------------------------------------------------------------------------ 
       
        # get response from url 
       
        # note: if you have already used cookiejar, then here will automatically use it 
       
        # while using rllib2.Request 
       
        def 
         getUrlResponse(url, postDict 
        = 
        {}, headerDict 
        = 
        {}, timeout 
        = 
        0 
        , useGzip 
        = 
        False 
        ) : 
       
        # makesure url is string, not unicode, otherwise urllib2.urlopen will error 
       
        url  
        = 
        str 
        (url); 
       
        if 
        (postDict) : 
       
        postData  
        = 
        urllib.urlencode(postDict); 
       
        req  
        = 
        urllib2.Request(url, postData); 
       
        req.add_header( 
        'Content-Type' 
        ,  
        "application/x-www-form-urlencoded" 
        ); 
       
        else 
        : 
       
        req  
        = 
        urllib2.Request(url); 
       
        if 
        (headerDict) : 
       
        #print "added header:",headerDict; 
       
        for 
        key  
        in 
         headerDict.keys() : 
       
        req.add_header(key, headerDict[key]); 
       
        defHeaderDict  
        = 
        { 
       
        'User-Agent'    
        : gConst[ 
        'userAgentIE9' 
        ], 
       
        'Cache-Control' 
        :  
        'no-cache' 
        , 
       
        'Accept'        
        :  
        '*/*' 
        , 
       
        'Connection'    
        :  
        'Keep-Alive' 
        , 
       
        }; 
       
        # add default headers firstly 
       
        for 
        eachDefHd  
        in 
         defHeaderDict.keys() : 
       
        #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]); 
       
        req.add_header(eachDefHd, defHeaderDict[eachDefHd]); 
       
        if 
        (useGzip) : 
       
        #print "use gzip for",url; 
       
        req.add_header( 
        'Accept-Encoding' 
        ,  
        'gzip, deflate' 
        ); 
       
        # add customized header later -> allow overwrite default header  
       
        if 
        (headerDict) : 
       
        #print "added header:",headerDict; 
       
        for 
        key  
        in 
         headerDict.keys() : 
       
        req.add_header(key, headerDict[key]); 
       
        if 
        (timeout >  
        0 
        ) : 
       
        # set timeout value if necessary 
       
        resp  
        = 
        urllib2.urlopen(req, timeout 
        = 
        timeout); 
       
        else 
        : 
       
        resp  
        = 
        urllib2.urlopen(req); 
       
        return 
        resp; 
       
        #------------------------------------------------------------------------------ 
       
        # get response html==body from url 
       
        #def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) : 
       
        def 
         getUrlRespHtml(url, postDict 
        = 
        {}, headerDict 
        = 
        {}, timeout 
        = 
        0 
        , useGzip 
        = 
        True 
        ) : 
       
        resp  
        = 
        getUrlResponse(url, postDict, headerDict, timeout, useGzip); 
       
        respHtml  
        = 
        resp.read(); 
       
        if 
        (useGzip) : 
       
        #print "---before unzip, len(respHtml)=",len(respHtml); 
       
        respInfo  
        = 
        resp.info(); 
       
        # Server: nginx/1.0.8 
       
        # Date: Sun, 08 Apr 2012 12:30:35 GMT 
       
        # Content-Type: text/html 
       
        # Transfer-Encoding: chunked 
       
        # Connection: close 
       
        # Vary: Accept-Encoding 
       
        # ... 
       
        # Content-Encoding: gzip 
       
        # sometime, the request use gzip,deflate, but actually returned is un-gzip html 
       
        # -> response info not include above "Content-Encoding: gzip" 
       
        # eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html 
       
        # -> so here only decode when it is indeed is gziped data 
       
        if 
        ( ( 
        "Content-Encoding" 
        in 
        respInfo)  
        and 
        (respInfo[ 
        'Content-Encoding' 
        ]  
        = 
        = 
         "gzip" 
        )) : 
       
        respHtml  
        = 
        zlib.decompress(respHtml,  
        16 
        + 
        zlib.MAX_WBITS); 
       
        #print "+++ after unzip, len(respHtml)=",len(respHtml); 
       
        return 
        respHtml;

【总结】

关于给python中的urllib2.urlopen添加gzip支持，其中主要逻辑就是：

1. 给request添加对应的gzip的header：

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

2. 然后获得返回的html后，用zlib对其解压缩:

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

其中解压缩之前，先要判断返回的内容，是否是真正的gzip后的数据，即“Content-Encoding: gzip”，因为可能出现你的http的请求中支持其返回gzip的数据，但是其返回的是原始的没有用gzip压缩的html数据。

王肇朋

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
给Python中通过urllib2.urlopen获取网页的过程中，添加gzip的压缩与解压缩支持

之前已经实现了用python获取网页的内容，相关已实现代码为：?1234567891011121314151617181920212223242526272829303132333435#--------------
复制链接

扫一扫