关于加快python爬虫获取页面的方法

最新推荐文章于 2024-03-19 15:26:24 发布

鶸者为何战斗

最新推荐文章于 2024-03-19 15:26:24 发布

阅读量516

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/yj1499945/article/details/50845574

版权

python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

1、使用异步i/o库Twisted,这个方法是现在速度较快的获得html文档的方法，在使用的时候一定要注意对于速度的控制。不能对服务器造成过大的负担，否则会被服务器封ip。

2、在正常的使用requests或者urllib2模块提交请求时，加上‘Accept-encoding','gzip'。这样服务器会返回一个状态，能不能通过gzip压缩方法来传输一个页面。

具体代码实现：

def getHtmlByU(url):
    try:
        request = urllib2.Request(url)
        request.add_header('Accept-encoding', 'gzip')//添加请求头，说明客户端可以使用gzip来进行处理html文档
        response = urllib2.urlopen(request,timeout=5)
        if response.info().get('Content-Encoding') == 'gzip':
            buff = StringIO.StringIO(response.read())
            f = gzip.GzipFile(fileobj=buff)//使用gzip模块进行解压缩
            data = f.read()
            return data
    except Exception:
        raise Exception("can't get the html")