Python：抓取网页初体验

最新推荐文章于 2020-07-13 10:18:56 发布

iamzhuwenhui

最新推荐文章于 2020-07-13 10:18:56 发布

阅读量913

点赞数

分类专栏： python 文章标签： python 编程

本文链接：https://blog.csdn.net/u013632854/article/details/52970775

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

编程悬崖，回头是岸 ——Python：抓取网页初体验

版本：Python 2.7.3 |EPD_free 7.3-2 (32-bit)|

#coding:utf-8
import urllib2

第一次尝试抓取网页成功，抓取的是百度首页

#first try,it works----------------------------
def getHtml_1(url):
    response = urllib.urlopen(url)
    html_body = response.read()
    return html_body

然后尝试了request方法来抓，也是成功的

#use request,still works------------------------------
def getHtml_2(url):
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    html_body = response.read()
    return html_body

当地址换成糗事百科就不行了，以为是请求失败，于是想获取状态码看看，发现也是失败的

#try to get status,failed-------------------------
def getHtml_3(url):
    response = None
    try:
        response = urllib2.urlopen(url,timeout=5)
        html_body = response.read()
        return html_body
    except urllib2.URLError as e:
        if hasattr(e,'code'):
            print 'Error code:',e.code
        elif hasattr(e,'reason'):
            print 'Reason:',e.code
    finally:
        if response:
            response.close()

通过Wireshark抓包看，人家根本没有回应

这里写图片描述

然后看到头信息是python什么的，那肯定是不行的，于是琢磨着加了个头信息，结果成功了

#add header--------------------------
def getHtml_4(url):
    response = None
    requset = None
    headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    try:
        requset = urllib2.Request(url,headers = headers)
        response = urllib2.urlopen(requset)
        html_body = response.read()
        return html_body
    except urllib2.URLError as e:
        if hasattr(e,'code'):
            print 'Error code:',e.code
        elif hasattr(e,'reason'):
            print 'Reason:',e.code
    finally:
        if response:
            response.close()

#run -----------------------------------

#html = getHtml_1("https://www.baidu.com/")
#html = getHtml_2("https://www.baidu.com/")
#html = getHtml_3("http://www.qiushibaike.com/")
html = getHtml_4("http://www.qiushibaike.com/")

#show me------------------------------
print html

成功信息如下：

这里写图片描述