爬虫入门（一）下载网页urllib库的使用及进阶

最新推荐文章于 2023-01-12 10:41:59 发布

午夜零时

最新推荐文章于 2023-01-12 10:41:59 发布

阅读量1.4k

点赞数 5

分类专栏：爬虫学习之旅文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_55796594/article/details/118655015

版权

爬虫学习之旅专栏收录该内容

9 篇文章 21 订阅

订阅专栏

获取网页的基本用法，进阶技巧：异常捕获、网页错误码处理，重试下载，设置代理

1.urllib库

Python自带的标准库，无需安装，可以直接使用。

其分为四个大的模块，分别是：

a、urllib.request 请求模块

b、urllib.error异常处理模块

c、urllib.parse解析模块

d、urllib.robotparser robot.txt文件解析模块

这里使用的主要是request模块和error模块

2.获取网页内容的基本用法

python模仿http请求，首先构造Request向服务器请求，然后urlopen（）、read（）响应。
这里给出基本的用法。

import urllib.request

url = 'https://www.baidu.com'
print('Downloading : ' + url)
response = urllib.request.urlopen(url)    #获得http响应
html = response.read()                    #read（）读取响应内容
print(html)

3.进阶用法：能够捕获异常。

在爬取网页的时候难免会遇到各种异常情况，最简单的如输入的url不正确，这时程序会自动结束，为了提高程序的健壮性，捕获异常是很重要的，这里用到urllib.error模块。

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

url1='htps://ww.baidu.com'              #错误的协议名和主机名
print('Downloading : ' + url1)
try:
    html1 = urllib.request.urlopen(url1).read()
    print(html1)
except (URLError, HTTPError, ContentTooShortError) as e:
    print('Download error: ', e.reason)        #遇到异常情况时输出异常原因

4.进阶用法：重试下载

我们都知道网页访问有时会发生错误，有的错误如服务器繁忙没有给出回应，这个时候在浏览器上可以刷新请求，在编写爬虫的时候也会有这种情况发生，这时就要重试下载。但并不是所有故障都要重试下载的，比如404找不到页面，有可能是给出的链接有问题。

一般网页错误的时候会返回网页错误码，分别以1、2、3、5开头，它们的错误含义如下。

1xx - 信息提示：这些状态代码表示临时的响应。

2xx - 成功：这类状态代码表明服务器成功地接受了客户端请求。

3xx - 重定向：客户端浏览器必须采取更多操作来实现请求。例如：浏览器可能不得不请求服务器上其他不同的页面，或通过代理服务器重复该请求。

4xx - 客户端错误：发生错误，客户端出现问题。例如，客户端请求不存在的页面，客户端未提供有效的身份验证信息。404 - 未找到文件发生此错误的原因是您试图访问的文件已被移走或删除。

5xx - 服务器错误：服务器由于遇到错误而不能完成该请求。

在5xx情况下我们可以尝试重新下载爬取网页

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

#进阶用法，重试下载。分类处理各种错误类型，如5xx服务器问题重试下载，4xx找不到页面问题不尝试重试
#这是一个测试网站只会返回5xx的错误代码，受限于国内不怎么好用，这里仅作为例子
url2 = 'https://httpstat.us/500'   
print('Downloading : ' + url2)
try:
    html2 = urllib.request.urlopen(url2).read()
    print(html2)
except (URLError, HTTPError, ContentTooShortError) as e:
    print('Download error: ', e.reason)     
    print(type(e.reason))
    if hasattr(e,'code') :                    
                        #hasattr()检查返回的错误e中是否包含'code'字段，该字段一般保存有错误码
        print(e.code)
        if 500<e.code<600:
            urllib.request.urlopen(url2).read()

5.进阶用法：设置代理

python默认的情况下，urllib使用python-urllib/3.x作为用户代理下载网页内容。这很容易被网站作为封禁的标识，而拒绝访问。因此我们要自定义一个用户代理，也就是模仿浏览器的标识。以Chrom为例

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

#进阶用法，设置用户代理，既增加请求头的标识信息。两种增加方式：在Request（url，data，header）#函数中以参数形势给出；或者在Request函数返回对象中用.add_header和.add_data函数加入。
#data一般是post方法请求时使用

url3 = 'https://www.baidu.com'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'    #Chrom的浏览器标识
referer = ''                                       
                        #referer的含义为从哪个页面跳转来的，也可以不写这一项
header = {
    'User-Agent' : user_agent,
    'Referer' : referer,

}
print('Downloading : ' + url3)
try:
    request = urllib.request.Request(url3)
    request.add_header('User_Agent',user_agent)
    request.add_header('Referer',referer)                  
#或者将头部信息放在字典里一次加载，如request = urllib.request.Request(url3，data，header)
    html3 = urllib.request.urlopen(request).read()
    print(html3)
except (URLError, HTTPError, ContentTooShortError) as e:
    print('Download error: ', e.reason)
    print(type(e.reason))
    if hasattr(e,'code') :
        print(e.code)
        if 500<e.code<600:
            urllib.request.urlopen(url2).read()

这里不必过于纠结代理的内容，正常浏览器访问网页的时候都会携带着一些信息，如用的是什么浏览器、从哪个页面跳转来的等。这里附上部分浏览器的标识信息：

IE 9.0
User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;

Firefox 4.0.1 – Windows
User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

Opera 11.11 – Windows
User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11

Chrome 17.0 – MAC
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11

腾讯TT
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)

搜狗浏览器 1.x
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)

360浏览器
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)

6.进阶用法：封装函数

封装成函数是开发时常用的手段，考虑到输入项有网页链接URL，代理proxy（默认为Chrom），重试次数（默认为2次）。

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

def GetData(url, proxy='', retry =2):
    print('download : ' + url)
    if proxy == '':
        proxy= 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'

    request = urllib.request.Request(url)
    request.add_header('User-Agent',proxy)
    try:
        html = urllib.request.urlopen(request).read()
        print(html)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('download error :', e.reason)
        if retry>0:                                #递归重试下载
            if hasattr(e, 'code') and 500<e.code<600 :
                GetData(url, proxy, retry-1)

GetData('https://www.baidu.com')

午夜零时

关注

5
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
爬虫入门（一）下载网页urllib库的使用及进阶

获取网页的基本用法，进阶技巧：异常捕获、网页错误码处理，重试下载，设置代理目录1.urllib库2.获取网页内容的基本用法3.进阶用法：能够捕获异常。4.进阶用法：重试下载5.进阶用法：设置代理6.进阶用法：封装函数1.urllib库Python自带的标准库，无需安装，可以直接使用。其分为四个大的模块，分别是：a、urllib.request 请求模块b、urllib.error异常处理模块c、urllib.parse解析模块d、urllib.rob.
复制链接

扫一扫