Python爬虫理论 | (2) 网络请求与响应

最新推荐文章于 2024-07-24 14:24:45 发布

CoreJT

最新推荐文章于 2024-07-24 14:24:45 发布

阅读量4k

点赞数 5

分类专栏： Python3网络爬虫从理论到实践Base 文章标签： Python爬虫理论发送请求获取响应 urllib requests

本文链接：https://blog.csdn.net/sdu_hao/article/details/94388385

版权

Python3网络爬虫从理论到实践Base 专栏收录该内容

30 篇文章 48 订阅

订阅专栏

在上一篇博客中，我们已经学习了爬虫的基本流程，我们称之为四步曲。如下图所示：

第一步：模拟浏览器向服务器发送请求
第二步：获取服务器响应
第三步：解析响应内容
第四步：保存解析后的数据

在本篇博客中，我们将学习如何用Python代码模拟浏览器向服务器发送请求，主要包含Python爬虫请求库的一些基本用法和写几个小实例进行实战：

1. urllib库

2. requests库

3. 实战

1. urllib库

urllib、urllib2、requests库的说明

在python2中，urllib和urllib2都是内置标准库，通过url打开资源，其中urllib只能接受url，无法对请求进行headers的伪装，有时请求会被很多网站阻挡。而urllib2则可以接受一个Request对象，并可设置URL的headers。因此，二者通常配合一起使用。最常用的方法是urllib.urlopen()，用于发送请求。

在python3中，urllib和urllib2已经整合为urllib。发送请求的方法为：urllib.request.urlopen()。

requests是第三方库，功能强大，需要安装导入。

注意：在学习网络上的开源代码时，一定要注意Python的版本。

urllib库

urllib是python3内置的HTTP请求库，官方文档。

包含以下四个模块：

1. urllib.request :HTTP请求模块，模拟发送请求。(该模块包含urlopen()方法和Request类)

2. urllib.error：异常处理模块，捕获请求错误。（该模块包含URLError类和HTTPError类）

3. urllib.parse：URL解析模块，URL 拆分、解析、合并等。（该模块包含urlparse方法，ParseResult类，parse_qs、parse_qsl、urlunparse、urlsplit、urlunsplit、urljoin、quote、quote_plus、unquote、unquote_plus、urlencode方法）

4. urllib.robotparser：robots.txt解析模块，识别网站的robots.txt 文件。（该模块包含RobotFileParser类）

urllib.request模块

1. urllib.request中的urlopen()方法

urllib.request模块中的urlopen()提供了最基本的构造HTTP 请求的方法，利用它可以模拟浏览器的一个请求发起过程，同时它还提供处理授权验证，重定向，cookies功能。

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

主要参数说明：

url：待请求的网址

data：请求体内容，请求方式为POST

timeout：超时时间，单位为秒。若超时，就会抛出异常

返回值：

该方法的返回值是一个http.client.HTTPResponse对象，该对象包括以下属性和方法：

1）status属性：响应状态

2）read()：获取响应内容

3）getheaders()：获取响应头

实例：

import urllib.request

#发送url请求
response = urllib.request.urlopen('https://www.python.org')
#打印页面信息
print(response.read().decode('utf-8'))
print("---------------")
#查看结果类型
print(type(response))
print("---------------")
#状态码
print(response.status)
#头部信息
print("---------------")
print(response.getheaders())
#服务器信息
print("---------------")
print(response.getheaders('Server'))

注意获取头部信息使用方法是getheaders(),而获取头部信息中的某个信息如服务器信息，使用的是getheader().

2. urllib.request中的Request类

如果请求中需要加入Headers 等更详细的信息，就需要利用更强大的Request 类来构建一个请求。

class urllib.request.Request (url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

主要参数说明：

url：待请求的网址

data：请求体的内容

headers：请求头

origin_req_host：请求方的host名称或着IP地址

unverifiable：该请求有无验证，默认值False(表示无法验证)

method：请求的方法

返回值：

返回Request类的一个对象。

注意headers可以在构建Request对象时当作参数传入，也可以使用构建的Request对象调用add_header()方法来进行添加：

req = request.Request(url=url,data=data,method='POST')
req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36')

实例：

import urllib.request

#普通方式
req = urllib.request.Request('https://www.baidu.com')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

from urllib import request,parse

#增强方式
url = 'https://httpbin.org/post'
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    'Host':'httpbin.org'
}
dict = {
    'name':'CoreJT'
}
data = bytes(parse.urlencode(dict),encoding='utf-8') #对dict使用utf-8进行编码 再转换为字节流
req = request.Request(url,headers=headers,data=data,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Request()相当于对url进行了更一步的封装，可以添加headers等细节信息，构成更强大的请求，返回一个对象，再用urlopen()方法获得响应。

注意两种不同的导入方式，如果使用import urllib.request的话，之后调用方法时要写完整；如果使用from urllib import request的话，调用方法时可以简写。

urllib.error模块

在网络情况不好的情况下，出现了异常怎么办呢？这时如果不处理这些异常，程序很可能报错而终止运行，所以异常处理十分必要。该模块包含URLError和HTTPError两个类。

1. URLError类

包含一个reason属性，返回错误的原因。

2. HTTPError类

它是URLError的子类，可以给出更多细节信息，包含以下属性：

code：返回HTTP 状态码，比如404 表示网页不存在， 500 表示服务器内部错误。

reason：返回错误的原因

headers：返回请求头

实例：

from urllib import request,error

try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

from urllib import request,error

try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully!')

先捕获子类异常，再捕获父类异常。

urllib.parse模块

这个模块可以拆分URL，也可以拼接URL，得到一个基本的URL标准格式。这一功能并不需要联网。URL格式如下：

protocol:// hostname[:port] / path / [;parameters][?query]#fragment

1）protocol:协议

2）hostname：主机名

3）post：端口号

4）path：访问路径

5）parameters：用于指定特殊参数

6）query：查询条件，一般用于GET类型的URL

7）fragment：锚点，用于直接定位页面内部的下拉位置

1. urllib.parse模块中的urlparse方法

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

功能：

用于将一个URL解析成六个部分。

参数：

urlstring：待解析的URL 。

scheme ：默认协议。假如这个链接没有带协议信息，会将给出的scheme作为默认协议，如果带了协议则会忽略scheme。

allow_ fragments ：是否忽略fragment 。当设置为False 时fragment 会被忽略。fragment 为True时（或为空），会被解析为path 、parameters 或者query 的一部分。

返回值：

返回一个ParseResult对象。

2. urllib.parse模块中的ParseResult类

class urllib.parse.ParseResult(scheme, netloc, path, params, query, fragment)

实例：

from urllib import request,parse

print(parse.urlparse('https://movie.douban.com/',allow_fragments=False))
print(parse.urlparse('https://movie.douban.com/', scheme='http'))
print(parse.urlparse('movie.douban.com/', scheme='http'))

3. urllib.parse模块中的parse_qs与parse_qsl方法

urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None)

功能：

用于分析URL中query组件的参数，返回一个key-value对应的字典格式。

实例：

from urllib import parse

print(parse.parse_qs('No=1&username=cortjt'))

urllib.parse.parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None)

功能：

用于分析URL中query组件的参数，返回一个key-value二元组形式的list列表。

实例：

print(parse.parse_qsl('No=1&username=cortjt'))

4. urllib.parse模块中的unparse、split、unsplit、join方法

urllib.parse.urlunparse(parts)

功能：

将urlparse()分解出来的元组组装成URL。

参数：

列表或元组，长度必须是6，否则会抛出参数数量有误的异常。

urllib.parse. urlsplit(urlstring, scheme='', allow_fragments=True)

功能：

和urlparse()方法类似，用于解析URL。区别是它不单独解析params，params会合并到path 中，只返回5个结果。

参数：

3个，与urlparse一样。

urllib.parse. urlunsplit(parts)

功能：

和urlunparse()方法类似，用于组装URL。

参数：

列表或元组，长度必须为5，少了一个param。

urllib.parse. urljoin(base, url, allow_fragments=True)

功能：

将一个基础URL和新URL组装成一个完整的URL。如果url有完整的路径，则以url为主。

参数：

参数base为基础URL，参数url为新URL。分析base的scheme 、netloc和path，并对新URL缺失部分进行补充。

实例：

from urllib.parse import urlparse
from urllib.parse import urlunparse
from urllib.parse import urljoin

#对传入网址进行拆分，有6个部分
result = urlparse("www.baidu.com/index.html;user?id=5#comment",scheme="https")
print(result)				
print("---------------------")
#将传入网址进行拆分，有5个部分，param合并到path中了
print(urllib.parse.urlsplit(" www.baidu.com/index.html;user?id=5#comment"))
print("---------------------")
#对传入列表组装成URL
data = ['http','www.baidu.com','index.html','user','a=123','commit']
print(urlunparse(data))
print("---------------------")
#将URL进行拼接
print(urljoin('https://movie.douban.com/', 'index'))
print(urljoin('https://movie.douban.com/', 'https://accounts.douban.com/login'))

5. urllib.parse模块中的quote和unquote方法

为什么要用到编码与解码？原因就是在url中不允许出现的字符（比如空格、斜线、汉字等）都会用%xxxx的形式代替，所以要用quote与unquote函数进行编码与解码还原。

urllib.parse.quote(string, safe='/', encoding=None, errors=None)

功能：

转化URL 编码格式，通过引入合适编码和特殊字符对URL进行安全重构。

参数：

第一个参数是URL，第二个参数是安全的字符串，即在加密的过程中，该类字符不变，默认为“/”。

urllib.parse.quote_plus(string, safe=”, encoding=None, errors=None)

功能：

这个函数和quote()相似，但是这个函数能把空格转成加号，并且safe的默认值为空

urllib.parse.unquote(string, encoding='utf-8', errors='replace')

功能：

对URL 进行解码。

urllib.parse.unquote_plus(string, safe=”, encoding=None, errors=None)

功能：

这个函数和unquote ()相似，对URL进行解码，但将加号解码为空格。

6. urllib.parse模块中的urlencode()方法

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

功能：

把传入的参数对转换为url标准格式。参数形式必须是map类型或两个序列元素的元组，从而将数据拼接成参数。

参数：

[(key1, value1), (key2, value2),...] 或 {'key1': 'value1', 'key2': 'value2',...}

返回值：

形如‘key1=value1&key2=value2’的字符串，并转换为百分比编码的ASCII文本字符串。

注意：

当urlencode之后的字符串传递过来之后，接受完毕就要解码。urllib提供了unquote()这个函数完成这一任务，并没有urldecode()这个方法。

实例：

from urllib.parse import urlencode
from urllib.parse import unquote

data = {'id':100,'name':'魔兽'}
res = urlencode(data)
print(res)
print(unquote(res))

from urllib.parse import quote,quote_plus,unquote,unquote_plus
from urllib.parse import urlencode

#对URL进行重新编码
url='https://www.zhihu.com/question/50056807/answer/223566912'
print(quote(url))
print(quote(url,safe=":"))
print("----------------------")
#空格的编码与加号的解码
print(quote('a&b /c'))  #编码了%和空格，未编码斜线
print(quote_plus('a&b /c'))  #编码了%和斜线，将空格编码为+
print(unquote('1+2'))  #不解码加号
print(unquote_plus('1+2'))  #把加号解码为空格
print("----------------------")
#将元组进行拼接
query = {'name': 'walker', 'age': 99}
print(urlencode(query))

urllib.robotparser模块

每个网站都会定义 robots.txt 文件，这个文件可以告诉网络爬虫爬取该网站时存在哪些限制。可以通过在目标网站站点或域名后面加上 robots.txt 进行访问。

例如：目标网站站点 https://www.douban.com 的 robots.txt 文件就是 https://www.douban.com/robots.txt

robots.txt的内容解读：

section1：定义了Sitemap文件，即所谓的网站地图。Sitemap文件可以帮助网络爬虫查找网站最新的内容。

section2：如果没有被注释掉，表明每个用户两次爬虫之间的时间间隔不能小于 5s。有时会给出一个网页地址，当大于5s时网页会自动跳转到这一指定的链接页面。

section 3：robots.txt文件禁止那些代理为Wandoujia Spider的爬虫访问网站。理解过来，就是禁止豌豆荚爬虫代理访问网站。

1. urllib.robotparser模块的RobotFileParser类

class urllib.robotparser.RobotFileParser(url='')

专用于解析robots.txt的类，有以下常用方法：

set_url（url）：设置robots.txt 文件的链接。

read （）：读取robots.txt 文件

parse （lines）：解析robots.txt 文件。

can_fetch（useragent, url）：返回的内容是True 或False，表示该搜索引擎是否可以抓取这个URL 。

mtime（）：返回的是上次抓取和分析robots.txt 的时间

modified （）：将当前时间设置为上次抓取和分析robots.txt 的时间。

实例：

import urllib.robotparser

#对douban网站上的robots.txt进行分析
rp =urllib.robotparser.RobotFileParser()		#创建对象
rp.set_url('https://www.douban.com/robots.txt')	#设置robots.txt文件的链接地址
rp.read()					#读取robots.txt文件
url = 'https://www.douban.com'			#设置豆瓣首页
user_agent = 'Wandoujia Spider'			#设置为豌豆荚代理
wsp_info = rp.can_fetch(user_agent, url)		#判断是否可以下载页面
print("Wandoujia Spider 代理用户访问情况：",wsp_info)#输出结果：不允许
#Wandoujia Spider 代理用户访问情况： False

user_agent = 'Other Spider'			#设置为其它代理
osp_info = rp.can_fetch(user_agent, url)		#判断是否可以下载页面
print("Other Spider 代理用户访问情况：",osp_info)	#输出结果：可以
#Other Spider 代理用户访问情况： True

2. requests库

这一模块用于请求网络资源，所以必须在联网后进行操作。因为是第三方库，所以使用前需要安装pip install requests.安装完成后通过import requests导入。官方文档

requests库中的主要方法和类

7个主要方法：requests.request()、requests.get()等

1个响应类：requests.Response

6个异常类：ReadTimeout、HTTPError、ConnectionError等

Requests支持多种请求方式，但二种方式使用最多：get和post。

requests.get('http://httpbin.org/get')

requests.post('http://httpbin.org/post')

注意：http://httpbin.org/是一个测试网站，可以尝试使用多种请求方式。

requests库中的主要方法

requests库中的request方法

requests.request(method, url, **kwargs) #其中**kwargs为参数列表， 共有13个

常用参数：

params：查询条件，字典类型

data ：请求体，字典类型

json：请求体，json对象

headers ：请求头，字典类型

cookies ：字典或CookieJar对象

timeout ：超时时间，即在放弃之前等待服务器发送数据的秒数

allow_redirects：是否允许重定向，布尔类型，默认为True

proxies ：代理，字典类型

实例：

res = requests.request("get", "http://httpbin.org/")
print(res.text)

requests库中的get请求

requests.get(url, params=None, **kwargs)

主要参数：

参数1：直接将参数放在url内，如requests.get(‘http://www.baidu.com’)

参数2：先将参数填写在dict中，发起请求时params参数指定为dict

data={'name': 'tom','age': 20}
requests.get('http://httpbin.org/get', params=data)

**kwargs：12个参数，常用参数同request()

返回类型：

requests.Response对象

requests库中的post请求

requests.post(url, data=None, json=None, **kwargs)

主要参数：

参数1：将参数填写在dict中，再使用json.dumps转换为json格式

data={'name': 'tom','age': 20}
response = requests.post("http://httpbin.org/post", data=json.dumps(data))

参数2：会自动使用json模块中的dumps方法将dict转成json数据，会增加消息头中的content_type为application/json

json={'name': 'tom','age': 20}
response = requests.post("http://httpbin.org/post",json=json)

**kwargs：12个参数，常用参数同request()

返回类型：

requests.Response对象

requests库中的Response类

requests.Response 类包括以下属性和方法：

status_code：响应状态码

headers

encoding：如果header中不存在charset，则认为编码为ISO-8859-1

apparent_encoding：根据网页内容分析出的编码方式

text：响应内容，使用encoding进行解码

content：响应内容的二进制形式

Json()：返回响应的json编码内容

实例：

import requests
#对baidu首页进行请求
response = requests.get('http://www.baidu.com')
print(response.status_code) 	# 打印状态码 200
print(response.url) 		# 打印请求url
print(response.headers) 		# 打印头信息
print(response.cookies) 		# 打印cookie信息 
print(response.text) 		#以文本形式打印网页源码
print(response.content) 		#以字节流形式打印 
print(response.json())		#以json格式打印

import requests

#基本GET请求
url="https://www.zhihu.com/explore"
print(requests.get(url).text)	  #400 Bad Request
#有些网站访问时必须带有浏览器等信息，如果不传入headers就会报错
heads={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
req=requests.get(url, headers=heads)
print(req.text)			  #<html>发现 - 知乎……知乎的网页信息

#基本POST请求
data={'name':'tom','age':'22'}
response=requests.post('http://httpbin.org/post', data=data)
print(response.text)		  # headers:{…} json:null …

#获取cookies
response=requests.get('http://www.baidu.com')
print(response.cookies)		
# <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
print(type(response.cookies))         
# <class 'requests.cookies.RequestsCookieJar'>
for k,v in response.cookies.items():
    print(k+':'+v)		  # BDORZ:27315

requests库中的异常类

所有Requests抛出的异常类都继承自requests.exceptions.RequestException

import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException

#使用异常处理进行获取请求，并设置了超时
try:
    response = requests.get('http://www.baidu.com',timeout=0.1)
    print(response.status_code)
except ReadTimeout:
    print('timeout')
except HTTPError:
    print('httperror')
except RequestException:
    print('reqerror')
#200