python3网络爬虫-urllib.request模块

最新推荐文章于 2024-08-17 13:38:49 发布

lss926

最新推荐文章于 2024-08-17 13:38:49 发布

阅读量575

点赞数

分类专栏： python3网络爬虫文章标签：网络爬虫 python urllib

本文链接：https://blog.csdn.net/lss926/article/details/80811236

版权

python3网络爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1. urllib.request模块基础使用

python中有很多网页抓取的库，python2中常用urllib+urllib2，python3中统一成了urllib库，urllib包中包含了四个模块：urllib.request、urllib.error、urllib.parse、urllib.robotparser。

urllib.request用于请求url和读取url的结果。
urllib.error包含了由urllib.request产生的异常。
urllib.parse用来解析和处理url。
urllib.robotparser用于解析robots.txt文件。

（1）urllib.request.urlopen

先来段简单的代码来抓取百度首页：

import urllib.request

#向指定的url发送请求，并返回服务器响应的类文件对象
response = urllib.request.urlopen("http://www.baidu.com")

#类文件对象支持文件对象的操作方法
print(response.read())

上面几行代码就可以把百度首页的代码给爬下来。但是urlopen()的参数就是一个url地址，如果想要执行更复杂的操作，例如增加HTTP报头，就需要通过创建一个Request实例来作为urlopen()的参数，而需要访问的url地址作为Request实例的参数。

（2）urllib.request.Request

网络爬虫程序就是模拟一个真实的浏览器与服务器交互，不同的浏览器在发送请求的时候，会有不同的User-Agent头，如果直接采用urllib.request模块去发送一个url，默认的User-Agent头为：Python-urllib/3.6（3.6为版本）。网站可以通过User-Agent来判断是否是爬虫程序，进而封IP禁止访问，因此可以给代码加上一个User-Agent头来防止封IP。

import urllib.request

url = "http://www.baidu.com"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"}

#构造一个Request实例对象，用于添加HTTP报头
request = urllib.request.Request(url, headers=header)

response = urllib.request.urlopen(request)
print(response.read())

除了直接定义header之外，也可以通过调用request.add_header()添加/修改一个特定的header，也可以通过调用request.get_header()来查看已有的header。

import urllib.request

url = "http://www.baidu.com"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"}
request = urllib.request.Request(url, headers = header)
#通过调用request.add_header()添加/修改一个特定的header
request.add_header("Connection", "keep-alive")
#可以通过调用request.get_header()来查看header信息
print(request.get_header(header_name = "Connection"))
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))