网络爬虫--Urllib

SmiledrinkCat

已于 2022-12-29 11:48:17 修改

阅读量756

点赞数

分类专栏： Python网络爬虫文章标签：爬虫 python Urllib

于 2022-12-29 11:47:24 首次发布

本文链接：https://blog.csdn.net/SmiledrinkCat/article/details/128478589

版权

Python网络爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

网络爬虫--Urllib

Urllib

Urllib

1. 使用Urlopen发送请求

1.1 发送GET请求

import urllib.request
url = 'https://www.python.org/'
response = urllib.request.urlopen(url=url)  # 发送网络请求
print('响应状态码为：', response.status)
print('响应头所有信息为：', response.getheaders())
print('响应头指定信息为：', response.getheader('Accept-Ranges'))
print('Python官网HTML代码如下： \n', response.read().decode('utf-8'))  # 读取HTML代码并进行UTF-8解码

1.2 发送POST请求

import urllib.request
import urllib.parse
url = 'https://www.httpbin.org/post'
# 将表单数据转换为bytes类型，并设置编码方式为UTF-8

*data = bytes(urllib.parse.urlencode({'hello': 'python'}), encoding='utf-8')*

response = urllib.request.urlopen(url=url, *data=data*)
print(response.read().decode('utf-8'))  # 读取HTML代码并进行UTF-8解码

1.3 设置网络超时

import urllib.request
import urllib.error
import socket
url = 'https://www.python.org/'
try:
    # 发送网络请求，设置超时时间为0.1秒
    response = urllib.request.urlopen(url=url, timeout=0.1)
    print(response.read().decode('utf-8'))
except urllib.error.URLError as error:
    if isinstance(error.reason, socket.timeout):  # 判断异常是否为超时异常
        print('当前任务已超时，即将执行下一任务！')

2. 复杂的网络请求

2.1 设置请求头 => 为了模拟浏览器向网页后台发送网络请求，避免服务器的反爬措施

import urllib.request
import urllib.parse
url = 'https://www.httpbin.org/post'
# 定义请求头信息
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/102.0.0.0 Safari/537.36'}
# 将表单数据转换为bytes类型，并设置编码方式为UTF-8
data = bytes(urllib.parse.urlencode({'hello': 'python'}), encoding='utf-8')
# 创建Request对象
r = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
response = urllib.request.urlopen(r)
print(response.read().decode('utf-8'))

2.2 Cookies的获取与设置

用户登录成功后，会在浏览器的Cookie中保留一些信息，使用爬虫获取登陆后数据时，除了使用模拟登录，还可获取登录后的Cookie，利用这个Cookie再次发送请求时，就能以登录用户的身份获取数据

2.2.1 模拟登录

import urllib.request
import urllib.parse
url = 'http://XXX.com/dologin.html'  # 登录请求地址
# 将表单数据转换为bytes类型，并设置编码方式为UTF-8
data = bytes(urllib.parse.urlencode({'username': 'myname', 'password': 'mypassword'}), encoding='utf-8')
# 创建Request对象
r = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
response = urllib.request.urlopen(r)
print(response.read().decode('utf-8'))

2.2.2 在模拟登录过程中获取Cookies信息

import urllib.request
import urllib.parse
import http.cookiejar
import json
url = 'http://XXX.com/dologin.html'  # 登录请求地址
# 将表单数据转换为bytes类型，并设置编码方式为UTF-8
data = bytes(urllib.parse.urlencode({'username': 'myname', 'password': 'mypassword'}), encoding='utf-8')
cookie = http.cookiejar.CookieJar()  # 创建CookieJar对象
cookie_processor = urllib.request.HTTPCookieProcessor(cookie)  # 生成Cookie处理器
opener = urllib.request.build_opener(cookie_processor)  # 创建opener对象
response = opener.open(url, data=data)  # 发送登录请求
response = json.loads(response.read().decode('utf-8'))['msg']
if response=='登录成功！':
    for i in cookie:
        print(i.name + '=' + i.value)  # 打印登录成功的Cookie信息

2.2.3 将Cookie保存为LWP格式文件

可将Cookie保存为LWP格式文件，下次登录请求时直接读取文件中的Cookie信息即可
需先创建LWPCookieJar对象，然后通过cookie.save()方法将Cookie信息保存为文件

import urllib.request
import urllib.parse
import http.cookiejar
import json
url = 'http://XXX.com/dologin.html'  # 登录请求地址
# 将表单数据转换为bytes类型，并设置编码方式为UTF-8
data = bytes(urllib.parse.urlencode({'username': 'myname', 'password': 'mypassword'}), encoding='utf-8')
cookie_file = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(cookie_file)  # 创建LWPCookieJar对象
cookie_processor = urllib.request.HTTPCookieProcessor(cookie)  # 生成Cookie处理器
opener = urllib.request.build_opener(cookie_processor)  # 创建opener对象
response = opener.open(url, data=data)  # 发送登录请求
response = json.loads(response.read().decode('utf-8'))['msg']
if response=='登录成功！':
    cookie.save(ignore_discard=True, ignore_expires=True)  # 保存Cookie文件

2.2.4使用Cookie，调用cookie.load()方法读取本地的Cookie文件

import urllib.request
import http.cookiejar
# 登录后页面的请求地址
url = 'http://XXX.com/index.html'
cookie_file = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar()
cookie.load(cookie_file, ignore_expires=True, ignore_discard=True)  # 读取cookie文件内容
cookie_processor = urllib.request.HTTPCookieProcessor(cookie)  # 生成Cookie处理器
opener = urllib.request.build_opener(cookie_processor)  # 创建opener对象
response = opener.open(url)
print(response.read().decode('utf-8'))

2.3 设置代理IP

解决反爬虫，最好每发送一次请求就设置一个新的代理IP

import urllib.request
url = 'https://httpbin.org/get'
# 创建代理IP
proxy_handler = urllib.request.ProxyHandler({
    'https':'58.220.95.114:10053'  # 键名为协议类型（如HTTP或者HTTPS)，值为代理链接
})
# 创建Opener对象
opener = urllib.request.build_opener(proxy_handler)
response = opener.open(url, timeout=2)
print(response.read().decode('utf-8'))

3. 异常处理

3.1 捕获URLError异常

import urllib.request
import urllib.error
try:
    # 向不存在的网络地址发送请求
    response = urllib.request.urlopen('http://abcd.com/index.html')
except urllib.error.URLError as error:
    print(error.reason)  # 打印异常原因

3.2 捕获HTTPError异常

import urllib.request
import urllib.error
try:
    # 向不存在的网络地址发送请求
    response = urllib.request.urlopen('http://abcd.com/index.html')
    print(response.status)
except urllib.error.HTTPError as error:
    print(error.code)  # 打印状态码
    print(error.reason)  # 打印异常原因
    print(error.headers)  # 打印请求头

3.3 双重异常捕获

import urllib.request
import urllib.error
try:
    # 向不存在的网络地址发送请求
    response = urllib.request.urlopen('http://abcd.com/index.html')
    print(response.status)
except urllib.error.HTTPError as error:
    print(error.code)  # 打印状态码
    print(error.reason)  # 打印异常原因
    print(error.headers)  # 打印请求头
except urllib.error.URLError as error:
    print(error.reason)  # 打印异常原因

4. 解析链接

4.1 拆分URL

4.1.1 urlparse()方法

import urllib.parse
parse_result = urllib.parse.urlparse('https://docs.python.org/3/library/urllib.parse.html')
print(type(parse_result))
print(parse_result)

4.1.2 urlsplit()方法 => 返回值不单独拆分params部分，将params合并到path中

import urllib.parse
parse_result = urllib.parse.urlsplit('https://docs.python.org/3/library/urllib.parse.html')
print(type(parse_result))
print(parse_result)

4.2 组合URL

4.2.1 urlunparse()方法

import urllib.parse
list_url = ['https', 'docs.python.org', '/3/library/urllib.parse.html', '', '', '']
tuple_url = ('https', 'docs.python.org', '/3/library/urllib.parse.html', '', '', '')
dict_url = {'scheme': 'https', 'netloc': 'docs.python.org', 'path': '/3/library/urllib.parse.html', 'params': '', 'query': '', 'fragment': ''}
print('组合列表类型的URL：', urllib.parse.urlunparse(list_url))
print('组合元组类型的URL：', urllib.parse.urlunparse(tuple_url))
print('组合字典类型的URL：', urllib.parse.urlunparse(dict_url.values()))

4.2.2 urlunsplit()方法

import urllib.parse
list_url = ['https', 'docs.python.org', '/3/library/urllib.parse.html', '', '']
tuple_url = ('https', 'docs.python.org', '/3/library/urllib.parse.html', '', '')
dict_url = {'scheme': 'https', 'netloc': 'docs.python.org', 'path': '/3/library/urllib.parse.html', 'query': '', 'fragment': ''}
print('组合列表类型的URL：', urllib.parse.urlunsplit(list_url))
print('组合元组类型的URL：', urllib.parse.urlunsplit(tuple_url))
print('组合字典类型的URL：', urllib.parse.urlunsplit(dict_url.values()))

4.3 连接URL => urljoin()方法

import urllib.parse
base_url = 'https://docs/python.org'
# 第二参数不完整时
print(urllib.parse.urljoin(base_url, '/3/library/urllib.parse.html'))
# 第二参数完整时
print(urllib.parse.urljoin(base_url, 'https://docs.python.org/3/library/urllib.parse.html#url-parsing'))

4.4 URL的编码与解码

4.4.1 urlencode()方法

import urllib.parse
base_url = 'http://httpbin.org/get?'
params = {'name': 'Jack', 'country': '中国', 'age': 30}
url = base_url + urllib.parse.urlencode(params)
print("编码后的请求地址为：", url)

4.4.2 quote()方法

import urllib.parse
base_url = 'http://httpbin.org/get?country='
url = base_url + urllib.parse.quote('中国')  # 字符串编码
print("编码后的请求地址为：", url)

4.4.3 unquote()方法 => 解码

import urllib.parse
u = urllib.parse.urlencode({'country': '中国'})
q = urllib.parse.quote('country=中国')
print('urlencode编码后结果为：', u)
print('quote编码后结果为：', q)
print('对urlencode解码：', urllib.parse.unquote(u))
print('对quote解码：', urllib.parse.unquote(q))

4.5 URL参数的转换 => parse_qs()方法

4.5.1 parse_qs()方法将参数转换为字典类型

import urllib.parse
url = 'http://httpbin.org/get?name=Jack&country=%E4%B8%AD%E5%9B%BD&age=30'
q = urllib.parse.urlsplit(url).query  # 获取参数
q_dict = urllib.parse.parse_qs(q)  # 将参数转换为字典类型的数据
print('数据类型为：', type(q_dict))
print('转换后的数据：', q_dict)

4.5.2 parse_qsl()方法将参数转换为列表

import urllib.parse
str_params = 'name=Jack&country=%E4%B8%AD%E5%9B%BD&age=30'
list_params = urllib.parse.parse_qsl(str_params)  # 将字符串转换为元组所组成的列表
print('数据类型为：', type(list_params))
print('转换后的数据：', list_params)

# 运行结果
# 数据类型为： <class 'list'>
# 转换后的数据： [('name', 'Jack'), ('country', '中国'), ('age', '30')]