【转载】爬虫篇：获取数据——urllib库的基础知识（总结）

1024码字猿

已于 2022-04-16 20:08:06 修改

阅读量1.8k

点赞数

分类专栏： urllib的使用文章标签： python 爬虫

于 2022-04-16 19:54:25 首次发布

本文链接：https://blog.csdn.net/weixin_40458518/article/details/124136100

版权

本文详细介绍了Python的urllib库在爬虫中的应用，包括发出HTTP请求、处理异常、解析链接以及分析Robots协议。内容涵盖urlopen方法、数据参数、异常处理、Cookie管理、代理IP的使用，以及URL的解析与合并。同时，讲解了如何处理gzip加密网站和中文编码问题。

摘要由CSDN通过智能技术生成

注：本文章大部分代码的案例节选于《Python3网络爬虫开发实战（第2版）》。

一、发出请求

1、urlopen方法

# Python版本：3.6
# -*- coding:utf-8 -*-
"""
urlopen()方法的API:
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
"""
import urllib.request

response = urllib.request.urlopen('https://www.python.org')
# 读取网页源代码
# print(response.read().decode('utf-8'))
# 输出响应的类型
print(type(response))  # <class 'http.client.HTTPResponse'>
# 输出响应的状态码
print(response.status)  # 200
# 输出响应的headers（列表数据类型）
print(response.getheaders())
# 输出响应的headers的指定Content-Type值
print(response.getheader('Content-Type'))

扩展1：下载网页到本地磁盘

a.先读取网站的源代码，再用文件操作写入、保存到本地磁盘

import urllib.request

# 打开并爬取一个网页
request_url = urllib.request.urlopen('https://www.baidu.com/')
# 读取网页内容
html = request_url.read().decode('utf-8')
# 下载到本地
with open('html_1.html', mode='wb') as f:
    f.write(html)

b.使用urllib.request.urlretrieve方法直接下载到本地磁盘

import urllib.request

# 下载到本地
fileName = urllib.request.urlretrieve("https://www.geeksforgeeks.org/", 'html_2.html')
print("fileName:",fileName) # fileName: ('html_2.html', <http.client.HTTPMessage object at 0x000002A905276B38>)

扩展2：获取网站信息

a.文件头信息（网页信息）

response = urllib.request.urlopen('https://www.baidu.com')
print("获取文件头信息：", response.info())  # Accept-Ranges: bytes Cache-Control: no-cache ...
print("获取文件头信息 - 列表嵌套元组的形式输出：", response.getheaders())  # [('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ...]
print("获取某项文件头信息，如Server:",response.getheader('Server')) # BWS/1.1

b.状态码：status、getcode()

response = urllib.request.urlopen('https://www.baidu.com')
print("获取状态码：", response.getcode())  # 200
print("获取状态码：", response.status)  # 200

2、data参数

# Python版本：3.6
# -*- coding:utf-8 -*-

import urllib.request
import urllib.parse

data = bytes(urllib.parse.urlencode({
   'name': 'germey'}), encoding='utf-8')
print(data)
response = urllib.request.urlopen('https://www.httpbin.org/post', data=data)
print(response.url)
print(response.read().decode('utf-8'))

3、timeout参数

案例1：

import urllib.request

try:
    # timeout=3,打开网页超时设置为3秒
    html = urllib.request.urlopen('http://www.google.com', timeout=3)
    data = html.read()
    print(len(data))
except Exception as e:
    print('异常了...', str(e))

案例2：

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('time out')

4、其他参数（了解）

# Python版本：3.6
# -*- coding:utf-8 -*-
import urllib.request
import ssl

context = ssl.create_default_context()

"""
context="": 指定ssl的设置
cafile="" : 指定CA证书
capath="" ： 指定CA路径
"""
res = urllib.request.urlopen('https://www.sogou.com', context=context, cafile="", capath="")
print(res.status) # 200
print(res.read().decode('utf-8'))

5、Request()：不完整请求，需要与urlopen()结合使用

参数说明：

class urllib.request.Resquest(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
第一个参数 url 用于请求url，这里必选参数，其它的都是可选参数
第二个参数 data 如果要传数据，必须传入 bytes 类型的。如果数据是字典，可以先用 urllib.parse 模块里的 urlopen 方法进行编码
第三个参数 headers 是一个字典，这就是请求头，可以通过 headers 参数直接构造此项，也可以通过调用实例的 add_headers 方法添加
第四个参数 origin_req_host 指的是请求方的 host 名称或 IP 地址
第五个参数 unverifiable 表示请求是否是无法验证，默认取值是 False，意思是用户没有足够的权限来接收这个请求的结果。
		例如：请求一个 HTML 文档中的图片，但是没有自动抓取图像的权限，这里 unverifiable 的值就是 True 
第六个参数 method 是一个字符串，用来指示请求使用的方法，例如 GET、POST、PUT等

headers参数

from urllib.request import Request, urlopen

headers = {
   
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/97.0.4692.71 Safari/537.36 '
}
response = urlopen(Request('https://python.org', headers=headers))
print(response.getcode())
print(response.read().decode('utf-8'))

传入多个参数，比如url、headers、data参数

# Python版本：3.6
# -*- coding:utf-8 -*-

from urllib import parse, request

url = 'https://www.httpbin.org/post'
headers = {
   
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Host': 'www.httpbin.org'
}
dict = {
   'name': 'germey'}
# 把字典数据转成字节流格式（url编码（参数编码））
data = bytes(parse.urlencode(dict), encoding='utf-8')
print(data)  # b'