【Python爬虫】urllib的基本介绍

最新推荐文章于 2024-03-25 13:55:43 发布

RwatitFahsa

最新推荐文章于 2024-03-25 13:55:43 发布

阅读量153

点赞数

分类专栏： Python 爬虫

本文链接：https://blog.csdn.net/sinat_37529938/article/details/110395625

版权

Python 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

最近学习了urllib相关的知识，以下是学习过程中记录的笔记，喜欢的朋友可以关注公众号：苏小怪的梦呓

一、什么是urllib?

urllib 是 Python内置的HTTP请求库

官方文档地址： https://docs.python.org/3/library/urllib.html

二、相关模块

urllib.request 请求模块

urllib.error 异常处理模块，包含urllib.request抛出的异常

urllib.parse url 解析模块

urllib.robotparser robots.txt解析模块

三、【重点】核心的网络请求库urllib

1、urllib.request 请求模块

python2 ：urllib2、urllib

python3 ：把urllib和urllib2合并,urllib.request

from urllib.request import urlopen urlopen(url, data=None)可以直接发起url的请求, 如果data为空时，则默GEt请求，反之为POST请求， urlopen()不支持重构User-Agent

from urllib.request import Request Request 构造请求的类

1.1 简单的的请求

from urllib.request import urlopen


# 发起网络请求
resp = urllopen('https://www.baidu.com/')
assert resp.code == 200
print('请求成功')
# 保存请求的网页
# f 变量接收open()函数返回的对象的__enter__()返回结果
with open('a.html', 'wb') as f:
     f.write(resp.read())

1.2 带请求头的请求

from urllib.request import Request


def search_baidu():
    # 网络资源的接口(URL)
    url = 'https://www.baidu.com'


    # 生成请求对象，封装请求的url和头header
    request = Request(url,
                      headers={
                          'Cookie': 'BIDUPSID=16CECBB89822E3A2F26ECB8FC695AFE0; PSTM=1572182457; BAIDUID=16CECBB89822E3A2C554637A8C5F6E91:FG=1; BD_UPN=123253; H_PS_PSSID=1435_21084_30211_30283; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_645EC=6f7aTIObS%2BijtMmWgFQxMF6H%2FhK%2FcpddiytCBDrefRYyFX%2B%2BTpyRMZInx3E',
                          'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
                      })


    response = urlopen(request)  # 发起请求


    assert response.code == 200
    print('请求成功')


    # 读取响应的数据
    bytes_ = response.read()
    
    # 将响应的数据写入文件中
    with open('index.html', 'wb') as file:
        file.write(bytes_)

1.3 HTTP 处理器

urllib的请求处理器，主要用于urllib.request.build_opener()函数参数，表示构造一个由不同处理组成的伪浏览器。

HTTPHandler：处理Http协议的请求处理。

HTTPCookieProcessor：处理Cookie的处理器，创建类实例时，需要提供http.cookiejar.CookieJar类的实例对象。

ProxyHandler 代理Handler

from urllib.request import Request, build_opener, HTTPHandler, HTTPCookieProcessor, ProxyHandler

from http.cookiejar import CookieJar

HTTPHandler HTTP协议请求处理器

ProxyHandler(proxies={'http': ' http://proxy_ip:port'}) 代理处理

HTTPCookieProcessor(CookieJar())

http.cookiejar.CookieJar 类

"""
多个urllib的请求处理器
- Cookie
- Proxy
- Http
"""
import json
from urllib.request import Request, build_opener, HTTPHandler, HTTPCookieProcessor, ProxyHandler
from http.cookiejar import CookieJar
from urllib.parse import urlencode


opener = build_opener(HTTPHandler(),
                      HTTPCookieProcessor(CookieJar()),
                      ProxyHandler(proxies={
                          'http': 'http://180.113.189.147:9999'
                      })
                      )


post_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2019111173190'

# 打开http://www.renren.com 登录可获取相关登录参数信息 
data = {
    'rkey': '349d874cb30075e222d45ba63074a793',
    'password': 'xxx', # 密码
    'origURL': 'http://www.renren.com/home',
    'key_id': '1',
    'icode': '',
    'f': 'http://www.renren.com/224549540',
    'email': 'xxx',# 邮箱
    'domain': 'renren.com',
    'captcha_type': 'web_login',
}


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
    'Referer': 'http://www.renren.com/SysHome.do'
}


request = Request(post_url,
                  urlencode(data).encode('utf-8'),
                  headers)


resp = opener.open(request)  # http.client.HTTPResponse
bytes_ = resp.read()
ret = json.loads(bytes_.decode('utf-8'))  # {"code":true,"homeUrl":"http://www.renren.com/home"}
if ret['code']:
    resp = opener.open(ret['homeUrl'])
    bs = resp.read()
    print(bs.decode('utf-8'))

2、urllib.parse url解析模块

此模块有2个核心函数，quote 和 urlencode

序号	名称	作用
1	quote(str)	可以将中文转为url编码格式
2	unquote(str)	可以将URL编码进行解码
3	urlencode(query)	将字典构形式的参数序列化为url编码后的字符串（常用来构造get请求和post请求的参数）k1=v1&k2=v2
4	urlparse(url)	实现URL的识别和分段
5	urlunparse(components)	以实现URL的构造
6	urljoin(base, url)	传递一个基础链接,根据基础链接可以将某一个不完整的链接拼接为一个完整链接

1、

word = '编程'
url = 'http://www.baidu.com/s?wd='+parse.quote(word)
print(parse.quote(word))
print(url)

"""
%E7%BC%96%E7%A8%8B
http://www.baidu.com/s?wd=%E7%BC%96%E7%A8%8B
"""

2、

# unquote:可以将URL编码进行解码
url = 'http://www.baidu.com/s?wd=%E7%BC%96%E7%A8%8B'
print(parse.unquote(url))
"""
http://www.baidu.com/s?wd=编程
"""

3、

parmas = {
    'wd': '123',
    'page': 20
}
parmas_str = parse.urlencode(parmas)
print(parmas_str)
"""
page=20&wd=123
"""

4、

url = 'https://book.qidian.com/info/1004608738?wd=123&page=20#Catalog'
"""
url：待解析的url
scheme=''：假如解析的url没有协议,可以设置默认的协议,如果url有协议，设置此参数无效
allow_fragments=True：是否忽略锚点,默认为True表示不忽略,为False表示忽略
"""
result = parse.urlparse(url=url,scheme='http',allow_fragments=True)
print(result)
print(result.scheme)
"""
(scheme='https', netloc='book.qidian.com', path='/info/1004608738', params='', query='wd=123&page=20', fragment='Catalog')
scheme:表示协议
netloc:域名
path:路径
params:参数
query:查询条件，一般都是get请求的url
fragment:锚点，用于直接定位页
面的下拉位置，跳转到网页的指定位置
"""

5、

url_parmas = ('https', 'book.qidian.com', '/info/1004608738', '', 'wd=123&page=20', 'Catalog')
#components:是一个可迭代对象，长度必须为6
result = parse.urlunparse(url_parmas)
print(result)
"""
https://book.qidian.com/info/1004608738?wd=123&page=20#Catalog
"""

6、

base_url = 'https://book.qidian.com/info/1004608738?wd=123&page=20#Catalog'
sub_url = '/info/100861102'
full_url = parse.urljoin(base_url, sub_url)
print(full_url)  # https://book.qidian.com/info/100861102

3、 urllib.error 异常处理模块

序号	名称	作用
1	URLError	来自urllib库的error模块，他继承自OSError类，是error异常模块的基类，由request模块产生的异常都可以通过它处理。 reason属性，返回错误原因
2	HTTPError	是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等。三个属性：　　　　code：返回http状态码,比如404表示网页不存在，500表示服务器内部错误等。　　　　reason：返回错误的原因　　　　headers：返回请求头

# 这是一个比较好的异常处理方法
# 可以先捕获子类异常再捕获父类异常
from urllib import request, error


try:
    response = request.urlopen('https://blog.csdn.net/Daycym/article/details/11')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

4、 urllib.robotparser robots.txt解析模块

Robots协议
也称网络爬虫协议，机器人协议，它的全称叫做：网络爬虫排除标准（Robots Exclusion Protocol），用来告诉爬虫和搜索引擎哪些页面可以爬取，哪些不可以爬取，它通常是一个叫robots.txt文本文件，一般放在网站的根目录下。

from urllib.robotparser import RobotFileParser

# 首先创建RobotFileParser对象，然后通过set_url()方法设置了“robots.txt”的链接
robotparser = RobotFileParser()
# 或 robotparser = RobotFileParser('https://blog.csdn.net/robots.txt')
robotparser.set_url('https://blog.csdn.net/robots.txt')
robotparser.read()
print(robotparser.can_fetch('*', 'https://blog.csdn.net/lianshaohua'))  # 判断是否可以被爬取
print(robotparser.can_fetch('*', "https://blog.csdn.net/nav/db"))


'''
此类的常用方法：

set_url()　　设置robots.txt文件的链接，如果在创建RobotFileParser对象时传入了链接，那么就不需要再用这个方法。
read()　　读取robots.txt文件并分析，不会返回任何内容，但是执行了读取和分析操作。，如果不调用这个方法，后面的判断都会为False，一定要调用这个方法。
parse()　　解析robots.txt文件，如果传入的参数是“robots.txt”某些行的内容，那么它会按照“robots.txt”的语法规则去分析。
can_fetch()　　传入两个参数，第一个是User-Agent，第二个是抓取的URL，返回是否可抓取，返回值为True或False。
mtime()　　返回上回抓取和分析“robots.txt”的时间，如果想要长时间分析和抓取的搜索爬虫的话，要用 mtime() ，此时就需要定期检查来抓取最新的“robots.txt”。
modified()　　想要长时间分析和抓取的搜索爬虫的话，，将当前时间设置为上次抓取和分析“robots.txt”文件的时间
'''

RwatitFahsa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Python爬虫】urllib的基本介绍

一、什么是urllib?urllib 是 Python内置的HTTP请求库官方文档地址： https://docs.python.org/3/library/urllib.html二、相关模块urllib.request 请求模块urllib.error 异常处理模块，包含urllib.request抛出的异常urllib.parse url解析模块urllib.robotparser robots.txt解析模块三、【重点】核心的...
复制链接

扫一扫