Python爬虫第2课 Urllib库详解

最新推荐文章于 2023-12-18 17:43:46 发布

黎明前最后的黑暗

最新推荐文章于 2023-12-18 17:43:46 发布

阅读量256

点赞数

分类专栏： Python爬虫 Python学习文章标签： Python爬虫 urllib

本文链接：https://blog.csdn.net/weixin_42057995/article/details/89469281

版权

Python学习同时被 2 个专栏收录

28 篇文章 0 订阅

订阅专栏

Python爬虫

9 篇文章 0 订阅

订阅专栏

Urllib库详解

官方文档连接

目标

● 什么是Urllib
● 用法详解

01.什么是Urllib

Python内置的HTTP请求库
● urllib.request
请求模块,它是最基本的 HTTP 请求模块，我们可以用它来模拟发送一请求，就像在浏览器里输入网址然后敲击回车一样，只需要给库方法传入 URL 还有额外的参数，就可以模拟实现这个过程了。
● urllib.error
异常处理模块,即异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作保证程序不会意外终止。
● urllib.parse
URL解析模块,是一个工具模块，提供了许多 URL 处理方法，比如拆分、解析、合并等等的方法。
● urllib.robotparser
robots.txt解析模块,主要是用来识别网站的 robots.txt 文件，然后判断哪些网站可以爬，哪些网站不可以爬的，其实用的比较少。

02.用法详解

2.1 urlopen

（1）urlopen参数介绍
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

● url——网站的网址
● data=None——格外的数据，例如post表单。加上data参数，请求以post形式发送，不加data参数，请求以get形式发送。
● [timeout, ]*——超时设置
● cafile=None——CA证书相关设置，一般不做考虑
● capath=None——CA证书相关设置，一般不做考虑
● cadefault=False——CA证书相关设置，一般不做考虑
● context=None——获取内容的设置，一般为空，就是抓取整个网页
（2）urlopen参数使用

● 1）url参数使用

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

● 2）data参数使用
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({‘word’:‘hello’}), encoding = ‘utf-8’)
response = urllib.request.urlopen(‘http://httpbin.org/post’,data = data)
print(response.read())
● 3）timeout参数使用
未超时，做正常输出

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get',timeout = 1)
print(response.read())

超时，抛出异常

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TimeOut')

http://httpbin.org 这个网址是用来做HTTP测试的一个网址，可以提供各种方法

2.2 响应

（1）响应类型

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(type(response))

通过返回结果我们可以看到，响应类型是http.client.HTTPResponse
（2）状态码、响应头
通过response.status、response.getheaders().response.getheader(“server”)，获取状态码以及头部信息

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

上述的urlopen只能做简单的请求，要是想做一些复杂的请求，比如添加headers等一些信息时。在我们进行爬虫时，很多情况下是需要我们添加头部信息来访问目标网站的，此时urlopen就无法实现了。我们就需要request来帮忙

2.3 Request

（1）简单的小例子

import urllib.request

# 用Request请求网页信息，将返回结果赋值给变量request
request = urllib.request.Request('http://python.org')
# 将request作为urlopen的第一个参数
response = urllib.request.urlopen(request)
# 用read（）方法将结果读出，规定编码格式
print(response.read().decode('utf-8'))

（2）request给请求添加头部信息——方法1

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {'User-Agent': 'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)',
           'Host': 'httpbin.org'}
dict_demo = {'name': 'Andy'}
data = bytes(parse.urlencode(dict_demo), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

（3）request给请求添加头部信息——方法2

from urllib import request, parse

url = 'http://httpbin.org/post'
dict_demo = {'name': 'Andy'}
data = bytes(parse.urlencode(dict_demo), encoding='utf-8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

2.4 Handler

代理，我们在爬取网站信息时，需要不断向网站发送请求，而网站也会检测某一时间段内同一IP的访问次数，我们在使用爬虫时，向网站发送请求的速度是我们手工所不能及的，这就会使网站检测到我们的IP在某一时间段内不正常的访问次数，这时，网站就会判定这时一个“恶意”的访问，从而终止我们的访问，不再返回相应数据。为了避免这种情况的发生，我们就要用到代理（ProxyHandler），来爬取数据了！
（1）构造代理ProxyHandler
使用API的一些方法，将相关参数传过来

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

简单来说，代理就是不断切换我们请求网站时的IP地址，这样，就不会被网站查封。

2.5 cookie

（1）http.cookijar
cookies是在客户端保存的，用来记录用户身份的一个文本文件，注要在爬虫中用来维持我们的登录状态。因此，在爬取网站是，有时需要携带cookie信息访问,这里用到了http.cookijar，用于获取cookie以及存储cookie。

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

（2）http.cookiejar.MozillaCookieJar()
将cookie写入到文件中保存的一种方式

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

（3）http.cookiejar.LWPCookieJar()
将cookie写入到文件中保存的一种方式

import http.cookiejar, urllib.request
filename = 'cookie2.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

（4）读取cooki文件

import http.cookiejar, urllib.request
# 用load 读取用LWPCookieJar方法存储的cookie文件
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie2.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

2.6 异常处理

我们在请求访问网页时，可能会出现一些错误，如404，500等错误，这是我们的程序就会报错停止。但在爬虫中这种情况是经常发生的，但是我们并不希望程序终止运行，这时就用到了异常处理，将一些错误提前做捕获处理，这样就不会影响程序的正常运行啦！
（1）请求一个不存在的网站

from urllib import request, error
try:
    response = request.urlopen('http://xiaoming.com/index.html')
except error.URLError as e:
    print(e.reason)

（2）URLError
属性：reason（只能打印出错信息）

（3）HTTPError
属性：
code
reason
headers——打印相应错误的响应头信息

from urllib import request,error
try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)

else:
    print("reqeust successfully")

（4）错误原因判断

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIMEOUT')

2.7 URL解析

（1）urlparse
● 1)参数
urllib.parse.urlparse(urlstring, scheme=’’,allow_fragments=True
● 2）使用
主要用于url的拆分，参数的使用就是按照不同的需求进行拆分。

from urllib.parse import urlparse

# urlstring 属性
result_urlstring = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result_urlstring), result_urlstring)

# scheme 属性
result_scheme = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result_scheme)
result_scheme2 = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result_scheme2)
# allow_fragments 属性
result_allow_fragments1 = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result_allow_fragments1)
result_allow_fragments2 = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result_allow_fragments2)

（2）urlunparse
urlparse的反向使用，将分开的url的几个部分的内容，拼接成完整的url。

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

（3）urljoin
用来拼接url

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://pythonsite.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

后面的url路径优先级更高
（4）urlencode
这个方法可以将字典对象转换为get请求参数
from urllib.parse import urlencode

params ={‘name’: ‘Andy’, ‘age’: 22}
base_url = ‘http://www.baidu.com?’
url = base_url + urlencode(params)
print(url)

黎明前最后的黑暗

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫第2课 Urllib库详解

Urllib库详解官方文档连接目标● 什么是Urllib● 用法详解01.什么是UrllibPython内置的HTTP请求库● urllib.request请求模块,它是最基本的 HTTP 请求模块，我们可以用它来模拟发送一请求，就像在浏览器里输入网址然后敲击回车一样，只需要给库方法传入 URL 还有额外的参数，就可以模拟实现这个过程了。● urllib.error异常处理模块...
复制链接

扫一扫