urllib库的使用

最新推荐文章于 2022-10-21 20:51:34 发布

GouZe1

最新推荐文章于 2022-10-21 20:51:34 发布

阅读量238

点赞数

分类专栏： Scrapy 文章标签： Urllib库的使用

本文链接：https://blog.csdn.net/weixin_43734271/article/details/90115916

版权

Scrapy 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

urllib库中包含四个模块

urllib.request基本的HTTP请求模块。可以模拟浏览器向目标服务器发送请求。
urllib.error 异常处理模块。如果出现错误，可以捕捉异常。
urllib.parse 工具模块。提供URL处理方法, 比如对URL进行编码和解码。
urllib.robotpaser 用来判断哪些网站可以爬，哪些网站不可以爬。

import urlllib      # 导入模块

urllib.request的模块简介

from urllib import request             # 导入request模块

提供了两个最重要的方法：

request.Request('url')        # 传入一个url，实例化url，返回一个Request对象
request.urlopen()              # 传入一个Request对象或者一个url，打开url，返回一个HTTPResposne类型的对象

简单实例

from urllib import request
url = 'http://www.baidu.com'
request_url = request.Request(url)
response = request.urlopen(request_url)
html = response.read().decode(‘utf-8’)          # decode解码成utf-8格式
print(html)

request.urlopen详细解析

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

返回一个HTTPResposne类型对象，主要包含的方法有read(), status(), readinto()方法等
可以发现除了第一个参数可以传递URL之外，我们还可以传递其它的内容，比如 data （附加参数）， timeout （超时时间）等等。
data参数
data 参数是可选的，如果要添加 data ，它要是字节流编码格式的内容，即 bytes 类型，通过 bytes() 函数可以进行转化，另外如果你传递了这个 data 参数，它的请求方式就不再是 GET 方式请求，而是 POST ,data的参数值会被作为POST的参数传入，打印结果可以发现，出现在form中，模拟表单提交的方式

from urllib import request
from urllib import parse
response_url = request.Request('http://www.baidu.com')
data = bytes(parse.urlencode({'word': 'hello'}), encoding='utf-8')       # 转化为比特流
response = request.urlopen(response_url, data=data)
html = response.read().decode('utf-8')  
print(html)

timeout参数

timeout 参数可以设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出异常，如果不指定，就会使用全局默认时间。它支持 HTTP 、 HTTPS 、 FTP 请求，可以结合try,except语句来实现这样的操作

import socket
from urllib import request
from urllib import error

try:
    response = request.urlopen('http://www.baidu.com', timeout=1)
    html = response.read().decode('utf-8')
    print(html)
except error.URLError as e:
    if isinstance(e.reason, socket.timeout):              # 如果socket.timeout是e.reason的父类
        print('Time out!')

urllib.request.Request构建实例化对象的详细解析

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

data参数
data 参数如果要传必须传 bytes （字节流）类型的，如果是一个字典，可以先用 urllib.parse.urlencode() 编码。
headers参数
headers 参数是一个字典，你可以在构造 Request 时通过 headers 参数传递，也可以通过调用 Request 对象的 add_header() 方法来添加请求头。请求头最常用的用法就是通过修改 User-Agent 来伪装浏览器，默认的 User-
Agent 是 Python-urllib ，你可以通过修改它来伪装浏览器，比如要伪装火狐浏览器，你可以把它设置为 Mozilla/5.0 (X11; U; Linux i686)Gecko/20071127 Firefox/2.0.0.1
method参数
method 是一个字符串，它用来指示请求使用的方法，比如 GET ， POST ， PUT 等等。

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    "User-Agent": 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',        # 修改浏览器
    "host": 'httpbin.org'          # 域名
}
dict = {
    "name": "Ze1al"
}
data = bytes(parse.urlencode(dict), encoding="utf-8")      # 转化为字节流
request_url = request.Request(url=url, headers=headers, data=data, method='POST')
html = request.urlopen(request_url).read().decode('utf-8')
print(html)

GouZe1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
urllib库的使用

**urllib库中包含四个模块**urllib.request基本的HTTP请求模块。可以模拟浏览器向目标服务器发送请求。urllib.error 异常处理模块。如果出现错误，可以捕捉异常。urllib.parse 工具模块。提供URL处理方法, 比如对URL进行编码和解码。urllib.robotpaser 用来判断哪些网站可以爬，哪些网站不可以爬。urlli...
复制链接

扫一扫