python爬虫学习记录（1）基本库的使用——urllib

玛卡巴卡巴巴亚卡

于 2021-07-12 13:59:19 发布

阅读量640

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_45540609/article/details/118641512

版权

一、使用urllib库

python内置HTTP请求库，包含如下四个模块：

request：模拟发送请求

error：异常处理模块

parse：工具模块，提供url处理方法

robotparser：识别网站robots.txt文件，判断哪些网站可爬取

1、发送请求

（1）urlopen（）

模拟浏览器请求发起过程

如下代码可以爬取python官网源代码

from urllib import request

url = 'https://www.python.org'
response = request.urlopen(url)
print(response.read().decode('utf-8'))

使用

print(type(response))

可以得到<class 'http.client.HTTPResponse'>，主要包含read（），readinto（），getheader（name），getheaders（），fileno（）等方法，以及msg，version，status，debuglevel，closed等属性。

urlopen参数API如下：

urllib.request.urlopen(url,data=None,[timeout,*],cafile=None,capath=None,cadefault=False,context=None)

如下代码，传递一个word，参数是hello，转码成bytes类型。用urllib中的parse的urlencode方法把参数字典转化成字符串，指定参数编码格式是utf-8，结果如下

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.8", \n "X-Amzn-Trace-Id": "Root=1-60e9acb1-65e3b6fe248bdd517eccab46"\n }, \n "json": null, \n "origin": "125.68.93.64", \n "url": "http://httpbin.org/post"\n}\n'

（2）request

urlopen可以实现最基本的请求发起，但是如果请求中需要加入header等信息，需要使用request请求方法。

request.Request(url,data=None,headers={ }, origin_req_host=None, unverifiable=False, method=None)

url：请求url，必传参数

data：如要传递，必须传递bytes字节流类型，如果是字典，可以使用urllib.parse模块中的urlencode编码

headers：一个字点，请求头，可通过headers直接构造，也可以通过请求实例的add_header()方法添加

origin_req_host：请求方的host名称或者IP地址

unverifiable：请求是否无法验证

method：请求使用方法，post、get、put

代码如下：

from urllib import request
from urllib import parse

url = 'http://httpbin.org/post'
dic = {'word':'hello!'}

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
    'Host': 'httpbin.org'
}
data = bytes(parse.urlencode(dic), encoding='utf-8')
resp = request.Request(url,data=data,headers=headers,method='POST')
response = request.urlopen(resp)
print(response.read())

结果如下：

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello!"\n }, \n "headers": {\

最低0.47元/天解锁文章

玛卡巴卡巴巴亚卡

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫学习记录（1）基本库的使用——urllib

一、使用urllib库python内置HTTP请求库，包含如下四个模块：request：模拟发送请求error：异常处理模块parse：工具模块，提供url处理方法robotparser：识别网站robots.txt文件，判断哪些网站可爬取1、发送请求（1）urlopen（）模拟浏览器请求发起过程如下代码可以爬取python官网源代码from urllib import requesturl = 'https://www.python.org'response
复制链接

扫一扫

专栏目录