python request post 数组_python 安全之 urllib 模块

最新推荐文章于 2023-11-14 10:51:41 发布

weixin_39586235

最新推荐文章于 2023-11-14 10:51:41 发布

阅读量229

点赞数

文章标签： python request post 数组

本文链接：https://blog.csdn.net/weixin_39586235/article/details/111637684

版权

本文介绍了Python的urllib库，包括urlopen的使用，Request对象的创建，设置HTTP头特别是User-Agent，处理GET和POST请求，以及HTTPS请求的SSL证书验证和代理设置。内容涵盖HTTP请求的基础操作和高级用法。

摘要由CSDN通过智能技术生成

乘风破浪会有时，直挂云帆济沧海
李白

前言

urllib 库是 python 的爬虫脚本中经常用到的一个模块，常用的还有 urllib3 ，requests 模块等等，这里先介绍 urllib。

注：这里使用的是 python3 的环境，在 python2 中，urllib 被分为 urllib，urllib2 等。

urllib库的基本使用

1、urlopen

用 urllib.request 模块的 urlopen() 获取页面，page 的数据格式为 bytes 类型，需要 decode() 解码，转换成 str 类型。

from urllib import requesturl = "http://www.baidu.com"response = request.urlopen(url)html = response.read().decode()print(html)

urlopen 返回对象提供的方法：

read()、readline()、readlines()、fileno()、close()：对 HTTPResponse 类型数据进行操作。
info()：返回 HTTPMessage 对象，表示远程服务器返回的头信息。
getcode()：返回状态码。
geturl()：返回请求的 URL。

2、Request

如果需要执行更复杂的操作，比如增加一个 HTTP 报头，添加各种请求报文字段，则必须创建一个 Resquest 实例来作为 urlopen() 的参数；而需要访问的 URL 地址则作为 Resquest 实例的参数。

from urllib import requesturl = "http://127.0.0.1"resq = request.Request(url)  # 创建 Resquest 实例response = request.urlopen(resq)html = response.read().decode()print(html)

新键 Request 实例，除了必要的 URL 参数之外，还可以设置另外两个参数。

urllib.resquest.Resquest(url,data=None,headers={},method=None)

data(默认为空)：是伴随着 URL 提交的数据(比如要 POST 的数据)，同时 HTTP 请求将从 GET 方式改为 POST 方式。
headers(默认空)：是一个字典，包含了需要发送的 HTTP 包头的键值对。

3、User-Agent

每一个网站都希望访问者以一个合法的身份去发送请求，而对方服务器验证访问者身份的途径就是检验请求报文的 User-Agent 字段。

浏览器就是互联网世界上公认被允许的身份，如果我们希望我们的爬虫程序更像一个真实用户，那么第一步就需要伪装成一个被公认的浏览器。

用不同的浏览器在发送请求的时候，会有不同的 User-Agent 头。urllib 默认的 User-Agent 头为：Python-urllib/x.y( x 和 y 是 python 主要版本和次版本号，例如 Python-urllib/2.7)

from urllib import requesturl = "http://127.0.0.1"# url连同 headers ，一起构造 Request 请求header = {    "User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}# 向服务器发送请求req = request.Request(url, headers=header)response = request.urlopen(req)html = response.read().decode()print(html)

4、添加更多的 Header 信息

在 HTTP Request 中加入特定的 Header，来构造一个完整的 HTTP 请求消息。

可以通过调用 Request.add_header() 添加/修改一个特定的 header，也可以通过调用 Request.get_header() 来查看已有的 header。

from urllib import requestimport randomurl = "http://127.0.0.1"ua_list = [    "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ",    "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ",    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ",    "Mozilla/5.0 (Macintosh; Intel Mac OS... "]user_agent = random.choice(ua_list)req = request.Request(url)# 通过调用 Request.add_header 添加/修改 headerreq.add_header("User-Agent", user_agent)response = request.urlopen(req)html = response.read().decode()# 使用 get_header 来获取 header 头部字段，参数第一个字母大写，其余小写print(req.get_header(header_name="User-agent"))

5、Get方法

GET 方法提交的数据需要经过 URL 编码。

在 urllib 库中使用 urlencode() 模块将 key:value 这样的键值对转换为 "key=value" 这样的字符串，也可以使用 quote() 将字符直接进行 URL 编码；解码工作可以使用 urllib.parse 的 unquote() 函数。

from urllib import parseword = {"wd": "字字酌情"}keyword = "字字酌情"word = parse.urlencode(word)keyword = parse.quote(keyword)print(word)print(keyword)print(parse.unquote(word))

from urllib import parse, requesturl = "https://www.baidu.com/s?"word = {"wd": "字字酌情"}header = {    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}word = parse.urlencode(word)new_url = url + wordreq = request.Request(new_url, headers=header)response = request.urlopen(req)html = response.read().decode()print(html)

6、POST方法

如果在请求中加入 data 参数，那么请求将会自动从 GET 方法转换为 POST 方法。

import urllib.parseimport urllib.request# POST请求的目标URLurl = "http://127.0.0.1/sqli-labs-master/Less-11/"headers = {    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"}formdata = {    "uname": "123",    "passwd": "456",    "submit": "Submit"}#将 formdata的数据从 dict 转换为 str，再转换为 bytesdata = urllib.parse.urlencode(formdata).encode()request = urllib.request.Request(url, data=data, headers=headers)response = urllib.request.urlopen(request)print(response.read().decode())

7、处理 HTTPS 请求 SSL 证书验证

现在随处可见 https 开头的网站，urllib可以为 HTTPS 请求验证SSL证书，就像web浏览器一样，如果网站的SSL证书是经过CA认证的，则能够正常访问，如：https://www.baidu.com/等...

如果SSL证书验证不通过，或者操作系统不信任服务器的安全证书，会警告用户证书不受信任。

如果以后遇到这种网站，我们需要单独处理SSL证书，让程序忽略SSL证书验证错误，即可正常访问。

from urllib import requestimport ssl# 表示忽略未经审核的 SSL 证书认证context = ssl._create_unverified_context()url = "https://xxx.com"header = {    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}req = request.Request(url, headers=header)response = request.urlopen(req.context=context)print(response.read().decode())

8、使用代理

当需要抓取的网站设置了访问限制，这时就需要用到代理来抓取数据。

from urllib import request# 设置代理proxy = request.ProxyHandler({'http': '127.0.0.1:8080'})opener = request.build_opener(proxy)  # 挂载openerrequest.install_opener(opener)  # 安装openerurl = "http://www.baidu.com"html = opener.open(url).read().decode()print(html)

尾声

我是匠心，一个在清流旁默默磨剑的匠人，期望那一天能利刃出鞘，仗剑走江湖。

weixin_39586235

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫