python http

最新推荐文章于 2024-08-30 09:57:41 发布

不争而善胜

最新推荐文章于 2024-08-30 09:57:41 发布

阅读量230

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_42501075/article/details/104391129

版权

python 专栏收录该内容

56 篇文章 0 订阅

订阅专栏

详情见

1、发送请求原博客

import requests #导入requests，然后就可以为所欲为了

#发送get请求

r0 = requests.get(“http://yunweicai.com”)

2、URL参数

URL 的查询字符串(query string)传递某种数据。如果你是手工构建 URL，那么数据会以键/值对的形式置于 URL 中，跟在一个问号的后面。例如， yunweicai.com/get?key=val。

requests库操作就比较优雅了，requests 允许你使用 params 关键字参数，以一个字符串字典来提供这些参数。

payload = {‘key1’: ‘value1’, ‘key2’: ‘value2’}

r = requests.get(“http://yunweicai.com/get”, params=payload)

通过打印输出该 URL，你能看到 URL 已被正确编码：

print(r.url)
4.

通过发送请求返回的对象，我们就可以获取到服务器对我们的相应内容了。Requests 会自动解码来自服务器的内容。请求发出后，Requests 会基于 HTTP 头部对响应的编码作出有根据的推测。当你访问 r.text 之时，Requests 会使用其推测的文本编码。

你可以找出 Requests 使用了什么编码，并且能够使用 r.encoding 属性来改变它:

r.encoding’utf-8’>>> r.encoding = ‘ISO-8859-1’

如果返回的json串，可以直接使用r.json()获取到字典对象进行操作

5、定制请求头

有些请求需要有指定的请求头才能正确获取到内容。

headers = {‘user-agent’: ‘my-app/0.0.1’}

r= requests.get(“http://yunweicai.com”,headers=headers)

1、安装

Win 平台：“以管理员身份运行” cmd，执行 pip install requests

小测：

import requests

r=requests.get(“http://www.baidu.com”)

print(r.status_code)

200

r.text

2、Requests库的7个主要方法

3、Response对象的属性

4、理解Requests库的异常

5、爬去网页的通用代码框架

import requests

def getHTMLText(url):

  try:

        r=requests.get(url,timeout=30)

        r.raise_for_status()  #如果状态不是200，引发HTMLError异常

        r.encoding=r.apparent_encoding

        return r.text

   except:

         return "产生异常"

if name==“main”:

 url="http://www.baidu.com"

 print(getHTMLText(url))

6、HTTP协议

HTTP，Hypertext Transfer Protocol，超文本传输协议

HTTP是一个基于“请求与响应”模式的、无状态的应用层协议

HTTP URL的理解：URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

方法细节往下看！！！

响应内容信息获取，响应头的建立
1、响应状态码

import requests
r = requests.get(‘https://api.github.com/some/endpoint’)
print(r.status_code) #响应状态码
print(r.status_code==requests.codes.ok) #内置状态码查询对象
r.raise_for_status() #通过 Response.raise_for_status() 来抛出异常
1
2
3
4
5

2、响应头信息

import requests
r = requests.get(‘http://httpbin.org/get’)
print(r.headers) #获得响应头信息
print(r.headers[‘Content-Type’])
print(r.headers.get(‘Content-Length’))

{'X-Processed-Time': '0.000617980957031', 'Connection': 'keep-alive', 'Via': '1.1 vegur', 'Content-Length': '268', 'X-Powered-By': 'Flask', 'Date': 'Thu, 23 Nov 2017 04:13:40 GMT', 'Server': 'meinheld/0.6.1', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Content-Type': 'application/json'} application/json 268
1
2
3
4
5
6
7
8

还有一个特殊点，那就是服务器可以多次接受同一 header，每次都使用不同的值。但 Requests 会将它们合并，这样它们就可以用一个映射来表示出来
3、如果某些请求包含cookie，可以使用以下命令获得cookie
获取响应信息中的cookie

import requests
r = requests.get(‘https://api.douban.com/v2/book/search?小王子’)
print(r.cookies)
print(r.cookies[‘bid’])
1
2
3
4
在发送请求时加入cookie参数

import requests
url= ‘http://httpbin.org/cookies’
cookies = dict(cookies_are=‘working’)
r = requests.get(url,cookies=cookies)
print(r.text)
1
2
3
4
5

4、请求信息获取r.request.headers
二、重定向与请求历史
import requests
r = requests.get(‘http://github.com’)
print(r.history)

[<Response [301]>]
1
2
3
4
通过 allow_redirects 参数禁用重定向处理

import requests
r = requests.get(‘http://github.com’,allow_redirects=False)
print(r.status_code)
print(r.history)

301
[]
1
2
3
4
5
6
使用head重启重定向

import requests
r = requests.head(‘http://github.com’,allow_redirects=True)
print(r.url)
print(r.status_code)
print(r.history)
1
2
3
4
5

官网文档–http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

发送get,post请求

res=requests.get(url) #发送get请求，请求url地址对应的响应
res=requests.post(url,data={请求的字典}) #发送post请求
#post请求
import requests

url=“http://fanyi.baidu.com/sug”
data={‘kw’:‘早上好’}#该字典键值对的形式可以通过form data中查询
headers={
“User-Agent”:“Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Mobile Safari/537.36”
}
res=requests.post(url,data=data,headers=headers)
print(res.text)
1
2
3
4
5
6
7
8
9
10

response方法

res.text（该方法往往会出现乱码，出现乱码使用res.encoding=’utf-8’ 或者res.encoding=res.apparent_encoding）
res.content.decode(‘utf-8’)#或者’gbk’
res.json() #针对响应为json字符串解码为python字典
res.request.url #发送请求的url地址
res.url #res响应的url地址(页面跳转时，请求的url地址与真正打开的url地址是不同的)
res.request.headers #请求头
res.headers #res响应头
发送带有header的请求

headers={请求体}#User-agent>>>Referer>>Cookie
-为了模拟浏览器，获取和浏览器一样的内容

超时参数 timeout

requests.get(url,headers=headers,timeout=3) #3秒内必须返回响应，否则会报错
一般为了避免再发出请求过程中出现异常而中断请求，一般采用retrying中的retry函数（作为装饰器调用）

from retrying import retry
import requests

@retry(stop_max_attempt_number=3) #让被装饰的函数反复执行三次，三次全部报错才会报错；中间有一次正常，程序继续执行
def _parse_url(url):
print(“打印效果”)
headers={
“user-agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36”
}
res=requests.get(url,headers=headers,timeout=5)
# print(res.content.decode(‘utf-8’))
print(res.status_code)

#增加异常处理
def parse_url(url):
try:
html_str=_parse_url(url)
except:
html_str=None
return html_str

if name == ‘main’:
url = ‘http://www.baidu.com’
parse_url(url)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

处理cookie请求

直接携带cookie请求url地址
1.cookie放在headers中
2.cookie字典传给cookies参数
cookie=”….”#通过字典推导式得到
cookie_dict={i.split(“=”)[0]: i.split(“=”)[1] for i in cookie.split(“;”)}
requests.get(url,headers=headers,cookies=cookie_dict)

先发送post请求，获取cookie,带上cookie请求登陆后的页面 —requests.session() 会话保持
1.实例化session
session=requests.session()#此时session实例同requests一样
2.session.post（url,data,headers）#服务器设置在本地的cookie会被保存在被session中

注意post请求的url可以通过两种方式获取：
-查看该登陆页面的源代码，找到form表单中的action提交的链接
-在登陆页面的NETWORK中勾选Perserver log，然后再页面跳转后找到post请求的url
1
2
3

3.session.get(url)#发出get请求会带上之前保存在session中的cookie，能够请求成功

#爬去人人网信息
#方法1–没有cookie的html信息
import requests

with open(‘renren1.html’,‘w’,encoding=‘utf-8’) as f:
f.write(res.content.decode())

#方法2—在headers中放入cookie(在html中nx.user会出现用户名）
import requests

url=“http://zhibo.renren.com/top”
headers={
“user-agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36”,
“Cookie”:“anonymid=jf3bo25ktzrg8e; depovince=SH; r01=1; ick_login=89a0be82-b23f-4aa2-9587-80b352b7d64f; _de=ED5538112FD97F3944B0A57815E527E7696BF75400CE19CC; ick=cb553aab-cde4-413f-864d-25644c96ea00; __utma=151146938.1210742990.1521771869.1521771869.1521771869.1; __utmc=151146938; __utmz=151146938.1521771869.1.1.utmcsr=renren.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmt=1; __utmb=151146938.1.10.1521771869; t=5812e6c25a16acd7d295e245218a8cd49; societyguester=5812e6c25a16acd7d295e245218a8cd49; id=964865629; xnsid=af52771f; XNESSESSIONID=292a42dae8f0; WebOnLineNotice_964865629=1; ch_id=10016; JSESSIONID=abcJKBMLTnzbs_aP1Rqjw; springskin=set; vip=1; wp_fold=0; jebecookies=cdb11dcd-461b-46e2-bfc5-9548dbe6a95e|||||”
}
res=requests.get(url,headers=headers,timeout=5)

with open(‘renren2.html’,‘w’,encoding=‘utf-8’) as f:
f.write(res.content.decode())

#方法3–requests.session方法
import requests

session=requests.session()#实例化session
post_url=“http://www.renren.com/PLogin.do” #此处的url地址是form表单中action的地址
headers={
“user-agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36”
}
post_data={
‘email’:‘17612167260’,
‘password’:‘068499’}

session.post(url,headers=headers,data=post_data)#使用session发送post请求，获取保存在本地的cookie
url=“http://zhibo.renren.com/top”#次数的url是登陆页面的url
res=session.get(url,headers=headers)#使用session，请求登陆后的页面
with open(‘renren3.html’,‘w’,encoding=‘utf-8’) as f:
f.write(res.content.decode())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

上传文件 files

import requests

#建立files文件字典
dict_files={“file”:open(r"C:\Users\poliy\Desktop\Crawler\1.png",“rb”)}
response=requests.post(“http://httpbin.org/post”,files=dict_files)
print(response.text)
1
2
3
4
5
6
证书认证（12306证书认证）

#方法1—设置verify=False,并取消提示警告
import requests
from requests.packages import urllib3
import ssl
context = ssl._create_unverified_context()
urllib3.disable_warnings()
res=requests.get(“https://www.12306.cn”,verify=False)
print(res.status_code)

#方法2–通过cert参数放入证书路径
res=requests.get(“https://www.12306.cn”,cert=‘PATH’)
1
2
1
1
2
3
4
5
6
7
8

设置代理

import requests

my_proxies={
“http”:“http://61.135.217.7:80”,
“https”:“https://42.96.168.79:8888”
}
res=requests.get(“https://www.baidu.com”,proxies=my_proxies)
print(res.text)
1
2
3
4
5
6
7
8

异常处理

requests的异常都在requests.exceptions中
import requests
from requests.exceptions import ReadTimeout,ConnectionError,RequestException

try:
res=requests.get(“http://httpbin.org/get”,timeout=0.1)
print(res.status_code)
except ReadTimeout:
print(“timeout”)
except ConnectionError:
print(“timeout”)
except RequestException:
print(“error”)
1
2
3
4
5
6
7
8
9
10
11
12
13