Requests-Quickstart

最新推荐文章于 2020-12-04 14:42:57 发布

假老练的迷妹

最新推荐文章于 2020-12-04 14:42:57 发布

阅读量155

点赞数

分类专栏： Python 文章标签： Crawler Python

本文链接：https://blog.csdn.net/weixin_36879590/article/details/96175236

版权

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

####Requests - Quickstart

######Make a Request

######发送简单的请求:

######需求:通过requests向百度首页发送请求，获取百度首页的数据

######response = requests.get(url)

######response的常用方法:

response.text 响应内容 (Requests会使用其根据HTTP头部作出的编码推测)
response.content 响应内容 (以字节方式访问请求响应体)
response.status_code 响应状态码
response.requests.headers 响应请求头
response.headers 响应头

#先导入Requests模块
>>> import requests
>>> r = requests.get('http://www.baidu.com')
>>> r.status_code
200

#判断请求是否成功
>>> assert r.status_code == 200
#200连接成功-无提示 如果失败将抛出AssertionError错误提示

#响应头
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 14 Jul 2019 19:03:14 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

######Response Content

# 响应请求url地址
>>> r.request.url
'http://www.baidu.com/'

# 响应url地址
>>> r.url
'http://www.baidu.com/'
# 区别：假如web服务器将请求的A-url地址重定向到B-url地址时，请求的url地址和响应的url地址是不一样的

# 响应请求头
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

# Binary Response Content 响应内容-二进制	
>>> r.content
# JSON Response Content	响应内容-JSON
>>> r.json()
# RAW Response Content	响应内容-原始
>>> r.raw

# 以解码方式响应内容
>>>r.content.decode()
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
…e.g…
</body> </html>\r\n'

######Custom Headers

######发送带header的请求

模拟浏览器，欺骗服务器，获取和浏览器一致的内容

header的形式：dict
headers = {User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36}
用法：requests.get(url,headers=headers)

>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
>>> r = requests.get('http://www.baidu.com',headers=headers)
>>> r.content.decode()
<省略内容>

######Passing Parameters In URLs

传递带URL参数的请求 - IDE

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
payload = {'wd':'哔哩哔哩'}
url_temp = 'https://www.bilibili.com/s?'

r = requests.get(url_temp,headers=headers,params=payload)
print(r.status_code)
print(r.request.url)
404
https://www.bilibili.com/s?wd=%E5%93%94%E5%93%A9%E5%93%94%E5%93%A9
  #wd=后的是 哔哩哔哩汉字经过url编码后的形式 ->经过url解码可看见内容   					*有兴趣可查看URL编码表
  
# URL拼接	->	字符串格式化:	%s	and format()
url = 'https://www.bilibili.com/s?wd={0}'.format('哔哩哔哩')
r = requests.get(url_temp,headers=headers,params=payload)

print(r.status_code)
print(r.request.url)
404
https://www.bilibili.com/s?wd=%E5%93%94%E5%93%A9%E5%93%94%E5%93%A9

#####练习

1.获取贴吧的爬虫，保存网页到本地

import requests

class PostBar():
    def __init__(self, name):
        self.name = name
        self.tempURL = 'http://tieba.baidu.com/f?kw=' + name + '&ie=utf-8&pn={}'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

    # 获取贴吧500页	每页pn都加50为下一页的pn值,从0开始
    def get_url_list(self):
        return [self.tempURL.format(i * 50) for i in range(500)]

    def parse_url(self, url):
        # 发送请求获取响应
        print(url)
        response = requests.get(url, headers=self.headers)
        # 响应内容
        return response.content.decode()

    def save_html(self, html_str, page_num):
        file_path = '{} - This is The {} page.html'.format(self.name, page_num)
        # with语句自动调用close()方法
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(html_str)

    def run(self):
        # 把get_url_list中URL列表变量遍历
        url_list = self.get_url_list()
        for url in url_list:
            # 将URL传参给响应请求
            html_str = self.parse_url(url)
            # 调用save_html()传内容和页数
            page_num = url_list.index(url) + 1
            self.save_html(html_str, page_num)


if __name__ == '__main__':
    PostBar_Spider = PostBar('李洙赫')
    PostBar_Spider.run()

2.获取新浪首页，查看response.text和response.content.decode()的区别

#####More complicated POST requests

1.模拟发送Post请求

登陆注册
需要传输大文本内容的时候(POST请求对数据长度没有要求)

2.使用代理

准备一堆的IP地址，组成IP池，随机选择一个IP使用

如何随机选择代理IP，让使用次数较少的ip地址有更大的可能性被用到

{'ip':ip,'time':0}
[{},{},{},{},{}],对这个ip的列表进行排序，按照使用次数进行排序
选择使用次数将少的10个ip，从中随机选择一个

检查IP的可用性
可以使用requests添加超时参数，判断IP地址的质量
在线代理IP质量检测的网站

3.处理cookies session

Cookie and Session

cookie浏览器

#####The Difference Between a Bookie and a Session

原理:

[外链图片转存失败(img-WAlxVCk3-1563277957427)(/Users/Yonki/Desktop/WTutorial/The-Difference-Between-a-Cookie-and-a-Session.png)]

Cookies = Client side 是一个包含信息的客户端文件。

单个cookie保存的数据 <=4 KB；每个web站点设置的Cookie <= 20个；缺点：用户隐私、安全隐患；

从图片可以看出:

(web browser = client)客户端发送一个请求给web服务器，

web服务器应答(会附送一个cookie),

当下次客户端发送请求访问同一个web服务器时，请求会把cookie一并发送，

这样web服务器就可以辨识这个用户(因为已经搭载了这个用户信息)

Session = Server side

缺点：性能低(当访问量增多，会比较占用你的服务器的性能)；作用：在网站中根据一个会话跟踪用户。

扩展：（HTTP是无状态的协议，此协议没有一个内奸机制来维护两个事物间的状态。当一个用户在请求一个页面后再请求另外一个页面时(页面跳转)。HTTP将无法告诉我们这两个请求是来自同一个用户），由于此缺陷。WEB的设计者们提出了Cookie and Session两种解决机制 Read More

– 持续更新中

假老练的迷妹

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Requests-Quickstart

####Requests - Quickstart######Make a Request######发送简单的请求:######需求:通过requests向百度首页发送请求，获取百度首页的数据######response = requests.get(url)######response的常用方法:response.text 响应内容 (Requests会使用其根据HTTP头部作...
复制链接

扫一扫

专栏目录