requests简介
requests模块是python3自带的库,可直接使用,该库主要用来处理http请求
中文文档 : http://docs.python-requests.org/zh_CN/latest/index.html
requests模块的简单使用
requests模块发送简单的请求、获取响应
一、requests.get()
哪些地方我们会用到get请求
下载网页
检索
1.1 下载网页
import requests # 预先安装requests库 response = requests.get('https://www.baidu.com/') # 发送Http请求 response.encoding = "utf-8" # 将下载内容编码为utf-8格式,否则乱码 print(response.text) # 打印网页内容 print(response.status_code) # 打印状态码,200代表正常
response.text 类型:str 解码类型: 根据HTTP 头部对响应的编码作出有根据的推测,推测的文本编码 如何修改编码方式:response.encoding=”gbk” response.content 类型:bytes 解码类型: 没有指定 如何修改编码方式:response.content.deocde(“utf8”)
网页编码分析
或者
开始编码
得到网页数据
这里为了方便,用pycharm打开,当然也可以用浏览器打开
下载的网页效果
1.2 保存图片
import requests # 预先安装requests库 response = requests.get('https://www.baidu.com/img/bd_logo1.png') # 发送Http请求 print(response.status_code) # 打印状态码,200代表正常 with open('baidu.png', 'wb') as f: # 图片是二进制(也叫字节)数据 f.write(response.content)
1.3 检索
关于参数的注意点
在url地址中,很多参数是没有用的,比如百度搜索的url地址,其中参数只有一个字段有用,其他的都可以删除
对应的,在后续的爬虫中,越到很多参数的url地址,都可以尝试删除参数
删除多余参数
import requests query_string = input(":") params = {"wd": query_string} url = "https://www.baidu.com/s?wd={}".format(query_string) response = requests.get(url) print(response.text) print(response.request.headers)
这里百度反爬虫的措施限制了User-Agent,去找一个User-Agent(网上也有很多)
# coding=utf-8 import requests query_string = input(":") params = {"wd": query_string} url = "https://www.baidu.com/s?wd={}".format(query_string) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"} response = requests.get(url, headers=headers) print(response.text) print(response.request.headers)
更多的反爬虫和突破反爬虫将在后续主题专门介绍
二、requests.post()
哪些地方我们会用到POST请求:
登录注册( POST 比 GET 更安全)
需要传输大文本内容的时候( POST 请求对数据长度没有要求)
百度单词翻译
import json import requests def fanyi(keyword): base_url = 'https://fanyi.baidu.com/sug' # 构建请求对象 data = { 'kw': keyword } # 模拟浏览器 header = { "User-Agent": "mozilla/4.0 (compatible; MSIE 5.5; Windows NT)", "Content-Type": "application/x-www-form-urlencoded" } req = requests.post(url=base_url, data=data, headers=header) # 获取响应的json字符串 str_json = req.text # 把json转换成字典 myjson = json.loads(str_json) info = myjson['data'][0]['v'] print(info) if __name__ == '__main__': while True: keyword = input('请输入翻译的单词:') if keyword == 'q': break fanyi(keyword)
requests详解
Python标准库中提供了:urllib、urllib2、httplib等模块以供Http请求,但是,它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务。
各类请求
requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)
requests模块已经将常用的Http请求方法为用户封装完成,用户直接调用其提供的相应方法即可,其中方法的所有参数有:
def request(method, url, **kwargs): """Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': ('filename', fileobj)}``) for multipart encoding upload. :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. :param timeout: (optional) How long to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed. :type allow_redirects: bool :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``. :param stream: (optional) if ``False``, the response content will be immediately downloaded. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. :return: :class:`Response <Response>` object :rtype: requests.Response Usage:: >>> import requests >>> req = requests.request('GET', 'http://httpbin.org/get') <Response [200]> """ # By using the 'with' statement we are sure the session is closed, thus we # avoid leaving sockets open which can trigger a ResourceWarning in some # cases, and look like a memory leak in others. with sessions.Session() as session: return session.request(method=method, url=url, **kwargs)
requests模块的深入使用
一、使用代理
代理IP的分类
为什么要使用代理
让服务器以为不是同一个客户端在请求
防止我们的真实地址被泄露,防止被追究
requests.get("http://www.baidu.com", proxies = proxies) requests.post("http://www.baidu.com", proxies = proxies) proxies = { "http": "http://12.34.56.78:9000", "https": "https://12.34.56.78:9000", }
代理IP的分类
透明代理(Transparent Proxy)
匿名代理(Anonymous Proxy)
混淆代理(Distorting Proxies)
高匿代理(Elite proxy或High Anonymity Proxy)
IP的选择:
高匿代理让别人根本无法发现你是在用代理,前几个都可以被发现
从使用的协议:代理ip可以分为http代理,https代理,socket代理等,使用的时候需要根据抓取网站的协议来选择
代理ip池的更新
购买的代理(Beagle等)ip很多时候大部分(超过60%)可能都没办法使用,这个时候就需要通过程序去检测哪些可用,把不能用的删除掉。
方式一
requests.utils.dict_from_cookiejar
:把cookiejar对象转化为字典
import requests url = "http://www.baidu.com" response = requests.get(url) print(type(response.cookies)) cookies = requests.utils.dict_from_cookiejar(response.cookies) print(cookies) ''' <class 'requests.cookies.RequestsCookieJar'> {'BDORZ': '27315'} '''
方式二
import requests from bs4 import BeautifulSoup from requests.cookies import RequestsCookieJar from win32.win32crypt import CryptUnprotectData def getcookiefromchrome(host='.oschina.net'): '获取浏览器中某个网站的cookie' cookiepath = os.environ['LOCALAPPDATA'] + r"\Google\Chrome\User Data\Default\Cookies" sql = "select host_key,name,encrypted_value from cookies where host_key='%s'" % host with sqlite3.connect(cookiepath) as conn: cu = conn.cursor() cookies = {name: CryptUnprotectData(encrypted_value)[1].decode() for host_key, name, encrypted_value in cu.execute(sql).fetchall()} return cookies def ret_soup(url, cookies_dict): '返回BeautifulSoup' response = requests.get(url, cookies=set_cookie(cookies_dict), headers=set_header(), proxies=None, timeout=10) response.encoding = "utf8" soup = BeautifulSoup(response.text, 'html.parser') return soup def start(): '程序入口' cookies_dict = getcookiefromchrome(host='.bitinfocharts.com') site = 'https://bitinfocharts.com' baseurl = site + '/top-100-richest-bitcoin-addresses.html' soup = ret_soup(baseurl, cookies_dict) print(soup) if __name__ == '__main__': start()
三、处理证书错误
出现这个问题的原因是:ssl的证书不安全导致(ssl.CertificateError)
import requests
url = "https://www.12306.cn/mormhweb/"
response = requests.get(url,verify=False)
四、超时参数
在平时网上冲浪的过程中,我们经常会遇到网络波动,这个时候,一个请求等了很久可能任然没有结果
对应的,在爬虫中,一个请求很久没有结果,就会让整个项目的效率变得非常低,这个时候我们就需要对请求进行强制要求,让他必须在特定的时间内返回结果,否则就报错
response = requests.get(url,timeout=5)
五、retrying模块
上述方法能够加快我们整体的请求速度,但是在正常的网页浏览过成功,如果发生速度很慢的情况,我们会做的选择是刷新页面,那么在代码中,我们是否也可以刷新请求呢?
retrying模块的地址:https://pypi.org/project/retrying/
retrying 模块的使用
使用retrying模块提供的retry模块
通过装饰器的方式使用,让被装饰的函数反复执行
retry中可以传入参数stop_max_attempt_number,让函数报错后继续重新执行,达到最大执行次数的上限,如果每次都报错,整个函数报错,如果中间有一个成功,程序继续往后执行
import requests from retrying import retry headers = {} @retry(stop_max_attempt_number=3) #最大重试3次,3次全部报错,才会报错 def _parse_url(url) response = requests.get(url, headers=headers, timeout=3) #超时的时候回报错并重试 assert response.status_code == 200 #状态码不是200,也会报错并充实 return response def parse_url(url) try: #进行异常捕获 response = _parse_url(url) except Exception as e: print(e) response = None return response
如何突破基本的反爬虫策略
从上面的例子可以发现: 基本的反爬虫策略: 1.尽量模拟浏览器:就是利用请求头里的字段做文章(UA Cookie...) 2.突破限速、封号:针对某个IP或者某个账户限速甚至封禁 突破策略: 1.把headers的信息copy到程序的headers 2.多个代理ip/多个账户
反爬虫一般策略
请求头
user-agent: 当前用户使用的设备
Referer: "xxx"
content-type: application/json,
host
请求携带cookie或token
加密
发现ip变化
限制访问频率
验证码
隐藏登录界面部分数据
js动态加载,分析复杂
发现大量请求只加载html,不加载css js media
发现爬虫加载假数据
健全账号体系
更多示例
爬取汽车之家
import requests
from bs4 import BeautifulSoup # 预先安装BeautifulSoup4库
response = requests.get('http://www.autohome.com.cn/news/')
response.encoding = 'gbk'
soup = BeautifulSoup(response.text,'html.parser')
tag = soup.find(id='auto-channel-lazyload-article') # BeautifulSoup标签支持链式操作
h3 = tag.find(name='h3')
print(h3)
import requests
from bs4 import BeautifulSoup
import re
# 找到所有新闻
# 标题,简介,url,图片
HTTPS = 'https:' # 页面中url无https,访问需添加
response = requests.get('http://www.autohome.com.cn/news/')
response.encoding = 'gbk'
soup = BeautifulSoup(response.text, 'html.parser')
li_list = soup.find(id='auto-channel-lazyload-article').find_all(name='li')[:3] # 只取3条新闻
for li in li_list:
title = li.find('h3')
if not title:
continue
summary = li.find('p').text
# li.find('a').attrs,字典
url = HTTPS + li.find('a').get('href') # 等效于 li.find('a').attrs['href']
img = HTTPS + li.find('img').get('src')
# 下载图片
res = requests.get(img)
re_image_name = re.match(r'.*__(.*).jpg', img)
if re_image_name:
image_name = re_image_name.group(1)
file_name = "%s.jpg" % (image_name,)
with open(file_name, 'wb') as f:
f.write(res.content)
登录github
浏览器打开开发者模式,查看FormData
技巧:当前的页面在Network对应的页面背景会呈蓝色
import requests
from bs4 import BeautifulSoup
# 获取token
r1 = requests.get('https://github.com/login')
s1 = BeautifulSoup(r1.text,'html.parser')
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
r1_cookie_dict = r1.cookies.get_dict()
# 将用户名密码token发送到服务端,post
"""
utf8:✓
authenticity_token:ollV+avLm6Fh3ZevegPO7gOH7xUzEBL0NWdA1aOQ1IO3YQspjOHbfnaXJOtVLQ95BtW9GZlaCIYd5M6v7FGUKg==
login:
password:
commit:Sign in
"""
r2 = requests.post(
'https://github.com/session',
data={
"utf8": '✓',
"authenticity_token": token,
'login': 'xxx',
'password': 'xxx',
'commit': 'Sign in'
},
cookies=r1_cookie_dict
)
r2_cookie_dict = r2.cookies.get_dict()
# 一般需要登陆的网站只要带上登陆页面的POST cookie即可,有的站点还会带上登陆页面的GET cookie,如github
cookie_dict = {}
cookie_dict.update(r1_cookie_dict)
cookie_dict.update(r2_cookie_dict)
#
r3 = requests.get(
url='https://github.com/settings/emails',
cookies=cookie_dict
)
r4 = requests.get(
url='https://github.com/ecithy/online-edu',
cookies=cookie_dict
)
print(r4.text)
登录抽屉
方式一
# 1. 登录,cookie
# 2. 标签url,xxxx
import requests
from bs4 import BeautifulSoup
# 1. 获取cookie
r0 = requests.get('http://dig.chouti.com/')
r0_cookie_dict = r0.cookies.get_dict()
# 2. 发送用户名密码cookie
r1 = requests.post(
'http://dig.chouti.com/login',
data={
'phone': 'xxx',
'password': 'xxx',
'oneMonth':1
},
cookies=r0_cookie_dict
)
r1_cookie_dict = r1.cookies.get_dict()
cookie_dict = {}
cookie_dict.update(r0_cookie_dict)
cookie_dict.update(r1_cookie_dict)
r2 = requests.post('http://dig.chouti.com/link/vote?linksId=13915601',cookies=cookie_dict)
print(r2.text)
方式二:session(自动保存cookie,避免了手动操作cookie的繁琐)
import requests
session = requests.Session()
r1 = session.get(url="http://dig.chouti.com/")
r2 = session.post(
url="http://dig.chouti.com/login",
data={
'phone': "xxx",
'password': "xxx",
'oneMonth': ""
}
)
r3 = session.post(
url="http://dig.chouti.com/link/vote?linksId=13915601"
)
print(r3.text)
登陆知乎
# -*- coding: utf-8 -*-
__author__ = 'hy'
import requests
try:
import cookielib
except:
import http.cookiejar as cookielib
import re
session = requests.session()
# session.cookies = cookielib.LWPCookieJar(filename="cookies.txt") # 加载cookie
try:
session.cookies.load(ignore_discard=True)
except:
print("cookie未能加载")
agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
header = {
"HOST": "www.zhihu.com",
"Referer": "https://www.zhizhu.com",
'User-Agent': agent
}
def is_login():
# 通过个人中心页面返回状态码来判断是否为登录状态
inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"
response = session.get(inbox_url, headers=header, allow_redirects=False)
if response.status_code != 200:
print('请登陆')
else:
print('您已登陆')
def get_xsrf():
# 获取xsrf code
response = session.get("https://www.zhihu.com", headers=header)
match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
if match_obj:
return (match_obj.group(1))
else:
return ""
def zhihu_login(account, password):
# 知乎登录
if re.match("^1\d{10}", account):
print("手机号码登录")
post_url = "https://www.zhihu.com/login/phone_num"
post_data = {
"_xsrf": get_xsrf(),
"phone_num": account,
"password": password,
}
else:
if "@" in account:
# 判断用户名是否为邮箱
print("邮箱方式登录")
post_url = "https://www.zhihu.com/login/email"
post_data = {
"_xsrf": get_xsrf(),
"email": account,
"password": password
}
response_text = session.post(post_url, data=post_data, headers=header)
# session.cookies.save()
# zhihu_login("xxx", "xxx")
zhihu_login("xxx", "xxx")
is_login()