python爬虫学习记录(2)基本库的使用——requests

一、基本用法

urllib库中的urlopen方法实际是用GET方式请求网页,而requests中相应的方法是get()

我们用get方法实现与urlopen相同的操作,得到一个response对象,分别输出response类型,状态码,响应体以及cookie

import requests

r = requests.get('https://www.baidu.com')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

<class 'requests.models.Response'>
200
<class 'str'>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ°é»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§é¢</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç»å½</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ´å¤äº§å</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å³äºç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使ç¨ç¾åº¦åå¿è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æè§åé¦</a>&nbsp;京ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

 返回类型:request.models.Response,响应体类型是字符串str,cookie类型是RequestCookieJar。

1、GET请求

(1)基本实例

import requests

r = requests.get('http://httpbin.org/get')
print(r.text)

 结果如下:

{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.24.0",
    "X-Amzn-Trace-Id": "Root=1-60ebe5c4-6925bea60abee1c828f14eb3"
  },
  "origin": "117.139.13.214",
  "url": "http://httpbin.org/get"
}

可以用字典存储get传参,cookie,user-agent同理。url被自动构造传参了。

import requests

dic = {
    'name' : 'alice',
    'age' : '22'
}
r = requests.get('http://httpbin.org/get',params=dic)
print(r.text)

{
  "args": {
    "age": "22",
    "name": "alice"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.24.0",
    "X-Amzn-Trace-Id": "Root=1-60ebe721-6e0fc0b159bb912070bbc378"
  },
  "origin": "117.139.13.214",
  "url": "http://httpbin.org/get?name=alice&age=22"
}

 或者可以直接把r.text()转换成r.json(),这个方法可以把返回格式是json的字符串转换成字典。注意之争对返回格式是json的。

(2)抓取网页

import requests
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}
r = requests.get('https://www.zhihu.com/explore', headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>'.re.S)
titles = re.findall(pattern,r.text)
print(titles)

(3)抓取二进制文件

import requests
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}
r = requests.get('https://pic3.zhimg.com/100/v2-641ebdf8fc236df74769ada646e7c7ea_hd.png', headers=headers)
with open ('v2-641ebdf8fc236df74769ada646e7c7ea_hd.png','wb') as p:#wb是写入
    p.write(r.content)

2、POST请求

import requests

data = {
    'name': 'alice',
    'age': '22'
}
r = requests.get('http://www.httpbin.org/post', data=data)
print(r.text)

这里被服务器端请求错误405了,所以无法访问

3、响应

可以使用text和content获取响应内容(text是现成的字符串,content还要编码,但是text不是所有时候显示都正常,这是就需要用content进行手动编码),还可以通过status_code,headers,cookie获取响应码、响应头、cookie

可以使用状态码判断请求是否成功。内置状态码查询对象request.codes

import requests
url = 'http://www.jianshu.com'
r = requests.get(url)
print('error') if not r.status_code == requests.codes.ok else print('Successfully')

最后返回error

二、高级用法

1、文件上传

import requests
url = 'https://www.jianshu.com/post'
files = {
    'file': open('v2-641ebdf8fc236df74769ada646e7c7ea_hd.png','rb')#二进制文件读取
}
r = requests.post(url,files=files)
print(r.text)

图片需要与脚本同在一个文件夹下。

2、cookies

查看cookie

import requests
url = 'http://www.baidu.com'
r = requests.get(url)
print(r.cookies)
for key, value in r.cookies.items():
    print(key+':'+value)

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ:27315

设置cookie

import requests
headers = {
    'Cookie': '_zap=b3a366d5-cf1c-4012-9679-9161e9ab43ca; d_c0="APBVwM_ohRGPTtoe3teAlasSjQt2OiPsGH8=|1593824285"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1625817562,1626073290,1626079791,1626080322; _ga=GA1.2.1483948118.1593824287; q_c1=064c8f5591884afdac9030d50d1d83b9|1607858480000|1595600252000; _xsrf=gnD5Rhh92uR6qxjTcYP2KbpKLWJMBlTb; __snaker__id=KbZ6SfS9e8i07e4H; gdxidpyhxdE=zk9WrUxuSv88l2mBYd0ouJ28ga%5CnzJf%2Blrt%5CKnBQE%5C46VxxdjG%2FjkHExxO5muLxnhahIAsGAE%2BmrauHbaWwL4QL5Vu4KIVa6n3IqTaMxQEtsGAX%2BocxlCssU4VLeMVhVek1MXLfcHvWfNRBlxkymUs%5C8oZIRjqZrr0iigWHs9T5ymRes%3A1617783942511; _9755xjdesxxd_=32; YD00517437729195%3AWM_NI=dExqBpUy7E82KtnEbBnS5F089FrpCyKqNhAk2HqROqUZ0DayY%2F4ZYVqB3Q96hRM89D3W6a0s374SZzNLzi7Nq6q6vpl9gasewqhwby%2FNg5f5tAdXDZOYTB1njMCC%2B3QTYXQ%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee83ca74b1aebfa3c44e988e8ab2c45f969b9baef1688ea785b0d56d909da092cc2af0fea7c3b92aa89bb892c17c97a7f9a6bb3a89aca2a6e14aae8dfba7b334fb9abe91c27389af8289c461f298bfb3ce7b978600a5e4739bf1fcbbdc5eb39efdb2bc3eac8e83d9bc50f189fa82b379f1e9fcb4d9728e94b68ab646b3eabfd7f57cb0e7a982f26fb5afa898e173ed92a9d6e53fb89facb9c7528d879ca3b466aa938da4aa39938c97b9d437e2a3; YD00517437729195%3AWM_TID=vAQwTtExnEtBUUQQUAN7lMnPeV0eDbqI; z_c0="2|1:0|10:1617783074|4:z_c0|92:Mi4xTjRyYUFnQUFBQUFBOEZYQXotaUZFU1lBQUFCZ0FsVk5JcmRhWVFDZ0ViVVo5Q2ZuUGlWQmZQUnNIMl9SbHZaRDRR|ba139c1aa2f549eb3db6d08703cb78d71d0d11d8996c651d340961588c24d4c5"; KLBRSID=fe0fceb358d671fa6cc33898c8c48b48|1626080324|1626079790; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1626080322; SESSIONID=gfDRt4UGh9vVlKgCV3Xw5lFr8T2E6HoVYCZtsdLLuTn; JOID=Vl4WCkpvMMQ4SPqoK2L3WgL_2iY9ClOKYjnM-GYCC7dkLLTNbSw7KlVP_acpeCEIhNojDRnII9trN-8qth4ClAU=; osd=VlwQBE9vMsI2TfqqLWzyWgD51CM9CFWEZznO_mgHC7ViIrHNbyo1L1VN-6kseCMOit8jDx_GJttpMeEvthwEmgA=; tst=r',
    'Host': 'www.zhihu.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}
url = 'https://www.zhihu.com'
r = requests.get(url,headers=headers)
print(r.text)

可以访问到登录后的结果页面,就说明成功了

也可以使用cookie参数设置,可以构造RequestCookieKar对象,需要用split()分割,set()设置cookie的key和value,get()传递给cookie参数即可

import requests

url = 'https://www.zhihu.com'
cookie = 'zap=b3a366d5-cf1c-4012-9679-9161e9ab43ca; d_c0="APBVwM_ohRGPTtoe3teAlasSjQt2OiPsGH8=|1593824285"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1625817562,1626073290,1626079791,1626080322; _ga=GA1.2.1483948118.1593824287; q_c1=064c8f5591884afdac9030d50d1d83b9|1607858480000|1595600252000; _xsrf=gnD5Rhh92uR6qxjTcYP2KbpKLWJMBlTb; __snaker__id=KbZ6SfS9e8i07e4H; gdxidpyhxdE=zk9WrUxuSv88l2mBYd0ouJ28ga%5CnzJf%2Blrt%5CKnBQE%5C46VxxdjG%2FjkHExxO5muLxnhahIAsGAE%2BmrauHbaWwL4QL5Vu4KIVa6n3IqTaMxQEtsGAX%2BocxlCssU4VLeMVhVek1MXLfcHvWfNRBlxkymUs%5C8oZIRjqZrr0iigWHs9T5ymRes%3A1617783942511; _9755xjdesxxd_=32; YD00517437729195%3AWM_NI=dExqBpUy7E82KtnEbBnS5F089FrpCyKqNhAk2HqROqUZ0DayY%2F4ZYVqB3Q96hRM89D3W6a0s374SZzNLzi7Nq6q6vpl9gasewqhwby%2FNg5f5tAdXDZOYTB1njMCC%2B3QTYXQ%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee83ca74b1aebfa3c44e988e8ab2c45f969b9baef1688ea785b0d56d909da092cc2af0fea7c3b92aa89bb892c17c97a7f9a6bb3a89aca2a6e14aae8dfba7b334fb9abe91c27389af8289c461f298bfb3ce7b978600a5e4739bf1fcbbdc5eb39efdb2bc3eac8e83d9bc50f189fa82b379f1e9fcb4d9728e94b68ab646b3eabfd7f57cb0e7a982f26fb5afa898e173ed92a9d6e53fb89facb9c7528d879ca3b466aa938da4aa39938c97b9d437e2a3; YD00517437729195%3AWM_TID=vAQwTtExnEtBUUQQUAN7lMnPeV0eDbqI; z_c0="2|1:0|10:1617783074|4:z_c0|92:Mi4xTjRyYUFnQUFBQUFBOEZYQXotaUZFU1lBQUFCZ0FsVk5JcmRhWVFDZ0ViVVo5Q2ZuUGlWQmZQUnNIMl9SbHZaRDRR|ba139c1aa2f549eb3db6d08703cb78d71d0d11d8996c651d340961588c24d4c5"; KLBRSID=fe0fceb358d671fa6cc33898c8c48b48|1626080324|1626079790; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1626080322; SESSIONID=gfDRt4UGh9vVlKgCV3Xw5lFr8T2E6HoVYCZtsdLLuTn; JOID=Vl4WCkpvMMQ4SPqoK2L3WgL_2iY9ClOKYjnM-GYCC7dkLLTNbSw7KlVP_acpeCEIhNojDRnII9trN-8qth4ClAU=; osd=VlwQBE9vMsI2TfqqLWzyWgD51CM9CFWEZznO_mgHC7ViIrHNbyo1L1VN-6kseCMOit8jDx_GJttpMeEvthwEmgA=; tst=r'
jar = requests.cookies.RequestsCookiejar()
headers = {
    'Host': 'www.zhihu.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}

for cookie in cookie:
    key,value = cookie.split('=',1)
    jar.set(key,value)
r = requests.get(url, headers=headers)
print(r.text)

3、会话维持

import requests

url = 'http://www.httpbin.org/cookies/set/number/1234567'
s = requests.Session()
s.get(url)
r = s.get('http://www.httpbin.org/cookies')
print(r.text)

{
  "cookies": {
    "number": "1234567"
  }
}

设定一个url去请求,并且设置cookie。我们先请求一个测试网站,设置一个cookie,名叫number,内容是1234567,又添加了一个http://www.httpbin.org/cookies来获取当前cookie。记住要用session获取。

4、SSL证书验证

requests提供证书验证机制,发送HTTP请求时,会检查SSL证书,可以使用verify参数控制是否检查了此证书。不添加verify,默认是true,会自动验证。

5、代理设置

大规模爬取,网站可能会弹出验证码,或者转到登录认证页面,或者直接封ip,导致无法访问。我们可以通过设置代理来解决问题,需要用到proxies参数。

import requests

procies ={
    "http": "http://10.10.1.10:3128",
    "https": "https://10.10.1.10:1080",
}

requests.get("https://www.baidu.com",proxies=proxies)

如果需要使用HTTP Basic Auth,可以使用类似的http://user:password@host:port这样的语法设置代理

import requests

proxies ={
    "http": "http://user:password@10.10.1.10:3128",
}

requests.get("https://www.baidu.com",proxies=proxies)

requests还支持socks协议的代理,需要先用pip安装socks这个库

import requests

proxies ={
    "http": "socks5//user:password@10.10.1.10:3128",
}

requests.get("https://www.baidu.com",proxies=proxies)

6、超时设置

本机网络状况不太好时,可能等待很久才得到响应,甚至由于收不到响应而报错。为了防止服务器不能及时响应,应该设置超时时间,超过规定时间没有响应就报错。需要用到timeout参数。

import requests

r= requests.get("http://www.baidu.com",timeout =1)
print(r.status_code)

timeout为连接和读取两个时间的总和,如果要分别制定连接和读取的时间,就传入一个元组

import requests

r= requests.get("http://www.baidu.com",timeout =(5,11,30))
print(r.status_code)

如果要永久等待,直接不设置timeout,或者设置成None

7、身份认证

如果遇到网页访问需要身份认证机制,可以使用requests自带的身份认证功能。

import requests
from requests.auth import HTTPBasicAuth

r= requests.get("http://xxx.xxx.xxxxx",auth=HTTPBasicAuth('username','password'))
print(r.status_code)

也可以使用如下方式,不用再传入一个类

import requests

r= requests.get("http://www.baidu.com",auth=('username','password'))
print(r.status_code)

requests还提供了OAuth认证,不过还需要pip安装oauth,pip3 install requests_oauthlib

import requests
from requests_oauthlib import OAuth1

url= 'https://api.twitter.com/1.1/account/versify_credentials.json'
auth = OAuth1('YOUR_APP_KEY','YOUR_APP_SECRET','USER_OAUTH_TOKEN','USER_OAUTH_TOKEN_SECRET')
requests.get(url,auth=auth,timeout=1)

9、Prepared Request

请求可以表达为一个数据结构,各个参数可以通过一个request对象来表示。

requests 在发送请求的时候在内部构造了一个 Request 对象,并给这个对象赋予了各种参数,包括 url、headers、data ,等等。然后直接把这个 Request 对象发送出去,请求成功后会再得到一个 Response 对象,再解析即可。

这个requests就是prepared request

from requests import Request,Session

url = 'http://httpbin.org/post'
data = {
    'name': 'hellen',
}
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
}
s = Session()
req = Request('POST',url,data=data,headers=headers)
prepped = s.prepare_request(req)
r = s. send(prepped)
print(r.text)

首先引入Request,用url,data,headers构造Request对象,这时需要调用Session的prepared_request()方法将其转换成为一个Prepared Request对象,调用send()方法发送即可。

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "name": "hellen"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "11",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
    "X-Amzn-Trace-Id": "Root=1-60f2279a-729eaf7970b9f15f6732c921"
  },
  "json": null,
  "origin": "118.114.194.5",
  "url": "http://httpbin.org/post"
}

这样可以达到POST一样的效果。有了Request对象,可以把请求当做独立的对象看待,进行队列调度时会更方便。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值