目录
什么是Requests
Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库。与urllib相比,Requests更加方便,可以节约我们大量的工作,建议爬虫使用Requests库。
主要使用
import requests
response= requests.get("http://www.baidu.com")
#response.text是经过解码之后的内容,但是解码的格式不一定是我们需要的,如下内容会出现乱码
print(response.text)
#response.content根据打印的内容我们可以知道,结果是个字节内容,所以我们可以根据自己的需要进行解码,防止出现乱码
print(response.content.decode("utf-8"))
#一些其它常用属性
#响应文本的编码 默认ISO-8859-1
print(response.encoding)
#请求的url
print(response.url)
#请求的状态
print(response.status_code)
使用requests发送请求也很简单。如使用百度查询:中国
import requests
#get请求使用也很简单
params={"wd":"中国"}
#正常的请求路径加参数为:https://www.baidu.com/s?wd=中国
url="http://www.baidu.com/s"
header={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
,"Cookie": "PSTM=1580476270; BIDUPSID=F7FDB52F1E6145DF42409CA9FA5E38F1; BAIDUID=EB7EB67C61D9B56FA038900594F4482C:FG=1; H_WISE_SIDS=146312_146745_146498_142019_144427_145946_141748_145498_144986_144419_144135_145270_131247_144681_137743_144742_144251_140259_127969_146548_145805_146752_145876_145999_131423_100807_132549_145909_146002_144499_107318_144378_146135_139909_144877_144966_145607_143664_145395_143853_145441_139914_110085; BD_UPN=12314753; BDUSS=FTamp5QXpBbUs0M3l0LX54OEpSTmZRQ1VyMkRaaEtDejJtWXh4bFg2RThYREZmRVFBQUFBJCQAAAAAAAAAAAEAAAD1ChdFz~Tk7NPq0rnA77XEu-oAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADzPCV88zwlfaF; BDSFRCVID=h80OJeC62RO1P5crLYfLhe5IpxPD5kjTH6aovKWxyuaQArQIgwZrEG0PDM8g0KubteeVogKKL2OTHmCF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tb4qoDPhtCP3j-bmKKTM-tF--fTMbJQKaDQ03Ru8Kb7VbIjt0MnkbfJBD4bnJTbWJjrrXfch2nKhf43ah5bDjJK7yajK2b0eWm79Ll6stJ5zhROeLfTpQT8r0UDOK5OibCrfXpI5ab3vOp44XpO1hftzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksDMDtqjtjtRI8VIPXfbP_jRTT5tQ_bP0_Mfv0etJyaR3Bb4bvWJ5TMCoz54jzW-DXXh6zXtv-J6cf-R4bKD5-ShPC-tn134_eDhufWRIJbgre2bC-3l02Vb7ae-t2ynLVLt8Lq4RMW23roq7mWn6TsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjjCKejcLjNueJ6nfb5kXLnTqaJQqDR3P-tnWMt_Hqxby26Pfam59aJ5nJDoWsC3Fh-bKXhFP-U7-txcjanPJoq8KQpP-HJACDh7BX60typALaMufKmj9Kl0MLn7Wbb0xyUQDjULZ5MnMBMnrteOnan6n3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDF4jj_BDjbyDGRK5b30bD5JBR5bHJnhjtcNq4bohjPT54O9BtQmJJrQ-KJdQIOK8MTIjt7qjfFQhJrPq45NQg-q3R7S2RK5qPc4y4O-LTD4j4TZ0x-jLN7hVn0MW-5DhfJYL4nJyUPTD4nnBPrm3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRh_CF-tI0ahIP9e5RD-t4bMfQXeJO0aDvKWjrJabC3jKPxXU6qLT5X04PDQTQ2-eJk0qRS5C5zq-nMQ5oOMl0njxQy-fnUMT5t3xjp-fjzeP3LKxonDh8vXH7MJUntKjn-0xQO5hvv8KoO3M7VLUKmDloOW-TB5bbPLUQF5l8-sq0x0bOte-bQXH_E5bj2qRFJoD0y3H; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_PS_PSSID=32292_1426_31660_32348_32045_32230_32117_31322_32261; delPer=0; BD_CK_SAM=1; PSINO=1; sugstore=1; H_PS_645EC=c93bxQToACUSgkjcCnirWUqenbGxDFtFRS0CJbuQ0JWgwpadZd7bhB57Wdg"
}
#发送get请求
respChina= requests.get(url,params=params,headers=header)
with open("baidu.html","w",encoding="utf-8") as fp:
fp.write(respChina.content.decode("utf-8"))
使用requests很简单,不用给参数进行编码,直接传入就行。底层会帮我们转码。
发送post请求
只要调用post方法,传入data参数即可,以之前用urllib库爬取拉勾网职位信息的例子。
import requests
url="https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false"
header={
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"referer":"https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
"origin":"https://www.lagou.com",
'Accept': 'application/json, text/javascript, */*; q=0.01',
'cookie':'user_trace_token=20200531194606-fc1f541a-e125-4963-9a9c-2b1d957e1216; _ga=GA1.2.1544368802.1590925567; LGUID=20200531194607-c7d6efeb-2ce7-4401-94bc-3e65916eb2a0; LG_LOGIN_USER_ID=b5a507f581fa09a8dca0626c07ed6a0bd39ed30461d10c2d; LG_HAS_LOGIN=1; RECOMMEND_TIP=true; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221726a8db339436-0a16f694cce861-3a65420e-1049088-1726a8db33a31%22%2C%22%24device_id%22%3A%221726a8db339436-0a16f694cce861-3a65420e-1049088-1726a8db33a31%22%2C%22props%22%3A%7B%22%24latest_utm_source%22%3A%22m_cf_cpt_baidu_pcbt%22%2C%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1594649478; gate_login_token=bc261c1a099219b9ea5417e5f508b118fcbf1b4d5f4d1e9a; JSESSIONID=ABAAABAABAGABFACB950A00325E11FB9B186A4767973E5D; WEBTJ-ID=20200713221208-1734884c952121-086ddd59679675-3a65420e-1049088-1734884c95514b; _putrc=EFB19A27B7811D99; login=true; unick=%E5%AD%94%E7%BB%B4%E6%8C%AF; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=42; privacyPolicyPopup=false; index_location_city=%E4%B8%8A%E6%B5%B7; TG-TRACK-CODE=index_search; LGSID=20200714222259-26727ec8-1214-4b9e-9b24-3ad9f63c3e14; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5Fpython%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; _gat=1; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1594736579; LGRID=20200714222259-a84b3e83-ac2c-4709-91c9-9957d5f3f7fd; _gid=GA1.2.842365217.1594736579; X_HTTP_TOKEN=ef0d6e7a5aae0da90856374951662facca1a5ab6a4; SEARCH_ID=ee5057c6ac5f42189cbf599cd3be9bb0'
}
data={'first':'true','pn':1,'kd':'python'}
response=requests.post(url,data=data,headers=header)
print(response.json())
返回部分结果如下: 是不是很简单,我们只要把参数准备好简单调用就可以了。
使用requests设置代理IP
设置代理也很简单,只要添加一个参数即可,如下代码:
import requests
url="https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false"
header={
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"referer":"https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
"origin":"https://www.lagou.com",
'Accept': 'application/json, text/javascript, */*; q=0.01',
'cookie':'user_trace_token=20200531194606-fc1f541a-e125-4963-9a9c-2b1d957e1216; _ga=GA1.2.1544368802.1590925567; LGUID=20200531194607-c7d6efeb-2ce7-4401-94bc-3e65916eb2a0; LG_LOGIN_USER_ID=b5a507f581fa09a8dca0626c07ed6a0bd39ed30461d10c2d; LG_HAS_LOGIN=1; RECOMMEND_TIP=true; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221726a8db339436-0a16f694cce861-3a65420e-1049088-1726a8db33a31%22%2C%22%24device_id%22%3A%221726a8db339436-0a16f694cce861-3a65420e-1049088-1726a8db33a31%22%2C%22props%22%3A%7B%22%24latest_utm_source%22%3A%22m_cf_cpt_baidu_pcbt%22%2C%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1594649478; gate_login_token=bc261c1a099219b9ea5417e5f508b118fcbf1b4d5f4d1e9a; JSESSIONID=ABAAABAABAGABFACB950A00325E11FB9B186A4767973E5D; WEBTJ-ID=20200713221208-1734884c952121-086ddd59679675-3a65420e-1049088-1734884c95514b; _putrc=EFB19A27B7811D99; login=true; unick=%E5%AD%94%E7%BB%B4%E6%8C%AF; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=42; privacyPolicyPopup=false; index_location_city=%E4%B8%8A%E6%B5%B7; TG-TRACK-CODE=index_search; LGSID=20200714222259-26727ec8-1214-4b9e-9b24-3ad9f63c3e14; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5Fpython%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; _gat=1; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1594736579; LGRID=20200714222259-a84b3e83-ac2c-4709-91c9-9957d5f3f7fd; _gid=GA1.2.842365217.1594736579; X_HTTP_TOKEN=ef0d6e7a5aae0da90856374951662facca1a5ab6a4; SEARCH_ID=ee5057c6ac5f42189cbf599cd3be9bb0'
}
data={'first':'true','pn':1,'kd':'python'}
proxy={"http":"121.232.148.167:9000"}
response=requests.post(url,data=data,headers=header,proxies=proxy)
print(response.json())
在requests中共享cookie
如果一个响应中包含了cookie,那么利用cookies属性可以拿到这个返回的cookie值。
import requests
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
data = {"email": "18238903856", "password": "kwzan328594"}
login_url = "http://www.renren.com/PLogin.do"
res=requests.post(url=login_url,data=data,headers=header)
print(res.cookies)
#获取cookie的字典内容
print(res.cookies.get_dict())
我们之前在用urllib登陆人人网查看照片墙的时候,是用opener发送多次请求,多个请求之间要共享cookie,步骤有点繁琐,但是如果用requests的话,就简单多了,要用到session(会话-多次请求可以称为一次会话)。保证多次请求的同一个会话,共享同一个session。
import requests
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
data = {"email": "18238903856", "password": "kwzan328594"}
login_url = "http://www.renren.com/PLogin.do"
#使用session登陆
session= requests.session()
res=session.post(url=login_url,data=data,headers=header)
#访问照片墙,可以直接使用上面的session
picture_url = "http://www.renren.com/974784400/newsfeed/photo"
resPict= session.get(url=picture_url)
with open("resPict.html","w",encoding="utf-8") as fp:
fp.write(resPict.content.decode("utf-8"))
从生成的html可以看出,共享cookie已经成功了。
处理不信任的SSL证书
对于那些已经被信任的SSL证书的网站,比如百度https://www.baidu.com,那么使用requests直接就可以正常返回的响应。
但是对于那些不信任的网站,比如访问某个网站的时候,地址栏的https/http被红杠打了标记,说明网站证书不合法。这个时候如果爬取内容可能会出错。可以在发起请求的加上一个参数来避免这个问题。
#增加verify=False参数来忽略网站证书不合法
requests.get("http://www.renren.com/PLogin.do",verify=False)