爬虫基础-requests库的使用

注:本文章为学习过程中对知识点的记录,供自己复习使用,也给大家做个参考,如有错误,麻烦指出,大家共同探讨,互相进步。
借鉴出处:
该文章的路线和主要内容:崔庆才(第2版)python3网络爬虫开发实战
Requests中文文档:https://www.w3cschool.cn/requests2/
requests库是在urllib的基础上进行的进行的封装,比urllib使用更加便捷,企业中多数用requests,所以与urllib对照学习,加深记忆!
1、安装requests库
pip install requests
2、requests.get() 对比 urllib.request.urlopen()
输入:

import requests 
res = requests.get('https://www.baidu.com/')
print(type(res))   //输出响应的类型
print(res.status_code)  //状态码
print(type(res.text))   //响应体类型
print(res.text[:100])   //响应体内容(显示一部分,要不然太多了)
print(res.headers)      //响应头
print(res.history)      //请求历史记录
print(res.cookies)      //Cookie
print(res.url)          //

输出:

<class 'requests.models.Response'>
200
<class 'str'>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charse
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 04 Oct 2022 02:40:51 GMT', 'Last-Modified': 'Mon, 23
 Jan 2017 13:24:33 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
[]
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
https://www.baidu.com/

总结:requests除了get方法,还有post、put、delete等方法。
①GET请求
a、如果需要在请求头和请求体中加入参数,该怎么做?
输入:

import requests
url = 'http://www.httpbin.org/get'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0',
    'Host': 'www.httpbin.org'
}
data = {
    'username': 'jack',
    'password': 'abc123456'
}
res = requests.get(url=url, data=data, headers=headers)
print(res.text)

输出:

{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Content-Length": "32",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "www.httpbin.org",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0",
    "X-Amzn-Trace-Id": "Root=1-633ba026-08f0fed010e0327633825313"
  },
  "origin": "120.227.32.26",
  "url": "http://www.httpbin.org/get"
}

对比原始响应体
在这里插入图片描述
b、抓取二进制数据
输入:

import requests

res = requests.get('https://scrape.center/favicon.ico')
print(res.text)
print(res.content)

输出:在这里插入图片描述在这里插入图片描述可以通过将爬取下来图片的二进制数据存入本地favicon.ico里,这样图片就会被保存到本地

import requests

res = requests.get('https://scrape.center/favicon.ico')
with open('favicon.ico','wb') as f:
	f.write(res.content)

②POST请求:与GET请求类似,传参即可。

3、响应

上面已经介绍了常用的响应属性,其中不同状态码都有对应的属性

# 信息性状态码
100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),

# 成功状态码
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# 重定向状态码
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# 客户端错误状态码
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# 服务端错误状态码
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')

例如判断状态结果是不是404,可以用requests.codes.not_found作为内置的状态码做比较。

4、高级用法
get()还可传入其他参数,(与post基本一致)。
在这里插入图片描述
①文件上传
file对象(文件对象,要传入二进制数据)
输入:

import requests
files = {'file': open('favicon.ico', 'rb')}
res = requests.post('http://www.httpbin.org/post',files=files)
print(res.text)

输出:

{
  "args": {},
  "data": "",
  "files": {
    "file": "data:application/octet-stream;base64,AAABAAEAICAAAAEAIACoEAAAFgAAACgAAAAgAAAAQAAAAAEAIAAAAAAAABAAABILAAASCwAAAAAAAAAAAABXP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+
v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1
c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1hA6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9YQOv/Ujrq/0ox6v9LMur/SzLq/0sy6v9LMu
r/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0ox6v9SOur/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/WEDr/1I66v9yXe7/n5H0/5qK8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5uL8/+bi/P/m4vz/5
uL8/+bi/P/m4vz/5uL8/+ZivP/n5H0/3Jd7v9SOur/WEDr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////////////////////////////////////////////////////////////////////z8////////mozz/0sy6v9aQuv/Vz/r/1c/6/9XP+
v/Vz/r/1pC6/9LMur/mYrz///////6+f7//Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P//+vn+//////+aivP/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8//////////////7+///+/v///v7///
7+///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v/////////////8/P///////5uL8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////////////////////////////////////////////////////////////
////////z8////////m4vz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////+VhfL/dF/v/3tn7/96Zu//embv/3pm7/96Zu//embv/3pm7/96Zu//embv/3tn7/90X+7/lYXz///////+/v///Pz///////+bi/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0
sy6v+ajPP///////z8///+/v///////3Rg7v9HLen/UTjq/0826v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/UTjq/0cu6f90X+7///////7+///8/P///////5uL8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///7+////////e2jv/1E46v9aQ+v/WUHr/1lC6/9bRO
z/W0Ts/1tE7P9bROz/W0Ts/1tE7P9dRuz/VDzr/31q8P///////v7///z8////////m4vz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////96Zu7/Tzbq/1lB6/9YQOv/VDzr/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0006v9DKen/cVzu///////+/v///Pz///
////+bi/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////3pm7v9PNur/WULr/1Q76/9mT+v/morz/5iI8/+YiPP/mIjz/5iI8/+YiPP/mYn0/5SD8/+toPb////////////8/P///////5uL8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8/////
///Pz///7+////////embu/0826v9aQ+v/Tzbr/3xq6f////7//v7///////////////////////////////////////////////////z8////////m4vz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////96Zu7/Tzbq/1pD6/9PNuv/e2rq/////v/7+////Pz///
z8///8/P///Pz///z8///8/P///f3//////////////Pz///////+ai/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////3pm7v9PNur/WkPr/0826/98aur////+//39///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///7+////////5qL8/9LMu
r/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///7+////////embu/0826v9aQ+v/Tzbr/3xq6f////7//v7//////v////7////+/////v////7////+/////v////7////+//z8/v//////m4zz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///
////96Zu7/Tzbq/1lB6/9VPev/X0np/31s6f98aur/fGrq/3xq6v98aur/fGrq/3xq6v98aur/fGrq/3xq6v98aur/e2rq/39t6v9mUOr/VDzr/1hA6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////3pm7v9PNur/WUHr/1c/6/9VPev/TzXr/0826/9PNuv/Tzbr/0826/9PNu
v/Tzbr/0826/9PNuv/Tzbr/0826/9PNuv/TjXr/1Q76/9YQOv/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///7+////////e2jv/1E46v9aQ+v/WUHr/1lB6/9bQ+v/WkPr/1pD6/9aQ+v/WkPr/1pD6/9aQ+v/WkPr/1pD6/9aQ+v/WkPr/1pD6/9bQ+v/WEHr/1c/6/9XP+v/Vz/r/1
c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P///v7///////90YO7/SC3p/1E46v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/Tzbq/0826v9PNur/UDfq/0826v9UPOv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/WkLr/0sy6v+ajPP///////z8///+/v///////5WF8v9zYO
7/e2jv/3pm7/96Zu//embv/3pm7/96Zu//embv/3pm7/96Zu//embv/3pm7/96Zu//embv/3pm7/95Zu//fGnv/2RP7f9VPOv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////////////////////////////////////////////////////
////////////////z8////////mozz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1pC6/9LMur/mozz///////8/P/////////////+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v///v7///7+///+/v///Pv///////+ai/P/SzLq/1pC6/9XP+v/Vz/r/1c/6/9XP+
v/WkLr/0sy6v+ZivP///////r5/v/8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///8/P///Pz///z8///6+f7//////5mK8/9LMur/WkLr/1c/6/9XP+v/Vz/r/1c/6/9aQuv/SzLq/5qM8////////Pz///////////////////////////////////
////////////////////////////////////////////////////////////////z8////////m4zz/0sy6v9aQuv/Vz/r/1c/6/9XP+v/Vz/r/1hA6/9SOur/cl7u/5+R9P+ZivP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajPP/mozz/5qM8/+ajP
P/mYrz/5+R9P9yXe7/Ujrq/1hA6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1hA6/9SOur/SjHq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SzLq/0sy6v9LMur/SjHq/1I66v9YQOv/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1
hA6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WkLr/1pC6/9aQuv/WEDr/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+
v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1c/6/9XP+v/Vz/r/1
c/6/9XP+v/Vz/r/1c/6/9XP+v/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
  },
  "form": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Content-Length": "4433",
    "Content-Type": "multipart/form-data; boundary=4e29f7f88aa42e12b942b3607d866316",
    "Host": "www.httpbin.org",
    "User-Agent": "python-requests/2.28.1",
    "X-Amzn-Trace-Id": "Root=1-633bba12-33dc358e435dfae534d00d8a"
  },
  "json": null,
  "origin": "120.227.32.26",
  "url": "http://www.httpbin.org/post"
}

②Cookie设置(与urllib中设置Cookie做对比)
输入:

import requests
res = requests.get('https://www.gitee.com')
print(res.cookies)
for key, value in res.cookies.items():
	print(key + ':' + value)

输出:

<RequestsCookieJar[<Cookie gitee-session-n=dllHb0lPWTR2eTkwS3gvM2dNdy9GcmZ5Q2tDQzFIb1Y1TDh3VnR6TDVHL1A5MVZJclhRLzN3UEN1Q3U2Yko4L2hIekE5b1lGSE1ETmRZL0FxYTRKNktoYnB2M25qVXJYY1RsdkFXT2NSajZnUnZGR25UbzZ0eGY2S0Y0ak5tMTFsQnc1dVRoZlRYZlBzc
HI1ZlZua3djWWNrTDZUaHRVYzBvNGVCTTJmaUZzNzhjUlBheEpYTXc3VTArc25HcTNJLS1qc2V1TXZSaDkwNVlIUzVUZ1hGQnFRPT0%3D--599965dcd2feafe5aa01b712881b026ab2eaf05a for .gitee.com/>, <Cookie user_locale=zh-CN for .gitee.com/>, <Cookie oschina_new_us
er=false for gitee.com/>]>
gitee-session-n:dllHb0lPWTR2eTkwS3gvM2dNdy9GcmZ5Q2tDQzFIb1Y1TDh3VnR6TDVHL1A5MVZJclhRLzN3UEN1Q3U2Yko4L2hIekE5b1lGSE1ETmRZL0FxYTRKNktoYnB2M25qVXJYY1RsdkFXT2NSajZnUnZGR25UbzZ0eGY2S0Y0ak5tMTFsQnc1dVRoZlRYZlBzcHI1ZlZua3djWWNrTDZUaHRVYzBv
NGVCTTJmaUZzNzhjUlBheEpYTXc3VTArc25HcTNJLS1qc2V1TXZSaDkwNVlIUzVUZ1hGQnFRPT0%3D--599965dcd2feafe5aa01b712881b026ab2eaf05a
user_locale:zh-CN
oschina_new_user:false

③Session维持
直接利用request中的get、post方法做到的模拟网页请求,但不同的请求处于不同的session中(或者说用两个浏览器打开两个请求)。假如第一个请求通过post执行登录,第二个请求通过get方法获取登录后的个人信息,如果第二次请求是打开一个新的浏览器选项卡而不是新的浏览器,且不想每个请求中都加入cookie(会比较繁琐),就可以用到Session对象。

import requests
s = requests.Session()
s.get('https://www.httpbin.org/cookies/set/number/123456')
r = s.get('https://www.httpbin/org/cookies')
print('r.text')

④SSL证书验证
某些网站没有设置https证书或者不能被CA机构认证,这时会出现SSL证书错误的提示,如下图:
在这里插入图片描述
直接爬取会报SSL证书无效,如下:
在这里插入图片描述在请求中加入verify参数,默认是True,会自动验证。

import requests
res = requests.get("https://ssr2.scrape/center/", verify=False)
print(res.status_code)

⑤超时设置
和urllib一样,在请求参数中加入timeout,timeout=1意味着请求超过1s,就会抛出异常。
实际上,请求分为两个阶段:连接(connect)和读取(read),如time(5,30)【timeout=1就是连接和读取的总和】。·
如果想永久等待,可以直接将timeout设为None,或者不加参数timeout。
⑥身份认证
访问页面如果需要登录,在请求参数中加入auth参数即可。

import requests
res = resquests.get('https://ssr3.scrape.center/',auth=('admin','admin'))
print(res.status_code)

⑦代理设置
后面会整章学习。

  • 1
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值