python爬虫(一) ------ request讲义

Requests

作用:发送网络请求,获得响应数据

官方文档:https://requests.readthedocs.io/zh_CN/latest/index.html

Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库

它比urllib更加方便,可以节约大量的工作,完全满足HTTP测试需求的库

⼀句话——Requests是一个Python代码编写的HTTP请求库,方便在代码中模拟浏览器发送http请求

安装命令:pip install requests

一,Requests请求

1,实例引入

# https://www.baidu.com/
import requests

response = requests.get('https://www.baidu.com/')
print(response)  # 响应体对象(响应源码+响应状态码+响应URL)
print(response.text)  # 查看响应体内容

print(type(response.text))  # 查看响应内容的数据类型

print(response.status_code)  # 查看响应状态码
print(response.url)

2,各种请求方式

requests.get('http://httpbin.org/get')   # GET请求
requests.post('http://httpbin.org/post')  # POST请求
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

3.1,基于get请求

1.基本写法

# 测试网站:http://httpbin.org/get
url = 'http://httpbin.org/get'  # 目标站点
r = requests.get(url)
print(r.status_code)
print(r.text)
print(type(r.text))

2.带参数的get请求

# 测试网站:http://httpbin.org/get
# 第一种写法
# https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD&pn=0&oq=%E4%B8%AD%E5%9B%BD&ie=utf-8&usm=6&fenlei=256&rsv_idx=1&rsv_pq=a1e3d64c000d24b8&rsv_t=a275rbOHpNWKXTdaXGjTAG6uADWzJfkIijwnQpMSUN4WqOcOki9o0nvbRrw
url= 'http://httpbin.org/get?age=12&name=lisi'
r = requests.get(url)
print(r.status_code)
print(r.text)
# 推荐写法
# 把参数单独构建在字典
d = {
    'name':'lisi',
    'age':10 
}
url = 'http://httpbin.org/get'
r = requests.get(url,params=d)   # params:携带get请求参数的
print(r.text)

3.2,基于post请求

# http://httpbin.org/post
url ='http://httpbin.org/post'  
d = {
    'name':'lisi',
    'age':10 
}
r = requests.post(url,data = d)  # data参数作用是携带post请求的参数的
print(r.text)

4,获取json数据

import requests
import json

url = 'http://httpbin.org/get'
r = requests.get(url)
# print(r.status_code)  # 查看响应状态码
a = r.text
# print(a)   
# print(type(a))  # 查看数据类型

dict_data = json.loads(a)
# print(dict_data)
# print(type(dict_data)) 
res = dict_data['headers']['Host']
# print(res)

json_data = r.json()  # .json(): 获取JSON数据 数据类型为dict
print(json_data)
print(type(json_data)) 

5,content 获取二进制数据

#目标站点 -- 百度logo图片:https://www.baidu.com/img/baidu_jgylogo3.gif
url = 'https://www.baidu.com/img/baidu_jgylogo3.gif'
r = requests.get(url)
print(r.text)
print(type(r.text))
url = 'https://www.baidu.com/img/baidu_jgylogo3.gif'
r = requests.get(url)   # 01010101
print(r.content)  # content:获取二进制数据
print(type(r.content))

with open('bdtp.gif','wb')as f:
    f.write(r.content)


"""
bytes类型是指一堆字节的集合,在python中以b开头的字符串都是bytes类型

Bytes类型的作用:
    1, 在python中, 数据转成2进制后不是直接以0101010的形式表示的,而是用一种叫bytes(字节)的类型来表示
    2,计算机只能存储2进制, 我们的字符、图片、视频、音乐等想存到硬盘上,也必须以正确的方式编码成2进制后再存。
      记住一句话:在python中,字符串必须编码成bytes后才能存到硬盘上
"""

6,添加headers

浏览器用户身份的标识,缺少的话服务器会认为你不是一个正常的浏览器用户,而是一个爬虫程序

# 目标站点 -- 知乎 :https://www.zhihu.com/explore
url ='https://www.zhihu.com/explore'
r = requests.get(url)  
print(r.status_code)
print(r.text)
# 目标站点 -- 知乎 :https://www.zhihu.com/explore
url ='https://www.zhihu.com/explore'

# 准备身份信息
head = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
r = requests.get(url,headers = head)   # headers :携带伪装好的身份信息
print(r.status_code)
print(r.text)

二,Response响应

1,response属性

# 目标网站 --  :http://www.jianshu.com    
import requests
url = 'http://www.jianshu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}
r = requests.get(url,headers = headers,allow_redirects=False)
print(r.status_code) # 查看响应状态码

# 查看响应头信息
print(r.headers)

# 查看url
print(r.url)

# 查看网页是否跳转
print(r.history)

# # 禁止网页跳转    allow_redirects=False

2,状态码判断

200 请求成功   
301、302 请求发生跳转   
404   页面没找到  
500 502  503服务器内部错误

100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

三,高级操作

会话维持

http/https 协议 是一种无状态的协议,对事物处理无记忆功能,所以每次请求都是一个独立状态

会话维持作用:跨请求时保持住某些参数

为了解决无状态协议 就有了cookie和session的出现

通常应用场景:模拟登录
1,在requests中,如果直接利用 get()或者 post()等方法的确可以做到模拟网页的请求,但这实际上是相当于不同的会话,每次请求都是独立的。
 当我们向服务器发送请求后,服务器处理请求之后返回结果。这是一个独立的过程,再次向服务器发出请求,服务器做出响应又是一次独立的过程。
 不会有一条网线一直连着你的电脑和服务器来完成你的所有请求

2,以上是因为我们访问每一个互联网页面,都是通过 HTTP 协议进行的,HTTP协议是无状态的协议。无状态是指协议对于事务处理没有记忆功能,缺少状态意味着,假如后面的处理需要前面的信息,则前面的信息必须重传,这样可能导致每次连接传送的数据量增大

3,建立会话对象让你能够跨请求保持某些参数,比如,仅使用 HTTP 协议的话,我们登录一个网站的时候,假如登录成功了,但是当我们访问该网站的其他网页的时候,该登录状态则会消失,此时还需要再登录一次,只要页面涉及更新,就需要反复的进行登录,这是非常不方便的。

4,所以此时,我们需要将对应的会话信息,比如登录成功等信息通过一些方式保存下来

5,比较常用的方式有两种:通过 Cookie 保存会话信息或通过 Session 保存会话信息。
为了解决这个问题,用于保持HTTP连接状态的Session和Cookies就出现了

(1).通过cookie维持会话

# 通过cookie维持的只是一种状态    用户信息(账号177+密码)  
'''
好处: 携带了登录后的cookie就可以访问到需要登录之后才能浏览器后续页面的网站
坏处: 大大提高你爬虫被反爬的风险
'''
import requests
# 构建身份伪装的时候 字典内是可以放多条信息的
head = {
    'cookie': '_ga=GA1.2.785088129.1663232728; UM_distinctid=18340655da2d57-01950da96446bf-26021c51-1fa400-18340655da312ae; _uab_collina=166323273910417053476305; __bid_n=1837d80e99661cd5874207; FPTOKEN=30$9ngowMFJIxfTt1ev3pV2NXZAR1KbxMRyvGdkwb1lSg7CK9kuoW4m9Tk64Yu2lhl15nidSlJoLkzDf2olHMDNYGwLjlHgyUfUO0wo/mE3LAvVc6V9fiJoFJVsWbl+ey1UCIIa1LpD066jfWClPLqnSi9ALdVA3OhtPKgV0FZEHXAyXI6j6WeE12B0jyXKVyrGe6R7EAVw/WzAXbu6WPEw51Pb7v+eU8Ega4xB/mzavBw11ihqg4/P0CSJH+wgvL+32DFO8SX5sSC8WG0eFKgky17uvn9ncoeELPmpK8OtYdRPNXcuIAN0CZYRI1UcjwXzeL+0pkKTBK14NBK1yOxkTjsPIvSfnGVxyOX+9Sj9py6bzOEWwGLFgOcfCq31ujs1|3xEtlkLt1BztK1OkB9Pdv9NZ6/mPKfenBOhlzlFYmnM=|10|632238864fad98e98e4f0f72baa5114a; __yadk_uid=FElYEfEw4rZzvSwnxEBuIunz7U248yoN; ssxmod_itna=Yq0xuD0CitY42Dl4iwlxjg4UOG=G8iK4EU+zDBMgd4iNDnD8x7YDv++vYi+KeqoAxaYqvFBTp4eftUChxhdg=Y4GLDmKDyQipieDxaq0rD74irDDxD3Db3dDSDWKD9D048yRvLKGWDbx=Df4DmDGYneqDgDYQDGMpLD7QDIq6dbD64X32B+l=q3lh5iqDMIeGX/BWeiLrWaPhM3eWrZqP2DB=CxBjg+qXj+eDHFdNSlLvoiOGo+73X/iPqzGGPIDxqUAg3zGd=n9hzj/honihIpGvnxDfo3+f5eD; ssxmod_itna2=Yq0xuD0CitY42Dl4iwlxjg4UOG=G8iK4EUD8unikxGXwoGaIFKGakrGx8gxKw=5N7kDULvxqPcaMhKpKlgjIt=rO12B5sH7hOyGf=MbBM+61M2TLBflRoKLtu+odVCkRjyPTjF1iV2NICf=Dg8fxOOfzeLvCSirwlODFlwNvlk=U+rvdMbb24+rUbLvr/QvFbu0+7mDUnxL0IGPzYkDNVdIoRq9emO=vFbL3jb=7DuBczq9wj35D1S2uzd3TD7j5iDEUbZ+7Q7Gy9tE5SOhX6rC7Y7DDFqD2iiD=; Hm_lvt_1c6cd6a9fd5f687d8040b44bebe5d395=1668671845,1668866044,1669442723,1669470014; CNZZDATA1279807957=801573648-1665130425-https%253A%252F%252Fwww.jianshu.com%252F%7C1669876060; _gid=GA1.2.1099201007.1673436819; FPTOKEN=4RfuzrZcNGkGUiwjAs4GFSrHaR9soX5FnEUc0wolx6zCv/uRqOWge0hit4Gb+HGv/CLyDiJovZLLBUyxkqhFFkrnAq6Pz2aKzVeE2Xl3/YAPemuqkPlqs4fzt24HQAAUQmX2lf1fA0DY2oQl/c8H1kO/nfa9zxhPjyJ/lrEy55oB3YU1mfjqVaQqA+57/y7c1G47kwBFkUfPahcb5MpHQYsMAALDm5KfVknmoEl3TqzFpgawjfN5aImZs/360CdOPPLpVkb36a7dvFZxaNaOl156AUC8Ld+OzqkUjiVCbqmFdtCxrN98F5QirLNYluVTVIGmqVtCyFUqV3ggLWqJxVpht3//wCi8WrJa5TPHp8hj+v6bX8pBnyemQgRmtssunNvpKKzWdxPg83cCuns3lw==|iS9jPE7pxhbfClB2d2NbjsWDrQz6SPBpE3Err5zO0NE=|10|685826e6ca2a68a398109d51188b2dd4; read_mode=day; default_font=font2; locale=zh-CN; Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1673436819,1673441882; remember_user_token=W1syNjQ2MDkwMF0sIiQyYSQxMSRJUi9IRks4T1BUWFo5bW14Q3BTbTIuIiwiMTY3MzQ0Mjc4Mi4xNjk5MjQ1Il0%3D--0857cd554ddbc5528cc64d8c32571f2ab488eb55; web_login_version=MTY3MzQ0Mjc4Mg%3D%3D--24038b98ea240da4d843d8275f21d52a8477429f; _m7e_session_core=1006290779d7b90dc779d8b6345b9769; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2226460900%22%2C%22first_id%22%3A%2218340655bf81169-0562c153136d52-26021c51-2073600-18340655bf911c9%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%A4%BE%E4%BA%A4%E7%BD%91%E7%AB%99%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fopen.weixin.qq.com%2F%22%2C%22%24latest_utm_source%22%3A%22recommendation%22%2C%22%24latest_utm_medium%22%3A%22seo_notes%22%2C%22%24latest_utm_campaign%22%3A%22maleskine%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%2C%22%24device_id%22%3A%2218340655bf81169-0562c153136d52-26021c51-2073600-18340655bf911c9%22%7D; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1673442781; BAIDU_SSP_lcr=https://open.weixin.qq.com/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
}
response = requests.get("https://www.jianshu.com/",headers = head)
print(response.text)

(2).通过session维持会话

import requests
'''让服务器知道你还是上一次的你
应用场景:验证码
'''

# 创建一个session对象
s = requests.session()
# 通过session对象发请求
s.get('https://www.baidu.com/')
response = s.get('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=python&fenlei=256&rsv_pq=0xa7948b5d00151ad4&rsv_t=d71fImFoqkgjlERKUqeQ5yYhnF3QMpfLrwwzluexY9jqhRHMfCrJOiIDuI5I&rqlang=en&rsv_enter=1&rsv_dl=tb&rsv_sug3=7&rsv_sug1=6&rsv_sug7=101&rsv_sug2=0&rsv_btype=i&prefixsug=python&rsp=5&inputT=1278&rsv_sug4=1849&rsv_sug=2')
print(response.text)

代理设置

# 目标站点:https://www.baidu.com
url= 'https://www.baidu.com'
r = requests.get(url)
print(r.status_code)
url= 'https://www.baidu.com'
# 将IP信息构建在字典中
p= {
    'http':'121.13.252.58:41564',
    'https':'121.13.252.58:41564',
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
}
r = requests.get(url,headers = headers,proxies = p)  # proxies:挂IP
print(r.text)

超时设置

# 目标站点 : http://baidu.com
url= 'https://www.baidu.com'
r = requests.get(url,timeout = 0.0000000001) 
print(r.status_code)

异常处理

url= 'https://www.baidu.com'
try:
    r = requests.get(url,timeout = 0.0000000001) 
    print(r.status_code) 
except:
    print('timeout!')

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值