爬虫基础之Requests
文章目录
前言
本文主要是学习了requests库的简单实用。
一. request基础操作
1.1 实例引入
import requests
response = requests.get('https://www.baidu.com/')
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
1.2 各种请求方式
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')
1.3 请求
1.3.1 基本GET请求
# 1. 基本写法
import requests
response = requests.get("http://httpbin.org/get")
print(response.text)
# 2.带参数GET请求
response = requests.get("http://httpbin.org/get?name=germey&age=22")
print(response.text)
# 另外一种设定参数的方法。
data = {
'name': 'germey',
'age': 22
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)
# 3.解析json
import requests
import json
response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))
# 4. 获取二进制数据
response = requests.get("https://github.com/favicon.ico")
print(type(response.text), type(response.content))
print(response.text)
print(response.content)
response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
f.write(response.content)
f.close()
# 5.添加headers
response = requests.get("https://www.zhihu.com/explore")
print(response.text)
有些html反馈出现了403,显然是找不到网页信息的,所以大概率有反爬虫机制。
1.3.2 基本POST请求
data = {'name': 'germey', 'age': '22'}
response = requests.post("http://httpbin.org/post", data=data)
print(response.text)
data = {'name': 'germey', 'age': '22'}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.post("http://httpbin.org/post", data=data, headers=headers)
print(response.json())
1.4 响应
1.4.1 response 属性
response = requests.get('http://www.jianshu.com')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)
1.4.2 状态码判断
response = requests.get('http://www.jianshu.com/hello.html')
exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')
import requests
response = requests.get('http://www.jianshu.com')
exit() if not response.status_code == 200 else print('Request Successfully')
1.4.3 常见状态码总结
- 较好的返回
状态码 | 返回信息 |
---|---|
100 | (‘continue’,) |
101 | (‘switching_protocols’,) |
102 | (‘processing’,) |
103 | (‘checkpoint’,) |
122 | (‘uri_too_long’, ‘request_uri_too_long’) |
200 | (‘ok’, ‘okay’, ‘all_ok’, ‘all_okay’, ‘all_good’, ‘\o/’, ‘✓’) |
201 | (‘created’,) |
202 | (‘accepted’,) |
203 | (‘non_authoritative_info’, ‘non_authoritative_information’) |
204 | (‘no_content’,) |
205 | (‘reset_content’, ‘reset’) |
206 | (‘partial_content’, ‘partial’) |
207 | (‘multi_status’, ‘multiple_status’, ‘multi_stati’, ‘multiple_stati’) |
208 | (‘already_reported’,) |
226 | (‘im_used’,) |
- Redirection(重新导向)
状态码 | 返回信息 |
---|---|
300 | (‘multiple_choices’,) |
301 | (‘moved_permanently’, ‘moved’, ‘\o-’) |
302 | (‘found’,) |
303 | (‘see_other’, ‘other’) |
304 | (‘not_modified’,) |
305 | (‘use_proxy’,) |
306 | (‘switch_proxy’,) |
307 | (‘temporary_redirect’, ‘temporary_moved’, ‘temporary’) |
308 | (‘permanent_redirect’,‘resume_incomplete’, ‘resume’,), # These 2 to be removed in 3.0 |
- Client Error(客户端错误)
状态码 | 返回信息 |
---|---|
400 | (‘bad_request’, ‘bad’) |
401 | (‘unauthorized’,) |
402 | (‘payment_required’, ‘payment’) |
403 | (‘forbidden’,) |
404 | (‘not_found’, ‘-o-’) |
405 | (‘method_not_allowed’, ‘not_allowed’) |
406 | (‘not_acceptable’,) |
407 | (‘proxy_authentication_required’, ‘proxy_auth’, ‘proxy_authentication’) |
408 | (‘request_timeout’, ‘timeout’) |
409 | (‘conflict’,) |
410 | (‘gone’,) |
411 | (‘length_required’,) |
412 | (‘precondition_failed’, ‘precondition’) |
413 | (‘request_entity_too_large’,) |
414 | (‘request_uri_too_large’,) |
415 | (‘unsupported_media_type’, ‘unsupported_media’, ‘media_type’) |
416 | (‘requested_range_not_satisfiable’, ‘requested_range’, ‘range_not_satisfiable’) |
417 | (‘expectation_failed’,) |
418 | (‘im_a_teapot’, ‘teapot’, ‘i_am_a_teapot’) |
421 | (‘misdirected_request’,) |
422 | (‘unprocessable_entity’, ‘unprocessable’) |
423 | (‘locked’,) |
424 | (‘failed_dependency’, ‘dependency’) |
425 | (‘unordered_collection’, ‘unordered’) |
426 | (‘upgrade_required’, ‘upgrade’) |
428 | (‘precondition_required’, ‘precondition’) |
429 | (‘too_many_requests’, ‘too_many’) |
431 | (‘header_fields_too_large’, ‘fields_too_large’) |
444 | (‘no_response’, ‘none’) |
449 | (‘retry_with’, ‘retry’) |
450 | (‘blocked_by_windows_parental_controls’, ‘parental_controls’) |
451 | (‘unavailable_for_legal_reasons’, ‘legal_reasons’) |
499 | (‘client_closed_request’,) |
- Server Error(服务器错误)
状态码 | 返回信息 |
---|---|
500 | (‘internal_server_error’, ‘server_error’, ‘/o\’, ‘✗’) |
501 | (‘not_implemented’,) |
502 | (‘bad_gateway’,) |
503 | (‘service_unavailable’, ‘unavailable’) |
504 | (‘gateway_timeout’,) |
505 | (‘http_version_not_supported’, ‘http_version’) |
506 | (‘variant_also_negotiates’,) |
507 | (‘insufficient_storage’,) |
509 | (‘bandwidth_limit_exceeded’, ‘bandwidth’) |
510 | (‘not_extended’,) |
511 | (‘network_authentication_required’, ‘network_auth’, ‘network_authentication’) |
二. request高级操作
2.1 文件上传
files = {'file': open('favicon.ico', 'rb')}
response = requests.post("http://httpbin.org/post", files=files)
print(response.text)
2.2 获取cookie
import requests
response = requests.get("https://www.baidu.com")
print(response.cookies)
for key, value in response.cookies.items():
print(key + '=' + value)
2.3 模拟登录
import requests
requests.get('http://httpbin.org/cookies/set/number/123456789')
response = requests.get('http://httpbin.org/cookies')
print(response.text)
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
response = s.get('http://httpbin.org/cookies')
print(response.text)
2.4 整数验证
response = requests.get('https://www.12306.cn')
print(response.status_code)
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)
# 这里cert放入登录时验证需要的证书
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)
2.5 代理设置
proxies = {
"http": "http://127.0.0.1:1234",
"https": "https://127.0.0.1:1234",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
# 这里无法连接,所以输入的代理一定得是能用的。
proxies = {
"http": "http://user:password@127.0.0.1:1324/",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
2.6 超时设置
from requests.exceptions import ReadTimeout
try:
response = requests.get("http://httpbin.org/get", timeout = 0.5)
print(response.status_code)
except ReadTimeout:
print('Timeout')
2.7 认证设置
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(r.status_code)
import requests
# 相对于上面。简洁的写法
r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)
2.8 异常处理
import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
try:
response = requests.get("http://httpbin.org/get", timeout = 0.5)
print(response.status_code)
except ReadTimeout:
print('Timeout')
except ConnectionError:
print('Connection error')
except RequestException:
print('Error')
总结
今天主要是简单的学习了request的常见使用,以及在不同的场景下的实际操作。如对于状态码的识别就可以很好地了解我们爬虫的状态。而且对于不同的场景,也提供了不同的案例供大家参考,有兴趣的小伙伴可以深入了解。
溜了溜了,脑壳疼。Loading(30/100)。。。