Day30.爬虫基础之Requests

爬虫基础之Requests



前言

本文主要是学习了requests库的简单实用。


一. request基础操作

1.1 实例引入

import requests

response = requests.get('https://www.baidu.com/')
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)

1.2 各种请求方式

requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

1.3 请求

1.3.1 基本GET请求

# 1. 基本写法
import requests

response = requests.get("http://httpbin.org/get")
print(response.text)


# 2.带参数GET请求
response = requests.get("http://httpbin.org/get?name=germey&age=22")
print(response.text)


# 另外一种设定参数的方法。
data = {
    'name': 'germey',
    'age': 22
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)


# 3.解析json
import requests
import json

response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))


# 4. 获取二进制数据

response = requests.get("https://github.com/favicon.ico")
print(type(response.text), type(response.content))
print(response.text)
print(response.content)

response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
    f.write(response.content)
    f.close()


# 5.添加headers
response = requests.get("https://www.zhihu.com/explore")
print(response.text)

有些html反馈出现了403,显然是找不到网页信息的,所以大概率有反爬虫机制。

1.3.2 基本POST请求

data = {'name': 'germey', 'age': '22'}
response = requests.post("http://httpbin.org/post", data=data)
print(response.text)


data = {'name': 'germey', 'age': '22'}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.post("http://httpbin.org/post", data=data, headers=headers)
print(response.json())

1.4 响应

1.4.1 response 属性

response = requests.get('http://www.jianshu.com')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)

1.4.2 状态码判断

response = requests.get('http://www.jianshu.com/hello.html')
exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')


import requests

response = requests.get('http://www.jianshu.com')
exit() if not response.status_code == 200 else print('Request Successfully')

1.4.3 常见状态码总结

  1. 较好的返回
状态码返回信息
100(‘continue’,)
101(‘switching_protocols’,)
102(‘processing’,)
103(‘checkpoint’,)
122(‘uri_too_long’, ‘request_uri_too_long’)
200(‘ok’, ‘okay’, ‘all_ok’, ‘all_okay’, ‘all_good’, ‘\o/’, ‘✓’)
201(‘created’,)
202(‘accepted’,)
203(‘non_authoritative_info’, ‘non_authoritative_information’)
204(‘no_content’,)
205(‘reset_content’, ‘reset’)
206(‘partial_content’, ‘partial’)
207(‘multi_status’, ‘multiple_status’, ‘multi_stati’, ‘multiple_stati’)
208(‘already_reported’,)
226(‘im_used’,)
  1. Redirection(重新导向)
状态码返回信息
300(‘multiple_choices’,)
301(‘moved_permanently’, ‘moved’, ‘\o-’)
302(‘found’,)
303(‘see_other’, ‘other’)
304(‘not_modified’,)
305(‘use_proxy’,)
306(‘switch_proxy’,)
307(‘temporary_redirect’, ‘temporary_moved’, ‘temporary’)
308(‘permanent_redirect’,‘resume_incomplete’, ‘resume’,), # These 2 to be removed in 3.0
  1. Client Error(客户端错误)
状态码返回信息
400(‘bad_request’, ‘bad’)
401(‘unauthorized’,)
402(‘payment_required’, ‘payment’)
403(‘forbidden’,)
404(‘not_found’, ‘-o-’)
405(‘method_not_allowed’, ‘not_allowed’)
406(‘not_acceptable’,)
407(‘proxy_authentication_required’, ‘proxy_auth’, ‘proxy_authentication’)
408(‘request_timeout’, ‘timeout’)
409(‘conflict’,)
410(‘gone’,)
411(‘length_required’,)
412(‘precondition_failed’, ‘precondition’)
413(‘request_entity_too_large’,)
414(‘request_uri_too_large’,)
415(‘unsupported_media_type’, ‘unsupported_media’, ‘media_type’)
416(‘requested_range_not_satisfiable’, ‘requested_range’, ‘range_not_satisfiable’)
417(‘expectation_failed’,)
418(‘im_a_teapot’, ‘teapot’, ‘i_am_a_teapot’)
421(‘misdirected_request’,)
422(‘unprocessable_entity’, ‘unprocessable’)
423(‘locked’,)
424(‘failed_dependency’, ‘dependency’)
425(‘unordered_collection’, ‘unordered’)
426(‘upgrade_required’, ‘upgrade’)
428(‘precondition_required’, ‘precondition’)
429(‘too_many_requests’, ‘too_many’)
431(‘header_fields_too_large’, ‘fields_too_large’)
444(‘no_response’, ‘none’)
449(‘retry_with’, ‘retry’)
450(‘blocked_by_windows_parental_controls’, ‘parental_controls’)
451(‘unavailable_for_legal_reasons’, ‘legal_reasons’)
499(‘client_closed_request’,)
  1. Server Error(服务器错误)
状态码返回信息
500(‘internal_server_error’, ‘server_error’, ‘/o\’, ‘✗’)
501(‘not_implemented’,)
502(‘bad_gateway’,)
503(‘service_unavailable’, ‘unavailable’)
504(‘gateway_timeout’,)
505(‘http_version_not_supported’, ‘http_version’)
506(‘variant_also_negotiates’,)
507(‘insufficient_storage’,)
509(‘bandwidth_limit_exceeded’, ‘bandwidth’)
510(‘not_extended’,)
511(‘network_authentication_required’, ‘network_auth’, ‘network_authentication’)

二. request高级操作

2.1 文件上传

files = {'file': open('favicon.ico', 'rb')}
response = requests.post("http://httpbin.org/post", files=files)
print(response.text)

2.2 获取cookie

import requests

response = requests.get("https://www.baidu.com")
print(response.cookies)
for key, value in response.cookies.items():
    print(key + '=' + value)

2.3 模拟登录

import requests

requests.get('http://httpbin.org/cookies/set/number/123456789')
response = requests.get('http://httpbin.org/cookies')
print(response.text)

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
response = s.get('http://httpbin.org/cookies')
print(response.text)

2.4 整数验证

response = requests.get('https://www.12306.cn')
print(response.status_code)


import requests
from requests.packages import urllib3

urllib3.disable_warnings()
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)


# 这里cert放入登录时验证需要的证书
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

2.5 代理设置

proxies = {
  "http": "http://127.0.0.1:1234",
  "https": "https://127.0.0.1:1234",
}

response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)
# 这里无法连接,所以输入的代理一定得是能用的。


proxies = {
    "http": "http://user:password@127.0.0.1:1324/",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

2.6 超时设置

from requests.exceptions import ReadTimeout
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')

2.7 认证设置

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(r.status_code)


import requests

# 相对于上面。简洁的写法
r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)

2.8 异常处理

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException

try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')
except ConnectionError:
    print('Connection error')
except RequestException:
    print('Error')

总结

今天主要是简单的学习了request的常见使用,以及在不同的场景下的实际操作。如对于状态码的识别就可以很好地了解我们爬虫的状态。而且对于不同的场景,也提供了不同的案例供大家参考,有兴趣的小伙伴可以深入了解。

溜了溜了,脑壳疼。Loading(30/100)。。。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值