一、常见请求方法
1、urllib包(python3)
在Python2中提供了urllib和urllib2。其中urllib提供较为底层的接口,urllib2对urllib进行了进一步封装。
在Python3中将urllib合并到了urllib2中,并只提供了标准库urllib包。
2、urllib3库
python3标准库urllib虽然能满足基本爬取,但是缺少了一些关键的功能。而非标准库的第三方库urllib3提供了,比如说连接池管理。
3、requests库
requests使用了urllib3,但是API更加友好,更加方便易用。相对运用较多。
二、urllib包(python3)的简单使用
- urllib.request
- 用于打开和读写url
- urllib.parse
- 解析url
- urllib.error
- 捕获urllib.request引起的异常
- urllib.robotparser
- 分析robots.txt 文件
1、urllib.request
1.1 urllib.request.urlopen 方法
urlopen(url, data=None)
url是链接地址字符串,或请求对象。
data提交的数据
from urllib.request import urlopen
response = urlopen('http://www.bing.com') # GET方法
with response: # 支持上下文管理
print(1, type(response)) # http.client.HTTPResponse 类文件对象
print(2, response.status) # 状态码
print(3, response.reason) # OK
print(4, response.geturl()) # 跳转后真实的url
print(5, response.read()) # 网页html文件
# 执行结果
1 <class 'http.client.HTTPResponse'>
2 200
3 OK
4 http://cn.bing.com/?setmkt=zh-CN
5 b'<!DOCTYPE html>......</script></html>'
1.2 urllib.request.Request 方法
Request(url, data=None, headers={})
初始化方法,构造一个请求对象。可添加一个header的字典。data参数决定是GET还是POST请求(后面有这两种方法)。
from urllib.request import Request, urlopen
url = 'http://www.bing.com/'
ua_list = [
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",# chrome
"Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36", # safafi
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0", # Firefox
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" # IE
]
request = Request(url)
request.add_header('User-Agent', ua_list[1])
response = urlopen(request, timeout=20) # request对象或者url都可以
with response:
pass
2、urllib.parse
该模块可以完成对url的编解码
urllib.parse.urlencode
urlencode函数第一参数要求是一个字典或者二元组序列。
from urllib import parse
u = parse.urlencode({
'url': "https://cn.bing.com/search?q=python语言"
})
print(u)
# 执行结果
url=https%3A%2F%2Fcn.bing.com%2Fsearch%3Fq%3Dpython%E8%AF%AD%E8%A8%80
3、GET方法
from urllib.request import Request, urlopen
from urllib.parse import urlencode
keyword = input('>>搜索内容')
data = urlencode({
'q':keyword
})
# 构建url
base_url = 'http://cn.bing.com/search'
url = '{}?{}'.format(base_url, data)
# 添加代理
userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"
request = Request(url, headers={'User-agent': userAgent})
response = urlopen(request)
with response:
pass # 可做处理
print("=======END==========")
4、POST方法
from urllib.request import Request, urlopen
from urllib.parse import urlencode
url = 'http://httpbin.org/post' # http://httpbin.org/ 测试网站
request = Request(url)
request.add_header(
'User-agent',
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"
)
data = urlencode({'name':'张三,@=/&*', 'age':'6'})
# data也可以通过Request类注入,如Request(url,data=data.encode())
response = urlopen(request, data=data.encode()) # POST方法,Form提交数据
with response:
print(response.read())
# 执行结果
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "age": "6", \n "name": "\\u5f20\\u4e09,@=/&*"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "47", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"\n }, \n "json": null, \n "origin": "114.250.100.128, 114.250.100.128", \n "url": "https://httpbin.org/post"\n}\n'
三、urllib3库的简单使用
import urllib3
# 打开一个url返回一个对象
url = 'https://movie.douban.com/'
userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"
# 连接池管理器
with urllib3.PoolManager() as http:
response = http.request('GET', url, headers={
'User-Agent':userAgent
})
四、 requests库的简单使用(常用)
import requests
url = 'https://movie.douban.com/'
userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"
response = requests.request('GET', url, headers={'User-Agent': userAgent}) # 发起请求
with response:
print(response.url) # https://movie.douban.com/
print(response.status_code) # 200
print(response.request.headers) # 请求头
print(response.headers) # 响应头
print(response.text) # HTML的内容