python小咖 爬虫基础
安装爬虫相关库,一般是进入python所在目录的scripts目录下,在文件里按住shift点右键,选择.然后执行pip install 库名
html元素
标签
选择器
属性
css
script
安装requests
PS D:\Program Files\Python37\Scripts> pip list
Package Version
---------- -------
numpy 1.17.2
pip 19.2.3
setuptools 41.2.0
wheel 0.33.6
PS D:\Program Files\Python37\Scripts> pip install requests
Collecting requests
Downloading
………
Installing collected packages: urllib3, chardet, idna, certifi, requests
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.3
PS D:\Program Files\Python37\Scripts> pip list
Package Version
---------- ---------
certifi 2019.9.11
chardet 3.0.4
idna 2.8
numpy 1.17.2
pip 19.2.3
requests 2.22.0
setuptools 41.2.0
urllib3 1.25.3
wheel 0.33.6
PS D:\Program Files\Python37\Scripts>
显示自动增加了5个文件urllib3, chardet, idna, certifi, requests。
urlib3没有requests方便?
爬虫利器requests介绍
http://docs.python-requests.org/en/master/
一个简单的例子:
import requests
req = requests. get('http://docs.python-requests.org/en/master/')
print(type(req))
print(req.status_code)
print(req.encoding)
print(req.cookies)
<class 'requests.models.Response'>
200
ISO-8859-1
<RequestsCookieJar[<Cookie __cfduid=dda7f875baa65df649326ae94abbb14201568439405 for .2.python-requests.org/>]>
status code 状态码
encoding 编码方式
cookies Cookies
状态码 含义
200 请求成功
301 资源(网页等)被永久转移到其它URL
404 请求的资源(网页等)不存在
500 内部服务器错误
Cookies的用途
1.会话状态管理(如用户登录状态、购物车、游戏分数或其它需要记录的信息)
2.个性化设置(如用户自定义设置、主题等)
3.浏览器行为跟踪(如跟踪分析用户行为等)
基本请求
requests库提供http的所有基本请求方式
import requests
req= requests.get('http://docs.python-requests.org/en/master/') #200
req= requests.post('http://docs.python-requests.org/en/master/') #200
req= requests.put('http://docs.python-requests.org/en/master/') #405
req= requests.delete('http://docs.python-requests.org/en/master/') #405
req= requests.head('http://docs.python-requests.org/en/master/') #301
req= requests.options(' http://docs.python-requests.org/en/master/') #405
print(req.elapsed.total_seconds()) #本句可以记录发出请求到返回响应时长
GET请求
可利用params参数
import requests
payload={'key1':'value1','key2':'value2'} #定义一个字典,作为参数
r=requests.get("http://docs.python-requests.org/en/master/", params=payload) #用params传递参数,
print(r.url)
输出结果
#老师结果http:/ /docs . python- requests . org/ en/master/ ?key2=value2&key1=value1
实际结果https://2.python-requests.org//en/master/?key1=value1&key2=value2
(可参看官方说明举例:https://2.python-requests.org//en/master/user/quickstart/#make-a-request)
用get的方式,如果传输密码账号则太明显,不安全
POST请求
利用data参数为POST添加参数
data_form = {'key1': ' value1', 'key2' : ' value2 ' }
req = requests. post("http://httpbin.org/post", data=data_form)
print(req.text)
上传文件
第一种方法