Python学习_31
爬虫(二)
1、requests请求
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: 2018/6/15 15:36
# @Auther: xiexiaolong
# @File: demon1.PY
import
requests
url =
"https://www.qiushibaike.com/"
header = {
"User-Agent"
:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
}
r = requests.get(
url
=url,
headers
=header)
print
(r.request)
print
(r.headers)
#打印header头信息
print
(r.headers.get(
"User-Agent"
))
#打印header头User-Agent内容
print
(r.encoding)
#打印网页字符集
r.encoding = r.apparent_encoding
print
(r.cookies)
#获取cookies
print
(r.cookies.get(
"_xsrf"
))
print
(r.status_code)
#获取状态码
2、cookie
equests通过会话来获取cookie,cookie的五要素是:name,value,domain,path,expires
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: 2018/6/15 15:36
# @Auther: xiexiaolong
# @File: demon1.PY
import requests
session = requests.session()
respomse = session.get(url=url).text
cookies = session.cookies
print(cookies.keys())
print(cookies.values())
for cookie in cookies:
print(cookie.name)
print(cookie.value)
print(cookie.domain)
print(cookie.path)
print(cookie.expires)
分析
Cookie常用的一些属性:
1. Domain 域
2. Path 路径
3. Expires 过期时间
4. name 对应的key值
5. value key对应的value值
a、拿到cookie信息之后,带着cookie访问网站
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: 2018/6/15 16:15
# @Auther: xiexiaolong
# @File: demon2.PY
import
requests
cookie =
dict
(
_ga
=
"GA1.2.208618761.1528809975"
,
_gid
=
"GA1.2.604525626.1528979734"
,
PHPSESSID
=
"ait0b8c22ofqpo630cekpc33b6"
,
_gat
=
"1"
,
Hm_lvt_0936ebcc9fa24aa610a0079314fec2d3
=
"1528809975,1528809984,1528979734,1528980228"
,
Hm_lpvt_0936ebcc9fa24aa610a0079314fec2d3
=
"1528980228"
,
ape__Session
=
"ait0b8c22ofqpo630cekpc33b6"
)
url =
"http://httpbin.org/cookies"
session = requests.session()
res = session.get(
url
=url,
cookies
=cookie)
res.encoding = res.apparent_encoding
print
(res.text)
结果:
"C:\Program Files\Python36\python.exe" D:/python/0614/demon2.py
{"cookies":{"Hm_lpvt_0936ebcc9fa24aa610a0079314fec2d3":"1528980228","Hm_lvt_0936ebcc9fa24aa610a0079314fec2d3":"1528809975,1528809984,1528979734,1528980228","PHPSESSID":"ait0b8c22ofqpo630cekpc33b6","_ga":"GA1.2.208618761.1528809975","_gat":"1","_gid":"GA1.2.604525626.1528979734","ape__Session":"ait0b8c22ofqpo630cekpc33b6"}}
Process finished with exit code 0
b、通过代理访问网页
采集信息时候,为了避免IP被封,经常会使用代理,requests也有相应的proxies属性
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: 2018/6/15 16:19
# @Auther: xiexiaolong
# @File: demon3.PY
import
requests
url =
"http://2018.ip138.com/ic.asp"
proxy = {
"http"
:
"http://221.228.17.172:8181"
}
res1 = requests.get(
url
=url,
proxies
=proxy)
res2 = requests.get(
url
=url)
res1.encoding = res1.apparent_encoding
res2.encoding = res2.apparent_encoding
print
(res1.text)
print
(
"#########"
)
print
(res2.text)
结果
"C:\Program Files\Python36\python.exe" D:/python/0614/demon3.py
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=gb2312">
<title> 您的IP地址 </title>
</head>
<body style="margin:0px"><center>您的IP是:[221.228.17.172] 来自:江苏省南京市 电信</center></body></html>
#########
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=gb2312">
<title> 您的IP地址 </title>
</head>
<body style="margin:0px"><center>您的IP是:[10.15.20.15] 来自:本地局域网</center></body></html>
Process finished with exit code 0
3、urllib
在python2中,urllib和urllib2各有各个的功能,虽然urllib2是urllib的升级版,但是urllib2还是不能完全替代urllib,但是在python3中,全部封装成一个类,urllib
Urllib2可以接受一个Request对象,并以此可以来设置一个URL的headers,但是urllib只接受一个URL。这就意味着你不能通过urllib伪装自己的请求头。
Urllib模板可以提供运行urlencode的方法,该方法用于GET查询字符串的生成,urllib2的不具备这样的功能,而且urllib.quote等一系列quote和unquote功能没有被加入urllib2中,因此有时也需要urllib的辅助。这就是urllib和urllib2一起使用的原因.
4、下载图片
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: 2018/6/15 16:29
# @Auther: xiexiaolong
# @File: demon4.PY
import codecs
import requests
res = requests.get(url=url,stream=True)
with codecs.open("car.jpg", "wb") as f:
for chunk in res.iter_content(10000):
f.write(chunk)
分析:
当使用requests的get下载大文件/数据时,建议使用使用stream模式。
当把get函数的stream参数设置成False时,它会立即开始下载文件并放到内存中,如果文件过大,有可能导致内存不足。
当把get函数的stream参数设置成True时,它不会立即开始下载,当你使用iter_content或iter_lines遍历内容或访问内容属性时才开始下载。需要注意一点:文件没有下载之前,它也需要保持连接。
• iter_content:一块一块的遍历要下载的内容
• iter_lines:一行一行的遍历要下载的内容