爬虫的主要步骤:
1 创建请求
2 打开目标网页
3 阅读网页
4 解码
5 找关键代码分析代码
6 根据规律使用正则表达式
7 文件读写
请求
请求的时候,可以使用默认的request请求
当然也可以自定义请求对象:
——request.Request()
在自定义的请求对象中可以加入请求头
请求头的作用是反爬虫,模拟不同的浏览器对数据进行访问
#构造 请求头信息
header = {
'user-Agent':'Mozilla/5.0 (Linux;\
U; Android 8.1.0; zh-cn; BLA-AL00 \
Build/HUAWEIBLA-AL00) AppleWebKit/537.36 \
(KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 \
MQQBrowser/8.9 Mobile Safari/537.36'
}
# print(header)
#自定义请求对象
REQ = request.Request(url,headers=header)
#请求页面,并保存本地,解码
reponse = request.urlopen(REQ)
当然也可以构造请求头列表:
userAgent = ['Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36',
'Mozilla/5.0 (Linux; Android 8.1; PAR-AL00 Build/HUAWEIPAR-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044304 Mobile Safari/537.36 MicroMessenger/6.7.3.1360(0x26070333) NetType/WIFI Language/zh_CN Process/tools',
'Mozilla/5.0 (Linux; Android 8.1.0; ALP-AL00 Build/HUAWEIALP-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/63.0.3239.83 Mobile Safari/537.36 T7/10.13 baiduboxapp/10.13.0.11 (Baidu; P1 8.1.0)',
'Mozilla/5.0 (Linux; Android 6.0.1; OPPO A57 Build/MMB29M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/63.0.3239.83 Mobile Safari/537.36 T7/10.13 baiduboxapp/10.13.0.10 (Baidu; P1 6.0.1)',
'Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044207 Mobile Safari/537.36 MicroMessenger/6.7.3.1340(0x26070332) NetType/4G Language/zh_CN Process/tools',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5',
'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',
'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'];
agent = random.choice(userAgent)
print(agent)
header = {
'user-Agent':agent
}
#自定义请求对象
REQ = request.Request(url,headers=header)
打开目标网页
可以使用urllib库request模块自带的方法打开:
#请求页面
reponse = urllib.request.urlopen(REQ)
也可以自定义一个打开方式(自定义一个Opener对象)
常见的有:
1、构建HTTP处理器对象(专门处理HTTP请求的对象)
2、构建代理处理器对象(代理IP)
使用http处理器对象:
from urllib import request
#构建HTTP处理器对象(专门处理HTTP请求的对象)
http_hander = request.HTTPHandler()
#创建一个自定义Opener对象
opener = request.build_opener(http_hander)
#请求页面
reponse = opener.open(REQ)
构建代理处理器对象 :
#创建请求对象
req = request.Request("http://www.baidu.com")
proxyList = [{'https':"117.88.177.101:3000"},{'https':"117.88.177.101:3000"}]#Ip 地址可能不能使用
proxy = random.choice(proxyList)
#构建代理处理器对象
proxyHandler = request.ProxyHandler(proxy)
#必须使用自定义的Opener
opener = request.build_opener(proxyHandler)
#打开网址
res = opener.open(req).read()
注意:可以将Opener设置为全局对象
#把自定义的opener设置为全局,这样用 urlopen发 送的请求也会使用自定义的opener
request.install_opener(opener)
阅读网页和解码
reponse = request.urlopen(REQ)
reponse.read().decode()