0. 爬虫
0.1 爬虫的两部分:
1.下载Web页面
- 最大程度的利用本地带宽
- 调度针对不同站点的Web请求以减轻对方服务器的负担
- DNS查询
- 遵循一些行规(如robots.txt)
2.对网页的处理
- 获取动态内容
- Spider Trap
- 内容去重
1.scrapy
1.1 安装scrapy
pip install scrapy
pip install service_identity
不装service_identity会出现警告:
warning:
:0: UserWarning: You do not have a working installation of the service_identity
module: 'No module named service_identity'. Please install it from <https://pyp
i.python.org/pypi/service_identity> and make sure all of its dependencies are sa
tisfied. Without the service_identity module and a recent enough pyOpenSSL to s
upport it, Twisted can perform only rudimentary TLS client hostname verification
. Many valid certificate/hostname mappings may be rejected.
Traceback (most recent ca