【实例简介】
【实例截图】
【核心代码】
目录
第1章 网络爬虫简介
1.1 网络爬虫何时有用.......................
1.2 网络爬虫是否合法 ……………………………………………………………………·2
1.3 背景调研…………………………………………………………………………………··3
1.3.1 检查 robots. t xt….......…...............................….............................3
1.3.2 检查网站 地图.........................…......................…........................4
1.3.3 估算网站 大小..........….......…………………………………………………5
1.3.4 识别网站 所用技术 …………………………………………………………·7
1.3.5 寻找网站 所有者….......…·· ………………………………………………··7
1.4 编写第一个网络爬虫…..........................................….........…·················8
1.4.1 下载网 页….........................…..........….........................................9
1.4.2 网站 地图爬虫….........................…...................…······················12
1.4.3 ID遍历爬虫·············································································13
1.4.4 链接爬虫………………………………………………………………………··15
1.
5 本章小结…··………………………………………………………………………………n
第2章 数据抓取 23
2.1 分析网页…….................................….........................….....................…·23
2.2 三种网页抓取方法......................................…..................................…·26
2.2.1 正则表达式……………………………………………………………………·26
目 录
2.2.2 B eau tifu l Soup ······· ··…........................................……················28
2.2.3 Lxml·………………………………………………………………·················30
2.2.4 性 能对 比.........................….........…................….........……········· 32
2.2.5 结论.........……..............................................................……· ·······35
2.2.6 为链接爬虫添加抓取回调… ....................................................35
2.3 本章小结 ….......… .......…………………………………………………………………38
第3章 下载缓存 39
3.1 为链接爬虫添加缓存支持…· ·················· ··…·…....................................39
3.2 磁盘缓存............................................................................................... 42
3.2.1 实现.........………………………………………………………………………·44
3.2.2 缓存测试..............……………………………………………………………46
3.2.3 节省磁盘空间…….......................….........…................……········ 46
3.2.4 清理过期数据…........................................................................47
3.2.5 缺点………………………………………………….......…........….........…· 48
3.3 数据库缓存...........................................................................................”
3.3.1 NoSQL是什么..............….......….......................……........…······50
3.3.2 安装M ongoDB …...................…….....................…····················50
3.3.3 M ongoDB概述.......….........….................…...........................…50
3.3.4 M ongoDB 缓存实现.................................…...............….......…52
3.3.5 压缩...............….........................................................................”
3.3.6 缓存测试…·”…....................…..........…........................…···········54
3.4 本章小结 ….........…............... ………………………………………………………妇
第4章 并发下载 57
4.1 100万个网页.................…................................…..............…................57
4.2 串行爬虫.........................................…..................…................…........…60
4.3 多线程爬虫...................................…..................….........…...............…··60
2
目 录
4.3.1 线程和进程如何工作...............................................................“
4.3.2 实现........................................…........................…..................…61
4.3.3 多进程爬虫…............................................................................63
4.4 性能….........................…..........................….................….........……........67
4.5 本章小结 …··….......…......................….......…................….......................68
第5章 动态内容 69
5.1 动态网页示例...............................................……..................................69
5.2 对动态网页进行逆向工程..............…·….........…............................……72
5.3 渲染动态网页.............................….......................….............................77
5.3.1 PyQt 还是PySid e……………………………………………………………78
5.3.2 执行JavaScript
··········….......……………………………………………..7g
5.3.3 使用WebKit 与 网站 交互.......…................…........................…80
5.3.
4 Selenium ................................................................................... 85
5.4 本章小结 .......…..........……………………………………………………...............gg
第6章 表单交E 89
6
.1 登录表单….........…....................................................….........................90
6
.2 支持内容更新的登录脚本扩展…..........................…......................…..97
6
.3 使用Mechanize模块实现自动化表单处理.........….......…................ 100
6
.4 本章小结 ………………………………………………………………………………·102
第7章 验证码处理 103
7.1 注册账号…··……………………………………………………………………………· 103
7.2 光学字符识别………………………………………………………………………·…106
7.3 处理复杂验证码….........…..........…................….......................…........ 111
7.3.1 使用验证码处理服务..............................…................…......... 112
7.3.2 9kw入门………………………………………………………………………112
3
目 录
7.3.3 与 注册功 能集成…..............……··············································119
7.4 本章小结 ..............…........…··………………………………………………………120
第8章 Scrapy
121
8.1 安装.............................…..................................……........................…··121
8.2 启动项目.........................…..........................................….......…··········122
8.2.
1 定义模型.........………………………………………………………………123
8.2.2 创建爬虫........…………………………………………………………·······124
8.2.3 使用 sh el 命l 令抓取…...........…………………………………………·128
8.2.4 检查结果..............….......…………………………………………………129
8.2.5 中断与恢复爬虫”…...............……..............…··························132
8.3 使用Port ia 编写可视化爬虫.............................……···························133
8.3.1 安装…..............................….............…·····································133
8.3.2 标注……………………………………………………………………………··136
8.3.3 优化爬虫…··…………………………………………………………………·138
8.3.4 检查结果…......................………………………………………………··140
8.4 使用Scrap ely 实现自动化抓取.........…..........................................…141
8.5 本章小结…........................…..........…........…......................…··············142
第9章 总结 143 句300OOAU’I句3
呵