scrapy 资料汇总

最新推荐文章于 2023-02-03 02:31:09 发布

yuzx2008

最新推荐文章于 2023-02-03 02:31:09 发布

阅读量528

点赞数

文章标签： scrapy

本文链接：https://blog.csdn.net/yuzx2008/article/details/50429747

版权

中文教程
http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/tutorial.html

http://scrapy.org/doc/

http://blog.csdn.net/ku360517703/article/details/9888945

学习
http://snipplr.com/all/tags/scrapy/

Twisted 的 Reactor Overview
http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html

mac


# ImportError: cannot import name xmlrpc_client
sudo rm -rf /Library/Python/2.7/site-packages/six*
sudo rm -rf /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six*
sudo pip install six

# 另外的解决方案（不能解决）
pip install --upgrade six scrapy

# 删除第二个目录时发现 sudo 报错，权限问题
OS X 10.11 El Capitan sudo 不能在 /usr 目录进行写操作，返回：Operation not permitted，ls -lO 发现多个 restricted，
EI Capitan 新增特性：SIP (System Integrity Protection) 也叫 "rootless"
那么怎么取消这个特性呢？
开机 CMD+R 恢复模式，终端
csrutil disable
重新开机

Spider

蜘蛛，定义什么？

*. 爬哪个网站，哪些网站？
*. 如何 crawl（爬）？例如：follow links
*. 如何从那些页面提取结构化数据？例如：scraping items

如何防止被 ban

设置 download_delay
禁止 cookies
使用 user agent 池
用 IP 池
WebServer 应对爬虫策略之一是：直接将你的 IP 或是整个 IP 段封掉，IP 封掉后，转其他 IP 继续访问即可。
可用 Scrapy + Tor + polipo
http://pkmishra.github.io/blog/2013/03/18/how-to-run-scrapy-with-TOR-and-multiple-browser-agents-part-1-mac
分布式爬取

yuzx2008

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy 资料汇总

中文教程 http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/tutorial.htmlhttp://scrapy.org/doc/http://blog.csdn.net/ku360517703/article/details/9888945学习 http://snipplr.com/all/tags/scrapy/Twisted 的 Re
复制链接

扫一扫