爬虫总结(一)
-
scrapy… setting.py
- ROBOTSTXT_OBEY = False , 拒绝接受默认协议
- CONCURRENT_REQUESTS = 32 , 开启的线程数量
- DOWNLOAD_DELAY = 0 , 等待时间
- CONCURRENT_REQUESTS_PER_DOMAIN = 32 ,
- CONCURRENT_REQUESTS_PER_IP = 32 ,
-
scrapy … globals 全局变量
- 可以使用global来回调使用,但当线程过多,数据传输过快会出错
-
scrapy … linux下 scrapy的crontab定时任务
- 0 16 * * * sh /home/ubuntu/dashboard/on_chain/startup.sh 每天16点运行脚本
- startup.sh脚本内容:
#!/bin/sh cd /home/ubuntu/dashboard/on_chain/QuarkChain/Tron/Tron /home/ubuntu/anaconda3/bin/scrapy crawl tron
-
scrapy爬虫自动停止设置:
CLOSESPIDER_TIMEOUT = 0
CLOSESPIDER_PAGECOUNT = 0
CLOSESPIDER_ITEMCOUNT = 0
CLOSESPIDER_ERRORCOUNT = 16