scrapy

最新推荐文章于 2024-08-05 20:06:05 发布

len_pn

最新推荐文章于 2024-08-05 20:06:05 发布

阅读量236

点赞数

文章标签： scrapy

本文链接：https://blog.csdn.net/len_pn/article/details/103463819

版权

一、更换请求头

1.安装fack-useragent到python中

pip install fack-useragent

2.在 middlewares.py 修改：

from fack-useragent import UserAgent

class UserAgentDownloadMiddlewares(object):
    def process_request(self,request,spider):
        ua = UserAgent()
        request.headers['User-Agent'] = ua.random()

3.在setting中修改 ‘DOWNLOADER_MIDDLEWARES’：

'UserAgentDownloadMiddlewares':100,

二、设置代理IP:

代理商：
1.芝麻代理：http://http.zhimaruanjian.com/
2.太阳代理：http://http.taiyangruanjian.com/
3.快代理： http://www.kuaidaili.com/
4.讯代理： http://www.xdaili.cn/
5.蚂蚁代理：http://www.mayidaili.com/
6.极光代理：http://www.jiguangdaili.com/

一、使用爬虫从代理商处获取IP地址

二、在 middlewares.py 修改：

class IPProxyDownloadMiddlewares(object):
    def process_request(self,request,spider):
        proxy_url = 'http://' + IP地址 + ':' + str(端口)
        request.meta['proxy'] = proxy_url

3.在setting中 ’DOWNLOADER_MIDDLEWARES‘ 添加：

'IPProxyDownloadMiddlewares':100,

三、scrapy-redis

scrapyd的安装和使用（scrapyd可部署爬虫到不同PC端）：

1.在服务端安装scrapyd：

pip install scrapyd

2.从’/usr/local/lib/python/dist-packages/scrapyd’中copy ‘default_scrapyd.conf’到新建文件夹’/etc/scrapyd/scrapyd.conf’

3.修改’/etc/scrapyd/scrapyd.conf’中的’bind_address’为自己的IP地址

4.降低’twisted’版本：

pip uninstall twisted
pip install twisted==18.9.0

5.在客户端安装scrapy-client：

pip install scrapy-client

6.在python目录下的/Script/中修改scrapyd-deploy为scrapyd-deploy.py

7.在scrapy中修改cfg文件：
url = http:// + 服务端IP:6800/

8.在项目所在的路径执行命令生成版本号：‘scrapyd-deploy default -p 项目名’

9.windows下需安装curl 下载地址：‘https://curl.haxx.se/windows/’,打开bin/curl.exe

10.在cmd中使用命令发布爬虫：

curl http://服务器IP：6800/ schedule.json -d project=项目名 -d scrapy=爬虫名

11.停止爬虫：

curl http://服务器IP：6800/ cancel.json -d project=项目名 -d
job=服务器端的job参数

12.部署爬虫到多台计算机中：

scrapyd-deploy -a

curl http://服务器1 IP：6800/ schedule.json -d project=项目名 -d scrapy=爬虫名

curl http://服务器2 IP：6800/ schedule.json -d project=项目名 -d scrapy=爬虫名

修改scrapy爬虫：
1.修改爬虫class继承的父类为scrapy_redis.spiders.RedisSpider

2.注释掉start-url并添加redis-key = ‘key’

3.在setting中添加和修改：

# 确保所有爬虫共享相同的去重指纹
SCHEDULTER = 'scrapy.redis.scheduler.Scheduler'
SCHEDULTER_CLASS = 'scrapy.redis.dupefilter.RFPDupeFilter'
# 设置redis为pipeline
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline':300,
}
# 在redis中保持scrapy-redis用到的队列，不会清理redis中的队列，从而可以实现暂停和恢复功能
SCHEDULTER_PERSIST = True
# 设置连接redis信息
REDIS_HOST = 'redis服务器IP'
REDIS_PORT = '6379'

4.通过scrapyd上传爬虫

5.在redis所在的电脑上进入redis-cli输入：

 lpush key start-url(例如：https://www.lianjia.com/city/) //key为2中redis-key的值

len_pn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫