scrapy分布式爬虫框架搭建超级详细版

最新推荐文章于 2024-03-05 17:11:46 发布

置顶 qq_43058335

最新推荐文章于 2024-03-05 17:11:46 发布

阅读量725

点赞数

分类专栏：爬虫文章标签： scrapy分布式爬虫爬虫实战

本文链接：https://blog.csdn.net/qq_43058335/article/details/100778609

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

注：本文系作者从其个人网站转载过来，源网址http://www.zhaoqiansunli.com.cn//a/meinv/20190912/192.html
scrapy分布式爬虫框架搭建超级详细版,若需要源码，请到[源站联系站长](http://www.zhaoqiansunli.com.cn//a/meinv/20190912/192.html)

本篇针对的是python爬虫框架scrapy的分布式爬虫框架搭建问题，下面看具体步骤。

安装python。现在有2个版本的python，python2.7和python3.x,本文使用python2的版本，下载和安装地址https://www.python.org/downloads
安装scrapy。直接使用pip工具进行安装，输入pip install scrapy
安装scrapy-redis。也是直接使用pip工具安装，输入pip install scrapy-redis
安装redis。也是直接pip工具安装，输入pip install redis
安装scrapyd，也是直接pip安装，输入pip install scrapyd
最好同时把mongo数据库组件和mysql组件也装上，mongo数据组件安装 pip install pymongo，mysql组件安装sudo apt-get install libmysqlclient-dev和pip install MySQL-Python
安装redis服务，redis下载地址http://xiazai.jinyihulian.cn/xiazai/redis.tar.gz
安装mongodb服务，mongodb下载地址http://xiazai.jinyihulian.cn/xiazai/mongodb.rar
配置。1、spider部分
name = ‘hdhdspider’
redis_key = “hdhdspider:start_urls”
rules=(
Rule(LinkExtractor(restrict_xpaths=’.//[@class=“title”]’),callback=“parse_page”,follow=True),
)#allow=r’/htm/movie\d+/[^\s]+.htm’),restrict_xpaths=".//[@class=“title”]"
num=0
def init(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop(‘domain’, ‘’)
self.allowed_domains = filter(None, domain.split(’,’))
super(HdhdspiderSpider, self).init(*args, **kwargs)

2、setting.py部分
DUPEFILTER_CLASS = “scrapy_redis.dupefilter.RFPDupeFilter”
SCHEDULER = “scrapy_redis.scheduler.Scheduler”
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = “scrapy_redis.queue.SpiderPriorityQueue”
#SCHEDULER_QUEUE_CLASS = “scrapy_redis.queue.SpiderQueue”
#SCHEDULER_QUEUE_CLASS = “scrapy_redis.queue.SpiderStack”

ITEM_PIPELINES = {
#‘example.pipelines.ExamplePipeline’: 300,
‘scrapy_redis.pipelines.RedisPipeline’: 400,
}
REDIS_HOST = “192.168.13.26”
REDIS_PORT = 6379
LOG_LEVEL = ‘DEBUG’
实例爬取百度新闻

-- coding: utf-8 --

import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)
import scrapy
import time,re
from scrapy_redis.spiders import RedisSpider
from mingan.items import bdtoprankitem
from scrapy import Selector
from spidertools import spiderTool
class bdnewsSpider(RedisSpider):
name = ‘bdnews’
#start_urls = [‘http://tieba.baidu.com/f?ie=utf-8&kw=李晨吧&fr=search&red_tag=n3540397284’]#https://www.cnblogs.com/qiyeboy/default.html?page=1 http://ycddz.cn/
redis_key = “bdnews:start_urls”#http://s.weibo.com/weibo/%25E7%25A7%2591%25E5%2588%259B%25E4%25BF%25A1%25E6%2581%25AF?topnav=1&wvr=6&b=1

domain="http://jian.news.baidu.com/"#https://www.amazon.com/s/ref=sr_pg_1?me=ABB9OQDQJ01FR&rh=i%3Amerchant-items&ie=UTF8&qid=1522631980
#http://jian.news.baidu.com/ 新版百度抓取地址

def parse(self, response):#parse
    urls=["http://jian.news.baidu.com/"]
    for url in urls:
        yield scrapy.Request(url=url,meta={"flag":"bdnews"},callback=self.parse_content1,dont_filter=True)
        #yield scrapy.Request(url=nodeqaurl,meta={"id":node,},callback=self.parse_qafenye,dont_filter=False)
def parse_content1(self,response):
    selector=Selector(response)
    nodes=selector.xpath("//div[@id='feeds']/div[@class='feed long']")
    for node in nodes:
        keyword=node.xpath("string(.)").extract_first()#
        if len(keyword)>10:
            keyword=keyword[0:20]
        #print keyword

        if keyword!=None:
            keyword=keyword.replace("评论","").replace("不感兴趣","");
            item=bdtoprankitem(index="bdtoprank",keyword=keyword)
            yield item
11.启动方法，新建begin.py，输入  from scrapy import cmdline  cmdline.execute("scrapy crawl bdad".split())，运行该文件就行了。
12、在redis客户端输入 lpush  bdnews:start_urls http://www.baidu.com即可开始运行。

在这里插入图片描述

qq_43058335

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
scrapy分布式爬虫框架搭建超级详细版

scrapy分布式爬虫框架搭建超级详细版,若需要源码，请到[源站联系站长](http://www.zhaoqiansunli.com.cn//a/meinv/20190912/192.html)本篇针对的是python爬虫框架scrapy的分布式爬虫框架搭建问题，下面看具体步骤。安装python。现在有2个版本的python，python2.7和python3.x,本文使用python2的版...
复制链接

扫一扫