分布式爬取顶点小说全站内容

最新推荐文章于 2023-09-20 21:51:55 发布

原创最新推荐文章于 2023-09-20 21:51:55 发布

· 431 阅读

0 ·

版权

爬虫同时被 3 个专栏收录

13 篇文章

订阅专栏

python

7 篇文章

订阅专栏

正则

2 篇文章

订阅专栏

这是一个简单的分布式爬虫，网站其实并不复杂，文章主要为了写一下分布式的布局。

首先使用命令 scrapy genspider -t crawl “爬虫名” 网址，创建一个爬虫。然后添加代码

from scrapy_redis.spiders import RedisCrawlSpider

引入RedisCrawlSpider，并让爬虫继承于此。删除start_urls,并在rules中添加所需要的url的正则表达的：

class NovelSpiderSpider(RedisCrawlSpider):
    name = 'novel_spider'

    redis_key = 'novel_spider:start_urls'

    list_page_lin=LinkExtractor(allow=(r'https://www.23us.so/list/.*?html'))
    novel_page=LinkExtractor(allow=(r'https://www.23us.so/xiaoshuo/\d+.html'))
    chapter_page=LinkExtractor(allow=(r'https://www.23us.so/files/article/html/\d+/\d+/index.html'))
    contents_page=LinkExtractor(allow=(r'https://www.23us.so/files/article/html/\d+/\d+/\d+.html'))
    # print(novel_page)


    rules = (
        Rule(list_page_lin,follow=True),
        Rule(novel_page,callback='parse_intro',follow=True),
        Rule(chapter_page,follow=True,),
        Rule(contents_page,callback='parse_item',follow=True)
    )

follow=True或者follow=False，表示是否对网页进行深度抓取，因为scrapy_redis中本身存在去重的机制，所以我选择的都是True。然后在callback的回调解析中解析你所需要的内容，我分了两个库存储，一个存放内容介绍，一个存放具体的章节和内容。

def parse_item(self, response):
    print("===============================")

    # print(response.url)
    item = XiaoshuoItem()
    item['rr']="21"
    # item['url']=response.url
    item['n_name'] = response.xpath('//*[@id="amain"]/dl/dt/a[3]/text()').extract_first()
    item['c_name'] = response.xpath('//*[@id="amain"]/dl/dd[1]/h1/text()').extract_first()
    print(response.xpath('//*[@id="amain"]/dl/dd[1]/h1/text()').extract_first())
    ret = ''
    contents = response.xpath('//dd[@id="contents"]/text()').extract()
    for i in contents:
        ret += i
    item['c_contents'] = ret

    yield item

存储的时候由于是分连个库存储的，需要区分内容，所以我加了一个隐藏的字段“rr”,在pipeline中判断本字段的内容或者格式来区分存储的库，其实使用if isinstance(item,XiaoshuoItem),这样判断也可以，但是数据多的时候可能会出现问题，所以我使用了隐藏字段，字段之作判断，并不保存。

数据的存储，我把书本简介存到了MongoDB中，把书的内容存到了mysql中，当然其实所有内容存到redis中会更加快点，但是我还是选择了mysql，没啥，就是习惯了。我在存mysql的时候，每个小说创建了一个表，表名称就是小说名

class DingdianxiaoshuoPipeline(object):
    def __init__(self):
        self.conn=pymysql.connect(
            host=MYSQL_HOST,
            port=MYSQL_PORT,
            user=MYSQL_USER,
            passwd=MYSQL_PASSWORD,
            db=MYSQL_DB,
        )
        self.cursor=self.conn.cursor()


    def process_item(self, item, spider):
        # table_name = item['n_name']
        if type(item['rr'])==str:
            creat_table='CREATE TABLE IF NOT EXISTS {} (novel_chapter varchar(256),chapter_contents varchar(10000))'.format(item['n_name'])
            self.cursor.execute(creat_table)
            data=[(item['c_name'],item['c_contents']),]
            sql="INSERT INTO {} (novel_chapter,chapter_contents) VALUES (%s,%s)".format(item['n_name'])
            self.cursor.executemany(sql,data)
            self.conn.commit()
            return item
        else:
            return item

整体的框架大约就是这样，我运行了一下午，并未见报错。但是为了保险起见，我还是加入了动态代理池和随机User-Agent表示对网站创作者的尊重，由于数量太大，并没有抓取完成。