实战scrapy-爬取红袖添香前20页小说

首先展现最终实现的效果:


这里写图片描述


首先是建立scrapy项目:

scrapy startproject novelcrawl  #我的项目名为novelcrawl

用pycharm打开项目:


这里写图片描述


这是我的items.py文件:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
class NovelcrawlItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    status = scrapy.Field()
    count = scrapy.Field()
    kind = scrapy.Field()

NovelCrawls.py文件:

# encoding:utf-8

from scrapy.spiders import CrawlSpider
import scrapy
import re
from bs4 import BeautifulSoup

from NovelCrawl.items import NovelcrawlItem


class NovelCrawl(CrawlSpider):
    name = "novelcrawl"
    allowed_domains = ['https://www.hongxiu.com']

    def start_requests(self):
        start_urls = []
        for i in range(1, 21):
            url = ("https: // www.hongxiu.com/all?pageSize = 10 & gender = 1 & catId = -1 & isFinish = -1 & isVip = -1 & size = -1 & updT = -1 & orderBy = 0 & pageNum = %d"%i).replace(' ', '')
            url = scrapy.Request(url)
            start_urls.append(url)
        return start_urls

    def parse(self, response):
        print "start crawling"
        item = NovelcrawlItem()
        soup = BeautifulSoup(response.body, 'html.parser', from_encoding='utf-8')
        item['title'] = soup.find_all('img', src=re.compile('//qidian.qpic.cn/qdbimg/\d+/c_\d+/\d+'))
        item['author'] = soup.find_all('a', class_='default')
        item['kind'] = soup.find_all('span', class_='org')
        item['status'] = soup.find_all('span', class_='pink')
        item['count'] = soup.find_all('span', class_='blue')
        yield item

关于上面的BeautifulSoup为什么这么写,请参考我的这篇 实战爬虫-爬取红袖添香并存入数据库

下面是我的pipelines.py文件(用来处理爬取到的数据):

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class NovelcrawlPipeline(object):
    def __init__(self):
        self.file = open('novel.csv', 'w+')#创建一个可写文件,保存爬取的数据

    def process_item(self, item, spider):
        titles = item['title']
        authors = item['author']
        statuss = item['status']
        counts = item['count']
        kinds = item['kind']
        for title, author, status, count, kind in zip(titles, authors, statuss, counts, kinds):
            self.file.write(title['alt'].encode('utf-8')+'\t'+author.get_text().encode('utf-8')+'\t'+status.get_text().encode('utf-8')+'\t'+count.get_text().encode('utf-8')+'\t'+kind.get_text().encode('utf-8')+'\n')
        return item

取消settings.py文件中对管道文件的注释:

ITEM_PIPELINES = {
   'NovelCrawl.pipelines.NovelcrawlPipeline': 300,
}

最终得到一个.csv文件,打开得到最上面的结果。

转载于:https://www.cnblogs.com/qukingblog/p/7475284.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值