Python + Scrapy抓取起点限免小说

最新推荐文章于 2021-07-04 22:42:11 发布

zljun8210

最新推荐文章于 2021-07-04 22:42:11 发布

阅读量1.1k

点赞数 1

分类专栏： Python 文章标签： Python3 Scrapy

本文链接：https://blog.csdn.net/zljun8210/article/details/80609069

版权

Python 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

 
 Scrapy的安装 

  在Python环境中，使用Pip安装： 

 
 pip install scrapy 

  过程中遇到报 twisted 相关的问题，由于C编译相关的环境缺少的原因，可以手动下载文件来安装。 

  pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl 

  安装完成后再安装 Scrapy即可。 

01

 
 Scrapy命令介绍 

* 其中 shell 命令为交互调试工具，非常有用，可以即时调试对应站点上的所有资讯：

  [s] 
  Available Scrapy objects: 

  [s] 
 scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) 

  [s] 
 crawler <scrapy.crawler.Crawler object at 0x0000024830629A90> 

  [s] 
 item {} 

  [s] 
 request <GET http://www.cqgzfglj.gov.cn/gongzdt/> 

  [s] 
 response <200 http://www.cqgzfglj.gov.cn/gongzdt/> 

  [s] 
 settings <scrapy.settings.Settings object at 0x000002483063A2E8> 

  [s] 
 spider <PubhouseSpider 'pubhouse' at 0x2483092c940> 

  [s] Useful shortcuts: 

  [s] 
 fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) 

  [s] 
 fetch(req) Fetch a scrapy.Request and update local objects 

  [s] 
 shelp() Shell help (print this help) 

  [s] 
 view(response) View response in a browser 

02

 
  项目介绍 

  2.1 新建项目 

 
 scrapy startproject xxx  

  命令执行后在当前目录新建一个 xxx 的目录，其下的的目录结构同下图的 books 

  进入 xxx 目录，创建爬虫程序 

 
 cd xxx 

 
 scrapy genspider xxx xxx.com 

  2.2 编辑相应文件 

  2.2.1 
 entrypoint.py 此文件是为了便于在PyCharm上调试而创建的，代码就两行： 

from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xxxxx'])

  2.2.2 
  items.py 

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BooksItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()   # 小说章节标题
    desc = scrapy.Field()     # 小说每章内容

2.2.3 pipelines.py

# -*- coding: utf-8 -*-

import json
import codecs

class BooksPipeline(object):
    def process_item(self, item, spider):
        self.file = codecs.open(item.get('title')+ '.txt', 'w', encoding='utf-8')
        self.file.write(item.get('desc') + '\n\f')
        return item

    def spider_closed(self, spider):
        self.file.close()

  2.2.4 
 books.p 
 y 
   

# -*- coding: utf-8 -*-
import scrapy
from books.items import BooksItem
from scrapy.http import Request

class FictionSpider(scrapy.Spider):
    name = 'fiction'
    allowed_domains = ['www.qidian.com']
    start_urls = ['https://book.qidian.com/info/1005209812#Catalog']
    
    def parse(self, response):
        hxs = response
        # 获取书名
        names = hxs.xpath('//div[@class="book-info "]/h1/em/text()').extract()[0]
        item = BooksItem()
        item['title'] = names
        charterurl = hxs.xpath('//div[@class="volume"]/ul/li/a/@href').extract()[0]
        # 通过获取到的第一章URL进入页面
        print(charterurl)
        yield Request("https:" + charterurl, meta={'item':item}, callback=self.parsecharter, dont_filter=True)

    def parsecharter(self,response):
        hxs = response
        # 获取章节名
        titles = hxs.xpath('//h3[@class="j_chapterName"]/text()').extract()[0]
        item = response.meta['item']
        content = ''
        content = '\n' + content + str(titles) + '\n'
        s = hxs.xpath('//div[@class="read-content j_readContent"]//p/text()').extract()
        for srt in s:
            srt = srt.replace("\u3000", " ")
            content = content + srt +'\n'

        desc = item.get('desc')
        if None==desc:
            item['desc'] = content
        else:
            item['desc'] = desc + content
        if content=='':
            yield item

        chapters = hxs.xpath('//a[@id="j_chapterNext"]/@href').extract()  # 下一章地址
        Nextt = hxs.xpath('//a[@id="j_chapterNext"]/text()').extract()[0]  #判断是不是最后一章
        if Nextt == '书末页':
            yield item
            return
        for chapter in chapters:
            yield Request("https:" + chapter, meta={'item':item}, callback=self.parsecharter, dont_filter=True)

  2.3 运行爬虫 

 
 scrapy crawl fiction  

  运行后在当前目录生成相应的文件 

zljun8210

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python + Scrapy抓取起点限免小说

00Scrapy的安装在Python环境中，使用Pip安装：pip install scrapy过程中遇到报 twisted 相关的问题，由于C编译相关的环境缺少的原因，可以手动下载文件来安装。pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl安装完成后再安装 Scrapy即可。01Scrapy命令介绍D:\&gt;scrapy -hScrapy...
复制链接

扫一扫