爬取我喜欢的小说

最新推荐文章于 2020-07-10 13:58:18 发布

weixin_30339457

最新推荐文章于 2020-07-10 13:58:18 发布

阅读量94

点赞数

文章标签： python 爬虫 json

原文链接：http://www.cnblogs.com/shaoqizhi/p/10416869.html

版权

看个小说，各种广告烦人，自己写个爬虫爬到本地

#开始工程

scrapy startproject  xt_tb

#首先创个爬虫 -创建 CrawlSpider 爬虫

scrapy genspider -c crawl [爬虫名字] [域名]

#settings.py 文件操作不做解释

#爬取规则

#xpath需要根据具体的爬取内容设置，可以结合scrapy shell 和谷歌浏览器的xpath tool判断class RentianSpider(CrawlSpider):

    name = 'rentian'
    allowed_domains = ['www.suimeng.com']
    start_urls = ['https://www.suimeng.com/files/article/html/6/6293/29891957.html']

    rules = (
        Rule(LinkExtractor(allow=r'.+\d+.html'), callback='parse_content', follow=True),
    )

    def parse_content(self, response):
        title=response.xpath("//div[@class='ctitle']/text()").get().strip()
        contentList = response.xpath("//div[@class='ccontent']/text()").getall()
        content = ""
　　
　　　　　#去除空格和换行
        for contentStr in contentList:
            contentStr = contentStr.replace('\r\n','')
            content = content+contentStr


        item = XiaoshuoItem(title=title,content=content)
        yield item

#设置items

import scrapy


class XiaoshuoItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()

# pipelines.py

#下载下来的json要注意格式 []和，

#否则解析会出现问题

from scrapy.exporters import JsonLinesItemExporter
import codecs
import json
import os

class TzbzdsPipeline(object):
    def __init__(self):
        super().__init__()  # 执行父类的构造方法
        self.fp = codecs.open('xiaoshuo.json', 'w', encoding='utf-8')
        self.fp.write('[')

    def process_item(self, item, spider):
        # 将item转为字典
        d = dict(item)
        # 将字典转为json格式
        string = json.dumps(d, ensure_ascii=False)
        self.fp.write(string + ',\n')  # 每行数据之后加入逗号和换行
        return item

    def close_spider(self,spider):
        self.fp.seek(-2, os.SEEK_END)  # 定位到倒数第二个字符，即最后一个逗号
        self.fp.truncate()  # 删除最后一个逗号
        self.fp.write(']')  # 文件末尾加入一个‘]’
        self.fp.close()   # 关闭文件

#大功告成，把爬取下来的文件放到我自己的 iOS项目中，就可以阅读了　　

转载于:https://www.cnblogs.com/shaoqizhi/p/10416869.html

weixin_30339457

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取我喜欢的小说

看个小说，各种广告烦人，自己写个爬虫爬到本地#开始工程scrapy startproject xt_tb#首先创个爬虫 -创建 CrawlSpider 爬虫scrapy genspider -c crawl [爬虫名字] [域名]#settings.py 文件操作不做解释#爬取规则#xpath需要根据具体的爬取内容设置，可以结合scrapy shell...
复制链接

扫一扫