使用scrapy爬虫,爬取17k小说网的案例-方法一

最新推荐文章于 2020-08-05 08:12:38 发布

weixin_30644369

最新推荐文章于 2020-08-05 08:12:38 发布

阅读量608

点赞数

文章标签： python 爬虫 json

原文链接：http://www.cnblogs.com/stevenshushu/p/9212854.html

版权

无意间看到17小说网里面有一些小说小故事，于是决定用爬虫爬取下来自己看着玩，下图这个页面就是要爬取的来源。

这个页面一共有125个标题，每个标题里面对应一个内容，如下图所示

下面直接看最核心spiders中的代码

# -*- coding: utf-8 -*-
import scrapy
from k17.items import K17Item
import json
class A17kSpider(scrapy.Spider):
    name = '17k'
    allowed_domains = ['17k.com']
    start_urls = ['http://www.17k.com/list/271047.html']
    def parse(self, response):
        old_url='http://www.17k.com'
            
        for bb in response.xpath('//div[@class="Main List"]/dl[@class="Volume"]/dd'):
            ##把xpath表达式作为normalize-space(）函数的参数 此方法可以去除数据的值有\r\n\t
            link=bb.xpath("a/@href").extract() 
            for newurl in link:
                new_url=old_url+newurl
                yield scrapy.Request(new_url, callback=self.parse_item)


    def parse_item(self,response):
        for aa in response.xpath('//div[@class="readArea"]/div[@class="readAreaBox content"]'):
            item=K17Item()
            title=aa.xpath("h1/text()").extract()###得到每一章的标题
            new_title=(''.join(title).replace('\n','')).strip()
            item['title']=new_title
            dec= aa.xpath("div[@class='p']/text()").extract()###得到每一章的详细内容
            dec_new=((''.join(dec).replace('\n','')).replace('\u3000','')).strip() ###去除内容中的\n 和\u3000和空格的问题
            item['describe'] = dec_new
            yield item

转载于:https://www.cnblogs.com/stevenshushu/p/9212854.html

weixin_30644369

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用scrapy爬虫,爬取17k小说网的案例-方法一

无意间看到17小说网里面有一些小说小故事，于是决定用爬虫爬取下来自己看着玩，下图这个页面就是要爬取的来源。 a这个页面一共有125个标题，每个标题里面对应一个内容，如下图所示下面直接看最核心spiders中的代码# -*- coding: utf-8 -*-import scrapyfrom k17.items import K17Itemimport...
复制链接

扫一扫