写爬虫，只是为了下小说

最新推荐文章于 2024-04-22 16:27:09 发布

github_30830155

最新推荐文章于 2024-04-22 16:27:09 发布

阅读量731

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/github_30830155/article/details/50346997

版权

python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

整体思路

spider

大概兴趣是第一生产力吧

网上找的小说现成的txt排版都有问题，缺少章节，不过一些盗版网站章节倒是挺全

本来想着是找到index页，爬取所有章节连接，然后拼接成网址，再进行细节爬取，但是搞了一下发现不知道怎么弄

所以后来就变成了直接从第一章节开始爬，找到标题和正文，用xpath就可以了，然后找到下一章的地址

看着网上一个博客，用的 yield ，我的理解就是一个循环递归吧

因为所有页面都是相同处理的，所以就相当于一个调用，简单粗暴的解决方式，一个函数 process_item(self, item, spider) 就解决了

pipeline

下载下来处理写成html，<h1> 和title，至于content直接带<p>，直接写了就可以

本来还想写目录的主体<li>之类的

还好还好calibre够强大，可以识别不规则的html，也可以自动生成目录，所以免去了很多步骤

设置了一个global，用来生成文件名的，从1开始递增，所以保证文件每一章节是顺序的

四个文件

item.py

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class NoveldownloadItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title = Field() # title 
    content = Field() #content
    pass

setting.py

# -*- coding: utf-8 -*-
# Scrapy settings for noveldownload project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#

BOT_NAME = 'noveldownload'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['noveldownload.spiders']
NEWSPIDER_MODULE = 'noveldownload.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0

ITEM_PIPELINES = {
    'noveldownload.pipelines.NoveldownloadPipeline':300,
}


DOWNLOAD_DELAY = 0.2
#RANDOMIZE_DOWNLOAD_DELAY = True


CONCURRENT_ITEMS = 128
CONCURRENT_REQUEST = 64
CONCURRENT_REQUEST_PER_DOMAIN = 64


LOG_ENABLED = False

COOKIES_ENABLES = False

pineline.py

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/topics/item-pipeline.html
INDEX = 100

class NoveldownloadPipeline(object):
	

    def process_item(self, item, spider):
    	global INDEX
    	datapath = '../2/'+str(INDEX)+'.html'
    	INDEX +=1
    	#print INDEX
    	fd = open(datapath,'a')
    	line = '<h1> '+ str(item['title']) + ' </h1>\n'+str(item['content'])+'\n'
    	fd.write(line)
    	fd.close
        return item

novel_spider.py

# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from noveldownload.items import NoveldownloadItem
from scrapy.http import Request

class NovelSpider(CrawlSpider):
    #name="doubanmoive"
    #allowed_domains=["movie.douban.com"]
    #start_urls=["http://movie.douban.com/top250"]
    name="noveldownload"
    allowed_domains=["lwxs520.com"]
    #start_urls=["http://www.xxx.html"]
    start_urls=["http://www.xxxx.html"]
#    rules=[
#        Rule(SgmlLinkExtractor(allow=(r'http://movie.douban.com/top250\?start=\d+.*'))),
#        Rule(SgmlLinkExtractor(allow=(r'http://movie.douban.com/subject/\d+')),callback="parse_item"),      
#    ]
    def parse(self,response):       
        prefix = "http://www.xxx"
        item = NoveldownloadItem()
        sel = HtmlXPathSelector(response)
        title = sel.select('//*[@id="bgdiv"]/table[2]/tbody/tr[1]/td/div/h1/text()').extract()
        # //*[@id="content"]/p
        content = sel.select('//*[@id="content"]/p').extract()
        #print title
        #print content
        item['title'] = title[0].encode('utf-8')
        item['content'] = content[0].encode('utf-8')
        #print item['title']
        #print item['content']      

        yield item

        nexturl = sel.select('//*[@id="thumb"]/a[3]/@href').extract()
        for url in nexturl:
            url = prefix + url
            print url
            yield Request(url, callback=self.parse)

总结

额，配置那里不是很清楚

还有就是不知道是网络问题，还是其他问题

最开始总是爬取10几个网页就开始无法GET，所以只能手动改 start_urls 和 global

但是后来把，速度虽然还是有时快有时慢，但是也能几百个不中断了，也不知道原因

另外就是，数据结构啊

返回值是list，或者是什么HtmlXPathSelector 结构，所以要extract() 或者 [0]，才能取到正真的值

总之，虽然很粗糙，但是还是完成了下载

并且calibre也做成了mobi的书，而且应该很好调整了，毕竟每个章节都是分开，就算顺序错误或者章节不对也好调整了，并且正在更新的书也很好添加了

不过这个程度太粗糙了，只是简单功能实现

github_30830155

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
写爬虫，只是为了下小说

整体思路spider大概兴趣是第一生产力吧网上找的小说现成的txt排版都有问题，缺少章节，不过一些盗版网站章节倒是挺全本来想着是找到index页，爬取所有章节连接，然后拼接成网址，再进行细节爬取，但是搞了一下发现不知道怎么弄所以后来就变成了直接从第一章节开始爬，找到标题和正文，用xpath就可以了，然后找到下一章的地址看着网上一个博客，用的 yield ，我的理解就是一个循
复制链接

扫一扫