初步使用Scrapy-Splash爬取小说网所有小说

最新推荐文章于 2020-12-19 13:17:04 发布

月半湾湾

最新推荐文章于 2020-12-19 13:17:04 发布

阅读量212

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_37566910/article/details/82024260

版权

python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

1 安装Scrapy

2 安装Scrapy-Splash

3 Scrapy对接Splash

5 通过队列实现爬取所有小说，生产SplashRequest核心代码如下

    def start_requests(self):
            queue = Queue()
            queue.put(self.base_urls)
            while queue != None:
                response = requests.get(queue.get(), headers=self.headers)
                html = etree.HTML(response.text)
                if response.status_code == 200:
                    charptersip = html.xpath('//div[@id="list"]//dd/a/@href')
                    if charptersip:
                        for charptes in charptersip:
                            base = 'http://www.biquge.com.tw'
                            yield SplashRequest(url=base + charptes, callback=self.parse)

                templist = html.xpath('//div[@class="footer_link"]//a/@href')
                for i in range(0, len(templist) - 1):
                    templist[i] = self.story_base + templist[i]
                    queue.put(templist[i])

队列queue中添加的第一个url是某本小说包含其所有章节的页面，且页面源代码中包含了各个章节的链接。

接下来将各个章节添加到SplashRequest中进行调度，最后将该页面关联的其他小说(也在此页面中)添加进队列，继续爬取.

由于该网站页面构造简单，所以代码不多。。。Splash是异步爬取，速度非常快，100M宽带1min能爬700页，也要注意数据入库的排序问题。

代码地址：https://github.com/yuebanwanwan/ScrapySplashCrawlAllStory.git

月半湾湾

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
初步使用Scrapy-Splash爬取小说网所有小说

1 安装Scrapy2 安装Scrapy-Splash3 Scrapy对接Splash5 通过队列实现爬取所有小说，生产SplashRequest核心代码如下 def start_requests(self): queue = Queue() queue.put(self.base_urls) whil...
复制链接

扫一扫

专栏目录