scrapy爬虫爬取17k小说网全部章节信息（MongoDB，分页）

andux

已于 2023-10-19 11:06:25 修改

阅读量357

点赞数 1

分类专栏：爬虫 mongodb 文章标签：爬虫 scrapy python

于 2023-10-19 10:53:41 首次发布

本文链接：https://blog.csdn.net/andux/article/details/133921507

版权

爬虫同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

mongodb

7 篇文章 0 订阅

订阅专栏

跟着教程爬取京东的时候，一会爬出来，一会儿爬不出来，京东反扒挺厉害啊。还是17k好爬一些。跟着教程走，还是出不来结果，看着自己的代码没错啊。使用17k一试，果然很顺利啊。看来目前也只会爬一些简单的网站，如果网站有反爬虫举措，那我就捉襟见肘了。

17k小说网分类中小说还是挺多的，为了方便查看结果，就缩小了小说书籍的范围。

【图片】

今天是2023年10月19日, AM 11:02:48，今天csdn图片出错了，不管是用什么浏览器，还是使用剪切板粘贴，上传图片，都失败了，文章里加不了图片。

这样只有6页，100多条结果，爬起来也快一些。

多练习，多看教程，原来代码是这个意思啊。刚开始，只会照着教程抄代码，囫囵吞枣，先有个大概的了解和熟悉，现在要慢慢深入理解代码的含义了。原来scrapy爬虫里面也是可以递归函数的，只是自己不会罢了。

app.py

from typing import Any

import scrapy
from scrapy import Request
from scrapy.http import Response
import re
from ..items import Scrapy05Item


class AppSpider(scrapy.Spider):
    name = "app"
    allowed_domains = ["www.17k.com"]
    # start_urls = ["https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html"]
    start_urls = ["https://www.17k.com/all/book/2_21_115_1_3_0_1_0_1.html"]


    def parse(self, response):
        # 获取首页数据
        links = response.xpath('//td[@class="td4"]/a/@href').extract()
        for index, link in enumerate(links):
            link = "https:" + link
            yield scrapy.Request(url=link, callback=self.parse_detail)
        # 翻页，函数递归
        exp = re.compile('book/2_21_115_1_3_0_1_0_(\d*?).html')
        result = exp.findall(response.url)[0]
        page = int(result) + 1
        if len(links) != 0:
            next_url = "https://www.17k.com/all/book/2_21_115_1_3_0_1_0_{}.html".format(page)
            # print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)


    def parse_detail(self, response):
        item = Scrapy05Item()
        book_type = response.xpath('//div[@class="infoPath"]/a[3]/text()').get()
        title = response.xpath('//div[@class="infoPath"]/a[4]/text()').get()
        chapter = response.xpath('//*[@id="readArea"]/div[1]/h1/text()').get()
        content = response.xpath('//*[@id="readArea"]/div[1]/div[2]/p[1]/text()').get()
        if book_type is not None:
            item["book_type"] = book_type
            item["title"] = title
            item["chapter"] = chapter
            item["content"] = content
        print(book_type, title, chapter, content[:10])
        yield item

使用正则表达式获取当前链接中的翻页参数，就是变化的数值，让它+1，形成新的链接。yield scrapy.Request()不仅可以调用其他函数，还可以递归自己，为啥之前没有想到呢？看来合适的教程就是捷径啊。

items.py

import scrapy


class Scrapy05Item(scrapy.Item):

    book_type = scrapy.Field()
    title = scrapy.Field()
    chapter = scrapy.Field()
    content = scrapy.Field()
    pass

pipelines.py

import pymongo


class Scrapy05Pipeline:
    def __init__(self):
        self.res = None
        print("-" * 10, "开始", "-" * 10)
        self.client = pymongo.MongoClient("mongodb://localhost:27017")
        self.db = self.client["17k"]
        self.collection = self.db["chapter"]
        self.collection.delete_many({})  # 清空MongoDB

    def process_item(self, item, spider):
        self.res = self.collection.insert_one(dict(item))
        # print("self.res.inserted_id:", self.res.inserted_id)
        return item

    def __del__(self):
        print("-" * 10, "结束", "-" * 10)

越来越发现，pipelines里面的代码，都成固定的了，几乎不用修改，可以拿之前的直接用。

settings.py

BOT_NAME = "scrapy_05"

SPIDER_MODULES = ["scrapy_05.spiders"]
NEWSPIDER_MODULE = "scrapy_05.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "ERROR"

ITEM_PIPELINES = {
   "scrapy_05.pipelines.Scrapy05Pipeline": 300,
}

附上python截取字符串的操作代码：https://www.ycpai.cn/python/xBRw8f2K.html

在Python中，可以通过切片操作来截取字符串。切片操作可以通过指定起始位置和结束位置来截取一个字符串的一部分，语法如下：

```python

string[start:end]

```

其中，start表示起始位置，end表示结束位置。需要注意的是，切片操作是左闭右开的，即包含起始位置，不包含结束位置。如果不指定start，则默认从字符串首字符开始截取；如果不指定end，则默认截取到字符串末尾。

下面是一些切片操作的示例：

```python

string = "Hello, World!"

# 截取前5个字符

print(string[:5]) # 输出：Hello

# 截取第6个字符到字符串末尾

print(string[6:]) # 输出：World!

# 截取第3个字符到第8个字符

print(string[2:8]) # 输出：llo, W

```

2. 使用split方法截取字符串

在Python中，还可以使用split方法来截取字符串。split方法可以通过指定分隔符来分隔字符串，并返回分隔后的字符串列表。语法如下：

```python

string.split(separator)

```

其中，separator表示分隔符。如果不指定分隔符，则默认使用空格作为分隔符。

下面是一些split方法的示例：

```python

string = "Hello, World!"

# 使用逗号分隔字符串

print(string.split(",")) # 输出：['Hello', ' World!']

# 使用空格分隔字符串

print(string.split()) # 输出：['Hello,', 'World!']

```

3. 使用正则表达式截取字符串

在Python中，也可以使用正则表达式来截取字符串。正则表达式可以匹配字符串中的某个模式，并返回匹配的结果。Python中的re模块提供了正则表达式的支持。语法如下：

```python

re.findall(pattern, string)

```

其中，pattern表示正则表达式模式，string表示要匹配的字符串。re.findall方法会返回所有匹配的结果，以列表的形式返回。

下面是一些使用正则表达式截取字符串的示例：

```python

import re

string = "Hello, World!"

# 匹配所有以大写字母开头的单词

print(re.findall(r'\b[A-Z]\w+', string)) # 输出：['Hello', 'World']

# 匹配所有以小写字母开头的单词

print(re.findall(r'\b[a-z]\w+', string)) # 输出：['ello', 'orld']

```

4. 使用字符串方法截取字符串

在Python中，还可以使用字符串方法来截取字符串。Python中的字符串方法非常丰富，常用的有find、index、replace等。这里只介绍find和index方法。

find方法可以在字符串中查找子串，并返回子串第一次出现的位置。如果找不到子串，则返回-1。语法如下：

```python

string.find(substring)

```

其中，substring表示要查找的子串。

index方法和find方法类似，不同之处在于如果找不到子串，则会抛出ValueError异常。语法如下：

```python

string.index(substring)

```

下面是一些使用find和index方法截取字符串的示例：

```python

string = "Hello, World!"

# 查找子串的位置

print(string.find("World")) # 输出：7

print(string.index("World")) # 输出：7

# 查找不存在的子串

print(string.find("Python")) # 输出：-1

# print(string.index("Python")) # 抛出ValueError异常

```

综上所述，本文介绍了Python中截取字符串的几种方法，包括切片、split方法、正则表达式和字符串方法。不同的方法适用于不同的场景，需要根据实际情况选择。掌握这些方法可以帮助我们更好地处理字符串，并从中获取需要的信息。

andux

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
scrapy爬虫爬取17k小说网全部章节信息（MongoDB，分页）

跟着教程爬取京东的时候，一会爬出来，一会儿爬不出来，京东反扒挺厉害啊。跟着教程走，还是出不来结果，看着自己的代码没错啊。使用正则表达式获取当前链接中的翻页参数，就是变化的数值，让它+1，形成新的链接。多练习，多看教程，原来代码是这个意思啊。刚开始，只会照着教程抄代码，囫囵吞枣，先有个大概的了解和熟悉，现在要慢慢深入理解代码的含义了。越来越发现，pipelines里面的代码，都成固定的了，几乎不用修改，可以拿之前的直接用。17k小说网分类中小说还是挺多的，为了方便查看结果，就缩小了小说书籍的范围。
复制链接

扫一扫