用python爬个小说

最新推荐文章于 2024-06-21 13:53:51 发布

6月的夕夕

最新推荐文章于 2024-06-21 13:53:51 发布

阅读量297

点赞数 2

分类专栏： python那些事文章标签： python 爬虫后端

本文链接：https://blog.csdn.net/u013250169/article/details/112723825

版权

python那些事专栏收录该内容

1 篇文章 0 订阅

订阅专栏

上下班，地铁间，用手机在各类小说网站看小说时，总会有莫名其妙的弹窗，是不是很烦恼。其实我们可以借助python写个小工具，将想看的小说爬下来。

友情提醒：记得仔细看一下17年6月公布的网络安全法，个人隐私信息、版权信息、robots.txt协议等限定的信息不可以随便抓取哦。

我们可以通过bp（浏览器F12也可以，不过不如bp直观）看一下输入目录页的url之后会看到什么信息

该结果是一个json格式，其中字段chapterName代表章节名称，chapterId代表章节的id号，该id号在获取每章节内容时会用到。

将具体的章节id替换到请求具体内容的url中去，会看到一下内容：

该结果也是一个json格式，其中content字段后面就代表小说的具体内容，切分具体内容之后最好把<br/>或者空格等影响观感的字符去掉。

对应的代码如下：

def extract_catalog():
    catalog_ids = []
    catalog_list = json.loads(requests.get(CATALOG_URI).text)['chapterlist']
    for catalog_node in catalog_list:
        catalog_ids.append(catalog_node['chapterId'])
    return catalog_ids

def extract_one_content(content_id, story_file):
    real_uri = CONTENT_URI.format(content_id)
    response_result = json.loads(requests.get(real_uri).text)['result']
    title_name = response_result['chapterName']
    content = response_result['content'].split("千万网友推荐:")[0]
    content = re.sub('\s+', '', content).strip()
    story_file.write("{0}\n".format(title_name))
    story_file.write(content.replace("<br/>", "\n"))
    story_file.write("{0}".format("\n\n"))

def extract_content(ids, story_file):
    for content_id in ids:
        extract_one_content(content_id, story_file)

def main():
    story_file = open("/tmp/story_guigushi.txt", "w+")
    ids = extract_catalog()
    extract_content(ids, story_file)
    story_file.close()

if __name__ == '__main__':
    main()

之前使用的是m站的url抓取的小说，下载之后发现只有20多章节，其他章节需要登录，于是找到该网站对应的www主站试着下载一下。

首先看到的目录页如下：

目录被a标签包围着，其中的chapterid为目录id，title属性为章节名称。我们将chapterid替换到抓取内容的url中就可以抓取每一章节的内容，其内容截图如下：

我们发现内容其实嵌套在id属性值为chapterContent的div中。于是我们可以通过python进行如下抓取：

def catalog_filter(tag):
    is_right = False
    if tag.name == "a" and tag.has_attr('href') \
            and not tag.has_attr('id') and not tag.has_attr('class'):
        if tag.attrs['href'].startswith(TITLE_MARK):
            is_right = True
    return is_right

def extract_catalog():
    catalog_ids = []
    catalog_name = []
    sourp = BeautifulSoup(requests.get(CATALOG_URI).text, 'html.parser')
    for tag in sourp.find_all(catalog_filter):
        catalog_ids.append(tag.attrs['href'][len(TITLE_MARK):])
        catalog_name.append(tag.attrs['title'])
    return catalog_ids, catalog_name

def extract_one_content(content_id, title_name, story_file):
    real_uri = CONTENT_URI.format(content_id)
    sourp_content = BeautifulSoup(requests.get(real_uri).text, 'html.parser')
    for tag_content in sourp_content.find_all('div', attrs={'id': 'chapterContent'}):
        story_file.write("{0}\n".format(title_name))
        for content in tag_content.strings:
            content = re.sub('\s+', '', content).strip()
            story_file.write(content.replace("<br/>", "\n"))
            story_file.write("\n")
        story_file.write("{0}".format("\n\n"))
        break

def extract_content(ids, title_names, story_file):
    for content_id,title_name in zip(ids, title_names):
        extract_one_content(content_id, title_name, story_file)

def main():
    story_file = open("/tmp/story_guigushi.txt", "w+")
    ids, title_names = extract_catalog()
    extract_content(ids, title_names, story_file)
    story_file.close()

if __name__ == '__main__':
    main()

其中 TITLE_MARK 的值为 "/ChapterDetail.aspx?bookid=430178&chapterid="，作为href属性值来过滤a标签。

通过主要站抓取之后我们发现章节还是20几章，其余章节获取需要登录。

于是，作为守法的公民，我们应当尊重知识产权，换个小说再看吧！

6月的夕夕

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用python爬个小说

上下班，地铁间，用手机在各类小说网站看小说时，总会有莫名其妙的弹窗，是不是很烦恼。其实我们可以借助python写个小工具，将想看的小说爬下来。友情提醒：记得仔细看一下17年6月公布的网络安全法，个人隐私信息、版权信息、robots.txt协议等限定的信息不可以随便抓取哦。我们可以通过bp（浏览器F12也可以，不过不如bp直观）看一下输入目录页的url之后会看到什么信息该结果是一个json格式，其中字段chapterName代表章节名称，chapterId代表章节的id号，该id号...
复制链接

扫一扫

专栏目录