BeautifulSoup+线程池爬取小说全部内容

最新推荐文章于 2024-01-02 14:56:05 发布

踏雪捕风

最新推荐文章于 2024-01-02 14:56:05 发布

阅读量232

点赞数 1

分类专栏：网络爬虫文章标签： python 爬虫多线程

本文链接：https://blog.csdn.net/taxuebufeng/article/details/116040200

版权

网络爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

BeautifulSoup+线程池爬取小说全部内容

目标网址：https://www.bixuege.com/7_7120/

打开网页，查看源代码，检查是否有我们想要的内容

在这里插入图片描述

"""这个网页的源代码很招人喜欢，直接把我们呢想要的内容展示出来了"""

先写个获取HTML页面的方法

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
        "Referer": "https://www.bixuege.com/xuanhuan/"
    }
    try:
        res = requests.get(url, headers=headers)
        res.encoding = res.apparent_encoding
    except Exception:
        return
    else:
        return res.text
    finally:
        res.close()

用BeautifulSoup解析HTML页面，拿到对应的节点信息

    soup = BeautifulSoup(html, 'lxml')

    div_list = soup.find('div', attrs={"id": "list"})
    dd_list = div_list.find_all('dd')

打开一个章节网页，看看我们想要的文章内容是不是在源代码中

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9KDUaDfr-1619137155650)(C:\Users\taxuebufeng\AppData\Roaming\Typora\typora-user-images\image-20210423081020973.png)]$

这不就是白送吗？内容直接放在了id为content的div节点中了

写个存储内容的方法

soup = BeautifulSoup(html, 'lxml')
    content = soup.find('div', attrs={"id": "content"}).text
    with open(path, 'w', encoding='utf-8') as f:
        f.write(content)
        print('[%s]保存成功！！！' % title)

加个线程池，提高抓取效率

    with ThreadPoolExecutor(100) as t:
        for content in contents:
            url, title = content
            t.submit(save_contents, url, title)

运行效果

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1QHcPS3m-1619137155652)(C:\Users\taxuebufeng\AppData\Roaming\Typora\typora-user-images\image-20210423081322075.png)]$

看看内容

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-diQtGSoq-1619137155654)(C:\Users\taxuebufeng\AppData\Roaming\Typora\typora-user-images\image-20210423081557620.png)]$

源代码：

import os
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor


def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
        "Referer": "https://www.bixuege.com/xuanhuan/"
    }
    try:
        res = requests.get(url, headers=headers)
        res.encoding = res.apparent_encoding
    except Exception:
        return
    else:
        return res.text
    finally:
        res.close()


def get_content_url(html):
    content_info = []

    soup = BeautifulSoup(html, 'lxml')

    div_list = soup.find('div', attrs={"id": "list"})
    dd_list = div_list.find_all('dd')
    for dd in dd_list[9:]:
        href, title = dd.a['href'], dd.a.text
        content_info.append(("https://www.bixuege.com" + href, title))
    if content_info:
        return content_info
    return


def save_contents(url, title):
    if not os.path.exists('./笔趣阁小说'):
        os.mkdir('./笔趣阁小说')

    path = '笔趣阁小说/' + title + '.txt'
    html = get_html(url)

    soup = BeautifulSoup(html, 'lxml')
    content = soup.find('div', attrs={"id": "content"}).text
    with open(path, 'w', encoding='utf-8') as f:
        f.write(content)
        print('[%s]保存成功！！！' % title)


def run():
    url = "https://www.bixuege.com/7_7120/"
    html = get_html(url)
    contents = get_content_url(html)

    with ThreadPoolExecutor(100) as t:
        for content in contents:
            url, title = content
            t.submit(save_contents, url, title)


if __name__ == '__main__':
    run()

踏雪捕风

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup+线程池爬取小说全部内容

BeautifulSoup+线程池爬取小说全部内容目标网址：https://www.bixuege.com/7_7120/打开网页，查看源代码，检查是否有我们想要的内容"""这个网页的源代码很招人喜欢，直接把我们呢想要的内容展示出来了"""先写个获取HTML页面的方法headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
复制链接

扫一扫