BeautifulSoup+线程池爬取小说全部内容

BeautifulSoup+线程池爬取小说全部内容


目标网址:https://www.bixuege.com/7_7120/

  • 打开网页,查看源代码,检查是否有我们想要的内容

在这里插入图片描述

"""这个网页的源代码很招人喜欢,直接把我们呢想要的内容展示出来了"""
  • 先写个获取HTML页面的方法

    headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
            "Referer": "https://www.bixuege.com/xuanhuan/"
        }
        try:
            res = requests.get(url, headers=headers)
            res.encoding = res.apparent_encoding
        except Exception:
            return
        else:
            return res.text
        finally:
            res.close()
    
  • 用BeautifulSoup解析HTML页面,拿到对应的节点信息

        soup = BeautifulSoup(html, 'lxml')
    
        div_list = soup.find('div', attrs={"id": "list"})
        dd_list = div_list.find_all('dd')
    
  • 打开一个章节网页,看看我们想要的文章内容是不是在源代码中

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9KDUaDfr-1619137155650)(C:\Users\taxuebufeng\AppData\Roaming\Typora\typora-user-images\image-20210423081020973.png)]

这不就是白送吗?内容直接放在了id为content的div节点中了

  • 写个存储内容的方法

    soup = BeautifulSoup(html, 'lxml')
        content = soup.find('div', attrs={"id": "content"}).text
        with open(path, 'w', encoding='utf-8') as f:
            f.write(content)
            print('[%s]保存成功!!!' % title)
    
  • 加个线程池,提高抓取效率

        with ThreadPoolExecutor(100) as t:
            for content in contents:
                url, title = content
                t.submit(save_contents, url, title)
    

运行效果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1QHcPS3m-1619137155652)(C:\Users\taxuebufeng\AppData\Roaming\Typora\typora-user-images\image-20210423081322075.png)]

看看内容

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-diQtGSoq-1619137155654)(C:\Users\taxuebufeng\AppData\Roaming\Typora\typora-user-images\image-20210423081557620.png)]

源代码:

import os
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor


def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
        "Referer": "https://www.bixuege.com/xuanhuan/"
    }
    try:
        res = requests.get(url, headers=headers)
        res.encoding = res.apparent_encoding
    except Exception:
        return
    else:
        return res.text
    finally:
        res.close()


def get_content_url(html):
    content_info = []

    soup = BeautifulSoup(html, 'lxml')

    div_list = soup.find('div', attrs={"id": "list"})
    dd_list = div_list.find_all('dd')
    for dd in dd_list[9:]:
        href, title = dd.a['href'], dd.a.text
        content_info.append(("https://www.bixuege.com" + href, title))
    if content_info:
        return content_info
    return


def save_contents(url, title):
    if not os.path.exists('./笔趣阁小说'):
        os.mkdir('./笔趣阁小说')

    path = '笔趣阁小说/' + title + '.txt'
    html = get_html(url)

    soup = BeautifulSoup(html, 'lxml')
    content = soup.find('div', attrs={"id": "content"}).text
    with open(path, 'w', encoding='utf-8') as f:
        f.write(content)
        print('[%s]保存成功!!!' % title)


def run():
    url = "https://www.bixuege.com/7_7120/"
    html = get_html(url)
    contents = get_content_url(html)

    with ThreadPoolExecutor(100) as t:
        for content in contents:
            url, title = content
            t.submit(save_contents, url, title)


if __name__ == '__main__':
    run()

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值