爬取斗罗大陆小说-CSDN博客

本文链接：https://blog.csdn.net/weixin_43901998/article/details/87705802

相信很多小伙伴都非常喜欢唐家三少写的斗罗大陆，今天就带来一个用reuquests和Beautifulsoup编写的爬虫爬取斗罗大陆第一部的案例！！！

通过下图我们可以发现：文章的标题存在于标签div class =“yuedu_index” … 的字标签 h3 中（第四十九章七怪战皇斗（下）），而文章的内容存在于 div class = “content” 的字标签 p 中，我们可以用BeautifulSoup库中的select方法来将我们需要的内容提取出来

下面是我们的代码:

import requests
from bs4 import BeautifulSoup
import bs4
import time

def get_HTML(url):
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36",
            "Referer": "http://www.85xs.cc/book/douluodalu1/"
    }
    try:
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

file = open('F:/斗罗大陆.txt','a+',encoding='UTF-8')

def explain_HTML(html):
    soup = BeautifulSoup(html, "html.parser")
    titles = soup.select("div.yuedu_index > h3")
    my = soup.select("div.content > p")
    for title in titles:
        print(title.get_text())#方便我们在编译器中看到程序进行到哪一章
        file.write(title.get_text()+'\n')
    for me in my:
        file.write(me.get_text() + '\n')

if __name__ == '__main__':
    url_0 = "http://www.85xs.cc/book/douluodalu1/"
    for j in range(1,400):
        url = url_0 + str(j) + ".html"
        html =get_HTML(url)
        explain_HTML(html)
        time.sleep(1)#控制爬虫爬取的速度，防止速度过快爬取失败

file.close()

因为时间很久就选择爬到了第145章：
爬取结果：