python爬虫学习（以爬取小说为例）

最新推荐文章于 2024-06-24 18:45:00 发布

zwj_?

最新推荐文章于 2024-06-24 18:45:00 发布

阅读量1.4k

点赞数 2

文章标签： python 学习爬虫

本文链接：https://blog.csdn.net/weixin_52058304/article/details/127180518

版权

爬取网站：https://www.31xiaoshuo.com/168/168168/

代码组成部分：

获取每一章的url
在上面获得的url中获取每一章的标题和内容
将获取到的内容保存到txt文本中

使用的库：

re库
requests库
beautifulsoup库

代码实现：

获取小说每一章的url

使用检查或F12，来查看该网页的页面。
在这里插入图片描述
发现每一章对应的链接地址都是形如

<a href="/168/168168/61321619.html">第一章：剑道</a>

点到具体的某一章，会发现完整的地址如下：

https://www.31xiaoshuo.com/168/168168/61321619.html

其实就是字符串

https://www.31xiaoshuo.com

和在小说主页面内href后面引号内的字符串

/168/168168/61321619.html

连接起来。
我们可以使用正则表达式来将小说主页面内形如

/168/168168/61321619.html

的字符串全部取出来，放在列表中，再与字符串

https://www.31xiaoshuo.com

连接起来，这样就获得了小说每一章完整的url
该部分代码如下

def get_titleurl(text):
    title_list = re.findall("/168.*?html",text)
    title_list1 = []
    for i in title_list:
        title_list1.append("https://www.31xiaoshuo.com/"+i)
    return title_list1

获取每一章的标题和文本内容

在这里插入图片描述

在这里插入图片描述
检查小说第一章的页面，发现小说标题在h1标签内，小说内容在div标签内，id值为content。
获取小说标题和内容的代码如下：

def get_content(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.text, "html.parser")
    title = soup.find('h1').text
    content = soup.find(id="content").text
    return title, content

将获取的文本内容保存到txt文本中

def save(title, content):
    with open('1.txt', mode='a', encoding='utf-8') as f:
        f.write(title)
        f.write(content)

完整代码如下：

import requests
import re
from bs4 import BeautifulSoup

#文件保存函数
def save(title, content):
    with open('1.txt', mode='a', encoding='utf-8') as f:
        f.write(title)
        f.write(content)

#获取每一章的文本内容和标题
def get_content(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.text, "html.parser")
    title = soup.find('h1').text
    content = soup.find(id="content").text
    return title, content

#获取每一章的url链接
def get_titleurl(text):
    title_list = re.findall("/168.*?html",text)
    title_list1 = []
    for i in title_list:
        title_list1.append("https://www.31xiaoshuo.com/"+i)
    return title_list1

url = "https://www.31xiaoshuo.com/168/168168/"
html = requests.get(url)
titleurl = get_titleurl(html.text)
for i in titleurl:
    title, content = get_content(i)
    save(title, content)