Python爬取文章和小说内容

最新推荐文章于 2024-08-07 09:00:00 发布

x-dragon8899

最新推荐文章于 2024-08-07 09:00:00 发布

阅读量2.7k

点赞数 9

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_45234510/article/details/106054094

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

python

一、安装requests库和 bs4
二、分析步骤
三、实践（爬取文章）
四、合并为一个.txt文件
五、解决爬虫获取网页，出现乱码问题
六、实践（爬取小说）

一、安装requests库和 bs4

pip install requests

pip install bs4

二、分析步骤

在这里插入图片描述

三、实践（爬取文章）

1、代码：

import io
import os
import sys
import requests
from bs4 import BeautifulSoup

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')  # 编码格式

def urlBS(url):  # 定义发起请求函数
    resp = requests.get(url)
    html = resp.content.decode('gbk')
    soup = BeautifulSoup(html, 'lxml') # 解析网页
    # print(soup)
    return soup
    
firsturl = 'http://www.rensheng5.com/zx/onduzhe/'  # 目标地址
urlBS(firsturl)

def main(url):
    soup = urlBS(url)  #调用函数
    lis = soup.find('ul', class_="i1 ico1").find_all('li') # 从网页获取的信息

    # 数据保存的目录(os.getced()创建文件夹)
    path = os.getcwd()+u'/爬取的文章/'
    if not os.path.isdir(path):   # 判断是否有这个文件夹
        os.mkdir(path)
    for i in lis:
        newurl = i.find('a')['href']
        print(newurl)
        # 请求每篇文章
        result = urlBS(newurl)  #调用函数
        title = result.find('div', class_="artview").find('h1').get_text() # 获取标题
        print(title)
        writer = result.find('div', class_="artinfo").get_text()   # 获取作者
        print(writer)
        # 保存的文件格式:
        filename = path + title + '.txt'
        print(filename)

        #写入操作
        new = open(filename, 'w')
        new.write('<<' + title + '>>\n\n') # 写入标题
        new.write(writer + '\n\n')  # 写入作者
        text = result.find('div', class_="artbody").find('p').get_text()
        new.write(text)  # 写入内容
        new.close()    # 关闭

if __name__ == '__main__':
    fristurl = 'http://www.rensheng5.com/zx/onduzhe/'
    main(firsturl)

2、效果：

在这里插入图片描述

3、说明：

在这里插入图片描述

四、合并为一个.txt文件

1、在命令行窗口，进入需要合并的Txt文件的目录。

2、确认目录正确后，输入type *.txt >>e:\111.txt，该命令将把当前目录下的所有txt文件的内容输出到e:\111.txt。

3、到此，打开合并后的e:\111.txt，即可看到多个Txt文件都已按顺序合并到F盘的111.txt文件中。

五、解决爬虫获取网页，出现乱码问题

通用解决方案：

response=request.get("url网站")

data=bytes(response.text,response.encoding).decode("gbk","ignore")

六、实践（爬取小说）

在这里插入图片描述

1、代码：

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.biquw.com/book/19877/')
response2=bytes(response.text,response.encoding).decode("utf-8","ignore")
# print(response2)

# 网页选择器实例化
soup = BeautifulSoup(response2,'lxml')

data_list = soup.find('ul')

for book in data_list.find_all('a'):
    print('{}:{}'.format(book.text,'http://www.biquw.com/book/19877/'+ book['href']))
    book_url = 'http://www.biquw.com/book/19877/' + book['href']
    data_book = requests.get(book_url).text
    soup = BeautifulSoup(data_book,'lxml')  # 解析网页
    data = soup.find('div',{'id':'htmlContent'})  # 查看网页获取
    data2 = bytes(data.text, response.encoding).decode("utf-8", "ignore")
    print(data2)

    # 文件操作
    # 方式一、输出到同一个txt文件
    file = open('book2.txt', 'a', encoding='utf-8')
    file.write(data2)
    file.close()
    
    #方式二、输入到各自的txt文件
    # with open(book.text + '.txt','a',encoding='utf-8') as f:
    #     f.write(data2)