python爬取小说写入txt_python3爬取纵横网小说并写入文本文件

最新推荐文章于 2024-02-23 11:58:17 发布

weixin_39999190

最新推荐文章于 2024-02-23 11:58:17 发布

阅读量567

点赞数

文章标签： python爬取小说写入txt

文中用到的库：

request

BeautifulSoup

requests库的一些方法：

爬取网页主要有如下几个关键步骤：

get请求则使用requests.get请求网页：

response = requests.get(book_url, headers=header)

soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页

content = html.select('#readerFt > div > div.content > p')# 使用soup.select，通过标签查找正文

通过子标签查找时，尽量不使用完整的selector

比如下图中，正文都是放在class=content标签下的每一个

标签中

eg：第二个

标签复制出来的selector就是这样的：#readerFt > div > div.content > p:nth-child(2)，由于我们是爬取整篇小说，不止取第一段落，所以去掉p:nth-child(2)后面的nth-child(2)，直接为#readerFt > div > div.content > p

完整的代码为：

# -*- coding: utf-8 -*-

import re

import requests

from bs4 import BeautifulSoup

from requests.exceptions import RequestException

def get_page(book_url):

'''

try... except... 通过response的状态码判断是否请求成功，若请求成功则使用BeautifulSoup解析网页，若状态码不是200，则抛出异常

'''

try:

# 构建一个header，模拟浏览器的操作，有些网站做了限制，如果不使用header，则无法正常返回数据

header = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}

response = requests.get(book_url, headers=header)

if response.status_code == 200:

soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页

print(type(soup))#

return soup

return response.status_code

except RequestException:

return '请求失败！'

def download():

html = get_page(book_url)

content = html.select('#readerFt > div > div.content > p')# 使用soup.select，通过标签查找正文

# print(content) #打印结果是list类型

f = open('E:\\pyProject\\test1\\content.txt', 'w')

for i in content:

i = str(i) # 将类型为强转为str类型

f.write(i+'\n') # 将每一个段落都换行写入

f.close()

'''

若想去掉

标签，可以使用下面的方法，使用一个正则表达式，仅获取

标签中的文字

'''

def download1():

html = get_page(book_url)

content_html = html.select('#readerFt > div > div.content')

# print(content_html)

content = re.findall(r'

(.*?)

', str(content_html), re.S)# 通过正则表达式获取

标签中的文字

# print(content)

f = open('E:\\pyProject\\test1\\content.txt', 'w')

for n in content:

f.write(str(n)+'\n')

f.close()

if __name__=='__main__':

book_url = 'http://book.zongheng.com/chapter/681832/37860473.html'

download()

# download1()

调用download()方法写入txt文件为：

调用download1()方法写入txt文件的结果：

至此，一个简单的爬取小说的脚本完成，撒花~~

本文地址：https://blog.csdn.net/dhr201499/article/details/107317802

如您对本文有疑问或者有任何想说的，请点击进行留言回复，万千网友为您解惑！

weixin_39999190

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬取小说写入txt_python3爬取纵横网小说并写入文本文件

文中用到的库：requestBeautifulSouprequests库的一些方法：爬取网页主要有如下几个关键步骤：get请求则使用requests.get请求网页：response = requests.get(book_url, headers=header)soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。