2021-7-3 爬网页22-爬取某小说保存到txt(python3.6，静态页面，requests.get，去除特定字符串）

最新推荐文章于 2024-06-28 10:41:11 发布

没人不认识我

最新推荐文章于 2024-06-28 10:41:11 发布

阅读量332

点赞数

分类专栏： python 爬虫文章标签： python 正则表达式爬虫

本文链接：https://blog.csdn.net/weixin_42555985/article/details/118439662

版权

python 同时被 2 个专栏收录

233 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

爬取某小说保存到txt(python3.6，静态页面，requests.get，去除特定字符串）

1.开发环境
2.编码
3.去除特定字符串
4.全代码

1.开发环境

Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32

2.编码

网站的编码是gb2312

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

所以获取网页

req = requests.get(url=target)
req.encoding = 'gb2312'

写txt

with open("test.txt","a",encoding='gb2312') as f:

网页中有些代码用gb2312写txt会报错

UnicodeEncodeError: 'gb2312' codec can't encode character '\xa0' in position 5217: illegal multibyte sequence

把它们都替换了

with open("test.txt","a") as f:
		#\xa0 -> &nbsp;
		#\ufffd ->��
		#\u30fb 
		#2个<br><br>替换为2个换行再加一个段落首行空格
    		f.write(text_delete_bmp.replace('\ufffd','').\
			replace('\u30fb','').\
			replace('\xa0', '').\
			replace('　　',"\n  ").\
			replace('\n\n',"\n  "))  # 自带文件关闭功能，不需要再写f.close()

3.去除特定字符串

文章中有些特定的字符串是不需要的，例如

{ewcMVIMAGE,MVIMAGE, !09100020_0014_1.bmp}{ewc MVIMAGE,MVIMAGE, !09100020_0015_1.bmp}

利用正则把它们都去除掉。
字符串规则：以"{ewc开头"，以“.bmp}”结尾

text_delete_bmp=re.sub(r'{ewc.*?\.bmp}', "", text_context[0].text)

4.全代码

下载

没人不认识我

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录