利用Python requests库从网上下载txt文件时多出一个CR的处理

最新推荐文章于 2024-04-09 12:56:05 发布

阿智智

最新推荐文章于 2024-04-09 12:56:05 发布

阅读量1.4k

点赞数 2

分类专栏： Python 文章标签：行尾符转换 Python 爬虫正则表达式

本文链接：https://blog.csdn.net/robertchenguangzhi/article/details/84026594

版权

Python 专栏收录该内容

44 篇文章 0 订阅

订阅专栏

问题描述

读¹ 的Reading word lists小节时，发现需要从thinkpython2/code/words.txt上下载words.txt文件。我不想利用复制-粘贴的方法构造该文件，想到之前学过的爬虫技术，于是写下如下代码：

import requests

r = requests.get('http://greenteapress.com/thinkpython2/code/words.txt')
# since abobe net use ISO-8859-1 encoding
r.encoding = 'utf-8'

# 写入外部文件
words = open('words.txt','w')
words.write(r.text)
words.close()

得到文件words.txt后，发现每个单词后面会跟个空行，我采用Notepad++的视图->显示符号->显示行尾符后，具体如下图所示：
多出cr
上述是个问题，怎样去掉多余的行？

解决方法

对上述文件的内容观察，发现是Macintosh格式，显示内容多出CR。为此我利用Notepad++的功能将其转换为Windows格式，如下图：

转换后得到结果如下图所示：
转换Windows后
我用程序实际测试，在Windows系统下，Python的\n相当于CR LF。于是对于转换成Windows格式后的文件words.txt来说，我们需要做的是：将\n\n替换为\n。为此我使用如下代码（利用正则表达式）：

# stripNewline.py
import re

fi = open('words.txt')
str = fi.read()
#str = 'nihao\n\n'
dnewlinePattern = re.compile(r'\n\n')
outStr = re.sub(dnewlinePattern,'\n',str)
fo = open('wordsOut.txt','w')
fo.write(outStr)
fo.close()
#print(repr(outStr))
fi.close()