原始数据文件,采用的是UTF-8编码
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>表格比较(B)</title>
</head>
<body>
表格比较(B)<br/>
已产生: 2021/3/24 11:32:41<br/>
<br/>
模式: 差异
<br/>
左边文件: D:\codepath\kunlun_automation_bom\bom_system\数据库元器件信息_20210324113239.xlsx
<br/>
右边文件: D:\codepath\kunlun_automation_bom\bom_system\上传元器件信息_20210324113239.xlsx
<br/>
<br/>
</body>
</html>
from bs4 import BeautifulSoup
import codecs
r = codecs.open('admin.html', 'r', 'gbk', errors='ignore')
soup = BeautifulSoup(r, 'html.parser')
# print(soup.prettify())
w = soup.body.encode('utf-8')
with open('1.html', 'wb') as f:
f.write(w)
w = soup.body.encode('gbk')
with open('2.html', 'wb') as f:
f.write(w)
r2 = codecs.open('admin.html', 'r', 'utf-8')
soup = BeautifulSoup(r2, 'html.parser')
# print(soup.prettify())
w = soup.body.encode('utf-8')
with open('3.html', 'wb') as f:
f.write(w)
w = soup.body.encode('gbk')
with open('4.html', 'wb') as f:
f.write(w)
print(soup.body.encode('utf-8'))
print(soup.body.encode('gbk'))
print(soup.body) # UnicodeEncodeError: 'gbk' codec can't encode character '\xa0'
结果输出
1.html
2.html
3.html
4.html
print 输出结果
print(soup.body)错误原因
原始文件采用的 UTF-8 即编码,使用 UTF-8 正确读取出来后,print
使用系统设置编码去显示字符, pycharm设置的字符显示编码默认使用的是 GBK 去编码字符显示,因此 print
输出显示的时候使用的是 GBK 去编码字符显示,字符 ‘\xa0’ 为 GBK 不支持因此导致 UnicodeEncodeError
, 修改为 UTF-8 即可正常显示。查看终端输出编码设置 sys.stdout.encoding
读取的原始数据的时候,设定的编码方式不存在问题,后续在去 encode
也是正确的。
输出编码设置
sys.stdout.reconfigure(encoding='utf-8')
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
safeprint(u"\N{EM DASH}")
参考:
https://www.cnblogs.com/vocus/p/11416022.html
https://www.cnblogs.com/-qing-/p/10934261.html
https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console/32176732#32176732
https://stackoverflow.com/questions/4374455/how-to-set-sys-stdout-encoding-in-python-3