习题17 更多文件操作
如何解决文件编码错误
在运行程序时报错,称编码失败。
- 原代码:
# 从sys包导入argv模块
from sys import argv
# 从os.path包导入exists函数
from os.path import exists
# 将argv解包
script, from_file, to_file = argv
print(f"Copying from {from_file} to {to_file}")
# We could do these two on one line, how?
# 打开from_file的文件对象并将其赋值给in_file
in_file = open(from_file)
# 读取in_file内容并将其赋值给indata
indata = in_file.read()
# 输出indata的文件字符长度
print(f"The input file is {len(indata)} bytes long")
# 查看to_file文件是否存在
print(f"Does the output file exist? {exists(to_file)}")
print("Ready, hit RETURN to continue, hit CTRL-C to abort.")
input()
# 打开to_file的文件对象并将其赋值给out_file
out_file = open(to_file, 'w')
# 将indata的内容写入out_file文件
out_file.write(indata)
print("Alright, all done.")
#关闭in_file, out_file文件
out_file.close()
in_file.close()
PS D:\pythonp> # first make a sample file.
PS D:\pythonp> echo "This is a test file." > test17.txt
PS D:\pythonp> #then look at it.
PS D:\pythonp> cat test17.txt
This is a test file.
PS D:\pythonp> # now run our script on it.
PS D:\pythonp> python ex17.py test17.txt new_file17.txt
Copying from test17.txt to new_file17.txt
Traceback (most recent call last):
File "ex17.py", line 10, in <module>
indata = in_file.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence
在read()函数处报错
- 查询发现可能是文件编码问题。
按照如下文章调试不成功,具体错误如下列代码。
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xab in position 11126: illegal multibyte sequence
- 尝试一(×)
将第9行改为
# 将编码格式改为gbk
in_file = open(from_file, encoding = 'gbk')
PS D:\pythonp> python ex17.py test17.txt new_file17.txt
Copying from test17.txt to new_file17.txt
Traceback (most recent call last):
File "ex17.py", line 10, in <module>
indata = in_file.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence
仍然在同一个位置报错
- 尝试二:(×)
将第9行改为
# 将编码格式改为gb18030
in_file = open(from_file, encoding = 'gb18030')
PS D:\pythonp> python ex17.py test17.txt new_file.txt
Copying from test17.txt to new_file.txt
Traceback (most recent call last):
File "ex17.py", line 10, in <module>
indata = in_file.read()
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xff in position 0: illegal multibyte sequence
继续同一位置报错
- 尝试三(×)
将第9行改为
# 将编码格式改为gb18030,令忽略错误
in_file = open(from_file, encoding = 'gb18030', errors = 'ignore')
PS D:\pythonp> python ex17.py test17.txt new_file17.txt
Copying from test17.txt to new_file17.txt
The input file is 44 bytes long
Does the output file exist? False
Ready, hit RETURN to continue, hit CTRL-C to abort.
Traceback (most recent call last):
File "ex17.py", line 19, in <module>
out_file.write(indata)
UnicodeEncodeError: 'gbk' codec can't encode character '\u2e84' in position 0: illegal multibyte sequence
原来的位置运行成功,但在write()函数处再一次报错,所以又去查阅资料,想要搞清楚问题根源。
powershell的文件输出格式
- 经查,问题在于PowerShell对于文件的输出重定向默认选择”UTF-16 (LE)”(微软称之为Unicode编码),而实际需要文件输出格式为”UTF-8”
试用参考文章中的解决方法
Windows PowerShell 输出文件编码格式问题
Powershell改变默认编码
将PowerShell的默认输出编码更改为UTF-8
更倾向于不更改默认输出编码的方式
所以做以下尝试
- 尝试一(×)
PS D:\pythonp> chcp 65001
chcp : 无法将“chcp”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写,如果包括路径,请确保路径正确
,然后再试一次。
所在位置 行:1 字符: 1
+ chcp 65001
+ ~~~~
+ CategoryInfo : ObjectNotFound: (chcp:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException
报错,输入chcp 65001切换当前命令行窗口工作编码格式为”UTF-8”的方式不适用。
- 在不想尝试改变默认输出编码的情况下只能从输出方式入手,尝试不同的输出途径
- 尝试一:在powershell中利用echo创建txt文件(最初导致报错的输出方式)
PS D:\pythonp> echo "TEST one">test17.1.txt