python处理文本文件 删除不要字符,Python从文件中读取并删除非ascii字符

I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.

import unicodedata

import codecs

infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')

outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')

for line in infile.readlines():

for word in line.split():

outfile.write(word+" ")

outfile.write("\n")

infile.close()

outfile.close()

The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??

解决方案

codecs.open() doesn't support universal newlines e.g., it doesn't translate \r\n to \n while reading on Windows.

Use io.open() instead:

#!/usr/bin/env python

from __future__ import print_function

import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \

io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:

for line in infile:

print(*line.split(), file=outfile)

btw, if you want to remove non-ascii characters, you should use ascii instead of utf-8.

If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate() to remove non-ascii characters:

#!/usr/bin/env python

nonascii = bytearray(range(0x80, 0x100))

with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:

for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)

outfile.write(line.translate(None, nonascii))

It doesn't normalize whitespace like the first code example.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值