python2 怎么读utf8,Python 2.7阅读和写作“éèàçê”从utf-8文件

本文介绍了一个Python脚本,用于处理文件中的UTF-8编码错误,替换特殊字符,并移除行尾多余空白。它解决了输入文本中é等字符被转为小方块的问题,并确保正确读写UTF-8编码文件,同时删除了不必要的尾部空格。
摘要由CSDN通过智能技术生成

I made this script which removes every trailing whitespace characters and replace all bad french characters by the right ones.

Removing the trailing whitespace characters works but not the part about replacing the french characters.

The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script but in the end every bad characters (like \u00e9) are being replaced by litte square.

Any idea why?

script :

# --*-- encoding: utf-8 --*--

import fileinput

import sys

CRLF = "\r\n"

ACCENT_AIGU = "\\u00e9"

ACCENT_GRAVE = "\\u00e8"

C_CEDILLE = "\\u00e7"

A_ACCENTUE = "\\u00e0"

E_CIRCONFLEXE = "\\u00ea"

CURRENT_ENCODING = "utf-8"

#Getting filepath

print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"

path = str(raw_input())

path.replace("\\", "/")

#removing trailing whitespace characters

for line in fileinput.FileInput(path, inplace=1):

if line != CRLF:

line = line.rstrip()

print line

print >>sys.stderr, line

else:

print CRLF

print >>sys.stderr, CRLF

fileinput.close()

#Replacing bad wharacters

for line in fileinput.FileInput(path, inplace=1):

line = line.decode(CURRENT_ENCODING)

line = line.replace(ACCENT_AIGU, "é")

line = line.replace(ACCENT_GRAVE, "è")

line = line.replace(A_ACCENTUE, "à")

line = line.replace(E_CIRCONFLEXE, "ê")

line = line.replace(C_CEDILLE, "ç")

line.encode(CURRENT_ENCODING)

sys.stdout.write(line) #avoid CRLF added by print

print >>sys.stderr, line

fileinput.close()

EDIT

the input file contains this type of text :

* Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e

* rechercherTechnicien et retourne la liste repr\u00e9sentant le num\u00e9ro

* de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e

* th\u00e9orique por se rendre au point d'intervention.

*

EDIT2

Final code if someone is interested, the first part replaces the badly encoded caracters, the second part removes all right trailing whitespaces caracters.

# --*-- encoding: iso-8859-1 --*--

import fileinput

import re

CRLF = "\r\n"

print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"

path = str(raw_input())

path = path.replace("\\", "/")

def unicodize(seg):

if re.match(r'\\u[0-9a-f]{4}', seg):

return seg.decode('unicode-escape')

return seg.decode('utf-8')

print "Replacing caracter badly encoded"

with open(path,"r") as f:

content = f.read()

replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))

with open(path, "w") as o:

o.write(''.join(replaced).encode("utf-8"))

print "Removing trailing whitespaces caracters"

for line in fileinput.FileInput(path, inplace=1):

if line != CRLF:

line = line.rstrip()

print line

else:

print CRLF

fileinput.close()

print "Done!"

解决方案

Not so quick, and mostly dirty, but...

with open("enc.txt","r") as f:

content = f.read()

import re

def unicodize(seg):

if re.match(r'\\u[0-9a-f]{4}', seg):

return seg.decode('unicode-escape')

return seg.decode('utf-8')

replaced = (unicodize(seg) for seg in re.split(r'(\\u[0-9a-f]{4})',content))

print(''.join(replaced))

Given that input file (mixing unicode escaped sequences and properly encoded utf-8 text):

* Cette m\u00e9thode permet d'appeller le service du module de

* tourn\u00e9e

* rechercherTechnicien et retourne la liste

* repr\u00e9sentant le num\u00e9ro

* de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien

* et la dur\u00e9e

* th\u00e9orique por se rendre au point d'intervention.

*

* S'il le désire le technicien peut dormir à l'hôtel

Produce that result:

* Cette méthode permet d'appeller le service du module de

* tournée

* rechercherTechnicien et retourne la liste

* représentant le numéro

* de la tournée ainsi que le nom et le prénom du technicien

* et la durée

* théorique por se rendre au point d'intervention.

*

* S'il le désire le technicien peut dormir à l'hôtel

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值