python2 怎么读utf8,Python 2.7阅读和写作“éèàçê”从utf-8文件-CSDN博客

本文介绍了一个Python脚本，用于处理文件中的UTF-8编码错误，替换特殊字符，并移除行尾多余空白。它解决了输入文本中é等字符被转为小方块的问题，并确保正确读写UTF-8编码文件，同时删除了不必要的尾部空格。

摘要由CSDN通过智能技术生成

I made this script which removes every trailing whitespace characters and replace all bad french characters by the right ones.

Removing the trailing whitespace characters works but not the part about replacing the french characters.

The file to read/write are encoded in UTF-8 so I added the utf-8 declaration above my script but in the end every bad characters (like \u00e9) are being replaced by litte square.

Any idea why?

script :

# --*-- encoding: utf-8 --*--

import fileinput

import sys

CRLF = "\r\n"

ACCENT_AIGU = "\\u00e9"

ACCENT_GRAVE = "\\u00e8"

C_CEDILLE = "\\u00e7"

A_ACCENTUE = "\\u00e0"

E_CIRCONFLEXE = "\\u00ea"

CURRENT_ENCODING = "utf-8"

#Getting filepath

print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"

path = str(raw_input())

path.replace("\\", "/")

#removing trailing whitespace characters

for line in fileinput.FileInput(path, inplace=1):

if line != CRLF:

line = line.rstrip()

print line

print >>sys.stderr, line

else:

print CRLF

print >>sys.stderr, CRLF

fileinput.close()

#Replacing bad wharacters

for line in fileinput.FileInput(path, inplace=1):

line = line.decode(CURRENT_ENCODING)

line = line.replace(ACCENT_AIGU, "é")

line = line.replace(ACCENT_GRAVE, "è")

line = line.replace(A_ACCENTUE, "à")

line = line.replace(E_CIRCONFLEXE, "ê")

line = line.replace(C_CEDILLE, "ç")

line.encode(CURRENT_ENCODING)

sys.stdout.write(line) #avoid CRLF added by print

print >>sys.stderr, line

fileinput.close()

EDIT

the input file contains this type of text :

* Cette m\u00e9thode permet d'appeller le service du module de tourn\u00e9e

* rechercherTechnicien et retourne la liste repr\u00e9sentant le num\u00e9ro

* de la tourn\u00e9e ainsi que le nom et le pr\u00e9nom du technicien et la dur\u00e9e

* th\u00e9orique por se rendre au point d'intervention.

EDIT2

Final code if someone is interested, the first part replaces the badly encoded caracters, the second part removes all right trailing whitespaces caracters.

# --*-- encoding: iso-8859-1 --*--

import fileinput

import re

CRLF = "\r\n"

print "Veuillez entrer le chemin du fichier (utiliser des \\ ou /, c'est pareil) :"

path = str(raw_input())

path = path.replace("\\", "/")

def unicodize(seg):

if re.match(r'\\u[0-9a-f]{4}', seg):

return seg.decode('unicode-escape')

return seg.decode('utf-8')

print "Replacing caracter badly encoded"