我正在尝试像这样在Python 3中标准化字符串中的重音字符:
from bs4 import BeautifulSoup
import os
def process_markup():
#the file is utf-8 encoded
fn = os.path.join(os.path.dirname(__file__), 'src.txt') #
markup = BeautifulSoup(open(fn), from_encoding="utf-8")
for player in markup.find_all("div", class_="glossary-player"):
text = player.span.string
print(format_filename(text)) # Python console shows mangled characters not in utf-8
player.span.string.replace_with(format_filename(text))
dest = open("dest.txt", "w", encoding="utf-8")
dest.write(str(markup))
def format_filename(s):
# prepare string
s = s.strip().lower().replace(" ", "-").strip("'")
# transliterate accented characters to non-accented versions
chars_in = "àèìòùáéíóú"
chars_out = "aeiouaeiou"
no_accented_chars = str.maketrans(chars_in, chars_out)
return s.translate(no_accented_chars)
process_markup()
输入的src.txt文件是utf-8编码的:
Fàilte Welcome
àèìòùáéíóú aeiouaeiou
输出文件dest.txt如下所示:
???
f??ilte Welcome
???¨???????????????? aeiouaeiou
我正在尝试使其看起来像这样:
failte Welcome
aeiouaeiou aeiouaeiou
我知道有像unidecode这样的解决方案,但只想找出我在这里做错了什么.