python读文件出现特殊字符,从python中的.txt文件读取特殊字符-CSDN博客

The goal of this code is to find the frequency of words used in a book.

I am tying to read in the text of a book but the following line keeps throwing my code off:

precious protégés. No, gentlemen; he'll always show 'em a clean pair

specifically the é character

I have looked at the following documentation, but I don't quite understand it: https://docs.python.org/3.4/howto/unicode.html

Heres my code:

import string

# Create word dictionary from the comprehensive word list

word_dict = {}

def create_word_dict ():

# open words.txt and populate dictionary

word_file = open ("./words.txt", "r")

for line in word_file:

line = line.strip()

word_dict[line] = 1

# Removes punctuation marks from a string

def parseString (st):

st = st.encode("ascii", "replace")

new_line = ""

st = st.strip()

for ch in st:

ch = str(ch)

if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':

print (ch)

new_line += ch

else:

new_line += ""

# now remove all instances of 's or ' at end of line

new_line = new_line.strip()

print (new_line)

if (new_line[-1] == "'"):

new_line = new_line[:-1]

new_line.replace("'s", "")

# Conversion from ASCII codes back to useable text

message = new_line

decodedMessage = ""

for item in message.split():

decodedMessage += chr(int(item))

print (decodedMessage)

return new_line

# Returns a dictionary of words and their frequencies

def getWordFreq (file):

# Open file for reading the book.txt

book = open (file, "r")

# create an empty set for all Capitalized words

cap_words = set()

# create a dictionary for words

book_dict = {}

total_words = 0

# remove all punctuation marks other than '[not s]

for line in book:

line = line.strip()

if (len(line) > 0):

line = parseString (line)

word_list = line.split()

# add words to the book dictionary

for word in word_list:

total_words += 1

if (word in book_dict):

book_dict[word] = book_dict[word] + 1

else:

book_dict[word] = 1

print (book_dict)

# close the file

book.close()

def main():

wordFreq1 = getWordFreq ("./Tale.txt")

print (wordFreq1)

main()

The error that I received is as follows:

Traceback (most recent call last):

File "Books.py", line 80, in

main()

File "Books.py", line 77, in main

wordFreq1 = getWordFreq ("./Tale.txt")

File "Books.py", line 60, in getWordFreq

line = parseString (line)

File "Books.py", line 36, in parseString

decodedMessage += chr(int(item))

OverflowError: Python int too large to convert to C long

解决方案

When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try

word_file = open ("./words.txt", "r", encoding='utf-8')