I'm trying read a CSV textfile (UTF-8 without BOM according to Notepad++) using Python. However there seems to be a problem with encoding:
print(open(path, encoding="utf-8").read())
Codec can't decode byte 08xf
This little character seems to be the problem: ● (full string: "●• อีเปียขี้บ่น ت •●"), however I'm sure there will be more.
If I try UTF-16, then there is a message:
#also tried with encode
print(open(path, encoding="utf-16").read().encode('utf-8'))
Illegal UTF-16 surrogate
Even when I try opening it with an automatic codec finder I receive the error.
def csv_unireader(f, encoding="utf-8"):
for row in csv.reader(codecs.iterencode(codecs.iterdecode(f, encoding), "utf-8")):
yield [e.decode("utf-8") for e in row]
What am I overlooking? The file contains Twitter texts which contain a lot of different characters that's for sure. But this can't be such difficult task in Python, just reading/printing a file?
Edit:
Just tried using the code from this answer: http://stackoverflow.com/a/14786752/45311
import csv
with open('source.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
This at least prints some rows to the screen, but it also throws an error after some rows:
cp850.py, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 62-63:
character maps to
It seems to automatically use CP850 which is another encoding... I can't make sense out of all this....
解决方案
What is the version of your python?
If use the 2.x try to paste the import at the beginning of your script:
from __future__ import unicode_literals
than try:
print(open(path).read().encode('utf-8'))
There is also a great tool for charset detections: chardet.
I hope it'll help you.