First of all, I found the following which is basically the same as my question, but it is closed and I'm not sure I understand the reason for closing vs. the content of the post. I also don't really see a working answer.
I have 20+ input files from 4 apps. All files are exported as .csv files. The first 19 files worked (4 others exported from the same app work) and then I ran into a file that gives me this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5762: character maps to
If I looked that up right it is a < ctrl >. The code below are the relevant lines:
with open(file, newline = '') as f:
reader = csv.DictReader(f, dialect = 'excel')
for line in reader:
I know I'm going to be getting a file. I know it will be a .csv. There may be some variance in what I get due to the manual generation/export of the source files. There may also be some strange characters in some of the files (e.g. Japanese, Russian, etc.). I provide this information because going back to the source to get a different file might just kick the can down the road until I have to pull updated data (or worse, someone else does).
So the question is probably multi-part:
1) Is there a way to tell the csv.DictReader to ignore undefined characters? (Hint for the codec: if I can't see it, it is of no value to me.)
2) If I do have "crazy" characters, what should I do? I've considered opening each input as a binary file, filtering out offending hex characters, writing the file back to disk and then opening the new file, but that seems like a lot of overhead for the program and even more for me. It's also a few JCL statements from being 1977 again.
3) How do I figure out what I'm getting as an input if it crashes while I'm reading it in.
4) I chose the "dialect = 'excel'"; because many of the inputs are Excel files that can be downloaded from one of the source applications. From the docs on dictreader, my impression is that this just defines delimiter, quote character and EOL characters to expect/use. Therefore, I don't think this is my issue, but I'm also a Python noob, so I'm not 100% sure.
解决方案
I posted the solution I went with in the comments above; it was to set the errors argument of open() to 'ignore':
with open(file, newline = '', errors='ignore') as f:
This is exactly what I was looking for in my first question in the original post above (i.e. whether there is a way to tell the csv.DictReader to ignore undefined characters).
Update: Later I did need to work with some of the Unicode characters and couldn't ignore them. The correct answer for that solution based on Excel-produced unicode .csv file was to use the 'utf_8_sig' codec. That deletes the byte order marker (utf-16 BOM) that Windows writes at the top of the file to let it know there are unicode characters in it.