I need to compare a unicode string coming from a utf-8 file with a constant defined in the Python script.
I'm using Python 2.7.6 on Linux.
If I run the above script within Spyder (a Python editor) I got it working, but if I invoke the Python script from a terminal, I got the test failing. Do I need to import/define something in the terminal before invoking the script?
Script ("pythonscript.py"):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
some_french_deps = []
idata_raw = csv.DictReader(open("utf8_encoded_data.csv", 'rb'), delimiter=";")
for rec in idata_raw:
depname = unicode(rec['DEP'],'utf-8')
some_french_deps.append(depname)
test1 = "Tarn"
test2 = "Rhône-Alpes"
if test1==some_french_deps[0]:
print "Tarn test passed"
else:
print "Tarn test failed"
if test2==some_french_deps[2]:
print "Rhône-Alpes test passed"
else:
print "Rhône-Alpes test failed"
utf8_encoded_data.csv:
DEP
Tarn
Lozère
Rhône-Alpes
Aude
Run output from Spyder editor:
Tarn test passed
Rhône-Alpes test passed
Run output from terminal:
$ ./pythonscript.py
Tarn test passed
./pythonscript.py:20: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if test2==some_french_deps[2]:
Rhône-Alpes test failed
解决方案
You are comparing a byte string (type str) with a unicode value. Spyder has changed the default encoding from ASCII to UTF-8, and Python does an implicit conversion between byte strings and unicode values when comparing the two types. Your byte strings are encoded to UTF-8, so under Spyder that comparison succeeds.
The solution is to not use byte strings, use unicode literals for your two test values instead:
test1 = u"Tarn"
test2 = u"Rhône-Alpes"
Changing the system default encoding is, in my opinion, a terrible idea. Your code should use Unicode correctly instead of relying on implicit conversions, but to change the rules of implicit conversions only increases the confusion, not make the task any easier.