When input strings like 'this is a test'.
In Python2:
# basic
s = 'this is a test'
t = u'this is a test'
print type(s) # string
print type(t) # Unicode
# transform
print type(s.decode('utf-8')) # <type 'unicode'>
print type(t.encode('utf-8')) # <type 'str'>
In Python3:
# basic
s = 'this is a test'
t = b'this is a test'
print(type(s)) # <type 'str'>
print(type(t)) # <type 'byte'>
# transform
print(type(s.encode('utf-8'))) # <type 'byte'>
print(type(t.decode('utf-8'))) # <type 'string'>
Differences:
The methods used to transform between byte(unicode) and string are opposite.
When input bytes like '\u0074\u0068\u0069\u0073'
In Python2:
"""
'\' is special character in python. If you want to display '\t\n' itself, please use '\\r\\n'
"""
s = '\u0074\u0068\u0069\u0073\u0020\u0069\u0073\u0020\u0061\u0020\u0074\u0065\u0073\u0074'
t = u'\u0074\u0068\u0069\u0073\u0020\u0069\u0073\u0020\u0061\u0020\u0074\u0065\u0073\u0074'
newline_str = '\r\n'
newline_uni = u'\r\n'
print type(s) # <type 'str'>
print type(t) # <type 'unicode'>
print type(newline_str) # <type 'str'>
print type(newline_uni) # <type 'unicode'>
print newline_str # change to a new line
print newline_uni # change to a new line
print s # this is a test
print t # this is a test
An Example: Get The Words Frequency
Processing the special characters in the file, such as '\r\n', '\x80'
This article 'The Call of The Wild' comes from http://novel.tingroom.com/jingdian/198/
# Example
with open('TheCallofTheWild.txt') as file:
str = file.read()
puncs = [',', '.', ';', "'s", '-', ':', '"', '\r\n', '\xe3\x80\x80\xe3\x80\x80']
for punc in puncs:
str = str.replace(punc, ' ')
print 'Punctuations replacement completed.'
# sort the words index by frequency.
words = str.lower().split(' ')
wordsindex = list(set(words))
wordsindex.remove('')
wordsindex = sorted(wordsindex, key=lambda x: words.count(x), reverse=True)
print 'The total number of words in The Call of The Wild is: {}'.format(len(wordsindex))
# compute the spercific frequency of word, and save to a dictionary.
wordsfrequency = {}
for word in wordsindex:
wordsfrequency[word] = words.count(word)
# Verify the sort is correct.
for word in wordsindex[1000:1030]:
print '{}: {}'.format(word, wordsfrequency[word])
# Done.