小小技能
1
key = ['a','b','c']
value = [1,2,3]
vocab = dict(zip(key,value))
print(vocab)
运行效果:
{'a': 1, 'b': 2, 'c': 3}
2
key = ['a','b','c']
vocab = dict(zip(key,range(3)))
print(vocab)
运行效果:
{'a': 0, 'b': 1, 'c': 2}
读语料
all_word = []
with open('./enwik9_text','r') as f:
for line in f.readlines():
all_word.extend(line.strip().split(' '))
print(len(all_word))
print(all_word[0:10])
words = list(set(all_word))
words.sort()
print(len(words))
print(words[0:100])
运行效果:
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaarg', 'aaaaaaaaaaaaahhhhhhhh', 'aaaaaaaaaah', 'aaaaaaaaab', 'aaaaaaaaaghghhgh', 'aaaaaacceglllnorst', 'aaaaaaccegllnorrst', 'aaaaaad', 'aaaaaaf', 'aaaaaah', 'aaaaaalmrsstt', 'aaaaaannrstyy', 'aaaaabbcdrr', 'aaaaaction', 'aaaaae', 'aaaaaf', 'aaaaahh', 'aaaaand', 'aaaaanndd', 'aaaaargh', 'aaaab', 'aaaabb', 'aaaabbbbccc', 'aaaabbbbccccddddeeeeffff', 'aaaabbbbccccddddeeeeffffgggg', 'aaaad', 'aaaae', 'aaaaf', 'aaaah', 'aaaahnkuttie', 'aaaar', 'aaaargh', 'aaaassembly', 'aaaatataca', 'aaaax', 'aaaay', 'aaab', 'aaaba', 'aaabb', 'aaabbbccc', 'aaac', 'aaad', 'aaadietya', 'aaae', 'aaaeealqoff', 'aaaf', 'aaagaaaaaa', 'aaagaattat', 'aaagctactc', 'aaaggacggu', 'aaagh', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaaimh', 'aaajj', 'aaake', 'aaalac', 'aaam', 'aaamazzarites', 'aaan', 'aaargh', 'aaarm', 'aaarrr', 'aaas', 'aaasthana', 'aaate', 'aaathwan', 'aaavs', 'aaay', 'aab', 'aaba', 'aabaa', 'aabab', 'aababb', 'aabach', 'aabb', 'aabba', 'aabbb', 'aabbcc', 'aabbirem', 'aabbs', 'aabbtrees', 'aabc', 'aabdul', 'aabebwuvev', 'aabehlpt', 'aabel']
语料总字数 124301826 跟 fasttext是一样的
fasttext构建的词典有:218316个不重复的word (token)
我们这里用set得到的是833184个不重复的word (token),根据上面打印的信息来看,fasttext去除了614868没有意义的word
去除无意义单词
enchant库-67406
参考:https://blog.csdn.net/hpulfc/article/details/80997252
1.安装包
pip install pyenchant
import enchant
words = ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaarg', 'aaaaaaaaaaaaahhhhhhhh', 'aaaaaaaaaah', 'aaaaaaaaab', 'aaaaaaaaaghghhgh', 'aaaaaacceglllnorst', 'aaaaaaccegllnorrst', 'aaaaaad', 'aaaaaaf', 'aaaaaah', 'aaaaaalmrsstt', 'aaaaaannrstyy', 'aaaaabbcdrr', 'aaaaaction', 'aaaaae', 'aaaaaf', 'aaaaahh', 'aaaaand', 'aaaaanndd', 'aaaaargh', 'aaaab', 'aaaabb', 'aaaabbbbccc', 'aaaabbbbccccddddeeeeffff', 'aaaabbbbccccddddeeeeffffgggg', 'aaaad', 'aaaae', 'aaaaf', 'aaaah', 'aaaahnkuttie', 'aaaar', 'aaaargh', 'aaaassembly', 'aaaatataca', 'aaaax', 'aaaay', 'aaab', 'aaaba', 'aaabb', 'aaabbbccc', 'aaac', 'aaad', 'aaadietya', 'aaae', 'aaaeealqoff', 'aaaf', 'aaagaaaaaa', 'aaagaattat', 'aaagctactc', 'aaaggacggu', 'aaagh', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaaimh', 'aaajj', 'aaake', 'aaalac', 'aaam', 'aaamazzarites', 'aaan', 'aaargh', 'aaarm', 'aaarrr', 'aaas', 'aaasthana', 'aaate', 'aaathwan', 'aaavs', 'aaay', 'aab', 'aaba', 'aabaa', 'aabab', 'aababb', 'aabach', 'aabb', 'aabba', 'aabbb', 'aabbcc', 'aabbirem', 'aabbs', 'aabbtrees', 'aabc', 'aabdul', 'aabebwuvev', 'aabehlpt', 'aabel']
d = enchant.Dict('en_US')
for word in words:
if d.check(word):
print(word)
运行此代码会有一个报错:
报错参考:https://installati.one/install-enchant-ubuntu-20-04/
ImportError: The 'enchant' C library was not found and maybe needs to be installed.
安装包 :
sudo apt install enchant
再次运行代码:
import enchant
words = ['a', 'aa', 'aaa', 'aaaa', 'aaaaa',