在自然语言处理任务中,经常会对文本进行预处理。这种操作中
有一部分十分重要,即建立词典。下面将给出一段讲解的Python代码。
def gen_vocabulary_file(input_file, output_file):
vocabulary = {}
with open(input_file) as f:
counter = 0
for line in f:
counter += 1
tokens = [word for word in line.strip().decode('utf-8')]
for word in tokens:
if word in vocabulary:
vocabulary[word] += 1
else:
vocabulary[word] = 1
vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)
if len(vocabulary_list) > 5000:
vocabulary_list = vocabulary_list[:5000]
print(input_file , " 词汇表大小:", len(vocabulary_list))
with open(output_file, "w") as ff:
for word in vocabulary_list:
ff.write(word+'\n')
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
在这段代码中,函数有两个参数,一个为输入文件,一个是输出文件(词汇表)。
(1)打开文档,并统计汉字词频;
with open(input_file) as f:
counter = 0
for line in f:
counter += 1
tokens = [word for word in line.strip().decode('utf-8')]
统计词频字典:
for word in tokens:
if word in vocabulary:
vocabulary[word] += 1
else:
vocabulary[word] = 1
统计新的词频字典,以词频逆排
vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)
取前5000个汉字:
if len(vocabulary_list) > 5000:
vocabulary_list = vocabulary_list[:5000]
词汇表大小,并写入文件
print(input_file , " 词汇表大小:", len(vocabulary_list))
with open(output_file, "w") as ff:
for word in vocabulary_list:
ff.write(word+'\n')
如果出现编码错误,请在python文件头部加上:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')