文档生成字典

最新推荐文章于 2023-09-25 04:17:26 发布

bbzz2

最新推荐文章于 2023-09-25 04:17:26 发布

阅读量733

点赞数 1

分类专栏： NLP

NLP 专栏收录该内容

28 篇文章

订阅专栏

在自然语言处理任务中，经常会对文本进行预处理。这种操作中
有一部分十分重要，即建立词典。下面将给出一段讲解的Python代码。

# 生成词汇表文件
def gen_vocabulary_file(input_file, output_file):
    vocabulary = {}
    with open(input_file) as f:
        counter = 0
        for line in f:
           counter += 1
           #print line
           tokens = [word for word in line.strip().decode('utf-8')]#这一步有问题，输出的不是汉字
           for word in tokens:
               if word in vocabulary:#已在词汇表中，则词频加1
                  vocabulary[word] += 1
               else:#不在则为1
                  vocabulary[word] = 1
        vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)
        #print vocabulary
        # 取前5000个常用汉字, 应该差不多够用了
        if len(vocabulary_list) > 5000:

           vocabulary_list = vocabulary_list[:5000]#5000大小的词汇表

        print(input_file , " 词汇表大小:", len(vocabulary_list))
        with open(output_file, "w") as ff:
            for word in vocabulary_list:
                ff.write(word+'\n')


 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

在这段代码中，函数有两个参数，一个为输入文件，一个是输出文件（词汇表）。
（1）打开文档，并统计汉字词频；

    with open(input_file) as f:
        counter = 0
        for line in f:
           counter += 1
           tokens = [word for word in line.strip().decode('utf-8')]#必须加上decode(),否则建立的词汇表会出现乱码，tokens为列表。
 1
2
3
4
5
 1
2
3
4
5

统计词频字典:

           for word in tokens:
               if word in vocabulary:#已在词汇表中，则词频加1
                  vocabulary[word] += 1
               else:#不在则为1
                  vocabulary[word] = 1
 1
2
3
4
5
 1
2
3
4
5

统计新的词频字典，以词频逆排

vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)
 1
 1

取前5000个汉字：

        if len(vocabulary_list) > 5000:
           vocabulary_list = vocabulary_list[:5000]#5000大小的词汇表
 1
2
 1
2

词汇表大小，并写入文件

print(input_file , " 词汇表大小:", len(vocabulary_list))
with open(output_file, "w") as ff:
    for word in vocabulary_list:
        ff.write(word+'\n')
 1
2
3
4
 1
2
3
4

如果出现编码错误，请在python文件头部加上：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')