文本校对中量词名词搭配库的挖掘

在文本校对中,现在的深度神经网络模型,如神经翻译模型seq2seq框架在文本纠错中也有一些应用,但这需要大规模语料去做训练,有时候效果还不一定好,或者效率不一定达到要求,另外这种end2end方法也不好解释。所以基于统计和规则的文本校对方法还是有一定应用价值的。文本校对中有一项是词的搭配校对,而词的搭配校对中有一项是量词名词的搭配校对,比如“一片猪”中的量词是“一片”,名词是“猪”,我们知道这样的搭配是错误的,应该为“一头猪”。量名搭配可以用统计和规则的方法去做,也能达到一些效果,但此篇不会説怎么用统计和规则去做量名校对。基于统计和规则的量名校对依赖量名搭配词库,也可以叫知识库。这个知识库怎么构建,这是今天所説的主题。

一.量名搭配词库来源

1.一部分整理自北大的《现代汉语语法信息词典》

2.一部分通过挖掘的方式提取

二.量名搭配词库挖掘

1.挖掘的语料是1998年上半年的人民日报语料,这是被分词和标注好词性的语料。

2.挖掘的规则是参考张仰森教授的论文进行改写的,规则如下:

(1).量词右边词性为顿号或连词,continue

(2).若量词右边的词性为名词或几个相连名词,则取最后一个名词为搭配词。

(3).量词后有“的”字,若其右相邻词性为【介词/副词/连词/动词/数词/形容词】,则“的”字后面的名词或名词组最后一个名词作为搭配词,否则取量   词最近的名词或名词组最后一个名词作为搭配词

(4).量词后没有“的”字,若其右相邻词性不为【介词/副词/助词“地”/动词/数词】,则取量词最近的名词或名词组最后一个名词作为搭配词

(5).最后利用量名搭配的共现频率和互信息来过滤提取量名搭配,形成搭配词库

三.程序

def fetch\_quantifier\_noun(data\_path):
    '''
    从语料中获取量名搭配词对
    fetch quantifier noun from corpus
    :param data\_path:
    :return:
    '''
    noun\_list \= \['n', 'ns', 'nr', 'nt', 'nz', 'vn'\]
    punctuation\_pattern \= list(',?!。:;')

    for root, dirs, files in os.walk(data\_path):
        for file in files:
            with open(os.path.join(root, file), 'r', encoding='utf-8') as file\_read:
                for line in file\_read:
                    line \= line.strip()
                    cut\_tokens \= line.split()
                    sents \= \[\]
                    sent \= \[\]
                    for token in cut\_tokens:
                        word, pos \= token.split('/')
                        if word not in punctuation\_pattern:
                            sent.append(token)
                        else:
                            sents.append(sent)
                            sent \= \[\]
                    sents.append(sent)
                    for sent in sents:
                        words\_list \= \[word.split('/')\[0\] for word in sent\]
                        pos\_list \= \[word.split('/')\[1\] for word in sent\]
                        index \= 0
                        while index < len(words\_list):
                            word \= words\_list\[index\]
                            pos \= pos\_list\[index\]
                            if (pos == 'q') and ((index + 1) < len(words\_list)):
                                right\_word \= words\_list\[index + 1\]
                                right\_pos \= pos\_list\[index + 1\]
                                if right\_pos != 'c' or right\_word != '、':
                                    if '的' in words\_list\[index:\]:
                                        if right\_pos in \['d', 'v', 'p', 'c', 'm', 'a'\]:
                                            de\_index \= words\_list\[index + 1:\].index('的')
                                            de\_index += (index + 1)
                                            pos\_index \= de\_index + 1
                                            while pos\_index < len(pos\_list):
                                                q\_right\_pos \= pos\_list\[pos\_index\]
                                                if q\_right\_pos in noun\_list:
                                                    right\_pos \= q\_right\_pos
                                                    right\_word \= words\_list\[pos\_index\]
                                                    pos\_index += 1
                                                else:
                                                    break
                                            if right\_pos in noun\_list:
                                                print('{}:{}'.format(word, right\_word))
                                        else:
                                            pos\_index \= index + 1
                                            while pos\_index < len(pos\_list):
                                                q\_right\_pos \= pos\_list\[pos\_index\]
                                                if q\_right\_pos in noun\_list:
                                                    right\_pos \= q\_right\_pos
                                                    right\_word \= words\_list\[pos\_index\]
                                                    pos\_index += 1
                                                else:
                                                    break
                                            if right\_pos in noun\_list:
                                                print('{}:{}'.format(word, right\_word))
                                    elif (right\_pos not in \['p', 'd', 'v', 'm'\]) and (right\_word != '地'):
                                        pos\_index \= index + 1
                                        while pos\_index < len(pos\_list):
                                            q\_right\_pos \= pos\_list\[pos\_index\]
                                            if q\_right\_pos in noun\_list:
                                                right\_pos \= q\_right\_pos
                                                right\_word \= words\_list\[pos\_index\]
                                                pos\_index += 1
                                            else:
                                                break
                                        if right\_pos in noun\_list:
                                            print('{}:{}'.format(word, right\_word))
                            index += 1

class QuantifierNounMI:
    def \_\_init\_\_(self, filepath, mipath):
        self.filepath \= filepath
        self.mipath \= mipath

    def build\_corpus(self):
        '''
        读取语料
        :return:
        '''
        def cut\_words(line):
            return \[word for word in line.strip().split(':')\]
        with open(self.filepath, 'r', encoding='utf-8') as f\_read:
            sentences \= \[cut\_words(line) for line in f\_read\]
        return sentences

    def count\_words(self, sentences):
        '''
        统计词频
        :param sentences:
        :return:
        '''
        words\_all \= list()
        for sent in sentences:
            words\_all.extend(sent)
        word\_dict \= {item\[0\]: item\[1\] for item in collections.Counter(words\_all).most\_common()}
        return word\_dict, len(words\_all)

    def count\_cowords(self, sentences):
        '''
        统计共现的词
        :param train\_data:
        :return:
        '''
        co\_dict \= dict()
        print(len(sentences))
        for index, data in enumerate(sentences):
            if data\[0\] not in co\_dict:
                co\_dict\[data\[0\]\] \= data\[1\]
            else:
                co\_dict\[data\[0\]\] += '@' +data\[1\]
        return co\_dict

    def build\_dict(self, words):
        return {item\[0\]: item\[1\] for item in collections.Counter(words).most\_common()}

    def compute\_mi(self, word\_dict, co\_dict, sum\_tf):
        '''
        计算互信息
        :param word\_dict:
        :param co\_dict:
        :param sum\_tf:
        :return:
        '''
        def compute\_mi(p1, p2, p12):
            return math.log2(p12) - math.log2(p1) - math.log2(p2)

        mis\_dict \= dict()
        for word, co\_words in co\_dict.items():
            co\_word\_dict \= self.build\_dict(co\_words.split('@'))
            mi\_dict \= {}
            for co\_word, co\_tf in co\_word\_dict.items():
                if co\_tf >= 2: #这里过滤共现频率>=2的词
                    if co\_word == word:
                        continue
                    p1 \= word\_dict\[word\] / sum\_tf
                    p2 \= word\_dict\[co\_word\] / sum\_tf
                    p12 \= co\_tf / sum\_tf
                    mi \= compute\_mi(p1, p2, p12)
                    mi\_dict\[co\_word\] \= mi
            mi\_dict \= sorted(mi\_dict.items(), key=lambda asd: asd\[1\], reverse=True)
            mis\_dict\[word\] \= mi\_dict

        return mis\_dict

    def save\_mi\_result(self, mis\_dict):
        '''
        将共现频率>=2以及互信息>=4的量名搭配找出来
        :param mis\_dict:
        :return:
        '''
        with open(self.mipath, 'w', encoding='utf-8') as f\_write:
            for word, co\_words in mis\_dict.items():
                co\_infos \= \[item\[0\] + '@' + str(item\[1\]) for item in co\_words if item\[1\] >= 4\] #这里过滤互信息>=4的词
                if len(co\_infos) !=0:
                    f\_write.write(word \+ '\\t' + ','.join(co\_infos) + '\\n')

    # 运行主函数
    def calcute(self):
        print('step 1/6: 读取语料 ..........')
        sentences \= self.build\_corpus()

        print('step 2/6: 统计词频..........')
        word\_dict, sum\_tf \= self.count\_words(sentences)

        print('step 3/6: 统计共现词..........')
        co\_dict \= self.count\_cowords(sentences)

        print('step 4/6: 计算互信息..........')
        mi\_data \= self.compute\_mi(word\_dict, co\_dict, sum\_tf)

        print('step 5/6: 保存词的互信息..........')
        self.save\_mi\_result(mi\_data)

        print('done!.......')

if \_\_name\_\_ == '\_\_main\_\_':
    mi\_corpus\_path = os.path.join(os.getcwd(), 'corpus.txt')
    data\_write\_path \= os.path.join(os.getcwd(), 'result.txt')
    quantifierNounMI \= QuantifierNounMI(mi\_corpus\_path, data\_write\_path)
    quantifierNounMI.calcute()

四.结果

1.fetch_quantifier_noun统计的词库如下:

次:代表大会
个:中国
种种:威胁
项:原则
次:代表大会
件:大事
条:战线
名:首都
名:首都
名:首都
场:题
位:音乐家
台:交响音乐会
件:事
盏:灯笼
里:长街
个:建设
个:建设
届:委员会
个:乡
个:县
次:全国
批:代表
年:时间
批:中央
支:模范

2.QuantifierNounMI利用词频和互信息挖掘的词库如下(@后为互信息):

次 主席团@4.85493845217302,访华@4.85493845217302,盛会@4.85493845217302,发射@4.85493845217302,手术@4.85493845217302,闭幕式@4.85493845217302,来访@4.85493845217302,普查@4.85493845217302,集会@4.85493845217302,能源@4.85493845217302,审议@4.85493845217302,审查@4.85493845217302,增资@4.85493845217302,瑞环@4.85493845217302,打击@4.85493845217302,失败@4.85493845217302,集训@4.85493845217302,旅行@4.85493845217302,纪要@4.85493845217302,摄影展@4.85493845217302,采油@4.85493845217302,接见@4.85493845217302,闭幕会@4.85493845217302,印刷版@4.85493845217302,呼吁@4.85493845217302,升级@4.85493845217302,灵感@4.85493845217302,传唤@4.85493845217302,通气会@4.85493845217302,检阅@4.85493845217302,认证@4.85493845217302,出访@4.85493845217302,受伤@4.85493845217302,会面@4.85493845217302,全省性@4.85493845217302,影展@4.85493845217302,听证会@4.85493845217302,对接@4.85493845217302,库奇马@4.85493845217302,开幕会@4.85493845217302,对话会@4.85493845217302,首日封@4.85493845217302,发言台@4.85493845217302,汇编@4.85493845217302,检修@4.85493845217302,打假@4.85493845217302,改进@4.85493845217302,骨灰@4.85493845217302,评议@4.85493845217302,执法@4.85493845217302,观樱会@4.85493845217302,体操赛@4.85493845217302,绿灯@4.85493845217302,接触@4.85493845217302,解放@4.85493845217302,闪电@4.85493845217302,风暴潮@4.85493845217302,大锅饭@4.85493845217302,反击@4.85493845217302,测验@4.85493845217302,全会@4.7960447631194505,巨变@4.772476291981047,会晤@4.7506017923582835,会议@4.7471351626385045,表决@4.729407570089162,谈话@4.6850134507307075,访问@4.660435428010043,核试验@4.646351830361601,飞跃@4.632546030836572,良机@4.632546030836572,提名@4.632546030836572,创业@4.623612906066564,检查@4.591904046339225,议程@4.533010357285658,浪潮@4.533010357285658,日程@4.533010357285658,拍卖会@4.533010357285658,招待会@4.517903464895449,高潮@4.517903464895449,机遇@4.46262102939426,采访@4.439900952894176,配合@4.439900952894176,民意测验@4.439900952894176,招标@4.439900952894176,促进@4.439900952894176,考察队@4.439900952894176,演习@4.439900952894176,抽查@4.439900952894176,三等功@4.439900952894176,新月@4.439900952894176,震动@4.439900952894176,全市性@4.439900952894176,日本队@4.439900952894176,会@4.4074794752018,调整@4.3955068335357215,开幕式@4.3955068335357215,交谈@4.369511625002778,射门@4.369511625002778,立场@4.369511625002778,失误@4.369511625002778,侧记@4.340365279343262,考试@4.324423735474239,机会@4.299723294845915,核查@4.269975951451865,海湾@4.269975951451865,订货会@4.269975951451865,失利@4.269975951451865,发掘@4.269975951451865,净化@4.269975951451865,起义@4.269975951451865,盛况@4.269975951451865,会场@4.269975951451865,支援@4.269975951451865,分配@4.269975951451865,期望@4.269975951451865,竞选@4.269975951451865,进攻@4.269975951451865,冲击@4.269975951451865,平均值@4.269975951451865,装卸费@4.269975951451865,检验@4.269975951451865,热身@4.269975951451865,污染@4.269975951451864,会见@4.269975951451864,部长会议@4.269975951451864,研讨会@4.238267091724525,演讲@4.202861755593325,尝试@4.1919734394505905,列车@4.18935749124358,人数@4.176866547060381,淘汰赛@4.176866547060381,历史性@4.154498734031927,讲座@4.117972858006814,修改@4.117972858006814,拍卖@4.117972858006814,整顿@4.077330873509467,画展@4.047583530115416,征文@4.047583530115416
种种 借口@8.989974996348284,疑虑@8.989974996348284,阻力@8.989974996348284,恩惠@8.989974996348284,束缚@8.989974996348284,暴行@8.989974996348284,迹象@8.874497778928347,弊端@8.627404916963576,疑问@8.504548169178042,理由@8.504548169178042,怀疑@8.40501249562713,干扰@8.40501249562713,局限@8.40501249562713,议论@8.40501249562713,弊病@8.40501249562713,原因@8.283706199404994,诱惑@7.989974996348286,手法@7.668046901460924,罪名@7.40501249562713,限制@7.40501249562713,风险@7.405012495627128,考虑@7.182620074290682,误区@7.182620074290682,考验@6.989974996348284,因素@6.475401823518526,尝试@5.742047482904701,变化@5.716956501941867,思想@4.8606919794033185,行为@4.466413040291274,问题@4.305476822076214,经济@4.059237658785399
项 守则@6.057089192206821,注意@6.057089192206821,社会性@6.057089192206821,奥斯卡奖@6.057089192206821,裁决@6.057089192206821,公证@6.057089192206821,职权@6.057089192206821,基本功@6.057089192206821,主张@5.9575535186559065,专利@5.834696770870373,经常性@5.694519112822112,法案@5.6851204148198615,硬指标@5.642051692927977,金像奖@5.642051692927977,大奖@5.609630215235601,政策性@5.571662365036579,任务@5.544137936401054,指标@5.486773467450067,不信任案@5.472126691485666,指控@5.472126691485666,权力@5.472126691485666,判决@5.472126691485666,任命@5.472126691485666,污染物@5.472126691485666,修改案@5.472126691485666,宣言@5.472126691485666,公约@5.4721266914856646,发明@5.4721266914856646,协议@5.446131482952721,声明@5.4373612733249885,决策@5.346595809401807,议题@5.320123598040615,使命@5.320123598040615,难度@5.320123598040615,收费@5.279481613543268,修正案@5.249734270149217,内容@5.217553864400067,原则@5.200047146048863,命令@5.1826200742906785,承诺@5.139551352398794,成果@5.110528451168282,举措@5.102892881819946,工作@5.0991124071025276,实施@5.057089192206822,事业@5.057089192206821,统计@5.057089192206821,桂冠@5.057089192206821,奖励@5.057089192206821,调查@4.999756017140868,纪录@4.902366597408179,权利@4.8871641907645085,协定@4.872664621069394,要求@4.845585087013108,前提@4.834696770870373,计划@4.810928604937423,实验@4.794054786373026,制度@4.791195132233785,决定@4.787000028839076,技术@4.743931306947191,工艺@4.735161097319461,义务@4.735161097319461,课题@4.735161097319459,金奖@4.735161097319459,工程@4.713604953953011,公报@4.694519112822112,业务@4.687855382541102,备忘录@4.67857756895309,进展@4.642051692927977,高新技术@4.642051692927977,研究@4.554588851677638,服务@4.52103629196661,罪名@4.472126691485666,奖@4.4721266914856646,赛事@4.4248209767073075,空白@4.356649474065728,费用@4.340882158207412,议案@4.320123598040615,功能@4.320123598040615,条约@4.320123598040615,措施@4.26618479092981,条款@4.249734270149219,热点@4.249734270149219,考核@4.249734270149219,基础@4.249734270149215,冠军@4.223099143645749,荣誉@4.1826200742906785,禁令@4.1826200742906785,方案@4.169563921465233,活动@4.0715887619019355,遗产@4.057089192206822
件 衬衫@7.144221783932796,羊毛衫@7.144221783932796,衣物@7.144221783932796,坏事@7.144221783932796,褂子@7.144221783932796,难事@7.144221783932796,行李@7.144221783932796,皮夹克@7.144221783932796,T恤衫@7.144221783932796,运动衫@7.144221783932796,皮袄@7.144221783932795,大事@7.076658500120161,好事@6.9798349660319134,实物@6.974296782490483,新鲜事@6.921829362596347,实事@6.903213684429001,艺术品@6.881187378099001,盛事@6.854715166737809,物品@6.822293689045432,展品@6.822293689045432,事情@6.733757014443774,棉大衣@6.729184284653952,憾事@6.729184284653952,小事@6.684790165295497,事@6.670290595600382,棉衣@6.559259283211642,大衣@6.559259283211642,用具@6.559259283211642,珍品@6.33686686187519,新衣@6.144221783932798,蓝色@6.144221783932798,黑色@6.144221783932798,衣服@6.144221783932795,案子@6.144221783932795,事儿@5.921829362596348,趣事@5.822293689045436,文物@5.765710160679064,家具@5.729184284653952,礼品@5.40725618976659,提案@5.226683944124768,投诉@5.144221783932798,商标@4.974296782490486,往事@4.896294270489209,议案@4.629648611103037,精品@4.559259283211642,服装@4.559259283211642,作品@4.545184097999916
条 承运人@6.236956996896712,两用品@6.236956996896712,捷径@6.236956996896712,灾区@6.236956996896712,小路@6.236956996896712,支流@6.236956996896712,探矿权@6.236956996896712,指导价@6.236956996896712,小溪@6.236956996896712,致富路@6.236956996896712,国道@6.236956996896712,大通道@6.236956996896712,飘带@6.236956996896712,管线@6.236956996896712,死胡同@6.236956996896712,商业街@6.236956996896712,小道@6.236956996896712,小街@6.236956996896712,邮路@6.236956996896712,星河@6.236956996896712,裤子@6.236956996896712,月报@6.236956996896712,性命@6.236956996896712,销路@6.236956996896712,融资@6.236956996896712,毛毯@6.236956996896712,界线@6.236956996896712,巷道@6.236956996896712,赛道@6.236956996896712,干道@6.236956996896712,分界线@6.236956996896712,银河@6.236956996896712,采矿权@6.236956996896712,缎带@6.236956996896712,大路@6.236956996896712,黄金水道@6.236956996896712,要道@6.236956996896712,运河@6.236956996896712,生命线@6.236956996896712,排污沟@6.236956996896712,托运人@6.236956996896712,行政处罚权@6.236956996896712,消防车@6.236956996896712,主任委员@6.236956996896712,牛仔裤@6.236956996896712,保障线@6.236956996896712,人命@6.236956996896712,海路@6.236956996896712,腿@6.236956996896711,中华人民共和国@6.236956996896711,大街@6.236956996896711,巨龙@6.236956996896711,线索@6.236956996896711,规律@6.23695699689671,小河@6.23695699689671,大动脉@6.23695699689671,血站@6.23695699689671,主线@6.23695699689671,江河@6.23695699689671,公安@6.23695699689671,车道@6.23695699689671,用血@6.23695699689671,申请人@6.23695699689671,沟@6.23695699689671,球道@6.23695699689671,街@6.204535519204335,航线@6.149494155646372,河流@6.137421323345797,路子@6.1300417929802,高速公路@6.121479779476775,战线@6.060079234812631,干线@6.044311918954316,河@6.044311918954315,主干道@6.014564575560264,直线@5.973922591062916,全长@5.973922591062916,生路@5.973922591062916,纽带@5.973922591062916,专线@5.947450379701725,途径@5.915028902009349,热线@5.915028902009349,龙@5.9150289020093485,小船@5.9150289020093485,小巷@5.9150289020093485,缝@5.9150289020093485,出路@5.9150289020093485,电缆@5.9150289020093485,产业化@5.9150289020093485,江@5.821919497617868,马路@5.821919497617868,命@5.821919497617868,道路@5.821919497617867,禁令@5.777525378259413,党组织@5.751530169726469,门路@5.751530169726468,大道@5.722383824066954,线路@5.722383824066953,生产线@5.716124833595269,路线@5.696388615534009,步行街@5.651994496175558,总长@5.651994496175558,灰鲸@5.651994496175558,有效期@5.651994496175558,使用费@5.651994496175558,胡同@5.651994496175558,光缆@5.651994496175558,曲线@5.651994496175558,喜讯@5.651994496175558,消防队@5.651994496175558,秘书处@5.651994496175558,管道@5.651994496175556,纪律@5.651994496175556,标语@5.651994496175554,新闻@5.60892577428367,水泥路@5.499991402730506,水渠@5.499991402730506,铁路@5.4999914027305055,横幅@5.4296020748391065,规定@5.3936825005841635,公路@5.343872200813222,线@5.313578278499624,机关@5.310957578340488,经营者@5.282760686509836,许可证@5.236956996896714,道理@5.236956996896714,真理@5.236956996896714,通道@5.236956996896712,航道@5.236956996896712,路@5.189651282118354,信息@5.178063307843142,街道@5.149494155646372,规矩@5.067031995454398,跑道@5.067031995454398,黄河@5.014564575560264,犯罪@5.014564575560264,马克思主义@5.014564575560264,电话线@4.915028902009352,渠道@4.883320042282011,隧道@4.858445373642979,铁@4.651994496175558,条幅@4.651994496175558,鱼@4.499991402730506,船@4.482069494733243,程度@4.42960207483911,火灾@4.36248787898057,渔船@4.36248787898057,防线@4.330066401288192,消息@4.282760686509836,原则@4.236956996896711,经验@4.166567669005313,集装箱@4.1214797794767755,路段@4.067031995454402,需要@4.067031995454402,位置@4.067031995454402,货物@4.067031995454402,人类@4.067031995454402,生命@4.03532313572706
名 监督员@5.058741574750174,主犯@5.058741574750174,打工妹@5.058741574750174,巴勒斯坦人@5.058741574750174,意大利人@5.058741574750174,待业青年@5.058741574750174,示威者@5.058741574750174,驾驶员@5.058741574750174,报关员@5.058741574750174,车手@5.058741574750174,被告@5.058741574750174,华工@5.058741574750174,武士@5.058741574750172,约旦人@5.058741574750172,游击队员@5.058741574750172,公仆@5.058741574750172,保管员@5.058741574750172,男童@5.058741574750172,劳工@5.058741574750172,无家可归者@5.058741574750172,技工@5.058741574750172,采购员@5.058741574750172,报考者@5.058741574750172,主妇@5.058741574750172,囚犯@5.058741574750172,文艺工作者@5.058741574750172,夫人@5.058741574750172,金宝@5.058741574750172,技师@5.058741574750172,幼儿@5.058741574750172,贵宾@5.058741574750172,荷兰人@5.058741574750172,不法分子@5.058741574750172,人人@5.058741574750172,丹麦@5.058741574750172,务工人员@5.058741574750172,伊朗人@5.058741574750172,土耳其人@5.058741574750172,劫机者@5.058741574750172,与会者@5.058741574750172,登山队员@5.058741574750172,监票人@5.058741574750172,经济界@5.058741574750172,受助生@5.058741574750172,责任人员@5.058741574750172,代管权@5.058741574750172,毒贩@5.058741574750172,幸存者@5.058741574750172,所长@5.058741574750172,发明家@5.058741574750172,画童@5.058741574750172,指战员@5.058741574750172,军民@5.058741574750172,骑车人@5.058741574750172,德国人@5.058741574750172,特工@5.058741574750172,现年@5.058741574750172,屠夫@5.058741574750172,盲童@5.058741574750172,智残人@5.058741574750172,正职@5.058741574750172,港人@5.058741574750172,挡车工@5.058741574750172,扑火队@5.058741574750172,座次@5.058741574750172,审判员@5.058741574750172,检验员@5.058741574750172,收费员@5.058741574750172,白人@5.058741574750172,自学成才者@5.058741574750172,炊事员@5.058741574750172,师生@4.976279414558201,嫌疑人@4.976279414558201,士兵@4.906738481305124,少先队员@4.888816573307862,员工@4.888816573307861,军嫂@4.866096496807776,警察@4.855208180665042,运动员@4.840561404700639,警官@4.836349153413725,嫌疑犯@4.836349153413725,小学生@4.836349153413724,官兵@4.817733475246379,平民@4.817733475246378,民兵@4.79570716891638,女童@4.79570716891638,文艺家@4.79570716891638,男队@4.79570716891638,战俘@4.7857230803437565,中小学生@4.769234957555188,干警@4.763285691224003,飞行员@4.736813479862812,考生@4.736813479862812,观察员@4.736813479862812,营业员@4.736813479862812,劳力@4.736813479862812,常委@4.736813479862812,下岗@4.736813479862812,机关干部@4.721706587472602,职工@4.715934750186896,公务员@4.686772797363215,党员@4.680229951496444,特困生@4.680229951496443,青工@4.671718451640926,民警@4.643704075471329,伤员@4.643704075471328,船员@4.643704075471328,首都@4.643704075471328,选民@4.643704075471328,养路工@4.643704075471328,乘务员@4.643704075471328,利比亚人@4.643704075471328,案犯@4.643704075471328,本科生@4.643704075471328,警卫@4.643704075471328,青壮年@4.643704075471328,管理员@4.643704075471328,主管@4.643704075471328,犯人@4.643704075471328,缉私队员@4.643704075471328,球员@4.625782167474067,难民@4.611282597778953,议员@4.596837470433354,队员@4.580694277945528,灾民@4.573314747579932,会员@4.573314747579932,参赛者@4.573314747579932,秘书@4.573314747579932,儿童@4.573314747579931,选手@4.569157122106277,劳动模范@4.544168401920414,工人@4.542166049009267,职员@4.528226858051394,家属@4.51125377944768,军官@4.506200551721392,民工@4.473779074029018,团员@4.473779074029018,罪犯@4.473779074029018,战犯@4.473779074029018,男性@4.473779074029018,市委@4.473779074029018,处长@4.473779074029018,学子@4.473779074029018,受害者@4.473779074029018,女将@4.473779074029018,家境@4.473779074029018,留学人员@4.473779074029018,巡警@4.473779074029018,责任人@4.473779074029018,助手@4.473779074029018,保级战@4.473779074029018,洗河@4.473779074029018,犹太人@4.473779074029018,师资@4.473779074029018,入党@4.473779074029018,运管员@4.473779074029018,共青团员@4.473779074029018,参议员@4.473779074029018,港胞@4.473779074029018,中锋@4.473779074029018,塞内加尔@4.473779074029018,行人@4.473779074029018,分子@4.473779074029016,新兵@4.473779074029016,中学生@4.473779074029016,学生@4.448688093066186,外交官@4.430710352137131,矿工@4.4213116541348825,华侨@4.321775980583967,退伍军人@4.321775980583967,好手@4.321775980583967,后卫@4.321775980583967,干部@4.316893759204221,歹徒@4.306669088193758,观众@4.303854072586705,人员@4.30085532895687,爱好者@4.293206828387197,工作者@4.28113399608662,教职工@4.25138665269257,护士@4.25138665269257,患者@4.2393138203919944,老兵@4.210744668195224,听众@4.210744668195224,标兵@4.210744668195224,骨干@4.18427245683403,学员@4.177386071248791,战士@4.17421879217011,工程师@4.165656778666683,群众@4.158277248301087,游客@4.111208994644308,成员@4.099383559247518,大学生@4.089115223793692,资格@4.058741574750174,宇航员@4.058741574750174,身高@4.058741574750174,一把手@4.058741574750174,队友@4.058741574750174,售票员@4.058741574750174,侨胞@4.058741574750174,打工者@4.058741574750174,军代表@4.058741574750174,外援@4.058741574750174,工友@4.058741574750174,雇员@4.058741574750174,教练员@4.058741574750174,裁判员@4.058741574750174,将领@4.058741574750174,民办教师@4.058741574750174,应聘者@4.058741574750174,检察官@4.058741574750174,监察员@4.058741574750174,店主@4.058741574750174,特派员@4.058741574750174,得分@4.058741574750174,董事@4.058741574750174,乘客@4.058741574750173,男子@4.058741574750172,研究生@4.058741574750172,业务员@4.058741574750172,军@4.058741574750172,技术员@4.058741574750172,八一队@4.058741574750172,人民警察@4.058741574750172,女婴@4.058741574750172,教师@4.04409479878577
场 泥雨@6.852930126247385,及时雨@6.852930126247384,搏斗@6.852930126247383,降水@6.852930126247383,恶战@6.852930126247383,人民战争@6.852930126247383,暴风雨@6.852930126247383,降雨@6.852930126247383,演唱会@6.852930126247383,战乱@6.852930126247383,持久战@6.852930126247383,大雾@6.852930126247383,朗诵会@6.852930126247383,治郅@6.852930126247383,闹剧@6.852930126247383,噩梦@6.852930126247383,国安队@6.852930126247383,对攻战@6.852930126247383,冰雹@6.852930126247383,混战@6.852930126247383,友谊赛@6.852930126247383,屠杀@6.852930126247383,灾难@6.672357880605563,讨论@6.638805320894536,官司@6.630537704910936,平局@6.630537704910936,重头戏@6.531002031360023,辩论@6.531002031360023,大雪@6.531002031360022,争论@6.531002031360022,半决赛@6.505006822827078,变革@6.460612703468624,小组赛@6.43789262696854,暴风@6.437892626968539,血战@6.437892626968539,热身赛@6.437892626968539,战争@6.312361744884682,车祸@6.267967625526229,争夺战@6.267967625526229,演出@6.253468055831113,补赛@6.115964532081177,大火@6.0873953798844065,危机@6.0865207541886965,激战@6.045575204189781,音乐会@6.038485779403461,胜利@6.016428858530262,雨@6.004933219692433,风暴@5.959845330163895,球@5.95726678606212,硬仗@5.935392286439357,雪灾@5.852930126247385,游戏@5.852930126247385,生死@5.852930126247385,大战@5.852930126247385,保卫战@5.852930126247385,京剧@5.852930126247385,革命@5.852930126247384,交锋@5.852930126247384,冲突@5.852930126247384,攻坚战@5.852930126247384,对抗赛@5.852930126247383,大会战@5.852930126247383,比赛@5.73166724350466,单打@5.715426602497447,表演@5.683005124805073,运动@5.683005124805072,斗争@5.660285048304988,风波@5.531002031360023,细雨@5.531002031360023,雪@5.531002031360022,悲剧@5.474418502993653,比分@5.3934985076100865,战斗@5.267967625526229,报告会@5.267967625526229,婚礼@5.267967625526229,金融@5.152490408106292,争夺@5.152490408106292,较量@5.0753225475838315,主场@5.045575204189781,表演赛@5.045575204189781,论坛@4.852930126247385,角逐@4.852930126247384,战役@4.852930126247383,双方@4.737452908827446,竞争@4.737452908827446,竞赛@4.3934985076100865,火灾@4.3934985076100865,考试@4.152490408106292,戏@4.136723092247975,讲座@4.115964532081177
位 师傅@5.212472512934525,老大娘@5.212472512934525,友人@5.212472512934525,嘉宾@5.212472512934525,寿爷@5.212472512934525,兄长@5.212472512934525,老师傅@5.212472512934525,同窗@5.212472512934525,咨询员@5.212472512934525,朋友家@5.212472512934525,球星@5.212472512934525,前辈@5.212472512934525,老伯@5.212472512934525,老兄@5.212472512934525,外宾@5.212472512934525,老大妈@5.212472512934525,大妈@5.212472512934525,车主@5.212472512934525,哲人@5.212472512934525,保姆@5.212472512934525,老头儿@5.212472512934525,中队长@5.212472512934525,农妇@5.212472512934525,君主@5.212472512934525,评论家@5.212472512934525,志成@5.212472512934525,长辈@5.212472512934525,股民@5.212472512934525,企业管理者@5.212472512934525,包工头@5.212472512934525,写信者@5.212472512934525,国防部长@5.212472512934525,老妇@5.212472512934525,舰长@5.212472512934525,老友@5.212472512934525,好心人@5.212472512934525,服务生@5.212472512934525,建筑师@5.212472512934525,营销员@5.212472512934525,大哥@5.212472512934525,堂叔@5.212472512934525,发言者@5.212472512934525,国务卿@5.212472512934525,歌星@5.212472512934525,老年人@5.212472512934525,双目@5.212472512934525,名士@5.212472512934525,大臣@5.212472512934525,大作家@5.212472512934525,史学家@5.212472512934525,师长@5.212472512934525,农家女@5.212472512934525,天文学家@5.212472512934525,裁判@5.212472512934525,来访者@5.212472512934525,女友@5.212472512934525,中文系@5.212472512934525,面庞@5.212472512934525,爸爸@5.212472512934525,骑士@5.212472512934525,小贩@5.212472512934525,老前辈@5.212472512934525,买主@5.212472512934525,介绍人@5.212472512934525,投资人@5.212472512934525,乡长@5.212472512934525,书商@5.212472512934525,食客@5.212472512934525,首长@5.212472512934525,好心@5.212472512934525,老同志@5.212472512934525,荷兰@5.212472512934525,一体@5.212472512934523,老者@5.212472512934523,摄影师@5.212472512934523,男士@5.212472512934523,老大爷@5.212472512934523,长者@5.212472512934523,大嫂@5.212472512934523,售货员@5.212472512934523,主人公@5.212472512934523,名人@5.212472512934523,摄影家@5.212472512934523,好友@5.212472512934523,发言人@5.158024728912149,小姐@5.086941630850667,女士@5.086941630850666,编辑@5.074968989184587,同行@5.042547511492211,身材@5.042547511492211,同事@5.0060216354671,女作家@4.990080091598077,大使@4.964544999490939,朋友@4.953385291617393,大师@4.949438107100731,老师@4.94943810710073,老朋友@4.949438107100729,老工人@4.949438107100729,业主@4.949438107100729,画家@4.93236459374179,老人@4.928679546933933,诗人@4.890544418047163,经济学家@4.890544418047162,歌唱家@4.890544418047162,老农@4.890544418047162,大姐@4.890544418047162,市长@4.890544418047162,负责人@4.875437525656953,顾客@4.849902433549817,歌手@4.849902433549817,老汉@4.849902433549817,家长@4.840503735547565,汉子@4.833960889680792,姓@4.797435013655681,藏书家@4.797435013655681,银发@4.797435013655681,失业者@4.797435013655681,渔民@4.797435013655681,翻译@4.797435013655681,收藏家@4.797435013655681,自述@4.797435013655681,老太太@4.797435013655679,教授@4.757906649469043,先生@4.753040894297226,书法家@4.727045685764281,外长@4.727045685764281,将军@4.709972172405342,中年人@4.697899340104767,同志@4.671011650849956,作者@4.656079164410138,姑娘@4.643106867264387,亲属@4.627510012213371,大人@4.627510012213371,考官@4.627510012213371,业内人士@4.627510012213371,女郎@4.627510012213371,哲学家@4.627510012213371,汉族@4.627510012213371,艺人@4.627510012213371,叔叔@4.627510012213371,佼佼者@4.627510012213371,社会学家@4.627510012213371,孕妇@4.627510012213371,英国人@4.627510012213371,参赛@4.627510012213371,伤者@4.627510012213371,叔公@4.627510012213371,雇主@4.627510012213371,老红军@4.627510012213371,摊主@4.627510012213369,音乐家@4.627510012213369,伟人@4.627510012213368,英雄@4.627510012213367,特使@4.627510012213367,来宾@4.627510012213367,年龄@4.627510012213367,亲人@4.627510012213367,小将@4.627510012213367,邻居@4.627510012213367,推销员@4.627510012213367,母亲@4.595801152486031,消费者@4.575042592319232,数@4.568616323159799,小伙子@4.560395816354831,同胞@4.560395816354831,战友@4.560395816354831,读者@4.524416519249264,客商@4.475506918768319,首脑@4.475506918768319,大夫@4.475506918768319,概括@4.475506918768319,一生@4.475506918768319,院长@4.475506918768319,首相@4.475506918768319,理事@4.475506918768319,学者@4.4557436639468895,人士@4.423976618128237,官员@4.417292304823023,经纪人@4.40511759087692,总统@4.405117590876919,经理@4.3893502750186055,校长@4.3823975143768354,元首@4.364475606379573,老板@4.364475606379573,客人@4.338003395018383,同学@4.279586708793063,记者@4.232101319683456,专家@4.231837837801456,小姑娘@4.212472512934527,所在@4.212472512934527,公主@4.212472512934527,当事人@4.212472512934527,实业家@4.212472512934527,教员@4.212472512934527,民办教师@4.212472512934527,工友@4.212472512934527,男生@4.212472512934527,元老@4.212472512934527,腰@4.212472512934527,文人@4.212472512934527,子弟@4.212472512934527,硕士@4.212472512934527,纳税人@4.212472512934527,教育家@4.212472512934527,同龄人@4.212472512934527,店主@4.212472512934527,司机@4.212472512934525,市民@4.212472512934525,领袖@4.212472512934525,技术员@4.212472512934525,解放军@4.212472512934525,书画家@4.212472512934525,思想家@4.212472512934525,人民警察@4.212472512934525,服务员@4.212472512934523,知名人士@4.212472512934523,律师@4.212472512934523,老干部@4.212472512934523,女孩@4.212472512934523,博士生@4.212472512934523,老乡@4.212472512934523,作家@4.1860003015733325,书记@4.153578823880956,委员@4.146884171306948,领导@4.140322727178689,总经理@4.125009671684184,教练@4.125009671684184,科学家@4.116257197675222,女性@4.112936839383611,工程师@4.0969952955145885,伙伴@4.074968989184587,听众@4.042547511492211,标兵@4.042547511492211,女工@4.010838651764874
台 洗衣机@8.219108995563786,脱粒机@8.219108995563786,变压器@8.219108995563786,割晒机@8.219108995563786,终端机@8.219108995563786,机床@8.219108995563786,缝纫机@8.219108995563786,交响音乐会@8.219108995563786,摄影机@8.219108995563786,样机@8.219108995563786,复印机@8.219108995563786,股票机@8.219108995563786,CT机@8.219108995563786,农机@8.219108995563786,分体@8.219108995563786,粉碎机@8.219108995563786,冰箱@8.08160547181385,电视机@8.02646391762139,电脑@7.992338133716764,收割机@7.971181482120201,机械@7.897180900676425,仪器@7.8040714962849425,VCD机@7.8040714962849425,风机@7.8040714962849425,摄像机@7.8040714962849425,推土机@7.8040714962849425,计算机@7.688594278865006,机器@7.63414649484263,微机@7.63414649484263,舞剧@7.567032298984092,晚会@7.512840198620497,机车@7.482143401397581,空调@7.482143401397581,钢琴@7.411754073506183,好戏@7.411754073506183,拖拉机@7.411754073506183,彩电@7.356612519313723,锅炉@7.219108995563788,发动机@7.219108995563786,核电机组@7.219108995563786,电视@6.996716574227339,订单@6.897180900676426,发电机组@6.759677376926488,剧目@6.63414649484263,机组@6.364959862027241,小车@6.344639877647644,车辆@6.10363177814385,样品@6.049183994121476,话剧@6.049183994121474,戏@5.783009880757113,车@5.646219327143205,设备@5.63414649484263,产量@5.411754073506184,节目@5.312218399955269,音乐会@5.131646154313447,演出@4.897180900676423,价值@4.8498751858980675,仪式@4.575252805789063

3.如果要想让词库的质量更高,可人工再过滤一遍

  • 22
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值