发现新词 | NLP之无监督方式构建词库(一)

一、数据介绍及处理

  本文以电商领域的商品名称为语料进行实验,来寻找未登录词。
  首先,将json格式的数据,提取其goods_name列,写入到txt文件中。

import pandas as pd

"""
    将数据中的goods_name列与search_value列进行去重后再分别写入到txt
"""


class DataConvert(object):
    def __init__(self, file_input_name, file_corpus, file_searchValue):
        self.file_input_name = file_input_name
        self.file_corpus = file_corpus
        self.file_searchValue = file_searchValue

    def run(self):
        # 读取原始数据
        # lines=True:文件的每一行为一个完整的字典,默认是一个列表中包含很多字典
        input_file = pd.read_json(self.file_input_name, lines=True)
        # 选定需要操作的两列,并进行去重
        goods_names = input_file.loc[:, 'goods_name'].dropna().drop_duplicates().tolist()
        search_values = input_file.loc[:, 'search_value'].dropna().drop_duplicates().tolist()
        # 将这两列写入到输出文件中
        with open(self.file_corpus, "w", encoding="utf-8") as f1:
            for goods_name in goods_names:
                try:
                    f1.write(goods_name)
                    f1.write("\n")
                except:
                    print(goods_name)
        f1.close()

        with open(self.file_searchValue, "w", encoding="utf-8") as f2:
            for search_value in search_values:
                f2.write(search_value)
                f2.write("\n")

        f2.close()

得到的file_corpus格式如下:
在这里插入图片描述

二、寻找未登录词

1.统计语料库中的词信息

  统计语料库中出现单字,双字的频率,前后链接的字相关信息;

#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
    统计语料库中出现单字,双字的频率,前后链接的字相关信息;
"""
import re
import codecs
import json
import os

# \u4E00-\u9FD5表示所有汉字
# a-zA-Z0-9表示26个英文字母与数字
# -+#&\._/  \(\) \~\'表示常用符号
# \u03bc\u3001 \uff08\uff09 \u2019:表示特殊字符,一些程式码
re_han = re.compile("([\u4E00-\u9FD5a-zA-Z0-9-+#&\._/\u03bc\u3001\(\)\uff08\uff09\~\'\u2019]+)", re.U)


class Finding(object):

    def __init__(self, file_corpus, file_count, count):
        self.file_corpus = file_corpus
        self.file_count = file_count
        self.count = count

    def split_text(self, sentence):
        """
        找到每个商品名称中符合正则表达式的子串,以列表形式返回
        :param sentence:
        :return:
        """
        # re.findall():在字符串中找到正则表达式所匹配的所有子串,并返回一个列表
        seglist = re_han.findall(sentence)
        return seglist

    def count_word(self, seglist, k):
        """
        遍历每个子串,返回窗口对应的词与前后各一个词
        :param seglist: 商品名称的子串列表
        :param k: 窗口大小
        :return:
        """
        for words in seglist:
            ln = len(words)
            i = 0
            j = 0
            if words:
                while 1:
                    j = i + k
                    if j <= ln:
                        word = words[i:j]
                        if i == 0:
                            lword = 'S'
                        else:
                            lword = words[i - 1:i]
                        if j == ln:
                            rword = 'E'
                        else:
                            rword = words[j:j + 1]
                        i += 1
                        yield word, lword, rword
                    else:
                        break

    def find_word(self):
        """

        :return:
        """
        # 读取语料数据
        input_data = codecs.open(self.file_corpus, 'r', encoding='utf-8')
        dataset = {}
        # enumerate将可迭代对象转化为索引与数据的格式,起始下标为1
        for lineno, line in enumerate(input_data, 1):
            try:
                line = line.strip()
                # 找到每个商品名称中符合正则表达式的子串,以列表形式返回
                seglist = self.split_text(line)
                # count_word:遍历每个子串,返回窗口对应的词与前后各一个词
                # 遍历这三个词,[[], {}, {}]分别记录窗口对应的词出现的个数,前一个词及出现个数,后一个词及出现个数
                for w, lw, rw in self.count_word(seglist, self.count):
                    if w not in dataset:
                        dataset.setdefault(w, [[], {}, {}])
                        dataset[w][0] = 1
                    else:
                        dataset[w][0] += 1
                    if lw:
                        dataset[w][1][lw] = dataset[w][1].get(lw, 0) + 1
                    if rw:
                        dataset[w][2][rw] = dataset[w][2].get(rw, 0) + 1

            except:
                pass
        self.write_data(dataset)

    def write_data(self, dataset):
        """
        将统计结果写入到字典中
        :param dataset:
        :return:
        """
        output_data = codecs.open(self.file_count, 'w', encoding='utf-8')
        for word in dataset:
            output_data.write(word + '\t' + json.dumps(dataset[word], ensure_ascii=False, sort_keys=False) + '\n')
        output_data.close()

这一步会生成两个文件,分别是
count_one.txt
在这里插入图片描述
count_two.txt文件。
在这里插入图片描述

2.利用互信息熵得到初始化词库

  对统计出的单字和双字的结果,使用互信熵,选择大于阈值K=的词加入词库,作为初始词库。在计算机领域,更常用的是点间互信息,点间互信息计算了两个具体事件之间的互信息。 点间互信息的定义如下:
在这里插入图片描述
本文操作时,选择以e为底。

#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
    对统计出的单字和双字的结果,使用互信熵,选择大于阈值K=的词加入词库,作为初始词库;
"""
import codecs
import json
import math


def load_data(file_count_one):
    """
    加载单字文件:{one_word,[[], {}, {}]},并返回总词数与单字字典:{one_word:one_word_freq}
    :param file_count_one:文件名
    :return:
    """
    count_one_data = codecs.open(file_count_one, 'r', encoding='utf-8')
    count_one_param = {}
    N = 0
    # 遍历每一行,返回总词数与单字字典:{one_word:one_word_freq}
    for line in count_one_data.readlines():
        line = line.strip()
        line = line.split('\t')
        try:
            word = line[0]
            value = json.loads(line[1])
            N += value[0]
            count_one_param[word] = int(value[0])
        except:
            pass
    count_one_data.close()

    return N, count_one_param


def select(file_count_one, file_count_two, file_dict, K=10.8):
    """
    遍历每一行,利用互信息熵计算每个词的成词概率PMI,利用阈值K来筛选一部分词,将结果保存到file_dict字典中
    :param file_count_one: 窗口为1的文件
    :param file_count_two: 窗口为2的文件
    :param file_dict: 字典文件
    :param K: 稳定词的阈值
    :return:
    """
    count_two_data = codecs.open(file_count_two, 'r', encoding='utf-8')
    # 总词数与单字字典
    N, count_one_param = load_data(file_count_one)
    count_two_param = {}

    # 遍历每一行,利用互信息熵计算每个词的成词概率PMI
    for line in count_two_data.readlines():
        line = line.strip()
        line = line.split('\t')
        try:
            word = line[0]
            value = json.loads(line[1])
            # 双字出现的词频 / 总的字数
            P_w = 1.0 * value[0] / N
            # 双字的第一个字出现的词频 / 总的字数
            P_w1 = 1.0 * count_one_param.get(word[0], 1) / N
            # 双字的第二个字出现的词频 / 总的字数
            P_w2 = 1.0 * count_one_param.get(word[1], 1) / N
            # 计算点间互信息PMI的计算公式,两个字的成词概率越大,则PMI值越大;如果两个子不相关,则PMI=0
            mi = math.log(P_w / (P_w1 * P_w2))
            count_two_param[word] = mi
        except:
            pass
    select_two_param = []
    for w in count_two_param:
        mi = count_two_param[w]
        if mi > K:
            select_two_param.append(w)

    with codecs.open(file_dict, 'a', encoding='utf-8') as f:
        for w in select_two_param:
            f.write(w + '\t' + 'org' + '\n')

    count_two_data.close()

这一步会生成初始词库文件dict.txt,后续发现的新词也会追加到这个文件中,构成新的词库文件。
在这里插入图片描述
这里我从业务端获得了几万条初始手工添加的词。考虑到初始已经有许多词,所以,互信息熵的阈值K设置的可以大一点。如果没有的话,K会对最后结果起决定作用。所使用的材料是长文本还是短文本也会对K有影响。可以尝试初始设置为K=8来运行程序。

3.对语料库进行切分

  有了初始词库,使用正向最大匹配,对语料库进行切分,对切分出来的字串按频率排序输出并记下数量seg_num

"""
    有了初始词库,使用正向最大匹配,对语料库进行切分,对切分出来的字串按频率排序输出并记下数量seg_num
"""
from __future__ import unicode_literals
import codecs
import re

# 匹配所有汉字与规定符号
re_han_cut = re.compile("([\u4E00-\u9FD5a-zA-Z0-9-+#&\._/\u03bc\u3001\(\)\uff08\uff09\~\'\u2019]+)", re.U)
# 匹配所有汉字
re_han = re.compile("([\u4E00-\u9FD5]+)", re.U)


class Cuting(object):
    def __init__(self, file_corpus, file_dict, file_segment):
        self.file_corpus = file_corpus
        self.file_dict = file_dict
        self.file_segment = file_segment
        self.wdict = {}
        self.get_dict()

    def get_dict(self):
        """
        遍历初始字典文件,初始化wdict字典 => {one_word:[many two world]}
        :return:
        """
        f = codecs.open(self.file_dict, 'r', encoding='utf-8')
        # 遍历每一行数据
        for lineno, line in enumerate(f, 1):
            line = line.strip()
            line = line.split('\t')
            w = line[0]
            if w:
                if w[0] in self.wdict:
                    value = self.wdict[w[0]]
                    value.append(w)
                    self.wdict[w[0]] = value
                else:
                    self.wdict[w[0]] = [w]

    def fmm(self, sentence):
        """
        将子串中在wdict中two_word保存到result表中
        :param sentence: 字符子串
        :return:
        """
        N = len(sentence)
        k = 0
        result = []
        while k < N:
            w = sentence[k]
            maxlen = 1
            # 如果这个字在初始化字典wdict中
            if w in self.wdict:
                # 初始化字典中的value,是一个列表,里面有许多上一步找到的two_word
                words = self.wdict[w]
                t = ''
                for item in words:
                    itemlen = len(item)
                    if sentence[k:k + itemlen] == item and itemlen >= maxlen:
                        t = item
                        maxlen = itemlen
                if t and t not in result:
                    result.append(t)
            k = k + maxlen
        return result

    def judge(self, words):
        """
        判断这个子串是否只有汉字
        :param words:
        :return:
        """
        flag = False
        n = len(''.join(re_han.findall(words)))
        if n == len(words):
            flag = True
        return flag

    def cut(self, sentence):
        """

        :param sentence:
        :return:
        """
        buf = []
        # 在商品名称中找到正则表达式所匹配的所有子串,并返回一个列表
        blocks = re_han_cut.findall(sentence)
        # 遍历每一个子串
        for blk in blocks:
            if blk:
                # 将子串中在wdict中的two_word返回
                fm = self.fmm(blk)
                if fm:
                    try:
                        # 如果返回值不为空,则将这些词以“|”进行拼接,并以此构建正则表达式
                        re_split = re.compile('|'.join(fm))
                        # split方法按照能够匹配的子串将字符串分割后返回列表
                        for s in re_split.split(blk):
                            # 如果这个子串只有汉字,则将该子串添加到列表中,最终返回该列表
                            if s and self.judge(s):
                                buf.append(s)
                    except:
                        pass

        return buf

    def find(self):
        """

        :return:
        """
        input_data = codecs.open(self.file_corpus, 'r', encoding='utf-8')
        output_data = codecs.open(self.file_segment, 'w', encoding='utf-8')
        dataset = {}
        # 遍历每一个语料文件
        for lineno, line in enumerate(input_data, 1):
            line = line.strip()
            # 遍历基于正则切割的字符串列表
            for w in self.cut(line):
                if len(w) >= 2:
                    dataset[w] = dataset.get(w, 0) + 1
        # 基于词频进行排序
        data_two = sorted(dataset.items(), key=lambda d: d[1], reverse=True)
        seg_num = len(data_two)
        for key in data_two:
            output_data.write(key[0] + '\t' + str(key[1]) + '\n')

        print('Having segment %d words' % seg_num)
        input_data.close()
        output_data.close()

        return seg_num

这一步会生成片段语料文件file_segment.txt
在这里插入图片描述
片段语料文件会根据初始词库的迭代更新而变化。

4.利用搜索引擎判断新词

  对切分产生的字串按频率排序,前H=2000的字串进行搜索引擎(百度)。若字串是“百度百科”收录词条,将该字串作为词加入词库;或者在搜索页面的文本中出现的次数超过60,也将该字串作为词加入词库;

#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
    对切分产生的字串按频率排序,前H=2000的字串进行搜索引擎(百度),
    若字串是“百度百科”收录词条,将该字串作为词加入词库,
    或者在搜索页面的文本中出现的次数超过60,也将该字串作为词加入词库;
"""
import requests
from lxml import etree
import codecs
import re


def search(file_segment, file_dict, H, R, iternum):
    # headers,从网站的检查中获取
    headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
               'Accept-Encoding': 'gzip, deflate, sdch, b',
               'Accept-Language': 'zh-CN,zh;q=0.8',
               'Cache-Control': 'max-age=0',
               'Connection': 'keep-alive',
               'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.'
               }
    # 加载切分出来的子符串
    input_data = codecs.open(file_segment, 'r', encoding='utf-8')
    read_data = input_data.readlines()
    N = len(read_data)
    if H > N:
        H = N
    output_data = codecs.open(file_dict, 'a', encoding='utf-8')
    n = 0
    m = 1
    # 遍历切分出的子符串
    for line in read_data[:H]:
        line = line.rstrip()
        line = line.split('\t')
        # 字符串
        word = line[0]
        try:
            # 访问百度百科词条
            urlbase = 'https://www.baidu.com/s?wd=' + word
            dom = requests.get(urlbase, headers=headers)
            ct = dom.text
            # 在搜索页面的文本中出现的次数
            num = ct.count(word)
            html = dom.content
            selector = etree.HTML(html)
            flag = False
            # 若字串是“百度百科”收录词条,将该字串作为词加入词库
            if selector.xpath('//h3[@class="t c-gap-bottom-small"]'):
                ct = ''.join(selector.xpath('//h3[@class="t c-gap-bottom-small"]//text()'))
                lable = re.findall(u'(.*)_百度百科', ct)
                for w in lable:
                    w = w.strip()
                    if w == word:
                        flag = True
            if flag:
                output_data.write(word + '\titer_' + str(iternum) + '\n')
                n += 1
            # 在搜索页面的文本中出现的次数超过阈值R=60,也将该字串作为词加入词库
            else:
                if num >= R:
                    output_data.write(word + '\titer_' + str(iternum) + '\n')
                    n += 1
            m += 1
            if m % 100 == 0:
                print('having crawl %dth word\n' % m)
        except:
            pass
    print('Having add %d words to file_dict at iter_%d' % (n, iternum))
    input_data.close()
    output_data.close()
    return n

这一步会将发现的新词添加到dict.txt文件中。

5.迭代寻找新词

  更新词库后,重复step3,step4进行迭代。当searh_num=0时,结束迭代;当seg_num小于设定的Y=3000,进行最后一次step4,并H设定为H=seg_num,执行完后结束迭代,最后词库就是本程序所找的词。

#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
    算法步骤:
        1.统计语料库中出现单字,双字的频率,前后链接的字相关信息;
        2.对统计出的单字和双字的结果,使用互信熵,选择大于阈值K=的词加入词库,作为初始词库;
        3.有了初始词库,使用正向最大匹配,对语料库进行切分,对切分出来的字串按频率排序输出并记下数量seg_num
        4.对切分产生的字串按频率排序,前H=5000的字串进行搜索引擎(百度),
          若字串是“百度百科”收录词条,将该字串作为词加入词库,
          或者在搜索页面的文本中出现的次数超过60,也将该字串作为词加入词库;
        5.更新词库后,重复step3,step4进行迭代,,当searh_num=0时,结束迭代;
          当seg_num小于设定的Y=1000,进行最后一次step4,并H设定为H=seg_num,执行完后结束迭代,
          最后词库就是本程序所找的词
"""
from __future__ import absolute_import

__version__ = '1.0'
__license__ = 'MIT'

import os
import logging
import time
import codecs
import sys

from module.corpus_count import *
from module.corpus_segment import *
from module.select_model import *
from module.words_search import *

# 获取当前路径
medfw_path = os.getcwd()
file_corpus = medfw_path + '/data_org/file_corpus.txt'
file_dict = medfw_path + '/data_org/dict.txt'
file_count_one = medfw_path + '/data_org/count_one.txt'
file_count_two = medfw_path + '/data_org/count_two.txt'
file_segment = medfw_path + '/data_org/file_segment.txt'

# 日志设置
log_console = logging.StreamHandler(sys.stderr)
default_logger = logging.getLogger(__name__)
default_logger.setLevel(logging.DEBUG)
default_logger.addHandler(log_console)


def setLogLevel(log_level):
    global logger
    default_logger.setLevel(log_level)


class MedFW(object):
    def __init__(self, K=10, H=2000, R=60, Y=5000):
        self.K = K  # 互信息熵的阈值
        self.H = H  # 取file_segment.txt前多少个
        self.R = R  # 片段单词在搜索引擎出现的阈值
        self.Y = Y  # 迭代结束结束条件参数
        self.seg_num = 0  # 片段语料库的数量
        self.search_num = 0  # 搜索引擎向 file_dict 添加单词的数量

    # step1: 统计语料库中出现单字,双字的频率,前后链接的字相关信息;
    def medfw_s1(self):
        for i in range(1, 3):
            if i == 1:
                file_count = file_count_one
            else:
                file_count = file_count_two
            default_logger.debug("Counting courpus to get %s...\n" % (file_count))
            t1 = time.time()
            cc = Finding(file_corpus, file_count, i)
            cc.find_word()
            default_logger.debug("Getting %s cost %.3f seconds...\n" % (file_count, time.time() - t1))

    # step2: 对统计出的单字和双字的结果,使用互信熵,选择大于阈值K=的词加入词库,作为初始词库;
    def medfw_s2(self):
        default_logger.debug("Select stable words and  generate initial vocabulary... \n")
        select(file_count_one, file_count_two, file_dict, self.K)

    # step3: 有了初始词库,使用正向最大匹配,对语料库进行切分,对切分出来的字串按频率排序输出并记下数量seg_num
    def medfw_s3(self):
        t1 = time.time()
        sc = Cuting(file_corpus, file_dict, file_segment)
        self.seg_num = sc.find()
        default_logger.debug("Segment corpuscost %.3f seconds...\n" % (time.time() - t1))

    # step4:对片段语料中的单词使用搜索引擎进行搜索
    def medfw_s4(self, H, R, iternum):
        t1 = time.time()
        self.search_num = search(file_segment, file_dict, H, R, iternum)
        default_logger.debug("Select words cost %.3f seconds...\n" % (time.time() - t1))

    # 主程序
    def medfw(self):
        # default_logger.debug("Starting to find words and do step1...\n" )
        print('-----------------------------------')
        print('step1:count corpus')
        self.medfw_s1()

        print('-----------------------------------')
        print('step2:select stable words and  generate initial vocabulary')
        self.medfw_s2()

        print('-----------------------------------')
        print('step3:use initial vocabulary to segment corpus')
        self.medfw_s3()

        print('-----------------------------------')
        print('step4:use search engine to select words of segment corpus')
        self.medfw_s4(H=self.H, R=self.R, iternum=0)

        print('-----------------------------------')
        print('step5:cycling iteration')
        iter_num = 1
        while True:
            if self.search_num:
                default_logger.debug("Itering %d...\n" % (iter_num))
                t1 = time.time()
                self.medfw_s3()
                print("---------------------- seg_num:%s -----------------------" % self.seg_num)
                if self.seg_num <= self.Y:
                    self.H = self.seg_num
                    self.medfw_s4(H=self.H, R=self.R, iternum=iter_num)
                    default_logger.debug("Ending the iteration ...\n")
                    break
                else:
                    self.medfw_s4(H=self.H, R=self.R, iternum=iter_num)
                    iter_num += 1
                default_logger.debug("Itering %d cost %.3f seconds...\n " % ((iter_num - 1), time.time() - t1))
            else:
                break
        with codecs.open(file_dict, 'r', encoding='utf-8') as f:
            total_num = len(f.readlines())

        print('Having succcessfuly find %d words from corpus ' % total_num)


if __name__ == '__main__':
    md = MedFW(K=10, H=3000, R=50, Y=3000)
    md.medfw()

这一步会多轮迭代寻找新词。
在这里插入图片描述

6.方法总结

  实践表明:这种思路获得的结果有一定用处,获得的结果需要人工来甄别。缺点的话也很明显,比如一个长的品牌名或者商品名,如果其中几个连续的词的词频很高,并且本身也能成词,就会将这个品牌名或者商品名切散!还有待继续研究!

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
博文的 java 实现,可以自动抽取语料库中的汇,可以作为自然语言处理的第一步,准备典。成条件互信息左右熵位置成概率ngram 频率运行方法下载或者gradle distTar打包程序解压dict_build-x.x.x.tar解压之后,进入bin. 运行:./dict_build 你的数据文件的绝对路径结束之后,在数据文件同目录有文件:words_sort.data四列分别为:,频,互信息,左右熵,位置成概率.示例《金瓶梅》抽取结果西门庆  4754    6.727920454563199   2.0315193024276885  0.17472535684926388 月娘    1829    6.491853096329675   2.3714166640957095  0.22135096835144072 敬济    906 9.084808387804362   2.554594603718855   0.14485683987274656 春梅    799 8.134426320220927   2.7880175589451714  0.16484505593416485 玳安    796 8.228818690495881   2.865686193737731   0.11791820110723605 后边    617 6.6293566200796095  4.008365154080131   0.2160373686259245 玉楼    594 7.977279923499917   2.27346284978306    0.27518689925240297 明日    580 6.189824558880018   2.705423396095033   0.1774535638537181 两银子  458 6.129283016944967   2.351100547282295   0.3809078896437581 小厮    454 7.257387842692652   3.945653525477103   0.16666666666666666 打发    444 6.870364719583405   3.694604352707633   0.18409496065046307 如今    410 6.643856189774725   2.1460777430093394  0.1780766096169519 淫妇    382 7.768184324776926   3.277903508489837   0.2555205047318612 桂姐    371 7.584962500721156   2.5922046565140424  0.36255305256284687 老婆    331 6.266786540694902   3.5783015008688523  0.3758007117437722 衣服    309 8.90388184573618    2.786139685416002   0.13284518828451883 丫头    297 7.383704292474053   4.291010086795063   0.21875 潘金莲  288 8.276124405274238   2.4955186567189194  0.35333669524289796 昨日    285 6.857980995127572   2.6387249970833997  0.1774535638537181 王婆    284 7.1799090900149345  2.3129267619188907  0.3758007117437722《西游记》抽取结果八戒    1807    7.88874324889826    2.00952580557629    0.36441586280814575 师父    1632    7.507794640198696   3.745294449785798   0.1371395690812608 大圣    1270    6.599912842187128   2.7790919785432147  0.13128460061010055 唐僧    1003    7.076815597050832   4.350465172292435   0.43277723258096173 菩萨    765 9.471675214392045   3.6013747138664756  0.15910495734948696 妖精    634 7.199672344836364   3.1817261900583627  0.13134411600669268 徒弟    439 8.060695931687555   2.498555429145656   0.15553809897879026 兄弟    284 7.845490050944376   2.93037668783551    0.16085578446909668 宝贝    283 9.319672120946995   2.616164396748633   0.15108220492589827 今日    282 6.714245517666122   2.1303069812971214  0.1774535638537181 取经    263 7.539158811108032   2.663944888382171   0.10181178023912565 如今    259 6.189824558880018   2.056188859866133   0.1780766096169519 认得    223 6.357552004618085   2.9543379335926954  0.2326782564877803 东土    212 8.422064766172811   3.326253983395916   0.14745277618775043 孙大圣  202 6.022367813028454   2.4886576514017107  0.13128460061010055 变作    189 7.554588851677638   3.0713596792578635  0.23452975920036348 玉帝    189 8.912889336229961   2.973106046717708   0.27518689925240297 土地    179 7.499845887083206   3.1206506190132566  0.2819944064037033 欢喜    173 8.861086905995393   2.184918471204895   0.31727272727272726 贫僧    170 7.400879436282184   2.0731236036504477  0.43277723258096173拉勾JD语料抽取结果工作  641962  11.645208082774683  4.083574124851783   0.11247281022865935 开发  348538  14.031184262140844  4.37645153459778    0.18409496065046307 相关  300517  10.477758266443889  5.038915743418073   0.1758213331033888 合作  159688  10.397674632948268  3.9963476653135794  0.19498851077798446 专业  158831  10.712527000439824  3.152041650598071   0.2640750670241287 测试  158179  13.65362883340751   4.464104436545589   0.18344308560677328 互联网   148818  16.106992250086762  3.9556191209604314  0.407386403912951 活动  131099  10.391243589427443  3.9155422678129406  0.20137250696976194 维护  120316  12.681677655209691  3.2400117935377266  0.1960306406685237 问题  112116  9.159871336778389   2.314215135279833   0.20283174185051037 优化  109563  11.324180546618742  4.331660381832997   0.2456782591010779 营销  105845  14.36850646150769   5.097001962525406   0.14961371773129828 平台  100783  9.002815015607053   4.443804901153697   0.2877423571272965 培训  93204   9.041659151637216   3.8898570467819824  0.13345998575160295 资源  90339   8.651051691178928   4.063430372719874   0.14695817490494298 相关专业    87545   8.988684686772165   2.4897196388075598  0.2905199904149232 网站  87182   8.92184093707449    5.465843476701055   0.21266038137095059 独立  86111   9.074141462752506   3.1456261690072957  0.19050261614079594 一定  83798   8.335390354693924   2.107303660112154   0.26157299167679793 流程  83165   9.321928094887362   2.5509378861028074  0.2063141084699957 网络  82742   9.087462841250339   4.681429111504988   0.21266038137095059 优秀  74600   9.370687406807217   2.0756995478573135  0.2899855507391353 信息  71009   9.820178962415188   4.2602697278449755  0.18863532864443658 媒体  67533   10.556506054671928  4.615376861300178   0.17976710334788937 编写  64337   7.960001932068081   3.482400585501417   0.265625 思维  62351   8.741466986401146   2.4320664807326646  0.15396736072031514 规划  59733   7.851749041416057   2.936854928368285   0.14166201896263245 移动  59671   10.10459875356437   3.4421932833155653  0.20137250696976194 渠道  59072   9.513727595952437   4.597891463808354   0.23578595317725753 关系  58483   8.348728154231077   2.4369558675502927  0.3170022612253688 积极  57295   9.044394119358454   2.763249521041074   0.1746848469256496 实施  56645   7.781359713524661   4.371966846513886   0.15944453739334113 福利  55732   8.475733430966399   2.4036919305145426  0.20908952728378172 其他  55665   8.434628227636725   2.9614863103296867  0.15943975441289332 功能  55087   7.787902559391432   4.1663586610392755  0.18097560975609756 代码  52431   7.88874324889826    3.876917512626917   0.2135697048449972 微信  49143   8.945443836377912   3.6868130380800643  0.18215857916308253 企业  48799   9.422064766172813   5.568662443510237   0.2905199904149232 提升  48446   8.233619676759702   3.7390647282620666  0.29750778816199375 质量  47918   10.861862340059153  3.391825261582227   0.10921827734437191 人员  47109   7.774787059601174   5.249783964892326   0.13589632038101343 数据库   45445   8.290018846932618   4.123423571610193   0.2640569395017794 商务  44047   8.189824558880018   3.44858516585648    0.12901085044961344 主动  42628   13.815583433851023  2.5049637884195137  0.1968791796700847 创意  41768   14.396470993910388  4.115068825929573   0.30544056771141337 工具  40227   9.927777962082342   2.208874047820781   0.11247281022865935 等相关   39230   11.919608238603255  3.0330398736413557  0.1758213331033888 提出  38741   10.179909090014934  4.46446156782086    0.13053040103492886 各类  38309   8.344295907915816   5.136417986953123   0.3969948596283116 操作  37061   9.06339508128851    4.676836974292029   0.23452975920036348 收集  36600   8.800899899920305   2.797691452951563   0.11388512456999896 过程  36534   8.214319120800766   2.5633950372758565  0.2063141084699957 数据分析    36081   8.442943495848729   3.5589033442862585  0.2640569395017794 标签:dictbuild

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值