自己动手写word2vec (二):统计词频

最新推荐文章于 2024-07-31 05:55:02 发布

multiangle

最新推荐文章于 2024-07-31 05:55:02 发布

阅读量2w

点赞数 5

分类专栏：自然语言处理机器学习&深度学习自然语言处理文章标签： python word2vec nlp 自然语言处理

本文链接：https://blog.csdn.net/u014595019/article/details/51907294

版权

机器学习&深度学习同时被 3 个专栏收录

35 篇文章 28 订阅

订阅专栏

自然语言处理

19 篇文章 7 订阅

订阅专栏

自然语言处理

15 篇文章 59 订阅

订阅专栏

系列所有帖子
自己动手写word2vec (一):主要概念和流程
 自己动手写word2vec (二):统计词频
 自己动手写word2vec (三):构建Huffman树
 自己动手写word2vec (四):CBOW和skip-gram模型

在我之前写的word2vec的大概流程中，第一步的分词使用jieba来实现，感觉效果还不错。

第二步. 统计词频

统计词频，相对来讲比较简单一些，主要在Python自带的Counter类基础上稍作改进。值得注意的是需要去掉停用词。所谓停用词，就是出现频率太高的词，如逗号，句号等等，以至于没有区分度。停用词可以在网上很轻易找到，我事先已经转化成二进制的格式存储下来了。这一部分的代码放在WordCount.py文件中

2.1 MulCounter

MulCounter完成的是根据单词数组来完成统计词频的工作。
这是一个继承自Counter的类。之所以不直接用Counter是因为它虽然能够统计词频，但是无法完成过滤功能。而MulCounter可以通过larger_than和less_than这两个方法过滤掉出现频率过少和过多的词。

class MulCounter(Counter):
    # a class extends from collections.Counter
    # add some methods, larger_than and less_than
    def __init__(self,element_list):
        super().__init__(element_list)

    def larger_than(self,minvalue,ret='list'):
        temp = sorted(self.items(),key=_itemgetter(1),reverse=True)
        low = 0
        high = temp.__len__()
        while(high - low > 1):
            mid = (low+high) >> 1
            if temp[mid][1] >= minvalue:
                low = mid
            else:
                high = mid
        if temp[low][1]<minvalue:
            if ret=='dict':
                return {}
            else:
                return []
        if ret=='dict':
            ret_data = {}
            for ele,count in temp[:high]:
                ret_data[ele]=count
            return ret_data
        else:
            return temp[:high]

    def less_than(self,maxvalue,ret='list'):
        temp = sorted(self.items(),key=_itemgetter(1))
        low = 0
        high = temp.__len__()
        while ((high-low) > 1):
            mid = (low+high) >> 1
            if temp[mid][1] <= maxvalue:
                low = mid
            else:
                high = mid
        if temp[low][1]>maxvalue:
            if ret=='dict':
                return {}
            else:
                return []
        if ret=='dict':
            ret_data = {}
            for ele,count in temp[:high]:
                ret_data[ele]=count
            return ret_data
        else:
            return temp[:high]

2.2 WordCounter

WordCounter完成的是根据文本来统计词频的工作。确切的来说，对完整的文本进行分词，过滤掉停用词，然后将预处理好的单词数组交给MulCounter去统计

class WordCounter():
    # can calculate the freq of words in a text list

    # for example
    # >>> data = ['Merge multiple sorted inputs into a single sorted output',
    #           'The API below differs from textbook heap algorithms in two aspects']
    # >>> wc = WordCounter(data)
    # >>> print(wc.count_res)

    # >>> MulCounter({' ': 18, 'sorted': 2, 'single': 1, 'below': 1, 'inputs': 1, 'The': 1, 'into': 1, 'textbook': 1,
    #                'API': 1, 'algorithms': 1, 'in': 1, 'output': 1, 'heap': 1, 'differs': 1, 'two': 1, 'from': 1,
    #                'aspects': 1, 'multiple': 1, 'a': 1, 'Merge': 1})

    def __init__(self, text_list):
        self.text_list = text_list
        self.stop_word = self.Get_Stop_Words()
        self.count_res = None

        self.Word_Count(self.text_list)

    def Get_Stop_Words(self):
        ret = []
        ret = FI.load_pickle('./static/stop_words.pkl')
        return ret

    def Word_Count(self,text_list,cut_all=False):

        filtered_word_list = []
        count = 0
        for line in text_list:
            res = jieba.cut(line,cut_all=cut_all)
            res = list(res)
            text_list[count] = res
            count += 1
            filtered_word_list += res

        self.count_res = MulCounter(filtered_word_list)
        for word in self.stop_word:
            try:
                self.count_res.pop(word)
            except:
                pass