百度深度学习集训营第二阶段-作业1

最新推荐文章于 2021-03-05 15:54:32 发布

seek_dreamer

最新推荐文章于 2021-03-05 15:54:32 发布

阅读量459

点赞数

分类专栏： NLP 文章标签：深度学习

本文链接：https://blog.csdn.net/muruibin88/article/details/104687514

版权

NLP 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

作业1-1

（1）下载飞桨本地并安装成功,将截图发给班主任

（2）学习使用PaddleNLP下面的LAC模型或Jieba分词 LAC模型地址：https://github.com/PaddlePaddle/models/tree/release/1.6/PaddleNLP/lexical_analysis Jieba模型：https://github.com/fxsjy/jieba

（3）对人民日报语料完成切词，并通过统计每个词出现的概率，计算信息熵语料地址：https://github.com/fangj/rmrb/tree/master/example/1946%E5%B9%B405%E6%9C%88

作业1-2

（1）思考一下，假设输入一个词表里面含有N个词，输入一个长度为M的句子，那么最大前向匹配的计算复杂度是多少？

（2）给定一个句子，如何计算里面有多少种分词候选，你能给出代码实现吗？

（3）除了最大前向匹配和N-gram算法，你还知道其他分词算法吗，请给出一段小描述。

作业1-1：

（1）下载飞桨本地并安装成功截图

（2）学习使用PaddleNLP下面的LAC模型或Jieba分词

结巴使用参考

（3）对人民日报语料完成切词，并通过统计每个词出现的概率，计算信息熵语料地址：https://github.com/fangj/rmrb/tree/master/example/1946%E5%B9%B405%E6%9C%88

信息熵计算公式：

代码：

import jieba
import os
import numpy as np

# 加载语料，分词，返回一个list
def loadCorpus(dirpath):
    files = os.listdir(dirpath)
    punctuation = list("s+\.!/,$%^*(+\"\')]+|[+——()?【】“”！，。？、~@#￥%……*（）]+")
    words = []
    for file in files:
        with open("194605\{}".format(file), "r", encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                tempwords = [word for word in jieba.cut(line, cut_all=False) if word not in punctuation]
                words += tempwords
    return words

# 统计词频
def countwords(corpus):
    counts = len(corpus)
    wordcountdict = {}
    for word in corpus:
        if word not in wordcountdict:
            wordcountdict[word] = 0
        wordcountdict[word] += 1
    return wordcountdict, counts

# 计算信息熵
def informationentropy(wordcountdic,counts):
    wordfreq = np.array(list(wordcountdic.values()))
    prob = np.divide(wordfreq,counts)
    entropy = - (np.sum(prob * np.log2(prob)))
    return entropy

if __name__ == '__main__':
    corpusWords = loadCorpus("corpus")
    wordcountdict, counts = countwords(corpusWords)
    entropy = informationentropy(wordcountdict, counts)
    print("entropy", entropy)corpusWords = loadCorpus("corpus")
    wordcountdict, counts = countwords(corpusWords)
    print("词频统计结果前10个（没有排序）：")
    for word in list(wordcountdict.keys())[:10]:
        print("{}:{}".format(word,wordcountdict[word]))
    entropy = informationentropy(wordcountdict, counts)
    print("entropy:", entropy)

说明：去除标点在分词后去掉，因为标点对分词效果影响很大，防止因为去除标点影响分词效果，但是在分词后去除标点和特殊符号也有弊端，分词可能将特殊符号组合分为一个词，例如“###”，这种出现在词表中也不合适。

运行结果：