NLP自然语言处理丶笔记

最新推荐文章于 2024-02-02 15:14:55 发布

强仔fight

最新推荐文章于 2024-02-02 15:14:55 发布

阅读量920

点赞数

分类专栏：机器学习文章标签：自然语言处理 nlp 机器学习

本文链接：https://blog.csdn.net/qq_35076836/article/details/109512530

版权

机器学习专栏收录该内容

21 篇文章 0 订阅

订阅专栏

语料库：
knowledge base

文本处理流程：
pipeline
原始文本 --》 (raw data) 网页文本,新闻,报告
分词 --》 (segmentation)
清洗 --》 (cleaning) 无用的标签,停用词,特殊符号
标准化 --》英文
特征提取 --》 tf-idf,word2vec
建模 --》 (modeling) 相似度算法,分类算法
评估

五大模块：
①Word segmentation ② spell correction
③ stop words removal，stemming
④ word representation
⑤ sentence similarity

①Word segmentation（将句子转化为词的表示）
工具：jieba分词 snownlp LTP等… [单词–句法–语义]
jiaba.add_word(" ") #给jieba词库添加词
【规则分词】
----前向最大匹配(forward-max matching)
[我们经常有]意见分析 [经常有意见]
[我们经常] [经常有意]
[我们经]
滑动窗口------------>
----后向最大匹配(backward-max matching)
[有意见分歧]
[意见分歧]
[见分歧]
实现代码：

#逆向最大匹配
claa IMM(object):
    def __init__(self, dic_path):
        self.dictionary = set()
        self.maximum = 0
        with open(dic_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                self.dictionary.add(line)
                if len(line) >self.maxium:
                    self.maxium = len(line)
    def cut(self, text):
        result=[]
        index=len(text)
        while index>0:
            word = None
            for size in range(self.maximum, 0 ,-1):
                 if index - size<0:
                     continue
                 piece = text[(index-size):index]
                 if piece in self.dictionary:
                     word = piece
                     result.append(word)
                     index-=size
                     break
              if word is None:
                  index-=1
          return result[::-1]
def main():
    text = "南京市长江大桥"
    tokenier = IMM('./data/imm_dic.utf8')
    print(tokenizer.cut(text))
***

----双向最大匹配(如果分词结果次数不同，选取词数切分最少的作为结果)

【统计分词】
把每个词看做由词的最小单位组成，相连的字在不同文本出现次数最多，证明相连的字可能就是一个词。

考虑语义 incorporate semantic
----输入–生成所有可能的分割–选择其中最好的 language model
语言模型原理： p(经常，有，意见，分歧)=p(经常)*p(有)*p(意见)*p(分歧)
----维特比算法
f(8)：从节点1到8的最短路径的值
f(7): 从节点1到7的最短路径值
f(6)：

f(8): f(5)+3 || f(6)+1.6 || f(7)+20 取三种路径最小

② spell correction
用户输入---------- 用户输入
天起 ---------- 天气
theris ---------- theirs
机器学系 ---------- 机器学习
触发：错别字，不适合场景
用户输入----候选(candidates) ---- 编辑距离edit distance
therr ---- there ---- replace1 1 成本
---- their ---- replace1 1 成本
---- thesis ---- 2replace + 1add 3成本
---- theirs
---- the
方法1：用户输入–》从词典中找出编辑距离最小的–》返回
缺点：循环次数太多
方法2：用户输入–》生成编辑距离为1,2的字符串–》过滤–》返回
怎么过滤？ – 计算 p(c|s)

③stop words removal，stemming
filtering words 类似于特征筛选
先把停用词，出现频率低的词汇过滤掉

stemming: one way to normalize
went,go,going->go
fly,flies->fly
方法举例：
step 1a sses ->ss ies->i
step 1b (v)ing -> v || (v)ed->v
step 2 for long stems
step 3 for longer stems

④ word representation
----one-hot representation:
词典：[我们，又，去，爬山，今天，你们，昨天，跑步]
S1:我们今天去爬山：【1,0,1,1,1,0,0,0】
S2:你们昨天跑步：【0,0,0,0,0,1,1,1】
S3:你们又去爬山又去跑步【0,1,1,1,0,1,0,1】
----count-based representation:
缺陷：并不是出现越多越重要。例如he
S4:我们今天去爬山：【1,0,1,1,1,0,0,0】
S5:你们昨天跑步：【0,0,0,0,0,1,1,1】
S6:你们又去爬山又去跑步【0,2,2,1,0,1,0,1】
----distributed representation：
词向量介绍：
下面哪些单词之间语义相似度更高？
今天运动爬山
分布式表达方法：sparsity稀疏性
distributed representation: 长度不依赖于词典例子：Word vector
100维one hot表示法最多可以表达多少个不同的单词?
100个！
100维分布式表示法最多可以表达多少个不同的单词？
远大于！

Q：怎么学习每一个单词的分布式表示(词向量):
输入 – 深度学习模型(skip-gram, cbow, rnn, mf) – distributed representation
词向量代表单词的意思

Q: from word embedding to sentence embedding
① average 法则句子向量=单独单词向量和的均值
② LSTM/RNN

⑤ sentence similarity
----欧式距离 d=|s1-s2|
例子：d(s1,s2)=根号下(1+1+1+1+1+1)=根号6
----余弦相似度 d=(s1*s2) / (|s1|*|s2|) 内积 / 范数乘积
例子：d=(x1y1+x2y2+x3y3)/( 根号下(x1^2+x2^2+x3^2) + 根号下(y1^2+y2^2+y3^2) )

语言模型介绍：
判断：一句话在语法上是否通顺
recap:chain rule
p(A,B,C,D)=P(A)*P(B|A)*P(C|AB)*P(D|ABC)
例子：文档中出现：今天是春节我们都休息
今天是春节我们都放假
p(休息|今天，是，春节，我们，都)=1/2

markov assumption 马尔科夫假设
p(休息|今天，是，春节，我们，都) 约等于 p(休息|都)
1st order markov assumption
p(休息|我们，都) 2ed
unigram – bigram – N-gram N>2

评估语言模型：
选定特定任务比如拼写纠错–两个模型AB应用在次任务中–比较准确率，判断A,B的表现
perplexity 越小越好 2**(-(x))
x:average log likelihood
今天 p(今天)=0.002
今天天气 p(天气|今天)=0.01
今天天气很好 p(很好|天气)=0.1
平均：(sum log likelihoood)/n

smoothing（平滑方法解决某一项为0的情况）
----add-one smoothing(laplace smoothing) 分子+1 分母+v 词典的数量
例子： p(是|我们)=0
p(是|我们)=(0+1)=(3+20)
----add-k smoothing 分子+k 分母+kv 词典的数量
----interpolation 核心思路：在计算trigram概率时同时考虑unigram,biagram,trigram出现的频次
p(wn|wn-1,wn-2)=alpha1p(wn|wn-1,wn-2)+alpha2*(wn|wn-1)+alpha3*p(wn)
alpha1+alpha2+alpha3=1
例子：C(in the kitchen)=0
C(the kitchen)=3
C(kitchen)=4
C(arboretum)=0
----good-turning smoothing

强仔fight

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
NLP自然语言处理丶笔记

语料库：knowledge base文本处理流程：pipeline原始文本 --》 (raw data) 网页文本,新闻,报告分词 --》 (segmentation)清洗 --》 (cleaning) 无用的标签,停用词,特殊符号标准化 --》英文特征提取 --》 tf-idf,word2vec建模 --》 (modeling) 相似度算法,分类算法评估五大模块：①Word segmentation ② spell correction③ stop
复制链接

扫一扫