Python自然语言处理-第一章学习笔记 & 习题

最新推荐文章于 2024-03-05 08:19:49 发布

cyberickk

最新推荐文章于 2024-03-05 08:19:49 发布

阅读量487

点赞数

分类专栏： NLP Python 文章标签： python nlp nltk 自然语言处理

本文链接：https://blog.csdn.net/m0_46471347/article/details/104650623

版权

NLP 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

Python

2 篇文章 0 订阅

订阅专栏

软件安装

P18

Python 3.8.2 https://www.python.org/
pip pipenv
NLTK、NLTK-Data pip install nltk http://www.nltk.org/install.html
NumPy pip install numpy
Matplotlib pip install matplotlib
NetworkX pip install networkx
Prover9

1.1 语言计算：文本和单词

IDLE(Interactive DeveLopment Environment) 交互式开发环境

python语法：
- 幂次方运算2 的 4 次方 2**4
- 单引号与双引号均可表示字符串： 'sssss' 与 "sssss"
- 进行精确（浮点）除法 from __future__ import division

>>> 3/4
0.75
>>> 3//4
0

搜索文本
- text1.concordance('monstrous') concordance: 一致
- text1.similar('monstrous')
- text2.common_contexts(['monstrous', 'very']) [ ]中可以有多个词汇，无共同上下文时返回 not found
离散图：
- 需要安装 NumPy
- text4.dispersion_plot(['citizens', 'democracy', 'freedon', 'duties', 'America'])
生成文本:
- text3.generate()
计数词汇
- 单词和标点符号均是标识符，统计时会把标点符号也计入数据
- len(text3)
- from __future__ import division
- len(text3) / len(set(text3)) 每个词汇平均使用次数
- text3.count('smote')
- 100 * text4.count('a') / len(text4) a 的出现频率百分比
命名函数

>>> def lexical_diversity(text): 
... 	return len(text)/ len(set(text)) 
...
>>> def percentage(count, total): 
... 	return 100 * count / total

1.2 近观 Python：将文本当做词链表

链表
- sent1 = ['Call', 'me', 'Ishmael', '.']
- [‘a’, ‘b’, ‘c’] 将文本以链表形式存储，可以查阅、作为函数参数
- 可以做加法、乘法运算等 sent1+sent2 sent1 = (sent1 + ' ' ) * 3
- 追加： sent1.append('Some')
索引列表
- text4[173]
- 找索引： text4.index('awaken')
- text5[16715:16735]
- text5[0:9] 前闭后开区间，包含 0~8位共9个元素
- text5[-2:] 倒数2个元素
字符串
- 以空格串接 ' '.join(['Monty', 'Python'])
- 按空格进行分割 'Monty Python'.split()

1.3 计算语言：简单的统计

频率分布：

fdist1 = FreqDist(text1)
vocabulary1 = list(fdist1.keys())
vocabulary1[:50]  //切片，出现频率最高的50个词
fdist['whale']	//whale的出现次数
fdist1.hapaxes()	//只出现一次的词
fdist1.freq('which')  //频率百分比
help(FreqDist)

长高频词更能体现文本的特征，排除了短高频词（‘the’）和不具有代表性的长低频词

fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

词语搭配和双连词(bigrams)
- 提取双连词 bigrams(['more', 'is', 'said', 'than', 'done'])
- 提取并打印双连词 list(bigrams(['more', 'is', 'said', 'than', 'done']))
- 找到比基于单个词的频率预期得到的更频繁出现的双连词 text4.collocation_list()

计数其他东西

查看文本中词长的分布 [len(w) for w in text1]
文本中有多少种词长

fdist = FreqDist([len(w) for w in text1])
fdist.N()	//样本总数
fdist.keys()	//以频率递减顺序排序的样本链表
for sample in fdist:   //以频率递减的顺序遍历样本

各种词长的频率分布

fdist.items() //每种词长的出现次数
fdist.max()	//出现频率最高的词长
fdist[3]	//词长为3的单词出现次数
fdist.freq(3) //词长为3的单词出现比例

绘图

fdist.tabulate()  //频率分布表
fdist.plot()	//频率分布图
fdist.plot(cumulative = True)  //绘制累积频率分布图

1.4 回到 Python:决策与控制

条件

[w for w in sent7 if len(w) < 4]
sorted([w for w in set(text1) if w.endswith('ableness')])
sorted([term for term in set(text4) if 'gnt' in term])
sorted([item for item in set(text6) if item.istitle()])	//首字母大写其余小写
sorted([item for item in set(sent7) if item.isdigit()])	//数字
sorted([w for w in set(sent7) if not w.islower()])
sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])
s.isupper()  //全大写字母
s.islower()	//全小写字母
s.isalnum	//数字或字母
s.isalpha	//字母

链表推导

[w.upper() for w in text1] 	//所有的大写单词组成链表

嵌套代码块

 if len(word) < 5:
	print('word length is less than 5')
    //额外一个空白行

for word in ['Call', 'me', 'Ishmael', '.']:
	print(word)		//print()默认换行	
    
    
print(word, end = ' ')	//替换末尾的换行为空格，也可换为其他分隔符

条件循环

for token in sent1:
	if token.islower():
		print(token, 'is a lowercase word')
	elif token.istitle():
		print(token, 'is a titlecase word')
	else:
		print(token, 'is punctuation')

1.5 自动理解自然语言

词意消歧： 算出特定上下文中的词被赋予的是哪个意思，自动消除歧义需要使用上下文，利用相邻词汇有相近含义

指代消解： 检测主语和动词的宾语

涉及到寻找代词they 的先行词thieves 或者paintings，处理这个问题的计算技术包括指代消解（anaphora resolution）——确定代词或名词短语指的是什么——和语义角色标注（semantic role labeling）——确定名词短语如何与动词相关联（如施事，受事，工具等）。

自动生成语言： 自动问答和机器翻译

机器翻译MT： 在两种语言之间循环翻译若干次，语义往往会发生变化，甚至句子会变得没有意义。给出双语词典或文档来进行文本对齐，建立一个更有效的翻译模型。****

人机对话系统： nltk.chat.chatbots()

文本的含义： 文本含义识别(Recognizing Textual Entailment, RTE)，给定一个文本-假设对，判定给出的文本是否支持这一假设。