python的基本文本处理操作

最新推荐文章于 2024-06-16 11:15:40 发布

许可可可可

最新推荐文章于 2024-06-16 11:15:40 发布

阅读量731

点赞数 2

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/xukeke12138/article/details/111233706

版权

笔记专栏收录该内容

20 篇文章 2 订阅

订阅专栏

基本文本的处理

基本文本的处理

基本文本的处理

语料库

nltk的基本语料库函数

在这里插入图片描述

nltk语料库的导入

from nltk.book import *
from nltk.corpus import gutenberg

文本的词汇多样性

在计算文本长度时，按照词次数（token_number）计数。即，len(text)。

定义函数lexcical_diversity()，计算文本的词汇多样性。即词次数（token_number）/词数（type_number）。

def lexcical_diversity(text):
    return len(text) / len(set(text))

词频统计

# 导入语料库
from nltk.book import *
# 导入词频
from nltk.probability import FreqDist

# 生成词频的词典
fd1 = FreqDist(text1)
# 生成前50词的 (词形, 词频) 的列表
mw1 = fd1.most_common(50)
# 绘制前100词的词频累计图
fd1.plot(100, cumulative = True)
# 绘制词分布图
text8.dispersion_plot(['crazy', 'love'])

utf8和unicode的python实现

# 导入库
import codecs

#unicode 编码解码
codecs.unicode_escape_encode(ch)
codecs.unicode_escape_decode(ch)
#utf-8 编码解码
codecs.utf_8_encode(ch)
codecs.utf_8_decode(ch)

python读取本地文件

# 打开文件，注意文件路径
f = open('license.txt')
# 读入一行
raw = f.read()

练习

制作《理智与情感》（text2）中四个主角：Elinor，Marianne，Edward和Willoughby的分布图
```
text2.dispersion_plot(['Elinor', 'Marianne', 'Edward', 'Willoughby'])
```

计算两个数的最小公倍数

# 计算两个数的最小公倍数
def Least_common_multiple(a, b):
    for i in range(2, min(a, b)):
        if a % i == 0 and b % i == 0:
            a /= i
            b /= i
    return a * b

切片表达式提取text2 中最后一个、两个词
```
last1 = text2[-2:-1]
last2 = text2[-3:-1]
```
使用for 和if 语句组合循环遍历《巨蟒和圣杯》（text6）的电影剧本中的词。输出所有的大写词作为一个列表。
```
lis = [word for word in text6 if word.isupper()]
```

找出聊天语料库（text5）中所有四个字母的词。使用频率分布函数（FreqDist），以频率从高到低显示这些词。

lis = [word for word in text5 if len(word) == 4]
FreqDist(lis).plot(100)
# FreqDist库不适用，则导入plt库
import matplotlib.pyplot as plt
fq = [s for (w, s) in FreqDist(lis).most_common(50)]
plt.plot(fq)
plt.show()

写表达式找出text6 中所有符合下列条件的词。要求结果是词链表的形式：[‘word1’, ‘word2’, …]。
a. 以ize 结尾
b. 包含字母z
c. 包含字母序列pt
d. 除了首字母外是全部小写字母的词（即titlecase）
```
s1=[w for w in text6 if w.endswith('ize')]
s2=[w for w in text6 if 'z' in w]
s3=[w for w in text6 if 'pt' in w]
s4=[w for w in text6 if w.istitle()]
```
定义sent 为词链表[‘she’, ‘sells’, ‘sea’, ‘shells’, ‘by’, ‘the’, ‘sea’, ‘shore’]。编写代码执行以下任务：
a. 输出所有sh 开头的单词
b. 输出所有长度超过4 个字符的词
```
s5=[w for w in sent if w.startswith('sh')]
s6=[w for w in sent if len(w)>4]
```
定义函数vocab_size(text)，返回文本的词汇量。
```
def vocab_size(text):
    return len(set(lis))
```

定义函数percent(word, text)，计算一个给定的词在文本中出现的频率。结果以百分比表示。

def percent(word, text):
    cnt = sum([1 for w in text if word == w])
    return cnt / len(text) * 100

计算文本的平均词长

def avgWordLen(text):
    return sum([len(w) for w in text]) / len(text)

正则表达式及其应用

正则表达式由两部分组成：

特殊字符/元字符–meta_characters
其他字符–literal

正则表达式模块

导入模块

import re

正则表达式匹配的函数调用：

re.search(pattern, string)
返回匹配对象，只返回第一个找到的匹配子串。
re.search(pattern, string).group()
获取匹配子串。
re.findall(pattern, string)
返回所有被匹配模式的列表。

在这里插入图片描述

基本元字符

已有wordlist，如下：

import nltk
# 'en' 表示english，指英文词汇
# 在words.words()中用'en'，在stopwords.words()中用'english'
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

$ 匹配结尾
例：列出wordlist中全部以ed结尾的词
```
[w for w in wordlist if re.search('ed$'. w)]
```

^ 匹配开始
例：列出以th开始的词

[w for w in wordlist if re.search('^th', w)]

. 匹配所有单个字符
例：列出第二个字符是的词
```
[w for w in wordlist if re.search('^.s', w)]
```
? 表示其前面的字符有1个或0个，即≤1
例：匹配email和e-mail
```
re.search('^e-?mail', w)
```
+ 表示前面的字符有1个或多个，即≥1
例：可以用来匹配goo(o···oo)gle
```
re.search('^goo+gle', w)
```
* 表示前面的字符有0个或多个，即≥0
例：匹配任意个任意字符’.*’
{m, n} 表示前面的字符至少出现m次，至多出现n次，即m≤字符出现字数≤n。特别地，{n}表示恰好出现n次，{, n}表示出现≤n次，{n, }表示出现≥n次
例：可以匹配到’sheet’，'sheep’等
```
re.search('^she{2}', w)
```
[] 表示在其中字符中任选一个进行匹配
注意，匹配是按照字符串中字符出现的位置顺序进行的，并不是按照模式中的顺序。
例：匹配的返回内容’ce’
```
re.search('[cer]', iceberg)
```
在[] 之中，当’^'作为第一个字符出现时，其含义为否定。即，[ ^ ]表示不与括号内字符之内的所有字符匹配。
例：匹配非元音字母开始的词
```
re.search('^[^aeiouAEIOU]', w)
```
| 在左右中选择一个字符进行匹配，优先选择左边
例1：先匹配以th开头的，如果匹配不上，匹配以sh开头的
```
re.search('^th|^sh', w)
```
例2：先匹配以th开头的，如果匹配不上，匹配以sh结尾的
```
re.search('^th|sh$', w)
```
() 限定操作符的作用范围
例1：匹配以sh开头的，如果匹配不上，匹配以th开头的
```
re.search('^(th|sh)', w)
```
例2：匹配以sh开头的，如果匹配不上，匹配含有sh的
```
re.search('^th|sh', w)
```
\ 表示回复后面字符的原意
例：网址匹配
```
re.search('^w{3}\.cuc\.edu', 'www.cuc.edu.cn')
```
注意：应避免正则表达式被python解释为一般字符串，它应被留给re解释器解释。因此，一般最好在正则表达式前加r，这样python就不会解释其中的特殊字符了。

例：在格式文本中的一行，’:‘之前是题目，’:'之后是内容。将题目和内容分开。

import re
from nltk.book import *

str = '春晓:春眠不觉晓，处处闻啼鸟'

result1 = re.search('(.*):', str)
# 结果：<re.Match object; span=(0, 3), match='春晓:'>
print(result1)
# 结果：春晓:
print(result1.group())
# 结果：春晓
print(result1.group(1))

result2 = re.search(':(.*)', str)
# 结果：<re.Match object; span=(2, 14), match=':春眠不觉晓，处处闻啼鸟'> 
print(result2)
# 结果：:春眠不觉晓，处处闻啼鸟
print(result2.group())
# 结果：春眠不觉晓，处处闻啼鸟
print(result2.group(1))

正则表达式在NLP中的应用

捕获与提取词干

对比：

使用字符串方法，去掉词后缀只提取词干信息

def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
	if word.endswith(suffix):
		return word[:-len(suffix)]
return word

基于re的词干提取器

# 结果：列表['ing']
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

# 结果：<re.Match object; span=(0, 10), match='processing'>
re.search(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') 

# findall函数返回一个列表，列表的第一个元素（也是唯一一个元素）即为其后缀。
def stem(w):
    w[:-len(re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', w)[0])]

nltk词干提取器

nltk提供已有的词干提取器，以应对re或字符串不能够处理的不规则情况。

word = 'lying'

poter = nltk.PorterStemmer()
# 结果：lie
print(poter.stem(word))

lancaster = nltk.LancasterStemmer()
# 结果：lying
print(lancaster.stem(word))

snowball = nltk.SnowballStemmer('english')
# 结果：lie
print(snowball.stem(word))

词形归并

恢复词的原型，是词义处理的第一步。它消除词的屈折变化。这个过程被称为“lemmatize”。

nltk有WordNet提供的lemmatize的工具。

wnl = nltk.WordNetLemmatizer()
# 结果：lying
print(wnl.lemmatize('lying'))
# 结果：foot
print(wnl.lemmatize('feet'))

使用re分词

re.split(pattern, string)函数，将string以pattern切分

re.split(r'', string)
re.split(r'\t\n', string)
# \W表示字符数字下划线以外的所有字符，+表示1个或多个
re.split(r'\W+', string)

re库的使用举例

查找词集中的元音连对：

[vs for w in wordlist for vs in re.findall(r'[aeiou]{2,}', w)]

压缩单词中的元音：

def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

辅音元音对的条件分布，音位分布：

rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

练习

使用re查找百分数
```
re.search(r'^\d+(\.\d+)?%$', '12.34%')
```
使用正则表达式取出文本中非字母的词
```
re.findall(r'[^a-zA-Z]+', string)
```

许可可可可

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python的基本文本处理操作

基本文本的处理语料库nltk的基本语料库函数nltk语料库的导入from nltk.book import *from nltk.corpus import gutenberg文本的词汇多样性在计算文本长度时，按照词次数（token_number）计数。即，len(text)。定义函数lexcical_diversity()，计算文本的词汇多样性。即词次数（token_number）/词数（type_number）。def lexcical_diversity(text): r
复制链接

扫一扫