python中文自然语言处理_Python中文自然语言处理：一、基础文本处理

最新推荐文章于 2024-02-22 18:13:59 发布

weixin_39642990

最新推荐文章于 2024-02-22 18:13:59 发布

阅读量441

点赞数

文章标签： python中文自然语言处理

对中文进行分词

import jieba

text = '你好，我正在进行Python自然语言处理，有些问题需要处理,笑哈哈'

word = jieba.cut(text)

word_list = ' '.join(word).split(' ')

print(word_list)

输出：

['你好', '，', '我', '正在', '进行', 'Python', '自然语言', '处理', '，', '有些', '问题', '需要', '处理', ',', '笑哈哈']

去除文本中的标点符号

import re

reg = r"[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+"

for i in word_list:

result = re.match(reg,i)

if result != None:

word_list.remove(i)

print(word_list)

输出：

['你好', '我', '正在', '进行', 'Python', '自然语言', '处理', '有些', '问题', '需要', '处理', '笑哈哈']

生成一个单词的起始位置

text_no_punp = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+", "",text)

print(list(jieba.tokenize(text_no_punp)))

输出：

[('你好', 0, 2), ('我', 2, 3), ('正在', 3, 5), ('进行', 5, 7), ('Python', 7, 13), ('自然语言', 13, 17), ('处理', 17, 19), ('有些', 19, 21), ('问题', 21, 23), ('需要', 23, 25), ('处理', 25, 27), ('笑哈哈', 27, 30)]

去除重复词

class RepeatReplacer(object):

def __init__(self):

self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')

self.repl = r'\1\2\3'

def replace(self,word):

repl_word = self.repeat_regexp.sub(self.repl,word)

if repl_word != word:

return self.replace(repl_word)

else:

return repl_word

replacer = RepeatReplacer()

replacer.replace("高高兴兴")

输出：

'高兴'

对文本应用Zipf定律

import nltk

from nltk.corpus import brown

from nltk.probability import FreqDist

import matplotlib

import matplotlib.pyplot as plt

# 解决中文和负号显示

from pylab import mpl

mpl.rcParams['font.sans-serif'] = ['SimHei']

mpl.rcParams['axes.unicode_minus'] = False

matplotlib.use('MacOSX')

fd = FreqDist()

for text in gutenberg.fileids():

for word in gutenberg.words(text):

fd[word]+= 1

ranks = []

freqs = []

for rank, word in enumerate(fd):

ranks.append(rank+1)

freqs.append(fd[word])

plt.figure(figsize=(15,8))

plt.loglog(ranks,freqs,'.-')

plt.xlabel('词频(f)', fontsize=14, fontweight='bold')

plt.ylabel('排名(r)', fontsize=14, fontweight='bold')

plt.grid(True)

plt.show()

1240

相似性度量

from nltk.metrics import *

text1 = '你好，我正在使用Python自然语言处理，有些问题正在处理,嘿嘿'

word1 = jieba.cut(text1)

word_list1 = ' '.join(word1).split(' ')

# print(word_list1)

for i in word_list1:

result = re.match(reg,i)

if result != None:

word_list1.remove(i)

print(word_list)

print(word_list1)

# 准确性度量

print(accuracy(word_list,word_list1))

#　Jaccard相似系数度量

print(jaccard_distance(set(word_list),set(word_list1)))

# MASI距离度量

print(masi_distance(set(word_list),set(word_list1)))

# 二值距离度量

print(binary_distance(set(word_list),set(word_list1)))

输出：

['你好', '我', '正在', '进行', 'Python', '自然语言', '处理', '有些', '问题', '需要', '处理', '笑哈哈']

['你好', '我', '正在', '使用', 'Python', '自然语言', '处理', '有些', '问题', '正在', '处理', '嘿嘿']

0.75

0.38461538461538464

0.12692307692307692

1.0

分享到：

weixin_39642990

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python中文自然语言处理_Python中文自然语言处理：一、基础文本处理

对中文进行分词import jiebatext = '你好，我正在进行Python自然语言处理，有些问题需要处理,笑哈哈'word = jieba.cut(text)word_list = ' '.join(word).split(' ')print(word_list)输出：['你好', '，', '我', '正在', '进行', 'Python', '自然语言', '处理', '，', '有些...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。