9.自然语言处理

最新推荐文章于 2024-04-08 09:58:45 发布

木景夕

最新推荐文章于 2024-04-08 09:58:45 发布

阅读量341

点赞数

分类专栏： Python网络爬虫权威指南（第2版）

本文链接：https://blog.csdn.net/MUJINGXI_LH/article/details/117001015

版权

本文介绍了自然语言处理的基础知识，包括使用n-gram模型进行文本概括，马尔可夫模型在天气系统的应用，以及Python的NLTK库在统计分析和词性标注中的使用。通过实例展示了如何利用这些工具进行文本分析和理解，如创建马尔可夫链，以及使用NLTK进行词频统计和词性标注。

摘要由CSDN通过智能技术生成

理解文本分析的原理对各种机器学习场景都是非常有用的，而且还可以提高自己利用概率论和算法知识对现实问题进行建模的能力。
1.概括数据
简单修改一下我们在第 8 章用过的 n-gram 模型，就可以用来获得 2-gram 序列的频率数据，并返回一个 2-gram 的 Counter 对象，代码如下：

# -*- coding: GBK -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
from collections import Counter

def cleanSentence(sentence):
	sentence = sentence.split(' ')
	sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]
	sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'i')]
	return sentence
	
def cleanInput(content):
	content = content.upper()
	content = re.sub('\n', ' ', content)
	content = bytes(content, 'UTF-8')
	content = content.decode('ascii', 'ignore')
	sentences = content.split('. ')
	return [cleanSentence(sentence) for sentence in sentences]
	
def getNgramsFromSentence(content, n):
	output = []
	for i in range(len(content)-n+1):
		output.append(content[i:i+n])
	return output
	
def getNgrams(content, n):
	content = cleanInput(content)
	ngrams = Counter()
	ngrams_list = []
	for sentence in content:
		newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, 2)]
		ngrams_list.extend

最低0.47元/天解锁文章

木景夕

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
9.自然语言处理

理解文本分析的原理对各种机器学习场景都是非常有用的，而且还可以提高自己利用概率论和算法知识对现实问题进行建模的能力。1.概括数据简单修改一下我们在第 8 章用过的 n-gram 模型，就可以用来获得 2-gram 序列的频率数据，并返回一个 2-gram 的 Counter 对象，代码如下：# -*- coding: GBK -*-from urllib.request import urlopenfrom bs4 import BeautifulSoupimport reimport str
复制链接

扫一扫