1.处理流程
-
语音识别
-
自然语言处理 - 语义分析
-
逻辑分析 - 结合业务场景与上下文
-
自然语言处理 - 分析结果生成自然语言文本
-
语音合成
自然语言处理的常用处理过程:
先针对训练文本进行分词处理(词干提取、原型提取),统计词频,通过词频-逆文档频率算法获得该词对样本语义的贡献,根据每个词的贡献力度,构建有监督分类学习模型。把测试样本交给模型处理,得到测试样本的语义类别。
自然语言工具包 - NLTK
2.文本分析
# 文本分词
"""
可能要使用以下操作
import nltk
nltk.download('puntk')
import nltk.tokenize as tk
# 把样本按句子进行拆分 sent_list:句子列表
sent_list = tk.sent_tokenize(text)
# 把样本按单词进行拆分 word_list:单词列表
word_list = tk.word_tokenize(text)
# 把样本按单词进行拆分 punctTokenizer:分词器对象
punctTokenizer = tk.WordPunctTokenizer()
word_list = punctTokenizer.tokenize(text)
"""
import nltk.tokenize as tk
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
text = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(text)
sent_list = tk.sent_tokenize(text)
print(sent_list)
word_list = tk.word_tokenize(text)
print(word_list)
# 分词器对象
punctTokenlizer = tk.WordPunctTokenizer()
tokens = punctTokenlizer.tokenize(text)
print(tokens)
# 词干提取
"""
stemmer = pt.PorterStemmer() # 波特词干提取器,偏宽松
stemmer = lc.LancasterStemmer() # 朗卡斯特词干提取器,偏严格
stemmer = sb.SnowballStemmer('english') # 思诺博词干提取器,偏中庸
"""
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
pt_stem = pt_stemmer.stem(word)
lc_stem = pt_stemmer.stem(word)
sb_stem = sb_stemmer.stem(word)
print('%8s %8s %8s %8s' %
(word, pt_stem, lc_stem,