NLP

最新推荐文章于 2024-07-07 21:50:25 发布

是ひま呀

最新推荐文章于 2024-07-07 21:50:25 发布

阅读量80

点赞数

分类专栏：课程笔记 # WDPS

本文链接：https://blog.csdn.net/Odessa_R/article/details/103556260

版权

课程笔记同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

WDPS

5 篇文章 0 订阅

订阅专栏

Knowledge Acquisition

definition: process to extract knowledge from unstructured text to other data.

Raw text -> refined text -> textual entities and relations -> knowledge bases

从原始文本到Refined Text要进行NLP preprocessing
从Refined text 到文本形式的实体和关系要进行 information extraction
最后一步要进行 disambiguation and linking

NLP preprocessing

structured data - databases
unstructured data - information retrieval
Semi-structured data : 文本总是有一些结构的，比如标题和bullets

进行NLP预处理的标准任务

Tokenization （分词）
- Token: instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit
- Type: class of all tokens containing the same character sequence

queries and documents have to be preprocessed identically.
分词决定了哪些queries能匹配成功
保证查询中的序列和文本中的序列相同
其他问题
连字符？名字/地名？分词的语言针对性

*State-of-the-art:
stanford tokenizer
apache OpenNLP
NLTK
Stemming（词干提取） or lemmatization（词形还原）
Goal of lemmatisation: reduce inflectional forms (all variants of a word) to base form
lemmatisation: reduction to dictionary headword form
lemma: dictionary form of a set of words
Goal of stemming: reduce term to their “roots”
Stemming: suggest crude affix chopping
stem: root form of a set of words, not necessarily a word itself
Stopword removal
几乎没有含义，极其常见，在几乎所有文章里都会出现
去除stopword的意义：stopwords不在IR的词典中，节省大量存储空间，加快查询速度

*趋势：不再进行stopwords removal：有更好的压缩技术，和更好的查询技术
POS tagging （part-of-speech）词性标注
Allow for higher degree of abstraction to estimate likelihoods
自动标记： Penn Treebank tagset
标记流程：
输入：分词后的words
输出：chain of tokens和它们的pos标记
目标：对该序列最有可能的pos标签

HMM (hidden markov model ) for POS tagging
通常一起使用Viterbi算法
当我们训练完一个模型后，对于给出词序列，我们想知道最后可能出现的POS是什么，这个任务被叫做decoding
Parsing
Construct a tree that represents the syntactic structure of the string according to some grammars

是ひま呀

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLP

Knowledge Acquisitiondefinition: process to extract knowledge from unstructured text to other data.Raw text -> refined text -> textual entities and relations -> knowledge bases从原始文本到Refin...
复制链接

扫一扫

专栏目录