Knowledge Acquisition
definition: process to extract knowledge from unstructured text to other data.
Raw text -> refined text -> textual entities and relations -> knowledge bases
- 从原始文本到Refined Text要进行NLP preprocessing
- 从Refined text 到文本形式的实体和关系要进行 information extraction
- 最后一步要进行 disambiguation and linking
NLP preprocessing
- structured data - databases
- unstructured data - information retrieval
- Semi-structured data : 文本总是有一些结构的,比如标题和bullets
进行NLP预处理的标准任务
-
Tokenization (分词)
- Token: instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit
- Type: class of all tokens containing the same character sequencequeries and documents have to be preprocessed identically.
分词决定了哪些queries能匹配成功
保证查询中的序列和文本中的序列相同
其他问题
连字符?名字/地名?分词的语言针对性*State-of-the-art:
stanford tokenizer
apache OpenNLP
NLTK -
Stemming(词干提取) or lemmatization(词形还原)
Goal of lemmatisation: reduce inflectional forms (all variants of a word) to base form
lemmatisation: reduction to dictionary headword form
lemma: dictionary form of a set of words
Goal of stemming: reduce term to their “roots”
Stemming: suggest crude affix chopping
stem: root form of a set of words, not necessarily a word itself -
Stopword removal
几乎没有含义,极其常见,在几乎所有文章里都会出现
去除stopword的意义:stopwords不在IR的词典中,节省大量存储空间,加快查询速度*趋势:不再进行stopwords removal: 有更好的压缩技术,和更好的查询技术
-
POS tagging (part-of-speech)词性标注
Allow for higher degree of abstraction to estimate likelihoods
自动标记: Penn Treebank tagset
标记流程:
输入:分词后的words
输出:chain of tokens和它们的pos标记
目标:对该序列最有可能的pos标签HMM (hidden markov model ) for POS tagging
通常一起使用Viterbi算法
当我们训练完一个模型后,对于给出词序列,我们想知道最后可能出现的POS是什么,这个任务被叫做decoding -
Parsing
Construct a tree that represents the syntactic structure of the string according to some grammars