NLP

Knowledge Acquisition

definition: process to extract knowledge from unstructured text to other data.

Raw text -> refined text -> textual entities and relations -> knowledge bases

  • 从原始文本到Refined Text要进行NLP preprocessing
  • 从Refined text 到文本形式的实体和关系要进行 information extraction
  • 最后一步要进行 disambiguation and linking

NLP preprocessing

  • structured data - databases
  • unstructured data - information retrieval
  • Semi-structured data : 文本总是有一些结构的,比如标题和bullets

进行NLP预处理的标准任务

  1. Tokenization (分词)
    - Token: instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit
    - Type: class of all tokens containing the same character sequence

    queries and documents have to be preprocessed identically.
    分词决定了哪些queries能匹配成功
    保证查询中的序列和文本中的序列相同
    其他问题
    连字符?名字/地名?分词的语言针对性

    *State-of-the-art:
    stanford tokenizer
    apache OpenNLP
    NLTK

  2. Stemming(词干提取) or lemmatization(词形还原)
    Goal of lemmatisation: reduce inflectional forms (all variants of a word) to base form
    lemmatisation: reduction to dictionary headword form
    lemma: dictionary form of a set of words
    Goal of stemming: reduce term to their “roots”
    Stemming: suggest crude affix chopping
    stem: root form of a set of words, not necessarily a word itself

  3. Stopword removal
    几乎没有含义,极其常见,在几乎所有文章里都会出现
    去除stopword的意义:stopwords不在IR的词典中,节省大量存储空间,加快查询速度

    *趋势:不再进行stopwords removal: 有更好的压缩技术,和更好的查询技术

  4. POS tagging (part-of-speech)词性标注
    Allow for higher degree of abstraction to estimate likelihoods
    自动标记: Penn Treebank tagset
    标记流程:
    输入:分词后的words
    输出:chain of tokens和它们的pos标记
    目标:对该序列最有可能的pos标签

    HMM (hidden markov model ) for POS tagging
    通常一起使用Viterbi算法
    当我们训练完一个模型后,对于给出词序列,我们想知道最后可能出现的POS是什么,这个任务被叫做decoding

  5. Parsing
    Construct a tree that represents the syntactic structure of the string according to some grammars

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值