Lecture 5 Part of Speech Tagging

文章探讨了词性(POS)在信息提取中的重要性,介绍了词性的概念,如名词、动词、形容词等,并讨论了词性的开放类和封闭类。此外,提到了词类的模糊性问题,以及自动词性标注的原因和方法,如基于规则、统计和隐马尔科夫模型的标注器。自动标注对于处理大规模文本数据和提高信息检索、情感分析等任务的效率至关重要,但也面临未知词的挑战。
摘要由CSDN通过智能技术生成

Part of Speech(POS)

  • Also called word classes, morphological classes, syntactic categories 也称为词类、形态类、句法类别

  • E.g.: nouns, verbs, adjective 例如:名词、动词、形容词

  • POS tells information about a word and its neighbors: 词性提供了关于单词及其相邻单词的信息

    • Nouns are often preceded by determiners 名词通常由限定词前置
    • Verbs preceded by nouns 动词通常由名词前置
    • content as a noun pronounced as /'kɑ:ntent/
    • content as an adjective pronounced as /kən’tent/
POS application: Information Extraction 词性应用:信息提取
  • Given sentence: “Brasilia, the Brazilian capital, was founded in 1960”

  • Extract information: 提取信息

    • capital(Brazil, Brasilia)
    • founded(Brasilia, 1960)
  • First step of information extraction is finding all POS tags: 信息提取的第一步是找到所有的词性标签

    • nouns: Brasilia, capital
    • adjective: Brazilian
    • verbs: founded
    • numbers: 1960

POS Open Class 开放类词性

  • Open vs. closed: How readily do POS categories take on new words? 开放类 vs. 封闭类:词性类别接受新词的频率如何?

  • E.g. of open classes: 开放类的例子

    • Nouns:
      • Proper(专有名词) vs. common(普通名词): Australia, wombat
      • Mass(集合名词) vs. count(可数名词): rice, bowls
    • Verbs:
      • Rich inflection: go/goes/going/gone/went 富有变化
      • Auxiliary verbs(助动词): be, have, do 助动词
      • Transitivity: wait, hit, give 及物性
    • Adjectives:
      • Gradable(等级形容词) vs. non-gradable(非等级形容词): happy/happier/happiest, computational
    • Adverbs:
      • Manner(情状副词): slowly
      • Locative(处所副词): here
      • Degree(程度副词): really
      • Temporal(时间副词): today
  • E.g. of closed classes: 封闭类的例子

    • Prepositions(介词):
      • in, on, with, for, of, over
    • Particles:
      • off
    • Determiners(限定词):
      • Articles(冠词): a, an, the
      • Demonstratives(指示词): this, that, these, those
      • Quantifiers(数量词): each, every, some, two
    • Pronouns(代词):
      • Personal(人称代词): I, me, she
      • Possessive(所有格代词): my, our
      • Interrogative(疑问代词): who, what
    • Conjunctions(连词):
      • Coordinating(并列连词): and, or, but
      • Subordinating(从属连词): if, although, that
    • Modal verbs(情态动词):
      • Ability: can, could
      • Permission: can, may
      • Possibility: may, might, could, will
      • Necessity: must

Problem of word classes: Ambiguity 词类问题:模糊性

  • Many word types belong to multiple classes 许多单词类型属于多个类别

  • POS depends on context 词性取决于上下文

  • E.g.: flies

    在这里插入图片描述

    • The word flies in the first sentence is an inflection of the verb “fly” 在第一句中,flies 是动词 “fly” 的变形
    • The word flies in the second sentence is the plural form of the noun “fly” 在第二句中,flies 是名词 “fly” 的复数形式

Tagsets

Tagsets 标记集

  • A compact representation of POS information 词性信息的紧凑表示

    • Usually less than 4 capitalized characters. E.g. NN = noun 通常少于4个大写字符。例如 NN = noun
    • Often includes inflectional distinctions 经常包括形态变化的区别
  • Major English tagsets: 主要的英语标记集

    • Brown: 87 tags
    • Penn Treebank: 45 tags
    • CLAWS/BNC: 61 tags
    • Universal: 12 tags
  • At least one tagset for all major languages 所有主要语言至少有一个标记集

Penn Treebank Tags:

  • Open classes: 开放类

    • NN: noun 名词
    • VB: verb 动词
    • JJ: adjective 形容词
    • RB: adverb 副词
  • Closed classes: 封闭类

    • DT: determiner 限定词
    • CD: cardinal number 基数
    • IN: preposition 介词
    • PRP: personal pronoun 人称代词
    • MD: modal 情态动词
    • CC: coordinating conjunction 并列连词
    • RP: particle 助词
    • WH: wh-pronoun 疑问代词
    • TO: to

Derived Tags: 衍生标签

  • Open classes: 开放类

    • NN (noun singular): 单数名词
      • NNS (plural) 复数
      • NNP (proper) 专有名词
      • NNPS (proper plural) 复数专有名词
    • VB (verb infinitive): 不定式动词
      • VBP (1st/2nd person present) 第一/第二人称现在时
      • VBZ (3rd person singular) 第三人称单数
      • VBD (past tense) 过去时
      • VBG (gerund) 现在分词
      • VBN (past participle) 过去分词
    • JJ (adjective): 形容词
      • JJR (comparative) 比较级
      • JJS (superlative) 最高级
    • RB (adverb): 副词
      • RBR (comparative) 比较级
      • RBS (superlative) 最高级
  • Closed classes: 封闭类

    • PRP (pronoun personal): 人称代词
      • PRP$ (possessive) 所有格
    • WP (wh-pronoun): 疑问代词
      • WP$ (possessive) 所有格
      • WDT (wh-determiner) 疑问限定词)
      • WRB (wh-adverb) 疑问副词

Tagged Text Example 标记文本示例

在这里插入图片描述

Automatic Tagging

Reasons for automatic POS tagging 自动词性标注的原因

  • Important for morphological analysis. E.g. lemmatization 对形态分析很重要。例如:词形还原

  • For some applications, we want to focus on certain POS 对于某些应用,我们希望关注某些词性

    • E.g. nouns are important for information retreieval, adjectives for sentiment analysis 例如:名词对于信息检索很重要,形容词对于情感分析很重要
  • Very useful features for certain classification tasks. 对于某些分类任务,这是非常有用的特性

    • E.g. genre attribution 体裁属性
  • POS tags can offer word sense disambiguation 词性标签可以提供词义消歧

    • E.g. cross/NN, cross/VB, cross/JJ all have different means
  • Can use them to create larger structures 可以用它们来创建更大的结构

Automatic Taggers 自动标注器

  • Rule-based taggers 基于规则的标注器
  • Statistical taggers 统计标注器
    • Unigram tagger 一元标注器
    • Classifier-based tagger 基于分类器的标注器
    • Hidden Markov Model tagger 隐马尔科夫模型标注器

Rule-Based Tagging

  • Typically starts with a list of possible tags for each word. Source from a lexical resource or a corpus 通常从词典或语料库中为每个单词列出可能的标签开始
  • Often includes other lexcial information. E.g. verb subcategorization 经常包括其他词汇信息。例如:动词下类化
  • Apply rules to narrow down to a single tag 应用规则以缩小到一个标签
  • Large systems have thousands of constraints 大型系统有数千个约束

Unigram Tagger

  • Assign most common tag to each word type 为每个单词类型分配最常见的标签
  • Requires a corpus of tagged words 需要一个标记过的词语的语料库
  • Just a look-up table 只是一个查找表
  • Approximately 90% accuracy 精度约为90%
  • Often considered the baseline for more complex approaches 通常被认为是更复杂方法的基线

Classifier-Based Tagging

  • Use a standard discriminative classifier like logistic regression or neural network with features: 使用如逻辑回归或神经网络这样的标准判别式分类器,其特征包括

    • Target word 目标词
    • Lexical context around the word 词周围的词汇上下文
    • Already classified tags in the sentence 句子中已分类的标签
  • Can suffer from error propagation: wrong predictions from previous steps affect the next ones 可能受到错误传播的影响:前一步的错误预测影响下一步

Hidden Markov Models

  • A basic sequential model 一个基本的序列模型
  • Like sequential classifiers, use both previous tag and lexical evident 与序列分类器一样,使用前一个标签和词汇证据
  • Unlike classifiers, considers all possibilities of previous tag and treat previous tag evidence and lexical evidence as independent from each other 与分类器不同的是,它考虑了前一个标签的所有可能性,并将前一个标签的证据和词汇证据视为相互独立的
    • Less sparsity 稀疏度较小
    • Fast algorithms for sequential prediction 针对序列预测的快速算法

Unknown Words

  • Huge problem in morphologically rich languages 在形态丰富的语言中是一个巨大的问题

  • Can use things already seen only once to best guess for things never seen before 可以使用已经看到一次的事物来对从未见过的事物进行最佳猜测

    • Tend to be nouns, followed by verbs 倾向于是名词,然后是动词
    • Unlikely to be determiners 不太可能是限定词
  • Can use sub-word representations to capture morphology 可以使用子词表示来捕获形态

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值