EE6405-Natural Language Processing Week 2(LEC)

本文探讨了信息提取系统的关键技术,包括命名实体识别(NER)、词性标注、依存句法分析,以及Python库spaCy的应用。这些技术用于定位文本中的关键信息,组织和分类实体,以及生成结构化数据,以支持情感分析、关系识别和底层信息检索等应用。
摘要由CSDN通过智能技术生成

Linguistic Analysis and Information Extraction

语言分析和信息提取

Information Extraction Systems 信息提取系统

▪ Locate and comprehend pertinent sections of text.

▪ Summarise useful information across documents.

▪ Produce a structured representation of the information.

  • 定位并理解文本中的相关部分。

  • 跨文档总结有用的信息。

  • 产生信息的结构化表示。

Goals:

▪ Information Organisation:To organise information in away that is useful for human understanding.

▪ Entity and Relationship Identification:To identify the relationshipbetween entities(ie. names/organisations) through their relationships within text.

▪ Generation of Structured Data:To put information contained within text into a semantically precise form that allows forfurther inferences to be made by algorithms.

  • 信息组织:以一种对人类理解有用的方式组织信息。

  • 实体和关系识别:通过文本中的关系识别实体(例如,名字/组织)之间的关系。

  • 生成结构化数据:将文本中包含的信息放入一个语义上准确的形式,以便算法能够进行进一步的推断。

Named Entity Recognition (NER) 命名实体识别

Most important text information lies within named entities.

▪ These include names, locations, companies, dates etc.

▪ Works by extracting the most important pieces of information from unstructured text.

▪ Delivers critical insights by picking up mentions of certain organisations/people.

  • 这些包括名字、地点、公司、日期等。
  • 通过从非结构化文本中提取最重要的信息片段来工作。
  • 通过捕捉对某些组织/人的提及来提供关键见解。

NER primarily focuses on:

▪ Organisations 组织

▪ Locations 地点

▪ Dates 日期

▪ Persons 人物

▪ Events 事件

▪ These entities can be adjusted depending on the nature of the task. 这些实体可以根据任务的性质进行调整。

Any NER task needs to accomplish two basic goals:

Detecting a named entity 检测命名实体

Detecting a word or a string of words that form an entity.With each word representing a token, “United Overseas Bank” is a string of three tokens representingone entity.

检测一个单词或一系列构成实体的单词。每个单词代表一个令牌,“United Overseas Bank”是由三个令牌组成的字符串,代表一个实体。

Categorising the entity 对实体进行分类

Entity categories need to be created based on the task athand. Common categories include people, organisation and time. Granular rules need to be created in orderto classify these entities into their respective subcategories.

需要根据手头的任务创建实体类别。常见的类别包括人物、组织和时间。需要创建细化的规则以便将这些实体分类到各自的子类别中。

Eg.Using spaCy
import spacy
from spacy import displacy            # Displacy is a built-in visualiser in SpaCy, it’ll help us better visualise the data later.

raw_text = "From 1925 to 1945, Tolkien was the Rawlinson andBosworth Professor of Anglo-Saxon and a Fellow ofPembroke College, both at the University of Oxford.He then moved within the same university to becomethe Merton Professor of English Language andLiterature and Fellow of Merton College, and heldthese positions from 1945 until his retirement in1959. Tolkien was a close friend of C. S. Lewis, a co-member of the informal literary discussion group TheInklings. He was appointed a Commander of theOrder of the British Empire by Queen Elizabeth II on28 March 1972."

NER = spacy.load("en_core_web_sm")    #en_core_web_sm is a pretrained Englishpipeline that includes an NER tagger.
text1 = NER(raw_text)

for word in text1.ents:               #These tokens can be accessed via the entsproperty.
    print(word.text,word.label_)      #The NER() function returns tagged NEs as a span(ordered list of tokens)

1925 to 1945 DATE
Tolkien PERSON
Anglo NORP
Fellow ofPembroke College ORG
the University of Oxford ORG
Merton DATE
English LANGUAGE
Fellow of Merton College ORG
heldthese NORP
1945 DATE
Tolkien PERSON
C. S. Lewis PERSON
TheInklings ORG
theOrder ORG
the British Empire GPE
Elizabeth II PERSON
March 1972 DATE

Tips: download en_core_web_sm
$python spacy download en_core_web_sm

Applications

▪ Performing sentiment analysis toward a company or product.

▪ Relations between entities represent a large number of IE relations.

▪ Answer detection – Answers often come in the form of named entities.

▪ Low-level information retrieval – Assists in queries on the nature/history of an entity.

  • 对公司或产品进行情感分析。

  • 实体之间的关系代表了大量的信息提取(IE)关系。

  • 回答检测 —— 答案通常以命名实体的形式出现。

  • 底层信息检索 —— 协助查询有关实体的性质/历史的问题。


Part-Of-Speech Tagging 词性标注

▪ Words found in natural languages canbe categorised based on their roles and functions within a sentence. Common categories include:

• Nouns

• Verbs

• Adjectives

• Adverbs

• Conjunctions

▪ By analysing the structural and semantic context of words in speech, text can be processedmore accurately.

▪ E.g.: Depending on context, the word ‘run’ can either be a noun or a verb. POS taggingallows computers to contextualise these words for more accurate processing andcomprehension.

  • 自然语言中的单词可以根据它们在句子中的角色和功能进行分类。常见的类别包括:
  • 名词
  • 动词
  • 形容词
  • 副词
  • 连词
  • 通过分析语言中单词的结构和语义上下文,文本可以被更准确地处理。
  • 例如:根据上下文,“run”这个词可以是名词也可以是动词。词性标注允许计算机对这些单词进行上下文化,以便更准确的处理和理解。

▪ POS tagging aims to automate the task of tagging parts-of-speech to determine thesyntactic and grammatical role of each word in the context of a sentence.

▪ Tagging can be done using linguistic patterns, context and predefined dictionaries.

▪ POS tagging can also be achieved using probability models, such as Hidden MarkovModels.

  • 词性标注旨在自动化为词性打标签的任务,以确定每个单词在句子中的语法和句法角色。

  • 标注可以使用语言模式、上下文和预定义的字典来完成。

  • 词性标注也可以通过概率模型来实现,如隐马尔可夫模型(Hidden Markov Models)。

Hidden Markov Models:

▪ HMM (Hidden Markov Model) is a stochastic technique for POS tagging.

▪ Works by leveraging state transitions and observations to determine the most likelysequence for POS tags in a sentence.

Components of a HMM include:

▪ States (Specific POS tags like ‘noun’, verb’, ‘adjective’)

▪ Observations (Each observation corresponds to a word in a sentence)

Two sets of probabilities are used:

▪ Transmission Probability:The probability of transitioning from one POS tag to another.

▪ Emission Probability:The probability of a certain word being emitted from a POS tag.

隐马尔可夫模型:

  • HMM(隐马尔可夫模型)是一种用于词性标注的随机技术。
  • 通过利用状态转换和观测来确定句子中词性标签最可能的序列。

隐马尔可夫模型的组成部分包括:

  • 状态(特定的词性标签,如“名词”、“动词”、“形容词”)
  • 观测(每个观测对应句子中的一个词)

使用两组概率:

  • 转移概率:从一个词性标签转移到另一个词性标签的概率。
  • 发射概率:某个词性标签发出特定单词的概率。

Applications

▪ Text classification: POS tagging can aid in categorizing texts into various groups, byconducting sentiment analysis. Through the examination of the part-of-speech tagsassigned to words within a text, algorithms can better understand the text's subjectmatter.

▪ Machine translation: By identifying the grammatical structure and relationshipsbetween words in the source language, and mapping them to the target language,POS tagging can be used to help translate texts.

▪ Natural language generation: POS tagging can be used to generate natural-sounding text by selecting appropriate words and constructing grammatically correctsentences. This is useful for tasks such as chatbots and virtual assistants.

  • 文本分类:词性标注可以帮助将文本分类到不同的组别中,通过进行情感分析。通过检查文本中单词被分配的词性标签,算法可以更好地理解文本的主题。

  • 机器翻译:通过识别源语言中单词的语法结构和关系,并将它们映射到目标语言,词性标注可用于帮助翻译文本。

  • 自然语言生成:词性标注可以用来生成听起来自然的文本,通过选择合适的单词并构建语法正确的句子。这对于聊天机器人和虚拟助手等任务很有用。


Dependency Parsing 依存句法分析

Dependency parsing identifies which words are the main components (heads) and which wordsdepend on them (modifiers).

▪ Each relationship is assigned a label (e.g., subject, object, modifier) to indicate the grammatical role ofthe dependent word in relation to the head word.

▪ These relationships form a tree-like structure, revealing the hierarchical organisation of the sentence.

依存句法分析识别哪些单词是主要成分(中心词)以及哪些单词依赖于它们(修饰词)。

  • 每个关系被分配一个标签(例如,主语、宾语、修饰语),以表示从属单词相对于中心词的语法角色。

  • 这些关系形成了树状结构,揭示了句子的层次组织。


Summary

▪ Information Extraction Systems including:

        • Named Entity Recognition:Process of identifying and categorising specific named         entities,such as names, dates, locations, and more, within text.

▪ Part Of Speech Tagging:Assigning grammatical labels to words in a sentence to indicate theirsyntactic roles and categories, like nouns, verbs, adjectives, etc.

▪ Dependency Parsing:Parsing technique that analyses the syntactic structure of a sentenceby identifying the relationships between words, showing how theydepend on one another.

▪ spaCy:A Python library that provides tools and resources for informationextraction.

  • 信息提取系统包括:
    • 命名实体识别:在文本中识别和分类特定的命名实体,如姓名、日期、地点等的过程。

  • 词性标注:为句子中的单词分配语法标签,以指示它们的句法角色和类别,如名词、动词、形容词等。

  • 依存句法分析:一种解析技术,通过识别单词之间的关系分析句子的句法结构,显示它们如何相互依赖。

  • spaCy:一个提供信息提取工具和资源的Python库。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值