spacy 名词性短语_使用spacy nlp进行词法化,词法化,停用词和短语匹配的快速指南...

本文是一篇关于如何利用spacy库进行词法化、词形还原、停用词处理和短语匹配的快速指南,特别关注spacy在提取名词性短语上的应用。
摘要由CSDN通过智能技术生成

spacy 名词性短语

spaCy” is designed specifically for production use. It helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning. In this article, you will learn about Tokenization, Lemmatization, Stop Words and Phrase Matching operations using spaCy.

spaCy”是专门为生产使用而设计的。 它可以帮助您构建处理和“理解”大量文本的应用程序。 它可用于构建信息提取自然语言理解系统,或预处理文本以进行深度学习 。 在本文中,您将学习使用spaCy进行的标记化,词法化,停用词和词组匹配操作。

This is article 2 in the spaCy Series. In my last article, I have explained about spaCy Installation and basic operations. If you are new to this, I would suggest starting from article 1 for a better understanding.

这是spaCy系列文章中的第2条。 在上一篇文章中,我解释了有关spaCy安装和基本操作的信息。 如果您对此不熟悉,我建议从第1条开始,以更好地理解。

Article 1 — spaCy-installation-and-basic-operations-nlp-text-processing-library/

第1条: spaCy安装和基本操作-NLP文本处理库/

代币化 (Tokenization)

Tokenization is the first step in text processing task. Tokenization is not only breaking the text into components, pieces like words, punctuation etc known as tokens. However, it is more than that. spaCy do the intelligent Tokenizer which internally identifies whether a “.” is punctuation and separate it into token or it is part of an abbreviation like “U.S.” and do not separate it.

标记化是文本处理任务的第一步。 标记化不仅将文本分解为多个组成部分,例如单词,标点符号等,也称为标记。 但是,不仅如此。 spaCy做智能标记器,在内部识别是否为“。” 是标点符号,请将其分隔为令牌,或者是“ US”等缩写的一部分,请勿将其分隔。

spaCy applies rules specific to the Language type. Let’s understand with an example.

spaCy应用特定于语言类型的规则。 让我们看一个例子。

import spacy
nlp = spacy.load("en_core_web_sm")doc = nlp("\"Next Week, We're coming from U.S.!\"")
for token in doc:
print(token.text)
  • spaCy start splitting first based on the white space available in the raw text.

    spaCy首先根据原始文本中可用的空白开始拆分。

  • Then it processes the text from left to right and on each item (splitter based on white space) it performs the following two checks:

    然后,它从左到右处理文本,并在每个项目(基于空白的分隔符)上执行以下两项检查:
  • Exception Rule Check: Punctuation available in “U.S.” should not be treated as further tokens. It should remain one. However, we’re should be split into “we” and “ ‘re “

    例外规则检查: “美国”中的标点符号不应被视为其他标记。 它应该保持为一。 但是,我们应该分为“我们”和“'re”

  • Prefix, Suffix and Infix check: Punctuation like commas, periods, hyphens or quotes to be treated as tokens and separated out.

    前缀,后缀和后缀检查:标点符号(例如逗号,句点,连字符或引号)被视为标记并分开。

If there’s a match, the rule is applied and the Tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

如果存在匹配项,则应用规则,并且Tokenizer继续其循环,从新拆分的子字符串开始。 这样,spaCy可以拆分复杂的嵌套令牌,例如缩写和多个标点符号的组合。

Image for post
  • Prefix: Look for Character(s) at the beginning ▸ $ ( " ¿

    前缀 :在开头查找字符▸ $ ( " ¿

  • Suffix: Look for Character(s) at the end ▸ mm ) , . ! " mm is an example of a unit

    后缀 :在末尾查找字符▸, mm ) , . ! " mm ) , . ! " mm ) , . ! " mm是一个单位的示例

  • Infix: Look for Character(s) in between ▸ - -- / ...

    中缀 :在▸ - -- / ...之间寻找字符

  • Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ St. N.Y.

    例外 :特殊情况下的规则,在应用标点符号规则时将字符串分割为多个标记或防止标记被分割▸St St. NY

Notice that tokens are pieces of the original text. Tokens are the basic building blocks of a Doc object — everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

请注意,标记是原始文本的一部分。 令牌是Doc对象的基本构建块-有助于我们理解文本含义的所有内容都源于令牌及其相互之间的关系。

前缀,后缀和前缀作为标记 (Prefixes, Suffixes and Infixes as Tokens)

  • spaCy will separate punctuation that does not form an integral part of a word.

    spaCy将分隔构成单词组成部分的标点符号。

  • Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token.

    句子结尾处的引号,逗号和标点符号将被分配自己的标记。
  • However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

    但是,作为电子邮件地址,网站或数字值一部分存在的标点符号将作为令牌的一部分保留。
doc2 = nlp(u"We're here to guide you! Send your query, email contact@enetwork.ai or visit us at http://www.enetwork.ai!")for t in doc2:
print(t)
Image for post

Note that the exclamation points, comma are assigned their own tokens. However point, colon present in email address and website URL are not isolated. Hence both the email address and website are preserved.

请注意 ,感叹号,逗号已分配了自己的令牌。 但是,电子邮件地址和网站URL中的冒号不是孤立的。 因此,电子邮件地址和网站都将保留。

doc3 = nlp(u'A 40km U.S. cab ride costs $100.60')
for t in doc3:
print(t)
Image for post

Here the distance unit and dollar sign are assigned their own tokens, however, the dollar amount is preserved, point in amount is not isolated.

在这里,距离单位和美元符号分配有它们自己的令牌,但是,美元金额被保留,金额点未隔离。

代币生成中的例外 (Exceptions in Token generation)

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

作为已知缩写的一部分存在的标点符号将作为令牌的一部分保留。

doc4 = nlp(u"Let's visit the St. Louis in the U.S. next year.")
for t in doc4:
print(t)
  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值