spacy 名词性短语_使用spacy nlp进行词法化，词法化，停用词和短语匹配的快速指南...

最新推荐文章于 2025-03-18 07:00:00 发布

weixin_26750481

最新推荐文章于 2025-03-18 07:00:00 发布

阅读量2.5k

点赞数 1

文章标签： java

原文链接：https://towardsdatascience.com/a-quick-guide-to-tokenization-lemmatization-stop-words-and-phrase-matching-using-spacy-nlp-b29b407adbfc

版权

本文是一篇关于如何利用spacy库进行词法化、词形还原、停用词处理和短语匹配的快速指南，特别关注spacy在提取名词性短语上的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

spacy 名词性短语

“ spaCy” is designed specifically for production use. It helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning. In this article, you will learn about Tokenization, Lemmatization, Stop Words and Phrase Matching operations using spaCy.

“ spaCy”是专门为生产使用而设计的。它可以帮助您构建处理和“理解”大量文本的应用程序。它可用于构建信息提取或自然语言理解系统，或预处理文本以进行深度学习 。在本文中，您将学习使用spaCy进行的标记化，词法化，停用词和词组匹配操作。

This is article 2 in the spaCy Series. In my last article, I have explained about spaCy Installation and basic operations. If you are new to this, I would suggest starting from article 1 for a better understanding.

这是spaCy系列文章中的第2条。在上一篇文章中，我解释了有关spaCy安装和基本操作的信息。如果您对此不熟悉，我建议从第1条开始，以更好地理解。

Article 1 — spaCy-installation-and-basic-operations-nlp-text-processing-library/

第1条： spaCy安装和基本操作-NLP文本处理库/

代币化 (Tokenization)

Tokenization is the first step in text processing task. Tokenization is not only breaking the text into components, pieces like words, punctuation etc known as tokens. However, it is more than that. spaCy do the intelligent Tokenizer which internally identifies whether a “.” is punctuation and separate it into token or it is part of an abbreviation like “U.S.” and do not separate it.

标记化是文本处理任务的第一步。标记化不仅将文本分解为多个组成部分，例如单词，标点符号等，也称为标记。但是，不仅如此。 spaCy做智能标记器，在内部识别是否为“。” 是标点符号，请将其分隔为令牌，或者是“ US”等缩写的一部分，请勿将其分隔。

spaCy applies rules specific to the Language type. Let’s understand with an example.

spaCy应用特定于语言类型的规则。让我们看一个例子。

import spacy
nlp = spacy.load("en_core_web_sm")doc = nlp("\"Next Week, We're coming from U.S.!\"")
 for token in doc:
 print(token.text)

spaCy start splitting first based on the white space available in the raw text.
spaCy首先根据原始文本中可用的空白开始拆分。
Then it processes the text from left to right and on each item (splitter based on white space) it performs the following two checks:
然后，它从左到右处理文本，并在每个项目(基于空白的分隔符)上执行以下两项检查：
Exception Rule Check: Punctuation available in “U.S.” should not be treated as further tokens. It should remain one. However, we’re should be split into “we” and “ ‘re “
例外规则检查： “美国”中的标点符号不应被视为其他标记。它应该保持为一。但是，我们应该分为“我们”和“'re”
Prefix, Suffix and Infix check: Punctuation like commas, periods, hyphens or quotes to be treated as tokens and separated out.
前缀，后缀和后缀检查：标点符号(例如逗号，句点，连字符或引号)被视为标记并分开。

If there’s a match, the rule is applied and the Tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

如果存在匹配项，则应用规则，并且Tokenizer继续其循环，从新拆分的子字符串开始。这样，spaCy可以拆分复杂的嵌套令牌，例如缩写和多个标点符号的组合。

Prefix: Look for Character(s) at the beginning ▸ $ ( " ¿
前缀：在开头查找字符▸ $ ( " ¿
Suffix: Look for Character(s) at the end ▸ mm ) , . ! " mm is an example of a unit
后缀：在末尾查找字符▸， mm ) , . ! " mm ) , . ! " mm ) , . ! " mm是一个单位的示例
Infix: Look for Character(s) in between ▸ - -- / ...
中缀：在▸ - -- / ...之间寻找字符
Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ St. N.Y.
例外：特殊情况下的规则，在应用标点符号规则时将字符串分割为多个标记或防止标记被分割▸St St. NY

Notice that tokens are pieces of the original text. Tokens are the basic building blocks of a Doc object — everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

请注意，标记是原始文本的一部分。令牌是Doc对象的基本构建块-有助于我们理解文本含义的所有内容都源于令牌及其相互之间的关系。

前缀，后缀和前缀作为标记 (Prefixes, Suffixes and Infixes as Tokens)

spaCy will separate punctuation that does not form an integral part of a word.
spaCy将分隔不构成单词组成部分的标点符号。
Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token.
句子结尾处的引号，逗号和标点符号将被分配自己的标记。
However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.
但是，作为电子邮件地址，网站或数字值一部分存在的标点符号将作为令牌的一部分保留。

doc2 = nlp(u"We're here to guide you! Send your query, email contact@enetwork.ai or visit us at http://www.enetwork.ai!")for t in doc2:
 print(t)

Note that the exclamation points, comma are assigned their own tokens. However point, colon present in email address and website URL are not isolated. Hence both the email address and website are preserved.

请注意 ，感叹号，逗号已分配了自己的令牌。但是，电子邮件地址和网站URL中的冒号不是孤立的。因此，电子邮件地址和网站都将保留。

doc3 = nlp(u'A 40km U.S. cab ride costs $100.60')
for t in doc3:
 print(t)

Here the distance unit and dollar sign are assigned their own tokens, however, the dollar amount is preserved, point in amount is not isolated.

在这里，距离单位和美元符号分配有它们自己的令牌，但是，美元金额被保留，金额点未隔离。

代币生成中的例外 (Exceptions in Token generation)

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

作为已知缩写的一部分存在的标点符号将作为令牌的一部分保留。

doc4 = nlp(u"Let's visit the St. Louis in the U.S. next year.")
for t in doc4:
 print(t)

最低0.47元/天解锁文章