文本预处理——标记化

一颗铜豌豆

于 2024-10-25 20:54:09 发布

阅读量786

点赞数 7

分类专栏： nlp快速入门文章标签：自然语言处理

本文链接：https://blog.csdn.net/2301_79731058/article/details/143242879

版权

nlp快速入门专栏收录该内容

11 篇文章

订阅专栏

标记化是处理文本数据时最常见的任务之一。但“标记化”一词实际上是什么意思呢？

Python中的标记化本质上是将短语、句子、段落或整个文本文档拆分为更小的单元，例如单个单词或术语。每个较小的单元都称为标记。

python中存在三种简单标记类型

1.词语标记：将句子拆分成单个单词

2.句子标记：将段落分成单独的句子

3.正则表达式标记：使用正则化模式拆分文本

接下来我将列举六种标记化方法

1使用python的split函数进行标记

让我们从split()方法开始，因为它是最基本的。它按指定的分隔符拆分给定的字符串后返回字符串列表。默认情况下，split() 在每个空格处拆分字符串。我们可以将分隔符更改为任何内容。

词语标记

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子标记

这个跟单词分词类似，这里我们分析的是句子的结构，一个句子通常以句号（.）结尾，所以我们可以用“.”作为分隔符来分割字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

使用 Python 的split()方法的一个主要缺点是，我们一次只能使用一个分隔符。还有一点需要注意——在单词标记化中，split()不将标点符号视为单独的标记。

2.使用正则表达式进行标记

我们可以使用Python 中的re库来处理正则表达式。该库已预装在 Python 安装包中。

词语标记

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\w']+", text)
print(tokens)

re.findall()函数查找与传递的模式匹配的所有单词并将其存储在列表中。

“ \w” 表示“任何单词字符”，通常指字母数字（字母、数字）和下划线 (_)。'+' 表示任意次数。 因此 [\w']+ 表示代码应查找所有字母数字字符，直到遇到任何其他字符。

句子标记

要执行句子标记化，我们可以使用re.split()函数。它将通过向其中传递一个模式将文本拆分为句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text)

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

在这里，我们比split()方法有优势，因为我们可以同时传递多个分隔符。在上面的代码中，我们使用了re.compile()函数，其中我们传递了 [.?!]。这意味着只要遇到这些字符中的任何一个，句子就会被拆分。

3.使用NLTK进行标记

NLTK 是 Natural Language ToolKit 的缩写，是一个用 Python 编写的用于符号和统计自然语言处理的库。非常适合处理文本数据较大的情况。

使用前需要安装该库

NLTK 包含一个名为tokenize()的模块，该模块进一步分为两个子类别：

单词标记：我们使用 word_tokenize() 方法将句子拆分为标记或单词
句子标记：我们使用 sent_tokenize() 方法将文档或段落拆分为句子

词语标记

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子标记

from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel 
launch vehicle to orbit the Earth.']

4.使用spaCy库进行标记化

spaCy 是一个用于高级开源库。它支持超过 49 种语言，并提供最先进的计算速度。

安装链接：Install spaCy · spaCy Usage Documentation

词语标记

from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable',  'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
'multi', '-', 'planet', '\n', 'species', 'by', 'building', 'a', 'self', '-',  'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\n', 'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子标记

from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(sbd)

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring  civilization and a multi-planet \nspecies by building a self-sustaining city on  Mars.',  'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

5.使用Keras进行标记

Keras！目前业界最热门的深度学习框架之一。它是一个 Python 的开源神经网络库。Keras 使用起来非常简单，也可以在 TensorFlow 上运行。

pip install Keras #安装指令

要使用 Keras 执行单词标记，我们使用keras.preprocessing.text类中的text_to_word_sequence方法。

from keras.preprocessing.text import text_to_word_sequence
# define
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# tokenize
result = text_to_word_sequence(text)

Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 
'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first',  'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'earth']

6.使用 Gensim 进行标记

它是一个用于无监督主题建模和自然语言处理的开源库，旨在从给定文档中自动提取语义主题。

pip install gensim #安装方法

词语标记

from gensim.utils import tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
list(tokenize(text))

Output : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 
'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars',  'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth']

句子标记

from gensim.summarization.textcleaner import split_sentences
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
result = split_sentences(text)

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring  civilization and a multi-planet ', 'species by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed ', 
'liquid-fuel launch vehicle to orbit the Earth.']