文本预处理——标记化

标记化是处理文本数据时最常见的任务之一。但“标记化”一词实际上是什么意思呢?

Python中的标记化本质上是将短语、句子、段落或整个文本文档拆分为更小的单元,例如单个单词或术语。每个较小的单元都称为标记。        

python中存在三种简单标记类型

1.词语标记:将句子拆分成单个单词

2.句子标记:将段落分成单独的句子

3.正则表达式标记:使用正则化模式拆分文本

接下来我将列举六种标记化方法

1使用python的split函数进行标记

让我们从split()方法开始,因为它是最基本的。它按指定的分隔符拆分给定的字符串后返回字符串列表。默认情况下,split() 在每个空格处拆分字符串。我们可以将分隔符更改为任何内容。

词语标记

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split() 
Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子标记

这个跟单词分词类似,这里我们分析的是句子的结构,一个句子通常以句号(.)结尾,所以我们可以用“.”作为分隔符来分割字符串:

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ') 
Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

使用 Python 的split()方法的一个主要缺点是,我们一次只能使用一个分隔符。还有一点需要注意——在单词标记化中,split()不将标点符号视为单独的标记。

2.使用正则表达式进行标记

我们可以使用Python 中的re库来处理正则表达式。该已预装在 Python 安装包中。

词语标记

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\w']+", text)
print(tokens)

re.findall()函数  查找与传递的模式匹配的所有单词并将其存储在列表中。

“ \w” 表示“任何单词字符”,通常指字母数字(字母、数字)和下划线 (_)。'+' 表示任意次数。 因此 [\w']+ 表示代码应查找所有字母数字字符,直到遇到任何其他字符。

句子标记

要执行句子标记化,我们可以使用re.split()函数。它将通过向其中传递一个模式将文本拆分为句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text)
Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

在这里,我们比split()方法有优势,因为我们可以同时传递多个分隔符。在上面的代码中,我们使用了re.compile()函数,其中我们传递了 [.?!]。这意味着只要遇到这些字符中的任何一个,句子就会被拆分。

3.使用NLTK进行标记

NLTK 是 Natural Language ToolKit 的缩写,是一个用 Python 编写的用于符号和统计自然语言处理的库。非常适合处理文本数据较大的情况。

使用前需要安装该库

NLTK 包含一个名为tokenize()的模块,该模块进一步分为两个子类别:

  • 单词标记:我们使用 word_tokenize() 方法将句子拆分为标记或单词
  • 句子标记:我们使用 sent_tokenize() 方法将文档或段落拆分为句子

词语标记

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

 Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子标记

from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel 
launch vehicle to orbit the Earth.']

4.使用spaCy库进行标记化

spaCy 是一个用于高级开源库。它支持超过 49 种语言,并提供最先进的计算速度。

安装链接:Install spaCy · spaCy Usage Documentation

词语标记

from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)

 

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable',  'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
'multi', '-', 'planet', '\n', 'species', 'by', 'building', 'a', 'self', '-',  'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\n', 'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子标记

from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(sbd)

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)

 

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring  civilization and a multi-planet \nspecies by building a self-sustaining city on  Mars.',  'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

5.使用Keras进行标记

Keras!目前业界最热门的深度学习框架之一。它是一个 Python 的开源神经网络库。Keras 使用起来非常简单,也可以在 TensorFlow 上运行。

pip install Keras #安装指令

要使用 Keras 执行单词标记,我们使用keras.preprocessing.text类中的text_to_word_sequence方法。

from keras.preprocessing.text import text_to_word_sequence
# define
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# tokenize
result = text_to_word_sequence(text)

 

Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 
'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first',  'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'earth']

6.使用 Gensim 进行标记

它是一个用于无监督主题建模和自然语言处理的开源库,旨在从给定文档中自动提取语义主题。

pip install gensim #安装方法

词语标记

from gensim.utils import tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
list(tokenize(text))

 

Output : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 
'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars',  'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth']

句子标记

from gensim.summarization.textcleaner import split_sentences
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
result = split_sentences(text)

 

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring  civilization and a multi-planet ', 'species by building a self-sustaining city on Mars.', 'In 2008, SpaceX’s Falcon 1 became the first privately developed ', 
'liquid-fuel launch vehicle to orbit the Earth.']

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值