①任务描述:
英文分词工具基本上都是以空格作为分词依据,但是我的任务识别关联词,有些关联词是词组的形式,过度分词,反而不好进行比对。
方案:
利用NLTK用户自定义词典
import nltk
# 有安装包的话,可以注释掉该句话
nltk.download('tokenizers/punkt/english.pickle')
from nltk.tokenize import MWETokenizer
test="as for Lippi in China, it will be make or break against Syria on Tuesday in Nanjing."
nltk.word_tokenize(test)
tokenizer = MWETokenizer([('in','China'), ('a', 'little', 'bit'), ('as', 'for')], separator = ' ')
tokenizer.add_mwe(('in', 'spite', 'of'))
word=tokenizer.tokenize(nltk.word_tokenize(test))
print(word)
结果:
[‘as for’, ‘Lippi’, ‘in China’, ‘,’, ‘it’, ‘will’, ‘be’, ‘make’, ‘or’, ‘break’, ‘against’, ‘Syria’, ‘on’, ‘Tuesday’, ‘in’, ‘Nanjing’, ‘.’]
②任务描述:
有些缩写,例如Mr. i.e.包含标点符号“.” 普通的分句策略,会将缩写当成新句子切开
方案:
向nltk中添加缩写标识
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
def cut_sentences_en(content):
punkt_param = PunktParameters()
abbreviation = ['i.e.', 'for example', 'vs', 'mr', 'mrs', 'prof', 'inc'] # 自定义的词典
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
sentences = tokenizer.tokenize(content)
return sentences
sentence="I love fruits,i.e.apple,banana."
print(cut_sentences_en(sentence))
结果:
[‘I love fruits,i.e.apple,banana.’]
③正则替换
#将非英文、数字替换为空字符
import re
pattern=re.compile("[^a-z^A-Z^0-9]")
word=pattern.sub("",word)