题目
'''
Description: 英文分词去除标点符号
Autor: 365JHWZGo
Date: 2021-12-07 11:45:13
LastEditors: 365JHWZGo
LastEditTime: 2021-12-07 11:57:34
'''
代码实现
import spacy
import string
# 测试内容
content = "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
# 将内容变小写
content = content.lower()
# 创建字符串映射表
remove = str.maketrans("","",string.punctuation)
print(content.translate(remove))
运行结果
wall st bears claw back into the black reuters reuters shortsellers wall streets dwindlinand of ultracynics are seeing green again
前提安装
python -m spacy info
# 加载spacy模型
nlp = spacy.load("en_core_web_sm")
doc = nlp(content)
print([e.text for e in doc])
[‘wall’, ‘st’, ‘bears’, ‘claw’, ‘back’, ‘into’, ‘the’, ‘black’, ‘reuters’, ‘reuters’, ’ ', ‘shortsellers’, ‘wall’, ‘streets’, ‘dwindling\x08and’, ‘of’, ‘ultracynics’, ‘are’, ‘seeing’, ‘green’, ‘again’]