NLP：最全去掉文本中的中英文标点符号大法

这般女子

于 2019-07-25 10:23:48 发布

阅读量5.5k

点赞数 7

文章标签： NLP 中英文标点符号处理 python

本文链接：https://blog.csdn.net/xiaoxiaojie521/article/details/97240436

版权

本文介绍了一种处理文本中中英文标点符号的方法，包括分别去除中英文标点、处理多余空格的策略，并提供了使用Python进行操作的具体代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在处理文本时，中英文标点不同带来很大麻烦，我是先处理中文，在处理英文标点，最后还要去掉前边留下的空格。注意：两个库分别是中英文的标点符号是分开的，要分开处理，不能同时处理。

去掉英文符号

from string import punctuation
def preprocess_English(content):
    train_data = []
    for word in content:
        word = re.sub(r'[{}]+'.format(punctuation),' ',word)
        train_data.append(word)
    return train_data

去掉中文符号

import re
from zhon.hanzi import punctuation
def preprocess_Chinese(content):
    train = []
    for line in content:
        line = re.sub(r'[{}]+'.format(punctuation),' ',line)
        train.append(line)
    return train