python 去除英文或中文文本中标点和数字及指定字符串

最新推荐文章于 2023-09-27 18:40:46 发布

v-space

最新推荐文章于 2023-09-27 18:40:46 发布

阅读量2.8k

点赞数

分类专栏： python 文章标签： python nlp 自然语言处理

本文链接：https://blog.csdn.net/weixin_42069606/article/details/108108721

版权

python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

例一

from string import punctuation
from string import digits
import re

def preprocess_English(text,rm_list):
    text = re.sub(r'[{}]+'.format(punctuation+digits), '', text)
    for rm_item in rm_list:
        text = text.replace(rm_item, '')
    return text

rm_list = ['pg','\n','\t'] #string you want to remove from text,'\n' and '\t' must be include
text_file='LifeofEdwinForrest.txt'
with  open(text_file,'r',encoding='utf-8') as f:
    text = f.read()
    text = text.lower()
print(preprocess_English(text,rm_list))

例二

def preprocess_Chinese(text):
    from zhon.hanzi import punctuation 
    text = re.sub(r'[{}]+'.format(punctuation),'',text)
    return text
def preprocess_English(text):
    from string import punctuation
    text = re.sub(r'[{}]+'.format(punctuation),'',text)
    return text