python英文文本清理删除一段话,使用Python从文本中删除非英语单词

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example given some text :

"Io andiamo to the beach with my amico."

I would like to be left with :

"to the beach with my"

Does anyone know of a way as to how this could be done?

Any help would be much appreciated.

解决方案

You can use the words corpus from NLTK:

import nltk

words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."

" ".join(w for w in nltk.wordpunct_tokenize(sent) \

if w.lower() in words or not w.isalpha())

# 'Io to the beach with my'

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值