Python 去掉字符串中的特殊字符,空格

In [1]: import re

In [2]: text
Out[2]: "                                                                      \nALL this shows is that YOU don't know much about SCSI.\n\nSCSI-1 {with a SCSI-1 controler chip} range is indeed 0-5MB/s\nand that is ALL you have right about SCSI\nSCSI-1 {With a SCSI-2 controller chip}: 4-6MB/s with 10MB/s burst {8-bit}\n Note the INCREASE in SPEED, the Mac Quadra uses this version of SCSI-1\n so it DOES exist. Some PC use this set up too.\nSCSI-2 {8-bit/SCSI-1 mode}:  "

In [3]: re.sub(r'[^A-Za-z0-9]+',' ',text)
Out[3]: ' ALL this shows is that YOU don t know much about SCSI SCSI 1 with a SCSI 1 controler chip range is indeed 0 5MB s and that is ALL you have right about SCSI SCSI 1 With a SCSI 2 controller chip 4 6MB s with 10MB s burst 8 bit Note the INCREASE in SPEED the Mac Quadra uses this version of SCSI 1 so it DOES exist Some PC use this set up too SCSI 2 8 bit SCSI 1 mode '

In [4]: re.sub('\W+', ' ',text)
Out[4]: ' ALL this shows is that YOU don t know much about SCSI SCSI 1 with a SCSI 1 controler chip range is indeed 0 5MB s and that is ALL you have right about SCSI SCSI 1 With a SCSI 2 controller chip 4 6MB s with 10MB s burst 8 bit Note the INCREASE in SPEED the Mac Quadra uses this version of SCSI 1 so it DOES exist Some PC use this set up too SCSI 2 8 bit SCSI 1 mode '

这两种方法都可以用来去除字符串中的特殊字符,空格和数字,第二种方法的速度大概是第一种方法的两倍。但是这样处理以后,句子中的逗号,句号等句子分隔符没有了,对于处理文本数据来说,这些分隔符也许是有用的,故我们这里也可以保留这些常见的分隔符。 

In [5]: re.sub(r'[^a-zA-Z0-9,.\'!?]+',' ',text)
Out[5]:  " ALL this shows is that YOU don't know much about SCSI. SCSI 1 with a SCSI 1 controler chip range is indeed 0 5MB s and that is ALL you have right about SCSI SCSI 1 With a SCSI 2 controller chip 4 6MB s with 10MB s burst 8 bit Note the INCREASE in SPEED, the Mac Quadra uses this version of SCSI 1 so it DOES exist. Some PC use this set up too. SCSI 2 8 bit SCSI 1 mode "

Reference:

https://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值