python unicode用法_在python正则表达式中使用unicode字符的正确方法是什么

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

解决方案

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

This throws away all characters not:

whitespace (spaces, tabs, newlines, etc)

printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

You could include more chars if needed - just adjust the character class.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值