python unicode用法_在python正则表达式中使用unicode字符的正确方法是什么

最新推荐文章于 2022-08-16 09:04:28 发布

清隳

最新推荐文章于 2022-08-16 09:04:28 发布

阅读量118

点赞数

文章标签： python unicode用法

本文链接：https://blog.csdn.net/weixin_33345697/article/details/111890936

版权

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

解决方案

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

This throws away all characters not:

whitespace (spaces, tabs, newlines, etc)

printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

You could include more chars if needed - just adjust the character class.

清隳

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python unicode用法_在python正则表达式中使用unicode字符的正确方法是什么

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one que...
复制链接

扫一扫

python unicode用法_在python正则表达式中使用unicode字符的正确方法是什么

“相关推荐”对你有帮助么？