python中删除字符串后面的零_Python：从字符串中删除\xa0？

最新推荐文章于 2023-12-21 13:48:26 发布

weixin_39566387

最新推荐文章于 2023-12-21 13:48:26 发布

阅读量258

点赞数 1

文章标签： python中删除字符串后面的零

本文链接：https://blog.csdn.net/weixin_39566387/article/details/111425725

版权

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?

解决方案

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

weixin_39566387

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中删除字符串后面的零_Python：从字符串中删除\xa0？

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove a...
复制链接

扫一扫