python数据框一列进行文本处理_如何使用python3将数据框中的一个特定文本列转换为'utf-8'...

I have a dataframe which multiple columns and one column contains scrapped text from various links.

I tried to convert that column to utf-8 but it didn't work.

Here is my approach:

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())

df['text'] = df['text'].apply(lambda x: x.encode('utf-8').strip())

print(df['text'])

I get text with some ascii code :

b"b'#Thank you, it\xe2\x80\x99s good to be ...

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())

df['text'] = df['text']

print(df['text'])

I get the text:

b'#Thank you, it\xe2\x80\x99s good to be here....

df['text'] = df['text'].apply(lambda x: x.decode('utf-8').strip())

AttributeError: 'str' object has no attribute 'decode'

I tried 2-3 approaches but it didn't work. Any alternative?

Using Python 3.6 and jupyter notebook.

解决方案

Assuming what you wrote for the example where the second line is df['text'] = df['text'] ends in '. In other words, b'#Thank you, it\xe2\x80\x99s good to be here....':

For some reason you have byte code that has been cast to a string because you see AttributeError: 'str' object has no attribute 'decode' when you try to decode it. (Ideally, it would be best to have not gotten into this situation, see here for some advice that looks to be pertinent. Alas, going with what you have ... )

I think at this point you can remove the b' at the start of the string and ' at the end far end and typecast back to byte code. Note that this will result in the backslashes getting escaped, and so that needs be dealt with, in addition to now decoding the byte code to a string in the proper way. Using an approach based on here you can escape and decode the byte code.

Putting this together (sort of like how @rolf82 illustrated in the comments) with what you show as df['text'], when df['text'] = df['text'] and that it is a string at the start, the conversion from what you have would be like this:

a = "b'#Thank you, it\xe2\x80\x99s good to be here'"

# But we only want the parts between the ''.

s = bytes(r"#Thank you, it\xe2\x80\x99s good to be here","utf-8")

import codecs

print(codecs.escape_decode(s)[0].decode("utf-8"))

That gives:

#Thank you, it’s good to be here

Which is what we want.

Now integrating that with Pandas is going to require something extra because we cannot simply say this is a raw string by adding r in front. Based on here and here, it seems using r in front to cast to raw string can be replaced with .encode('unicode-escape').decode(), like:

"#Thank you, it\xe2\x80\x99s good to be here".encode('unicode-escape').decode()

So pulling it all together I'd replace your second line with this:

import codecs

df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape').decode(), "utf-8"))[0].decode('utf-8').strip())

If that doesn't work, also try leaving off the .decode() after .encode('unicode-escape'), which is:

```python

import codecs

df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape'), "utf-8"))[0].decode('utf-8').strip())

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值