python数据框一列进行文本处理_如何使用python3将数据框中的一个特定文本列转换为'utf-8'...-CSDN博客

I have a dataframe which multiple columns and one column contains scrapped text from various links.

I tried to convert that column to utf-8 but it didn't work.

Here is my approach:

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())

df['text'] = df['text'].apply(lambda x: x.encode('utf-8').strip())

print(df['text'])

I get text with some ascii code :

b"b'#Thank you, it\xe2\x80\x99s good to be ...

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())

df['text'] = df['text']

print(df['text'])

I get the text:

b'#Thank you, it\xe2\x80\x99s good to be here....

df['text'] = df['text'].apply(lambda x: x.decode('utf-8').strip())

AttributeError: 'str' object has no attribute 'decode'

I tried 2-3 approaches but it didn't work. Any alternative?

Using Python 3.6 and jupyter notebook.

解决方案

Assuming what you wrote for the example where the second line is df['text'] = df['text'] ends in '. In other words, b'#Thank you, it\xe2\x80\x99s good to be here....':

For some reason you have byte code that has been cast to a string because you see AttributeError: 'str' object has no attribute 'decode' when you try to decode it. (Ideally, it would be best to have not gotten into this situation, see here for some advice that looks to be pertinent. Alas, going with what you have ... )

I think at this point you can remove the b' at the start of the string and ' at the end far end and typecast back to byte code. Note that this will result in the backslashes getting escaped, and so that needs be dealt with, in addition to now decoding the byte code to a string in the proper way. Using an approach based on here you can escape and decode the byte code.

Putting this together (sort of like how @rolf82 illustrated in the comments) with what you show as df['text'], when df['text'] = df['text'] and that it is a string at the start, the conversion from what you have would be like this:

a = "b'#Thank you, it\xe2\x80\x99s good to be here'"

# But we only want the parts between the ''.

s = bytes(r"#Thank you, it\xe2\x80\x99s good to be here","utf-8")

import codecs

print(codecs.escape_decode(s)[0].decode("utf-8"))

That gives:

#Thank you, it’s good to be here

Which is what we want.

Now integrating that with Pandas is going to require something extra because we cannot simply say this is a raw string by adding r in front. Based on here and here, it seems using r in front to cast to raw string can be replaced with .encode('unicode-escape').decode(), like:

"#Thank you, it\xe2\x80\x99s good to be here".encode('unicode-escape').decode()

So pulling it all together I'd replace your second line with this:

import codecs

df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape').decode(), "utf-8"))[0].decode('utf-8').strip())

If that doesn't work, also try leaving off the .decode() after .encode('unicode-escape'), which is:

```python

import codecs

df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape'), "utf-8"))[0].decode('utf-8').strip())