python utf-8 codec cant decode_Python请求以utf-8编码的响应,但无法解码

I am trying to scrape my messenger.com (facebook messenger) chats using python and i have used google chromes developer tools to see the POST request for the chat history and i have copied the entire header and body into a format that requests can use.

I get HTTP code 200 implying the request at least got something but and i can print res.encoding to get the encoding it returned in which it says is utf-8. But i cannot decode it!

here is the function:

def download_thread(self, limit, offset, message_timestamp):

"""Download the specified number of messages from the

provided thread, with an optional offset

"""

data = request_data(self.thread, offset=offset,

limit=limit, group=self.group,

timestamp=message_timestamp)

res = self.ses.post(url_thread, data=data, headers=headers)

print(res.content)

thread_contents = json.loads(res.content)

print(thread_contents)

return thread_contents

yields

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

when it attempts to json.load (or loads) the data

But res.encoding does return utf-8.

I tried unzipping with gzip but that says it is is not gzipped content.

If i just try to do print(res.content) i get

Traceback (most recent call last):

File "FBChatScraper.py", line 200, in

main()

File "FBChatScraper.py", line 134, in main

fbms.run()

0f\x82\x048\xbb\xb9=\x87\xebK0.\xff\x90\xdd\xeb\xfa\x16\xc6\xbbz\x8b\x82)\xe8\xaaV\x01^\xda\x8b\xbd\x15d-\xb1\x10@\x17\\\xd43\xa8\x92w\xe8\xc0\xcdU\xc4\xff\xc7\xfa\x90\xb2\xb3\xf5\x84\x11u\x0b\t\x8f\x83r\xf3}\xe5!y$\xe6\xf6c0\xf0\xb4\x98\xcat_\x0c\x08\xb5\xdd\x8ctx\x91\xa9\x95\rB%\xe2\x93\xa52\x85_\xa6\x10\xc2\xc9\xa3\xee4SDb\xa5\x18QJ\x83X\x19)\xaa$\xf4\xb4\xb7\x0b\x84\x15&\x88\x08L\xc9iP\xa2\xb9\xf2\xaf\x96\x96N\xd8\xcf=\x05\xc1\x18\x8d\xa0\xf2Y\x8e\n\xcf\xc8\x0fE4\xd6)\xa1\xd4\xb7D\xd6{i\xc8P\x96R\x11HC\xac\xbcKyT#~}\x93\xf7@K\xc7r/\x82\xb0\xe4\xefX\xf9j\x08\xa6Hp\xfcn\x06\xfdo\x9a\xd0wJ\xb4fJ(\x89+\x1c\xf6\x0eOI\x90\xac\x9eDD\xfd,\xa5\xe9\x89\x1blh\x86Z\x98\x05\xdd9\xc7\xf4\x80\xfcY\x8e\xad\xee\x99!\x15\x13+\x9b\x07\xe8Fdj\xfc\x11\xfc\xfe7\x06h\x02\x00@>]W\x92\xc9\x02\xb1c3\x82\xcd\xa4\xefN9\x90\xe6\x81y\x9c\x84er\xd4\xc3\x06\x1c\x06\x14\xcf\xc7\x07hj\xbfH\xdc\xf5~\xf7z\x18Ce\xaf^\x8c\xab \xdfV\xce\xb8\x11\xf8\x06\x03'

Traceback (most recent call last):

File "FBChatScraper.py", line 200, in

main()

File "FBChatScraper.py", line 134, in main

fbms.run()

File "FBChatScraper.py", line 43, in run

thread_contents = self.download_thread(limit, offset, message_timestamp)

File "FBChatScraper.py", line 74, in download_thread

thread_contents = json.loads(res.content)

File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads

s = s.decode(detect_encoding(s), 'surrogatepass')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

oddly printing the content in the middle of the traceback leading me to think there are some invisible characters pushing it down.

I am unable to get the response loaded into a json format because no matter how i handle the response content it isn't properly formatted for json library to interpret.

Moreover if i just do print(res.text) i get garbage:

Traceback (most recent call last):

File "FBChatScraper.py", line 200, in

main()

File "FBChatScraper.py", line 134, in main

fbms.run()

}sP���c���f�u0���\� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z�\��l4����t=i��%Ҳu�x��%�x�

F

� f/V�i̴��_��83� �_����*��O��

������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^׷0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a� "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^

� �g���SWQHӳ.��BӄG�,����@E��������

nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�

Traceback (most recent call last):

File "FBChatScraper.py", line 200, in

main()

File "FBChatScraper.py", line 134, in main

fbms.run()

File "FBChatScraper.py", line 43, in run

thread_contents = self.download_thread(limit, offset, message_timestamp)

File "FBChatScraper.py", line 74, in download_thread

thread_contents = json.loads(res.content)

File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads

s = s.decode(detect_encoding(s), 'surrogatepass')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

EDIT:

MWE as best i can, not sure what data from my post request is private so i left some out

using this data

url_thread = "https://www.messenger.com/api/graphqlbatch/"

request_data = {

"batch_name": "MessengerGraphQLThreadFetcher",

"__user": "",

"__a": "1",

"__dyn": "",

"__req": "9",

'__be' : '-1',

'__pc' : 'PHASED:messengerdotcom_pkg',

"fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",

"ttstamp": "265817254666710077746711957586581715370521181008510710777",

"__rev": "3791607",

"jazoest": "",

"queries": ''

}

headers = {

"authority": "www.messenger.com",

"method": "POST",

"path": "/api/graphqlbatch/",

"scheme": "https",

"accept": "*/*",

"accept-encoding": "gzip, deflate, br",

"accept-language": "en-US,en;q=0.9",

"cache-control": "no-cache",

"content-length": "754",

"content-type" : "application/x-www-form-urlencoded",

"cookie": "",

"origin": "https://www.messenger.com",

"pragma": "no-cache",

"referer": "https://www.messenger.com/t/",

"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"

}

You can get all the by using chrome developer tools and lookng on the network tab for a POST request to Request URL: https://www.messenger.com/api/graphqlbatch/.

Its easy to find if you scroll up to reload old messages while chrome dev tools is recording.

Then put together a simple request with python

import requests as rq

import time

ses = rq.Session()

thread =

conversation_type =

data = request_data

data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0

data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())

data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000

res = ses.post(url_thread, data=data, headers=headers)

print(res.content)

thread_contents = json.loads(res.content)

print(thread_contents)

As what my dev tools got back you can see the start of the json here

解决方案

The problem is this line in your request headers:

"accept-encoding": "gzip, deflate, br",

That br requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests doesn't understand Brotli natively.

You can pip install brotli and register the decompresser or just call it manually on res.content. But a simpler solution is to just remove the br:

"accept-encoding": "gzip, deflate",

… and then you should get gzip, which you and requests already know how to handle.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值