Python爬取页面时遇到的字符编码问题
在利用Python爬取微博页面时,遇到错误UnicodeEncodeError: ‘UCS-2’ codec can’t encode characters in position 73-73: Non-BMP character not supported in Tk,在网上找了一些资料,但都比较复杂,现我已成功解决该问题,且较简单,故通过我的实例分享经验。
原代码
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import requests
base_url = 'https://m.weibo.cn/api/container/getindex?'
headers = {
'Host': 'm.weibo.cn',
'Referer': 'https://m.weibo.cn/u/3908167020',
'User-Agent': 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def get_page(page):
params = {
'type': 'uid',
'value': '1989519725',
'containerid': '1076031989519725',
'page': page
}
url = base_url + urlencode(params)
try:
response = requests.get(url, headers = headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('Error',e.args)
def parse_page(json):
if json:
items = json.get('data').get('cards')
for item in items:
item = item.get('mblog')
weibo = {}
weibo['id'] = item.get('id')
weibo['text'] = pq(item.get('text')).text()
weibo['comments'] = item.get('comments_count')
weibo['reposts'] = item.get('reposts_count')
yield weibo
if __name__ == '__main__':
for page in range(1,11):
json = get_page(page)
results = parse_page(json)
for result in results:
print(result)
运行后显示错误:
出现错误的原因是:
我们爬取的HTML页面中包含了Unicode下无法识别的字符(这是我自己的理解,如果理解有误,欢迎指正!)
解决办法是:
- import sys(新代码第6行)
- 定义non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)(新代码第8行)
- 对要输出的结果(本实例中为result)translate,即在输出result之前,增加语句result = str(result).translate(non_bmp_map)(新代码第55行)
新代码
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import requests
import sys
non_bmp_map = dict.fromkeys(range(0x10000,sys.maxunicode + 1),0xfffd)
base_url = 'https://m.weibo.cn/api/container/getindex?'
headers = {
'Host': 'm.weibo.cn',
'Referer': 'https://m.weibo.cn/u/3908167020',
'User-Agent': 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def get_page(page):
params = {
'type': 'uid',
'value': '1989519725',
'containerid': '1076031989519725',
'page': page
}
url = base_url + urlencode(params)
try:
response = requests.get(url, headers = headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('Error',e.args)
def parse_page(json):
if json:
items = json.get('data').get('cards')
for item in items:
item = item.get('mblog')
weibo = {}
weibo['id'] = item.get('id')
weibo['text'] = pq(item.get('text')).text()
weibo['comments'] = item.get('comments_count')
weibo['reposts'] = item.get('reposts_count')
yield weibo
if __name__ == '__main__':
for page in range(1,11):
json = get_page(page)
results = parse_page(json)
for result in results:
result = str(result).translate(non_bmp_map)
print(result)
这样问题就完美解决啦!
参考资料来源
1: https://stackoverflow.com/questions/32442608/ucs-2-codec-cant-encode-characters-in-position-1050-1050.
2: https://www.2cto.com/kf/201805/748337.html.
3: https://blog.csdn.net/qq_16272049/article/details/79492020.