爬虫「Python」：解决网络爬虫返回文本中中文显示“\uxxxx”的问题

最新推荐文章于 2024-03-11 10:51:30 发布

Snowbowღ

最新推荐文章于 2024-03-11 10:51:30 发布

阅读量3.2k

点赞数 1

分类专栏：小记录网络爬虫 Python 文章标签： python

本文链接：https://blog.csdn.net/qq_41297934/article/details/104616428

版权

小记录同时被 3 个专栏收录

23 篇文章 1 订阅

订阅专栏

Python

19 篇文章 4 订阅

订阅专栏

网络爬虫

4 篇文章 0 订阅

订阅专栏

一、问题描述

我们在网络爬虫时常常遇到好不容易爬到了想要的内容，结果文本中中文显示“\uxxxx”的问题，这里展示我遇到的情况：

<html>
<head></head>
<body>
    <pre style="word-wrap: break-word; white-space: pre-wrap;">
        {"code":200,
        "msg":"success",
        "data":{
            "recommend_card_num":0,
            "count":{ "all":8,"enable":8,"draft":0,"deleted":0,"private":0,"audit":0,"original":0},
            "list":[{"ArticleId":"104551659","Title":"Python\uff1a\u4e00\u6b21\u767b\u5f55\uff0c\u89e3\u51b3\u722c\u53d6\u6dd8\u5b9d\u5546\u54c1\u8bc4\u4ef7\u7e41\u6742\u7684\u95ee\u9898\u2014\u2014\u7b80\u8ff0 Headers \u7684\u4f7f\u7528","PostTime":"2020\u5e7402\u670828\u65e5 12:28:23","ViewCount":"13","CommentCount":"0","CommentAuth":"2","IsTop":"0","Status":"1","UserName":"qq_41297934","Type":"1","is_vip_article":false,"editor_type":0,"is_recommend":false}],
            "total":8,"list_status":"all","page":1,"size":20}}
    </pre>
</body>
</html>

我们可以看到在 Html 的 <body> 标签中，数据格式为 json 的格式，其中 Title 和 PostTime 属性存在中文编码异常的问题，下面我介绍两种解决方案，读者可根据自己实际问题有选择性的采用方案。

二、解决方案

（一）json.loads(str)方法

1. 我们获取 <body> 标签内的 json 内容，交由 json.loads(str) 方法来解决编码问题，示例代码如下：

import json
import re

# 文件存储在../data/1.html
with open('../data/1.html', 'r') as f:
    html = f.read()
    # 通过正则表达式来匹配json内容
    html_json = re.findall(r'>(\{.+?})<', html.replace('\n', '').replace(' ', ''))[0]

    # 输出json对象
    print(json.loads(html_json))

2. 结果

{'code': 200, 'msg': 'success', 'data': {'recommend_card_num': 0, 'count': {'all': 8, 'enable': 8, 'draft': 0, 'deleted': 0, 'private': 0, 'audit': 0, 'original': 0}, 'list': [{'ArticleId': '104551659', 'Title': 'Python：一次登录，解决爬取淘宝商品评价繁杂的问题——简述Headers的使用', 'PostTime': '2020年02月28日12:28:23', 'ViewCount': '13', 'CommentCount': '0', 'CommentAuth': '2', 'IsTop': '0', 'Status': '1', 'UserName': 'qq_41297934', 'Type': '1', 'is_vip_article': False, 'editor_type': 0, 'is_recommend': False}], 'total': 8, 'list_status': 'all', 'page': 1, 'size': 20}}

（二）eval(expression)方法

1. 使用正则表达式生成“\uxxxx”异常列表 u_list，然后使用 eval 函数对列表中每一个元素进行运算，生成正常列表 n_list 。然后使用 replace 全部替换。

import re

with open('../data/2.html', 'r') as f:
    html = f.read()
    old_html = html.replace('\n', '').replace(' ', '')

    # 通过正则表达式匹配'\uxxxx'，返回全部'\u'列表，生成u_list
    u_list = re.findall(r'(\\u\w+)', old_html)

    # 使用eval()函数执行字符串表达式（这里将'\uxxxx'视为字符串表达式），返回正常列表，生成n_list
    n_list = []
    for i in u_list:
        i = "u'" + i + "'"
        n_list.append(eval(i))

    # 遍历u_list和n_list，替换原html中的内容
    new_html = old_html
    for i in range(len(u_list)):
        new_html = new_html.replace(u_list[i], n_list[i])
    print(new_html)

2. 结果

<html><head></head><body><prestyle="word-wrap:break-word;white-space:pre-wrap;">{"code":200,"msg":"success","data":{"recommend_card_num":0,"count":{"all":8,"enable":8,"draft":0,"deleted":0,"private":0,"audit":0,"original":0},"list":[{"ArticleId":"104551659","Title":"Python：一次登录，解决爬取淘宝商品评价繁杂的问题——简述Headers的使用","PostTime":"2020年02月28日12:28:23","ViewCount":"13","CommentCount":"0","CommentAuth":"2","IsTop":"0","Status":"1","UserName":"qq_41297934","Type":"1","is_vip_article":false,"editor_type":0,"is_recommend":false}],"total":8,"list_status":"all","page":1,"size":20}}</pre></body></html>