Python 3 JSON 中文字符读写注意事项

最新推荐文章于 2024-06-24 15:33:33 发布

六开箱

最新推荐文章于 2024-06-24 15:33:33 发布

阅读量1k

点赞数

文章标签： Python 爬虫中文 json

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/RandomParty/article/details/80030734

版权

文章目录

一、背景介绍
- 2.1 文章背景
- 2.2 文章目标
二、代码示例

一、背景介绍

2.1 文章背景

最近报名了中国大学MOOC的嵩天老师的爬虫课，颇有收获，决定自己找个大学新闻网站练练手，果然还是碰到了臭名昭著的中文字符处理的问题，特此记录。

2.2 文章目标

这篇文章讲的是如何把 Python 中的 Dict 转化为 JSON Object 以及如何把 List 转化为 JSON Array，其中 Dict 和 List 中的数据包括中文，并且如何把这些数据在文件中以正确的方式存入，然后正确地读入到内存中重复使用。关键点在于中文字符的处理，这在 Python2 中一度是一个麻烦的问题，到了 Python3 中稍有改善，但还是需要在读写时做出一些设置。这里详细地讲了设置的地方有哪些。

二、代码示例

问题讲清楚了，再多说也无益处，直接上代码。

# -*- coding: utf-8 -*-
import json
import requests
from bs4 import BeautifulSoup
import re
import traceback
import codecs

newsList = []
newsDict = {}
r = requests.get('http://news.hunnu.edu.cn/sdxw.htm')
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,'lxml')
newsInfoTags = soup.find_all(id=re.compile('lineu'))
for item in newsInfoTags:
    elem = item.contents
    # 注意 <class 'bs4.element.NavigableString'> 可自动转换为str
    # 所以是否把结果用显式地用str()转换都可以
    newsDict['tag'] = elem[3].string
    newsDict['title'] = elem[5].string
    newsDict['date'] = elem[7].string
    newsList.append(newsDict)

# 把数据写入JSON文件中，注意 'utf8' 必不可少
with codecs.open('newsList.json', 'w', 'utf8') as f:
    # 注意 ensure_ascii=False 必不可少
    f.write(json.dumps(newsList, ensure_ascii=False))

# 从JSON文件中把数据读入内存，注意 'utf8' 必不可少
with codecs.open('newsList.json', 'r', 'utf8') as f:
    objs = json.loads(f.read())
    print(len(objs))