性能2-使用 msgspec 实现更快、更高效内存的 Python JSON 解析

最新推荐文章于 2025-05-28 20:48:46 发布

李星星BruceL

最新推荐文章于 2025-05-28 20:48:46 发布

阅读量609

点赞数 23

分类专栏：自动化测试文章标签： python json 开发语言

本文链接：https://blog.csdn.net/liluo0815481/article/details/146153852

版权

自动化测试专栏收录该内容

96 篇文章

订阅专栏

[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"https://api.github.com/users/petroav","avatar_url":"https://avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"https://api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]

我们的目标是找出某个用户与哪些仓库进行了交互。

以下是使用 Python 标准库内置的 json 模块的实现：

import json

with open("large.json", "r") as f:
    data = json.load(f)

user_to_repos = {}
for record in data:
    user = record["actor"]["login"]
    repo = record["repo"]["name"]
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)
print(len(user_to_repos), "records")

以下是使用 orjson 的实现，只需两行更改：

import orjson

with open("large.json", "rb") as f:
    data = orjson.loads(f.read())

user_to_repos = {}
for record in data:
    # ... 与标准库代码相同 ...

以下是这两种方法的内存和时间消耗：

$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python stdlib.py 
5250 records
RAM: 136464 KB, Elapsed: 0:00.42
$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_orjson.py 
5250 records
RAM: 113676 KB, Elapsed: 0:00.28

内存使用相似，但 orjson 更快，耗时 280ms 而不是 420ms。

接下来，我们来看看 msgspec。

`msgspec`：基于模式的 JSON 解码和编码

以下是使用 msgspec 的相应代码；正如你所看到的，它在解析方法上有所不同：

from msgspec.json import decode
from msgspec import Struct

class Repo(Struct):
    name: str

class Actor(Struct):
    login: str

class Interaction(Struct):
    actor: Actor
    repo: Repo

with open("large.json", "rb") as f:
    data = decode(f.read(), type=list[Interaction])

user_to_repos = {}
for record in data:
    user = record.actor.login
    repo = record.repo.name
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)
print(len(user_to_repos), "records")

这段代码更长，更详细，因为 msgspec 允许你为要解析的记录定义模式。

非常有用的是，你不必为所有字段都定义模式。 虽然 JSON 记录有很多字段（参见前面的示例以查看所有数据），但我们只告诉 msgspec 我们实际关心的字段。

以下是使用 msgspec 解析的结果：

$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_msgspec.py 
5250 records
RAM: 38612 KB, Elapsed: 0:00.09

更快，内存使用更少。

总结一下我们看到的三种选项，以及一个基于流式 ijson 的解决方案：

包	时间	内存	固定内存使用	模式
标准库 `json`	420ms	136MB	❌	❌
`orjson`	280ms	114MB	❌	❌
`ijson`	300ms	14MB	✓	❌
`msgspec`	90ms	39MB	❌	✓

流式解决方案在解析过程中只使用固定数量的内存；其他解决方案的内存使用与输入大小成比例。但在这三种方案中，msgspec 的内存使用显著更低，而且它是迄今为止最快的解决方案。

基于模式解析的优缺点

由于 msgspec 允许你指定模式，我们能够仅为实际关心的字段创建 Python 对象。这意味着更低的内存使用和更快的解码；不需要浪费时间和内存创建数千个我们永远不会查看的 Python 对象。

我们还免费获得了模式验证。如果某个记录缺少字段，或者值类型错误（例如整数而不是字符串），解析器会报错。使用标准 JSON 库，模式验证必须单独进行。

另一方面：

解码时的内存使用仍然与输入文件成比例。像 ijson 这样的流式 JSON 解析器仍然提供在解析过程中固定内存使用的优势，无论输入文件有多大。
指定模式需要更多的编码工作，并且处理不完美数据的灵活性较低。

了解更多关于 `msgspec` 的信息

msgspec 还有其他功能，如编码、MessagePack 支持（一种比 JSON 更快的替代格式）等。如果你经常解析 JSON 文件，并且遇到性能或内存问题，或者你只是想要内置的模式，考虑试试它。

性能2-使用 msgspec 实现更快、更高效内存的 Python JSON 解析

目录

使用 msgspec 实现更快、更高效内存的 Python JSON 解析

起点：内置的 `json` 和 `orjson`

`msgspec`：基于模式的 JSON 解码和编码

基于模式解析的优缺点

了解更多关于 `msgspec` 的信息

性能2-使用 msgspec 实现更快、更高效内存的 Python JSON 解析

目录

使用 msgspec 实现更快、更高效内存的 Python JSON 解析

起点：内置的 json 和 orjson

msgspec：基于模式的 JSON 解码和编码

基于模式解析的优缺点

了解更多关于 msgspec 的信息

起点：内置的 `json` 和 `orjson`

`msgspec`：基于模式的 JSON 解码和编码

了解更多关于 `msgspec` 的信息