python simdjson_simdjson - 一个C++高性能JSON解析器

simdjson : Parsing gigabytes of JSON per second

JSON is everywhere on the Internet. Servers spend a *lot* of time parsing it. We need a fresh approach. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 2.5x faster than anything else out there.

Fast: Over 2.5x faster than other production-grade JSON parsers.

Easy: First-class, easy to use API.

Strict: Full JSON and UTF-8 validation, lossless parsing. Performance with no compromises.

Automatic: Selects a CPU-tailored parser at runtime. No configuration needed.

Reliable: From memory allocation to error handling, simdjson's design avoids surprises.

This library is part of the Awesome Modern C++ list.

Table of Contents

Quick Start

The simdjson library is easily consumable with a single .h and .cpp file.

Prerequisites: g++ (version 7 or better) or clang++ (version 6 or better), and a 64-bit system with a command-line shell (e.g., Linux, macOS, freeBSD). We also support programming environnements like Visual Studio and Xcode, but different steps are needed.

Pull simdjson.h and simdjson.cpp into a directory, along with the sample file twitter.json. wget https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.h https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.cpp https://raw.githubusercontent.com/simdjson/simdjson/master/jsonexamples/twitter.json

Create quickstart.cpp:

#include "simdjson.h"

int main(void) {

simdjson::dom::parser parser;

simdjson::dom::element tweets = parser.load("twitter.json");

std::cout << tweets["search_metadata"]["count"] << " results." << std::endl;

}

c++ -o quickstart quickstart.cpp simdjson.cpp

./quickstart 100 results.

Documentation

Usage documentation is available:

Basics is an overview of how to use simdjson and its APIs.

Performance shows some more advanced scenarios and how to tune for them.

Implementation Selection describes runtime CPU detection and how you can work with it.

API contains the automatically generated API documentation.

Performance results

The simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON and fifty percent less than sajson. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second (GB/s) on commodity processors. It can parse millions of JSON documents per second on a single core.

The following figure represents parsing speed in GB/s for parsing various files on an Intel Skylake processor (3.4 GHz) using the GNU GCC 9 compiler (with the -O3 flag). We compare against the best and fastest C++ libraries. The simdjson library offers full unicode (UTF-8) validation and exact number parsing. The RapidJSON library is tested in two modes: fast and exact number parsing. The sajson library offers fast (but not exact) number parsing and partial unicode validation. In this data set, the file sizes range from 65KB (github_events) all the way to 3.3GB (gsoc-2018). Many files are mostly made of numbers: canada, mesh.pretty, mesh, random and numbers: in such instances, we see lower JSON parsing speeds due to the high cost of number parsing. The simdjson library uses exact number parsing which is particular taxing.

On a Skylake processor, the parsing speeds (in GB/s) of various processors on the twitter.json file are as follows, using again GNU GCC 9.1 (with the -O3 flag). The popular JSON for Modern C++ library is particularly slow: it obviously trades parsing speed for other desirable features.

parser

GB/s

simdjson

2.5

RapidJSON UTF8-validation

0.29

RapidJSON UTF8-valid., exact numbers

0.28

RapidJSON insitu, UTF8-validation

0.41

RapidJSON insitu, UTF8-valid., exact

0.39

sajson (insitu, dynamic)

0.62

sajson (insitu, static)

0.88

dropbox

0.13

fastjson

0.27

gason

0.59

ultrajson

0.34

jsmn

0.25

cJSON

0.31

JSON for Modern C++ (nlohmann/json)

0.11

The simdjson library offers high speed whether it processes tiny files (e.g., 300 bytes) or larger files (e.g., 3MB). The following plot presents parsing speed for synthetic files over various sizes generated with a script on a 3.4 GHz Skylake processor (GNU GCC 9, -O3).

Real-world usage

If you are planning to use simdjson in a product, please work from one of our releases.

Bindings and Ports of simdjson

We distinguish between "bindings" (which just wrap the C++ code) and a port to another programming language (which reimplements everything).

ZippyJSON: Swift bindings for the simdjson project.

pysimdjson: Python bindings for the simdjson project.

simdjson-rs: Rust port.

simdjson-rust: Rust wrapper (bindings).

SimdJsonSharp: C# version for .NET Core (bindings and full port).

simdjson_nodejs: Node.js bindings for the simdjson project.

simdjson_php: PHP bindings for the simdjson project.

simdjson_ruby: Ruby bindings for the simdjson project.

simdjson-go: Go port using Golang assembly.

rcppsimdjson: R bindings.

About simdjson

The simdjson library takes advantage of modern microarchitectures, parallelizing with SIMD vector instructions, reducing branch misprediction, and reducing data dependency to take advantage of each CPU's multiple execution cores.

Some people enjoy reading our paper: A description of the design and implementation of simdjson is in our research article: Geoff Langdale, Daniel Lemire, Parsing Gigabytes of JSON per Second, VLDB Journal 28 (6), 2019.

For the video inclined,

(it was the best voted talk, we're kinda proud of it).

Funding

The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.

Contributing to simdjson

Head over to CONTRIBUTING.md for information on contributing to simdjson, and HACKING.md for information on source, building, and architecture/design.

License

This code is made available under the Apache License 2.0.

Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it under the liberal (business-friendly) MIT license.

For compilers that do not support C++17, we bundle the string-view library which is published under the Boost license (http://www.boost.org/LICENSE_1_0.txt). Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
1. 封装了几个自定义的函数, 例如 move_to_root, array_get_length, array_move_to_index, 这样可以少调用一些 X64Call; 2. 简单实现了对于类似 [0].A.B[0].C 的路径的解析取值. 接下来说一下遇到的问题和一些体验: 1. 我构造的测试数据大小是大约是 96MB , 在我的机器上可以正常解析, 再大一些(例如 128MB )会崩溃, 崩溃位于 ParsedJson.allocateCapacity , 琢磨了下没琢磨明白 (温馨提示: 真要是这种大小级别了还是建议各位用 SAX 方式); 2. 除了上面这点, 还有个已知的比较隐蔽 BUG, 貌似是 print_ 这个函数的锅: 静态编译之后, 在 demo 中如果 print_ 递归打印了一个 Object 例如 [0] , 再点击解析就会在 iterator_free 崩溃. 如果只是取值就不崩溃. 3. 这个库会拷贝数据, 在针对过长的数据的时候这不是好做法, 感觉这个库更像是科研性质, 和那些千锤百炼的老牌库相比, 目前可能只有速度占优势了; 4. 机器或者其它方面的限制, 我用 易语言 跑不出宣传文章中的千兆字节每秒, 不过几百 MB/s 还是有的; 5. 由于解析的时候它会拷贝数据, 我不清楚有没有可能会产生 64-bit 的内存地址, 暂时就是指针到文本当 32-bit 用, 但心里很没底, 希望 eWOW64Ext 作者有空可以帮忙看一下... @shier2817 谢谢! 6. 库用的是 10.0.17134.0 版本的 SDK /MT 编译的, 但已经无法支持 WindowXP, 低版本的 SDK 编译不过去, 对这些指令不熟悉所以没有去探究原因(也许就是不支持, 详情请翻阅 MSDN); 7. 关于编译模式: 用 MinSizeRel 生成的话, 会导致 double 取值异常, 具体原因未深究, 所以默认使用了 Release . 我将会在附件中附上三种编译模式生成的文件供各位研究: RelWithDebInfo, MinSizeRel, Release; 用到的模块: 1. 感谢 eWOW64Ext : https://bbs.125.la/thread-14322538-1-1.html 2. Jβec : https://bbs.125.la/thread-14069145-1-1.html
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值