Author: Ma, Guolai
Story
JSON data comprises a large majority of content sent around the internet, especially for social networking sites and HTML5 games. One day, someone wants to search something (such as "search for #museum" on Twitter). Where there are large collections of dictionary objects, each one listing the property name and value. It makes the size of return value much larger. Looking at most of these types of data, it`s apparent there`s a structure layout that we can take advantage of compression.
JSON
What`s JSON
At the beginning of the 21st century, Douglas Crockford tried to find a simple data exchange format between client and server. XML is the popular data exchange language at that time, but Douglas thought it is too complex to generate and parse XML. At last, he proposed a new one - JSON.
JSON is really simple, it has a self-documenting format, it is much shorter because there is no data configuration overhead. That is why JSON is considered a fat-free alternative to XML.
JSON is built on two structures:
· A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
· An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
Object:
An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).
Array:
An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).
Advantage
JSON used to have an advantage because it could be directly parsed by a JavaScript engine, but even that advantage is gone because of security and interoperability concerns. About the only thing JSON going for it is that it is usually more compact than the alternative, XML, and it is well supported by many web programming languages.
Problem
Though it is one of the most used data interchanged format, there is still room for improvement.
For one, it converts everything to text. The value 3.141592653589793 takes only 8 bytes of memory, but JSON.stringify() expands it to 17.
A second problem is its excessive use of quotes, which add two bytes to every string.
Thirdly, it has no standard format for using a schema. When multiple objects are serialized in the same message, the key names for each property must be repeated, even though they are the same for each object.
In today`s technologically inclined world sharing of data via internet is increasing day by day and need to share more compressed and reliable form of data is the need so that we can share data at much faster rate and also use low bandwidth.
Compression of JSON data is useful when large data structures must be transmitted from the web browser to the server. Certainly, compression will cost extra effort. However, looking at the current trend where the processors are getting faster and faster and with the surge of mobile devices where bandwidth is the limitation, CJSON should prove to be better technique than JSON.
In any case, if your application makes heavy use of JSON, then you should consider what it`s costing you, and figure out how to minimize that cost.
Compression Algorithms
Here you'll find an analysis of several JSON compression algorithms and a conclusion whether JSON compression is useful and when it should be used.
CJSON
CSJON compress the JSON with automatic type extraction. It tackles the most pressing problem: the need to constantly repeat key names over and over. Using this compression algorithm, the following JSON:
Can be compressed as:
More details about CJSON algorithm can be found here.
HPACK
HPack is a lossless, cross language, performances focused, data set compressor. It is able to reduce up to 70% number of characters used to represent a generic homogeneous collection. This algorithm provides several level of compression (from 0 to 4). The level 0 compression performs the most basic compression by removing keys (property names) from the structure creating a header on index 0 with each property name. Next levels make it possible to reduce even more the size of the JSON by assuming that there are duplicated entries.
For the following JSON:
Can be compressed as:
More details about hpack algorithm can be found here.
JSONPACK
For the following JSON:
Can be compressed as:
More details about jsonpack algorithm can be found here.
Experiment
Experiment environment:
· CPU - intel core i5
· OS - Windows 7(64bit)
· Memory - 4G
Experiment algorithms:
· CJSON
· Hpack
· JSONPack
· Gzipped
· Gzipped and CJSON
· Gzipped and HPack
· Gzipped and JSONPack
Experiment data:
· 5 files export from json-generator varying from 3K to 5MB
· 1 file export from a real webapp
Experiment result:
· compression ratio
· time cost
Table 1 compression ratio
Original JSON size | 3k | 50k | 1M | 2M | 5M | webapp |
CJSON | 2517 | 45865 | 1106816 | 2018653 | 4760795 | 142224 |
HPack | 2176 | 41403 | 1028647 | 1877062 | 3912005 | failed |
JSONPack | 2517 | 47091 | 1129400 | 2060855 | 5037766 | 142113 |
Gzipped | 1336 | 15459 | 290125 | 527030 | 1486170 | 51145 |
Gzipped and CJSON | 1333 | 15116 | 286833 | 521021 | 1430745 | 51025 |
Gzipped and HPack | 1132 | 12412 | 208088 | 376125 | 632593 | failed |
Gzipped and JSONPack | 1450 | 17957 | 321494 | 586758 | 1765638 | 52451 |
Table 2 Time Cost (pack + unpack)(ms)
Original JSON size | 3k | 50k | 1M | 2M | 5M | webapp |
CJSON | 5 | 15 | 82 | 181 | 767 | 29 |
HPack | 4 | 2 | 7 | 11 | 24 | failed |
JSONPack | 18 | 77 | 1535 | 4251 | 89220 | 34 |
Gzipped | 24 | 54 | 1023 | 1840 | 4905 | 352 |
Gzipped and CJSON | 35 | 100 | 1081 | 1860 | 5257 | 339 |
Gzipped and HPack | 45 | 50 | 839 | 2106 | 4826 | failed |
Gzipped and JSONPack | 76 | 151 | 2527 | 6535 | 94362 | 350 |
Analysis and Conclusion
JSON Compression algorithms considerably reduce json file size. HPack seems to be much more efficient than CJSON and JSONPack and also significantly faster. However, HPack only works if all dictionary objects in the array have exactly the same set of property names. It is restricted by the reason that json structure on internet is always complex.
Beyond doubt, gzip is the default choice of most developer, and it doesn`t disappoint us. When Gzip of content is allowed, it has a better efficiency than any other compression algorithm. The one and only disadvantage is cost of time. The reason is that Gzip adds extra processing on sorting key which will increases the burden for both server and client.
CJSON compression algorithm looks nice, however when you gzip the minified and the non-minified version, the file size only differ by about 5%. This is because the Huffman-Encoding in Gzip can deal quite well with repeated text structures. However, CJSON compression before Gzip could reduce the cost of sorting key and time.
Objectively speaking, it doesn`t exist the best compression algorithm for all. Only the most appropriate on specific domain. When all dictionary objects in the json array have exactly the same set of property names (I think it occupy most requirements), HPack is the best choice. When structure is complex but the CPU is powerful, “Gzipped + CJSON” is better than the others.
I`ve realized my json compression code based on cjson algorithm. Here is thesource code.
* 本文版权和/或知识产权归eBay Inc所有。如需引述,请和联系我们DL-eBay-CCOE-Tech@ebay.com。本文旨在进行学术探讨交流,如您认为某些信息侵犯您的合法权益,请联系我们DL-eBay-CCOE-Tech@ebay.com,并在通知中列明国家法律法规要求的必要信息,我们在收到您的通知后将根据国家法律法规尽快采取措施。