spark解析json数据

最新推荐文章于 2024-08-02 21:00:00 发布

一只勤奋爱思考的猪

最新推荐文章于 2024-08-02 21:00:00 发布

阅读量1.1k

点赞数 1

分类专栏： spark海量数据分析文章标签：一切即工具-大数据天天学系列

本文链接：https://blog.csdn.net/sinat_26566137/article/details/85639255

版权

spark海量数据分析专栏收录该内容

187 篇文章 3 订阅

订阅专栏

1. json数据格式–定义
JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式，易于人阅读和编写。

2.json数据格式解编码(2.1,2.2两种方法)
2.1 json函数实现解编码：json.dumps及json.loads

函数	描述
json.dumps	将 Python 对象编码成 JSON 字符串

json.loads	将已编码的 JSON 字符串解码为 Python 对象

2.1.1 python中json函数使用案例

(1)json.dumps----将 Python 对象编码成 JSON 字符串

json.dumps工具函数介绍：
def dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True,
        allow_nan=True, cls=None, indent=None, separators=None,
        default=None, sort_keys=False, **kw):
    """Serialize ``obj`` to a JSON formatted ``str``.

    If ``skipkeys`` is true then ``dict`` keys that are not basic types
    (``str``, ``int``, ``float``, ``bool``, ``None``) will be skipped
    instead of raising a ``TypeError``.

    If ``ensure_ascii`` is false, then the return value can contain non-ASCII
    characters if they appear in strings contained in ``obj``. Otherwise, all
    such characters are escaped in JSON strings.

    If ``check_circular`` is false, then the circular reference check
    for container types will be skipped and a circular reference will
    result in an ``OverflowError`` (or worse).

    If ``allow_nan`` is false, then it will be a ``ValueError`` to
    serialize out of range ``float`` values (``nan``, ``inf``, ``-inf``) in
    strict compliance of the JSON specification, instead of using the
    JavaScript equivalents (``NaN``, ``Infinity``, ``-Infinity``).

    If ``indent`` is a non-negative integer, then JSON array elements and
    object members will be pretty-printed with that indent level. An indent
    level of 0 will only insert newlines. ``None`` is the most compact
    representation.

    If specified, ``separators`` should be an ``(item_separator, key_separator)``
    tuple.  The default is ``(', ', ': ')`` if *indent* is ``None`` and
    ``(',', ': ')`` otherwise.  To get the most compact JSON representation,
    you should specify ``(',', ':')`` to eliminate whitespace.

    ``default(obj)`` is a function that should return a serializable version
    of obj or raise TypeError. The default simply raises TypeError.

    If *sort_keys* is ``True`` (default: ``False``), then the output of
    dictionaries will be sorted by key.

    To use a custom ``JSONEncoder`` subclass (e.g. one that overrides the
    ``.default()`` method to serialize additional types), specify it with
    the ``cls`` kwarg; otherwise ``JSONEncoder`` is used.
    
将 Python 对象编码成 JSON 字符串--demo1：
#!/usr/bin/python
import json
data = [ { 'a' : 1, 'b' : 2, 'c' : 3, 'd' : 4, 'e' : 5 } ]
json = json.dumps(data)
print json
json = json.dumps(data)，dumps之后的json值为字符串：
'[{"d": 4, "e": 5, "c": 3, "a": 1, "b": 2}]'
print json的打印结果为：
[{"d": 4, "a": 1, "c": 3, "e": 5, "b": 2}]

将 Python 对象编码成 JSON 字符串，设置参数缩进为4等格式化输出字符串--demo2:
print (json.dumps({'a': 'Runoob', 'b': 7}, sort_keys=True, indent=4, separators=(',', ': ')))
print打印结果为：
{
    "a": "Runoob",
    "b": 7
}
可以看出，上述参数格式化json字符串里面参数sort_keys实现了字典的key的排序方式，，设置为True时，默认按字母的升序排序，设置为False时，默认key没有顺序；indent参数设置右缩进4个空格，separators参数让json更紧凑；

(2)json.loads----将已编码的 JSON 字符串解码为 Python 对象

json字符串类型	python数据类型
object	dict
array	list
string	unicode
number(int)	int,long
number(real)	float
true	True
false	False
null	None

import json
jsonData = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
text = json.loads(jsonData)
print (text)
text的值为：{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}
print的结果：{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}
注意，print非字符串的Python数据类型，会按原样显示；
   print字符串，显示的时候，会隐藏最外层的引号；
print("{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}")
结果：{'d': 4, 'c': 3, 'a': 1, 'e': 5, 'b': 2}

2.2.json函数实现解编码：使用第三方库：Demjson
Demjson 是 python 的第三方模块库，可用于编码和解码 JSON 数据，包含了 JSONLint 的格式化及校验功能。
Github 地址：https://github.com/dmeranda/demjson
官方地址：http://deron.meranda.us/python/demjson/
安装方式：

$ tar -xvzf demjson-2.2.3.tar.gz
$ cd demjson-2.2.3
$ python setup.py install

函数	描述
encode	将 Python 对象编码成 JSON 字符串
decode	将已编码的 JSON 字符串解码为 Python 对象

encode函数工具：
    r"""Encodes a Python object into a JSON-encoded string.

    * 'strict'    (Boolean, default False)

        If 'strict' is set to True, then only strictly-conforming JSON
        output will be produced.  Note that this means that some types
        of values may not be convertable and will result in a
        JSONEncodeError exception.

    * 'compactly'    (Boolean, default True)

        If 'compactly' is set to True, then the resulting string will
        have all extraneous white space removed; if False then the
        string will be "pretty printed" with whitespace and
        indentation added to make it more readable.

    * 'encode_namedtuple_as_object'  (Boolean or callable, default True)

        If True, then objects of type namedtuple, or subclasses of
        'tuple' that have an _asdict() method, will be encoded as an
        object rather than an array.
        If can also be a predicate function that takes a namedtuple
        object as an argument and returns True or False.

    * 'indent_amount'   (Integer, default 2)

        The number of spaces to output for each indentation level.
        If 'compactly' is True then indentation is ignored.

    * 'indent_limit'    (Integer or None, default None)

        If not None, then this is the maximum limit of indentation
        levels, after which further indentation spaces are not
        inserted.  If None, then there is no limit.

    CONCERNING CHARACTER ENCODING:

    The 'encoding' argument should be one of:

        * None - The return will be a Unicode string.
        * encoding_name - A string which is the name of a known
              encoding, such as 'UTF-8' or 'ascii'.
        * codec - A CodecInfo object, such as as found by codecs.lookup().
              This allows you to use a custom codec as well as those
              built into Python.

    If an encoding is given (either by name or by codec), then the
    returned value will be a byte array (Python 3), or a 'str' string
    (Python 2); which represents the raw set of bytes.  Otherwise,
    if encoding is None, then the returned value will be a Unicode
    string.

    The 'escape_unicode' argument is used to determine which characters
    in string literals must be \u escaped.  Should be one of:

        * True  -- All non-ASCII characters are always \u escaped.
        * False -- Try to insert actual Unicode characters if possible.
        * function -- A user-supplied function that accepts a single
             unicode character and returns True or False; where True
             means to \u escape that character.

    Regardless of escape_unicode, certain characters will always be
    \u escaped. Additionaly any characters not in the output encoding
    repertoire for the encoding codec will be \u escaped as well.

    """
encode函数的使用demo:
import demjson
data = [ { 'a' : 1, 'b' : 2, 'c' : 3, 'd' : 4, 'e' : 5 } ]
json = demjson.encode(data)
print (json)
json的值为字符串：'[{"a":1,"b":2,"c":3,"d":4,"e":5}]'
print的结果为：[{"a":1,"b":2,"c":3,"d":4,"e":5}]

decode函数工具：
"""Decodes a JSON-encoded string into a Python object.

    == Optional arguments ==

    * 'encoding'  (string, default None)

       This argument provides a hint regarding the character encoding
       that the input text is assumed to be in (if it is not already a
       unicode string type).

       If set to None then autodetection of the encoding is attempted
       (see discussion above). Otherwise this argument should be the
       name of a registered codec (see the standard 'codecs' module).

    * 'strict'    (Boolean, default False)

        If 'strict' is set to True, then those strings that are not
        entirely strictly conforming to JSON will result in a
        JSONDecodeError exception.

    * 'return_errors'    (Boolean, default False)

        Controls the return value from this function. If False, then
        only the Python equivalent object is returned on success, or
        an error will be raised as an exception.

        If True then a 2-tuple is returned: (object, error_list). The
        error_list will be an empty list [] if the decoding was
        successful, otherwise it will be a list of all the errors
        encountered.  Note that it is possible for an object to be
        returned even if errors were encountered.

    * 'return_stats'    (Boolean, default False)

        Controls whether statistics about the decoded JSON document
        are returns (and instance of decode_statistics).

        If True, then the stats object will be added to the end of the
        tuple returned.  If return_errors is also set then a 3-tuple
        is returned, otherwise a 2-tuple is returned.

    * 'write_errors'    (Boolean OR File-like object, default False)

        Controls what to do with errors.

        - If False, then the first decoding error is raised as an exception.
        - If True, then errors will be printed out to sys.stderr.
        - If a File-like object, then errors will be printed to that file.

        The write_errors and return_errors arguments can be set
        independently.

    * 'filename_for_errors'   (string or None)

        Provides a filename to be used when writting error messages.

    * 'allow_xxx', 'warn_xxx', and 'forbid_xxx'    (Booleans)

        These arguments allow for fine-adjustments to be made to the
        'strict' argument, by allowing or forbidding specific
        syntaxes.

        There are many of these arguments, named by replacing the
        "xxx" with any number of possible behavior names (See the JSON
        class for more details).

        Each of these will allow (or forbid) the specific behavior,
        after the evaluation of the 'strict' argument.  For example,
        if strict=True then by also passing 'allow_comments=True' then
        comments will be allowed.  If strict=False then
        forbid_comments=True will allow everything except comments.

    Unicode decoding:
    -----------------
    The input string can be either a python string or a python unicode
    string (or a byte array in Python 3).  If it is already a unicode
    string, then it is assumed that no character set decoding is
    required.

    However, if you pass in a non-Unicode text string (a Python 2
    'str' type or a Python 3 'bytes' or 'bytearray') then an attempt
    will be made to auto-detect and decode the character encoding.
    This will be successful if the input was encoded in any of UTF-8,
    UTF-16 (BE or LE), or UTF-32 (BE or LE), and of course plain ASCII
    works too.
    
    Note though that if you know the character encoding, then you
    should convert to a unicode string yourself, or pass it the name
    of the 'encoding' to avoid the guessing made by the auto
    detection, as with

        python_object = demjson.decode( input_bytes, encoding='utf8' )
    
    Callback hooks:
    ---------------
    You may supply callback hooks by using the hook name as the
    named argument, such as:
        decode_float=decimal.Decimal

    See the hooks documentation on the JSON.set_hook() method.

    """
decode函数的使用demo:
import demjson
json = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
text = demjson.decode(json)
print(text)
text的值为字典对象：{"a":1,"b":2,"c":3,"d":4,"e":5}
print的结果为：{"a":1,"b":2,"c":3,"d":4,"e":5}