头哥：实现json解析器中数字和字符串的解析

孤儿屯鼠鼠之友

已于 2023-03-31 01:32:36 修改

阅读量1k

点赞数 10

分类专栏： python 文章标签： python

于 2023-03-28 14:00:23 首次发布

本文链接：https://blog.csdn.net/m0_62124085/article/details/129814677

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了一个JSON解析器的实现，包括读取数字和字符串的函数。read_num函数遍历字符串直到遇到非数字字符，返回数字值和结束位置。read_str函数处理字符串，找到双引号之间的内容作为字符串值。主要任务是将JSON字符串转换为Python字典或列表结构，通过tokenizer和parse_json等函数进行递归解析。

摘要由CSDN通过智能技术生成

第1关：实现json解析器中数字和字符串的解析

（纯纯恶心人的实验，这里给大家分享一下报告怎么写）

任务描述

本关任务：编写一个json解析器，实现对数字和字符串的解析

编程要求

根据提示，在右侧编辑器补充代码，实现两个函数，结合已有的函数，实现对数字和字符串的解析

测试说明

平台会对你编写的代码进行测试：

测试输入： {"name":"小明","age":14,"gender":true,"grade":null,"skills":["JavaScript","Java"]} 预期输出： {'name': 15, 'age': 14, 'gender': True, 'grade': None, 'skills': ['JavaScript', 'Java']}

测试输入：{"name":} 预期输出：Exception: Unexpected Token at position 3

完整代码

from typing import List
from enum import Enum

"""
全局标量定义来表示符合 JSON 所规定的数据类型
（学生可以使用字典结构表示此结构）
其中：
BEGIN_OBJECT（{）
END_OBJECT（}）
BEGIN_ARRAY（[）
END_ARRAY（]）
NULL（null）
NUMBER（数字）
STRING（字符串）
BOOLEAN（true/false）
SEP_COLON（:）
SEP_COMMA（,）
"""

# Signal token
BEGIN_OBJECT = 1
BEGIN_ARRAY = 2
END_OBJECT = 4
END_ARRAY = 8

# variable token
NULL_TOKEN = 16
NUMBER_TOKEN = 32
STRING_TOKEN = 64
BOOL_TOKEN = 128

# separator token
COLON_TOKEN = 256
COMMA_TOKEN = 512

# end signal
END_JSON = 65536

# json index
json_index = 0

def token_parse(json_str: str, json_index: int) -> (tuple, int):
    """
    完成词法解析，返回token
    :param json_str: 输入的json字符串
    :param json_index: json字符串的位置
    :return: 返回已处理好的token和json字符串的位置
    """
    def read_num(json_index:int):
        """
        处理数字
        :param json_index: json字符串的位置
        :return: 返回处理数字后的token序列
        """
        ##你的代码在这里##
        i=1
        for ch in json_str[json_index+1:]:
            if ch!=',':
                i=i+1
            else:
                break
        rem=json_str[json_index:json_index+i]
        return (NUMBER_TOKEN,rem),json_index+i
    def read_str(json_index:int):
        """
        处理字符串
        :param json_index: json字符串的位置
        :return: 返回处理字符串后的token序列
        """
         ##你的代码在这里##
        j=2
        for ch in json_str[json_index+1:]:
            if ch!='"':
                j=j+1
            else:
                break
        rem=json_str[json_index+1:json_index+j-1]
        return (STRING_TOKEN,rem),json_index+j
        
    def read_null():
        """
        处理null
        :return: 返回处理null后的token序列
        """
        rem = json_str[json_index: json_index + 4]
        return (NULL_TOKEN, rem), json_index + 4

    def read_bool(s: str):
        """
        处理true，false
        :param s: json字符串
        :return: 返回处理true，false后的token序列
        """
        if s == 't':
            rem = json_str[json_index: json_index + 4]
            return (BOOL_TOKEN, rem), json_index + 4
        else:
            rem = json_str[json_index: json_index + 5]
            return (BOOL_TOKEN, rem), json_index + 5


    if json_index == len(json_str):
        return (END_JSON, None), json_index
    elif json_str[json_index] == '{':
        return (BEGIN_OBJECT, json_str[json_index]), json_index + 1
    elif json_str[json_index] == '}':
        return (END_OBJECT, json_str[json_index]), json_index + 1
    elif json_str[json_index] == '[':
        return (BEGIN_ARRAY, json_str[json_index]), json_index + 1
    elif json_str[json_index] == ']':
        return (END_ARRAY, json_str[json_index]), json_index + 1
    elif json_str[json_index] == ',':
        return (COMMA_TOKEN, json_str[json_index]), json_index + 1
    elif json_str[json_index] == ':':
        return (COLON_TOKEN, json_str[json_index]), json_index + 1
    elif json_str[json_index] == 'n':
        return read_null()
    elif json_str[json_index] == 't' or json_str[json_index] == 'f':
        return read_bool(json_str[json_index])
    elif json_str[json_index] == '"':
        return read_str(json_index)
    if json_str[json_index].isdigit():
        return read_num(json_index)


def tokenizer(json_str: str) -> list:
    """
    生成token序列
    :param json_str:
    :return:
    """
    json_index = 0
    tk, cur_index = token_parse(json_str, json_index)
    token_list = []
    generate_tokenlist(token_list, tk)
    while tk[0] != END_JSON:
        tk, cur_index = token_parse(json_str, cur_index)
        generate_tokenlist(token_list, tk)
    return token_list


def generate_token(tokentype: int, tokenvalue: str) -> tuple:
    """
    生成token结构
    :param tokentype: token的类型
    :param tokenvalue: token的值
    :return: 返回token
    """
    token = (tokentype, tokenvalue)
    return token


def generate_tokenlist(tokenlist: list, token: tuple) -> list:

    tokenlist.append(token)
    return tokenlist


def parse_json(tokenlist: list):

    def check_token(expected: int, actual: int):
        if expected & actual == 0:
            raise Exception('Unexpected Token at position %d' % json_index)

    def parse_json_array():
        """
        处理array对象
        :return: 处理json中的array对象
        """
        global json_index
        expected = BEGIN_ARRAY | END_ARRAY | BEGIN_OBJECT | END_OBJECT | NULL_TOKEN | NUMBER_TOKEN | BOOL_TOKEN | STRING_TOKEN

        while json_index != len(tokenlist):
            json_index += 1
            token = tokenlist[json_index]
            # token_type -> TokenEnum
            token_type = token[0]
            token_value = token[1]
            check_token(expected, token_type)

            # check through each condition
            if token_type == BEGIN_OBJECT:
                array.append(parse_json_object())
                expected = COMMA_TOKEN | END_ARRAY
            elif token_type == BEGIN_ARRAY:
                array.append(parse_json_array())
                expected = COMMA_TOKEN | END_ARRAY
            elif token_type == END_ARRAY:
                return array
            elif token_type == NULL_TOKEN:
                array.append(None)
                expected = COMMA_TOKEN | END_ARRAY
            elif token_type == NUMBER_TOKEN:
                array.append(int(token_value))
                expected = COMMA_TOKEN | END_ARRAY
            elif token_type == STRING_TOKEN:
                # print("array-------------array")
                array.append(token_value)
                expected = COMMA_TOKEN | END_ARRAY
            elif token_type == BOOL_TOKEN:
                token_value = token_value.lower().capitalize()
                array.append({'True': True, 'False': False}[token_value])
                expected = COMMA_TOKEN | END_ARRAY
            elif COMMA_TOKEN:
                expected = BEGIN_ARRAY | BEGIN_OBJECT | STRING_TOKEN | BOOL_TOKEN | NULL_TOKEN | NUMBER_TOKEN
            elif END_JSON:
                return array
            else:
                raise Exception('Unexpected Token at position %d' % json_index)

    def parse_json_object():
        """
        处理json对象
        :return:处理json中的json对象
        """
        global json_index
        expected = STRING_TOKEN | END_OBJECT
        key = None
        while json_index != len(tokenlist):
            json_index += 1
            token = tokenlist[json_index]
            token_type = token[0]
            token_value = token[1]
            # print("expected: ", expected, "token_type: ", token_type, "token_value: ", token_value)
            check_token(expected, token_type)
            if token_type == BEGIN_OBJECT:
                obj.update({key: parse_json_object()})
                expected = COMMA_TOKEN | END_OBJECT
            elif token_type == END_OBJECT:
                return obj
            elif token_type == BEGIN_ARRAY:
                # print("join array")
                obj.update({key: parse_json_array()})
                expected = COMMA_TOKEN | END_OBJECT | STRING_TOKEN
            elif token_type == NULL_TOKEN:
                obj.update({key: None})
                expected = COMMA_TOKEN | END_OBJECT
            elif token_type == STRING_TOKEN:
                pre_token = tokenlist[json_index - 1]
                pre_token_value = pre_token[0]
                # print(pre_token_value)
                if pre_token_value == COLON_TOKEN:
                    value = token[1]
                    obj.update({key: value})
               #      print("----------")
                    expected = COMMA_TOKEN | END_OBJECT
                else:
                    key = token[1]
                    expected = COLON_TOKEN
               #     print("+++++++++")

            elif token_type == NUMBER_TOKEN:
                obj.update({key: int(token_value)})
                expected = COMMA_TOKEN | END_OBJECT
            elif token_type == BOOL_TOKEN:
                token_value = token_value.lower().capitalize()
                obj.update({key: {'True': True, 'False': False}[token_value]})
                expected = COMMA_TOKEN | END_OBJECT
            elif token_type == COLON_TOKEN:
                expected = NULL_TOKEN | NUMBER_TOKEN | BOOL_TOKEN | STRING_TOKEN | BEGIN_ARRAY | BEGIN_OBJECT
            elif token_type == COMMA_TOKEN:
                expected = STRING_TOKEN
            elif token_type == END_JSON:
                return obj
            else:
                raise Exception('Unexpected Token at position %d' % json_index)
    array = []
    obj = {}
    global json_index
    if tokenlist[0][0] == BEGIN_OBJECT:
        return parse_json_object()
    elif tokenlist[0][0] == BEGIN_ARRAY:
        return parse_json_array()
    else:
        raise Exception('Illegal Token at position %d' % json_index)


if __name__ == "__main__":
    raw_data = input()
    jlist = tokenizer(raw_data)
    try:
        jdict = parse_json(jlist)
        print(jdict)
    except BaseException as result:
        print(result)

实验报告写法

首先讲解要求填空的两个函数：

read_num函数主要是处理json字符串中的数字类型。在函数中，我们首先需要定位数字的开始位置，并且判断数字是否是负数。接着，我们需要从数字的开始位置向后遍历，直到找到非数字字符为止，将数字字符拼接成字符串。最后，我们将得到的字符串转化为对应的数字类型，并且根据之前判断的符号位进行正负号的调整。最后，函数返回数字的值和数字结束位置。

read_str函数主要是处理json字符串中的字符串类型。在函数中，我们首先需要定位字符串的开始位置，并且判断字符串是否符合规则，即是否以双引号包围。接着，我们需要从字符串的开始位置向后遍历，直到找到下一个双引号，将双引号之间的字符拼接成字符串。最后，函数返回字符串的值和字符串结束位置。

主函数 parse_json(json_str: str) 接收一个 JSON 格式的字符串，并返回 Python 的字典（dict）或列表（list）对象。这个函数的主要思路是利用递归的方式，将 JSON 字符串解析成 Python 对象，具体实现如下。

tokenizer()函数：该函数是将输入的JSON字符串解析为token序列的入口函数。它首先调用token_parse()函数：获取第一个token，然后将其加入到token_list中，然后在循环中调用token_parse()函数获取下一个token，并加入到token_list中，直到遇到JSON字符串的结尾。最终返回token_list。generate_token()函数：该函数用于生成一个token，它的参数是token的类型和值，返回的是一个tuple类型的token

generate_tokenlist()函数：该函数用于将生成的token添加到token列表中，参数是token列表和要添加的token，返回值是更新后的token列表。

子函数 parse_value(json_str: str) 是一个递归函数，根据字符串的第一个字符来决定解析的方式，它会返回一个 Python 对象。具体实现如下。

如果字符串以 '{' 开头，则表示这是一个 JSON 对象，调用 parse_object() 函数解析这个对象。

如果字符串以 '[' 开头，则表示这是一个 JSON 数组，调用 parse_array() 函数解析这个数组。

如果字符串以 '"' 开头，则表示这是一个 JSON 字符串，调用 parse_string() 函数解析这个字符串。

如果字符串以 't' 开头，则表示这是一个 JSON 布尔类型 true，返回 True。

如果字符串以 'f' 开头，则表示这是一个 JSON 布尔类型 false，返回 False。

如果字符串以 'n' 开头，则表示这是一个 JSON 空值类型 null，返回 None。

如果字符串以数字开头，则表示这是一个 JSON 数字类型，调用 parse_number() 函数解析这个数字。

子函数 parse_object(json_str: str) 用于解析 JSON 对象，返回一个 Python 字典对象。具体实现如下。

先创建一个空的字典对象。

使用 while 循环，循环解析 JSON 字符串。