浅谈json-2

浅谈json-2

这是我浅谈json一二事的第一篇,链接如下
https://blog.csdn.net/dongguanting/article/details/115267289
如果对我或者NLP相关的知识感兴趣,欢迎关注我刚搭建的个人blog:
https://dongguanting.github.io/



前言

之前我已经说了一期嵌套非常复杂的json,今天我想聊的是另一种json,估计做NLP的小伙伴也很常见,这种json的特征是数据量非常常庞大,但是每条dict键值相同,数据集详见:

580万百度知道问答语料库.

在此对数据稍作赘述:
数据集来自百度知道,百度知道的四个数据特点以及4个应用点,本项目通过采集百度知道,形成了百万级别的问答数据库规模。其中:

1, 问题个数583万个。
2, 问答对983万个。
3, 每个问题的答案个数1.7个。
4, 问题标签个数5824个。

QA对的数量情况(多余1即一问多答):

一、初探数据

json示例如下(示例)

 {
        "_id" : ObjectId("5d36e599bc54f451543da02b"),
        "url" : "http://zhidao.baidu.com/question/2207667243516878988.html",
        "answers" : [
            "这与当时的历史背景有关。卡萨布兰卡属于法国的殖民地,而当时的法国是与纳粹德国合作的。但在法国人中又分为合作和抵抗两派。警长的立场早先是在双方之间摇摆不定。后来与Rick站在了一起。拿酒瓶又扔掉象征其抛弃过去,走上了义无反顾的抵抗道路。----------------------------------------------------------------------------------------卡萨布兰卡的剧情简介   · · · · · · 二战期间,卡萨布兰卡是欧洲逃往美国的必经之地,那里鱼龙混杂,局势紧张。里克(亨佛莱?鲍嘉 Humphrey Bogart 饰)是一个神秘的商人,他在卡萨布兰卡开了一家人气很旺的夜总会,并拥有两张宝贵的通行证。一天,反纳粹人士维克多和妻子伊尔莎(英格丽?褒曼 Ingrid Bergman 饰)来到夜总会,原来他们正在逃避纳粹的追捕。碰巧的是,里克发现,伊尔莎竟然是他的旧日情人。那段爱曾经刻骨铭心,却因为一个误会而终止。而当误会消解时,伊尔莎和里克的感情还是不可避免的重燃了。里克手上的两张通行证能帮助维克多度过难关,但这样一来,伊尔莎是决定留下,还是离去,他们的爱情在政治和伦理的推波逐流中走向何方。"
        ],
        "question" : "卡萨布兰卡为什么是欧洲逃往美国的必经之地",
        "tags" : [
            "美国"
        ]
    }
    {
        "_id" : ObjectId("5d36e599bc54f451543da02c"),
        "url" : "http://zhidao.baidu.com/question/1929874578384929307。html",
        "answers" : [
            "你好是的!现在办理的身份证都是2代身份证!都是有磁性的"
        ],
        "question" : "2017年办的身份证是二代身份证吗",
        "tags" : [
            "公务办理"
        ]
    }
    {
        "_id" : ObjectId("5d36e599bc54f451543da02d"),
        "url" : "http://zhidao.baidu.com/question/941683273505984492.html",
        "answers" : [
            "龙凤汤是一道色香味俱全的传统名肴,属于闽系。此菜汤色微红,清澈见底,是一道上等滋补药膳。此菜由泰西宾馆一级厨师孙业富创作,被泰安市地方名吃评审会评为一等奖,受到海内外宾客的好评。龙凤汤主要食材:鲤鱼 ,口    味:鲜香 ,辅    料:香菇主料鲤鱼1条﹐鸡(大雏鸡)1/2只﹐香菇5个﹐大枣﹑栗子各10个﹐切好的葱2大勺﹐蒜1头﹐香油﹐胡椒粉鸡肉佐料切好的葱1大勺﹐捣好的蒜1大勺﹐胡椒面1/4小勺﹐香油1小勺调料鸡蛋﹐辣椒丝做法(1) 鸡要准备大雏鸡﹐去掉头和瓜﹐除去内脏洗净。(2) 要准备活鲤鱼﹐去尾放血后刮鳞﹐并切成块洗净。(3) 把香菇泡在水里除去香菇柱﹐大枣去核﹐栗子去皮。(4) 鸡蛋煎出来﹐切成丝。(5) 往平锅里倒水﹐开锅时放鸡﹑香菇﹑栗子﹑大枣﹑蒜煮熟。营养价值编辑鸡完全煮熟时﹐捞取撕肉与葱﹑蒜﹑香油﹑胡椒粉一起拌。在煮鸡的汤里放鲤鱼块儿﹐重煮一遍。有龙凤汤的真味儿出来时﹐盛在碗里﹐并在上面放拌的鸡肉和辣椒丝。",
            "因为香菇具有特殊的香味,会相互影响口味,所以不适合放。"
        ],
        "question" : "为什么炖龙凤汤不能放香菇呢",
        "tags" : [
            "美食",
            "花鸟鱼虫",
            "香菇"
        ]
    }

不难看出这json数据每一个dict都有相同的键值,难点在于dict数量繁多,并且dict都一个个分开。因为我们之后要做QA对的匹配(nlp的兄弟肯定懂QA是什么),所以目标就是要找到dict中键值“answer”数量为1个的dict(这样的dict有36w个),并将它保留,取其中的30w个dict输出在一个新json中即可,这样任务明确,开干!

二、代码实现

1.思路

我的思路很简单,其实就是读取这些json,然后通过if语句筛选len(data[“answer”])==1的json
将他们输出在新的json即可,并且每有一个符合条件的dict,计数器就+1,直到300000个结束。

2.几个坑

当我在实际编程的时候,遇到了几个坑:

  1. 500w条json实在太庞大了,所以在读取的时候一定要逐行读,这样保证每次line_data变量永远只存当前json,而不会将500w条全存进去,否则会出现虚拟内存不够的报错

这块要感谢这位博主的点播:
https://blog.csdn.net/Threeyearsago/article/details/104763329.

line_data = f.readline()  #逐行读
data= json.loads(line_data)
  1. 写操作的时候,只允许字符串类型,且不会自动换行,所以大家可以参考我的模板进行写入
with open(path2, "a",encoding = 'utf-8') as fo:
	fo.write(str(data)+"\n")  #逐行读如并换行

3.全部代码

代码如下(示例):

import json
path1 =r"D:\json\zhidao_qa.json"#文件的存储位置
count = 0
with open(path1, 'r', encoding='utf-8') as f:#整体思路是逐行读取,一旦遇到空行
    try:  #用try和except语法,可以排错不报错
        while True :#没出来的话一直读
            line_data = f.readline()  #逐行读
            if line_data:
                data= json.loads(line_data)#把读取的一行的json数据加载出来
                #print(data)
                if len(data['answers'])==1 and count<300000: #只找答案这一项为1的,并且只要30w行数据即可
                    path2 = r"D:\json\res.json"
                    with open(path2, "a",encoding = 'utf-8') as fo:
                        fo.write(str(data)+"\n")  #逐行读如并换行
                        count+=1    #统计数加1
            else:
                break
    except Exception as e:
        print(e)
        f.close()

代码注释很详尽了。


总结

以上就是我想分享的内容,json除了复杂list与dict的嵌套种类,还有dict数量庞大但是键值相似的json,nlp的学生未来必然会接触庞大的文本资源,希望大家可以在磨练中得心应手。
另附个人blog:https://dongguanting.github.io/
salute!

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
This file creates a global JSON object containing two methods: stringify and parse. JSON.stringify(value, replacer, space) value any JavaScript value, usually an object or array. replacer an optional parameter that determines how object values are stringified for objects. It can be a function or an array of strings. space an optional parameter that specifies the indentation of nested structures. If it is omitted, the text will be packed without extra whitespace. If it is a number, it will specify the number of spaces to indent at each level. If it is a string (such as '\t' or ' '), it contains the characters used to indent at each level. This method produces a JSON text from a JavaScript value. When an object value is found, if the object contains a toJSON method, its toJSON method will be called and the result will be stringified. A toJSON method does not serialize: it returns the value represented by the name/value pair that should be serialized, or undefined if nothing should be serialized. The toJSON method will be passed the key associated with the value, and this will be bound to the value For example, this would serialize Dates as ISO strings. Date.prototype.toJSON = function (key) { function f(n) { // Format integers to have at least two digits. return n < 10 ? '0' + n : n; } return this.getUTCFullYear() + '-' + f(this.getUTCMonth() + 1) + '-' + f(this.getUTCDate()) + 'T' + f(this.getUTCHours()) + ':' + f(this.getUTCMinutes()) + ':' + f(this.getUTCSeconds()) + 'Z'; }; You can provide an optional replacer method. It will be passed the key and value of each member, with this bound to the containing object. The value that is returned from your method will be serialized. If your method returns undefined, then the member will be excluded from the serialization. If the replacer parameter is an array of strings, then it will be used to select the members to be serialized. It filters the results such that only members with keys listed in the replacer array are stringified. Values that do not have JSON representations, such as undefined or functions, will not be serialized. Such values in objects will be dropped; in arrays they will be replaced with null. You can use a replacer function to replace those with JSON values. JSON.stringify(undefined) returns undefined. The optional space parameter produces a stringification of the value that is filled with line breaks and indentation to make it easier to read. If the space parameter is a non-empty string, then that string will be used for indentation. If the space parameter is a number, then the indentation will be that many spaces. Example: text = JSON.stringify(['e', {pluribus: 'unum'}]); // text is '["e",{"pluribus":"unum"}]' text = JSON.stringify(['e', {pluribus: 'unum'}], null, '\t'); // text is '[\n\t"e",\n\t{\n\t\t"pluribus": "unum"\n\t}\n]' text = JSON.stringify([new Date()], function (key, value) { return this[key] instanceof Date ? 'Date(' + this[key] + ')' : value; }); // text is '["Date(---current time---)"]' JSON.parse(text, reviver) This method parses a JSON text to produce an object or array. It can throw a SyntaxError exception. The optional reviver parameter is a function that can filter and transform the results. It receives each of the keys and values, and its return value is used instead of the original value. If it returns what it received, then the structure is not modified. If it returns undefined then the member is deleted. Example: // Parse the text. Values that look like ISO date strings will // be converted to Date objects. myData = JSON.parse(text, function (key, value) { var a; if (typeof value === 'string') { a = /^(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2}(?:\.\d*)?)Z$/.exec(value); if (a) { return new Date(Date.UTC(+a[1], +a[2] - 1, +a[3], +a[4], +a[5], +a[6])); } } return value; }); myData = JSON.parse('["Date(09/09/2001)"]', function (key, value) { var d; if (typeof value === 'string' && value.slice(0, 5) === 'Date(' && value.slice(-1) === ')') { d = new Date(value.slice(5, -1)); if (d) { return d; } } return value; }); This is a reference implementation. You are free to copy, modify, or redistribute.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值