pandas处理json数据
将json串解析为DataFrame的方式主要有三种:
- 利用pandas自带的read_json直接解析字符串
- 利用json的loads和pandas的json_normalize进行解析
- 利用json的loads和pandas的DataFrame直接构造(这个过程需要手动修改loads得到的字典格式)
由于read_json直接对字符串进行的解析,其效率是最高的,但是其对JSON串的要求也是最高的,需要满足其规定的格式才能够读取。其支持的格式可以在pandas的官网点击打开链接可以看到。然而json_normalize是解析json串构造的字典的,其灵活性比read_json要高很多。但是令人意外的是,其效率还不如我自己解析来得快(自己解析时使用列表解析的功能比普通的for循环快很多)。当然最灵活的还是自己解析,可以在构造DataFrame之前进行一些简单的数据处理。
# -*- coding: UTF-8 -*-
from pandas.io.json import json_normalize
import pandas as pd
import json
import time
# 读入数据
data_str = open('data.json').read()
print data_str
# 测试json_normalize
start_time = time.time()
for i in range(0, 300):
data_list = json.loads(data_str)
df = json_normalize(data_list)
end_time = time.time()
print end_time - start_time
# 测试自己构造
start_time = time.time()
for i in range(0, 300):
data_list = json.loads(data_str)
data = [[d['timestamp'], d['value']] for d in data_list]
df = pd.DataFrame(data, columns=['timestamp', 'value'])
end_time = time.time()
print end_time - start_time
# 测试read_json
start_time = time.time()
for i in range(0, 300):
df = pd.read_json(data_str, orient='records')
end_time = time.time()
print end_time - start_time
pandas里的read_json函数可以将json数据转化为dataframe。 pandas.read_json的语法如下:
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
第一参数就是json文件路径或者json格式的字符串。
第二参数orient是表明预期的json字符串格式。orient的设置有以下几个值:
(1).'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。
import pandas as pd
s='{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
print(pd.read_json(s,orient='split'))
运行结果:
a b
1 1 3
2 2 8
3 3 9
(2). 'records' : list like [{column -> value}, ... , {column -> value}]
这种就是成员为字典的列表。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。
import pandas as pd
s='[{"name":"xiaomaimiao","age":20},{"name":"xxt","age":18},{"name":"xmm","age":1}]'
print(pd.read_json(s,orient='records'))
运行结果:
age name
0 20 xiaomaimiao
1 18 xxt
2 1 xmm
再例如:
# coding=utf-8
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
s=open('a.json', encoding='UTF-8').read()
df=pd.read_json(s,orient='records')
print(df.head(5))
# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])
df.to_excel('pandas处理json1.xlsx', index=False)
运行结果:
数据: a.json.zip
或: https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json
(3). 'index' : dict like {index -> {column -> value}}
以索引为key,以列字段构成的字典为键值。如:
import pandas as pd
s='{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
print(pd.read_json(s,orient='index'))
运行结果:
a b
0 1 2
1 9 11
(4). 'columns' : dict like {column -> {index -> value}}
这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:
import pandas as pd
s='{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
print(pd.read_json(s,orient='columns'))
运行结果:
a b
0 1 2
1 9 11
(5). 'values' : just the values array
values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。
import pandas as pd
s='[["a",1],["b",2]]'
print(pd.read_json(s,orient='values'))
运行结果:
0 1
0 a 1
1 b 2
要处理的json字符串:
strtext='[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'