1. 背景
数据以json格式传输到AWS 的S3里面,将数据从S3下到本地,然后对其进行解析,入库。将表结构做成大宽表,对于json里面有的字段就有值,没有的则补为null。大宽表一共70多个字段。
2. 数据格式
{
"dId": "204083",
"ccn": "24474728",
"version": "000001",
"D": [{
"N": "Compressor Status Word",
"V": -27898
}, {
"N": "Compressor Discharge Pressure",
"V": 8
}, {
"N": "2nd Stage Inlet Pressure",
"V": 2
}, {
"N": "2nd Stage Discharge Pressure",
"V": 8
}, {
"N": "Inlet Vacuum",
"V": 0
}, {
"N": "Oil Filter Outlet Pressure",
"V": 2
}, {
"N": "2nd Stage Inlet Temperature",
"V": 33
}, {
"N": "1st Stage Discharge Temperature",
"V": 174
}, {
"N": "2nd Stage Discharge Temperature",
"V": 174
}, {
"N": "Oil Filter Outlet Temperature",
"V": 55
}, {
"N": "Compressor Discharge Temperature",
"V": 30
}, {
"N": "1st Stage Inlet Temperature",
"V": 15
}, {
"N": "Cooling Motor Speed",
"V": 100
}, {
"N": "Vsd Motor Speed",
"V": 2070
}, {
"N": "Driver Motor Current",
"V": 256
}, {
"N": "Vsd AC Input Voltage",
"V": 386
}, {
"N": "Vsd DC Bus Voltge",
"V": 524
}, {
"N": "Remote Pressure",
"V": 0
}, {
"N": "Vsd Motor Voltage",
"V": 330
}, {
"N": "Package Power",
"V": 130
}, {
"N": "Vsd Motor Power",
"V": 129
}, {
"N": "Target Pressure",
"V": 8
}, {
"N": "First Stage Temperature",
"V": 250
}],
"fre": 70,
"iTs": 1608269180,
"type": "Timing",
"gSN": "NIR5012047000191"
}
从图上看来,数据都是文件,在s3里是以对象存储的,数据的内容是以json存储的,然后以n个小文件组成一个文件夹,又由n多个文件夹组成一个大文件夹
3. 数据预处理
因为kettle只能识别json结尾的文件然后才能进一步解析json文件,所以先做一步数据预处理,将这些文件添加为.json
使用python 在这些文件后面添加.json,以下附上代码
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
'''
@Project :cn.com.softland.IOTHadoop
@File :json.py
@IDE :PyCharm
@Author :Cayon_L
@Date :2021/6/4 15:15
@User :liuky
'''
import os
def batch_rename(dir_path, suffix):
files = os.listdir(dir_path)
for i, file in enumerate(files):
old_name = os.path.join(dir_path, file)
# new_name = os.path.join(dir_path, file.split(".")[0] + suffix + '.' + file.split(".")[-1])
new_name = os.path.join(dir_path, file.split(".")[0] + suffix)
os.rename(old_name, new_name)
dir_path = 'C:\\Users\\liuky\\Desktop\\10\\NIR5012047000201'
suffix = '.json'
batch_rename(dir_path, suffix)
处理后的数据
4. 数据处理
4.1 整体流程
4.2 解析第一层json
众多的json小文件组成了一个文件夹,所以采用了获取文件名批量处理一个文件夹下的json
下面开始解析第一层json
4.3 解析第二层json
这里面的D字段就是第二层json
4.4 提取字段
选择想要的入库字段
4.5 转换
将字段根据iTs分组
4.6 提取入库字段
4.7 入库
最后将结果入库,在此想要提高作业的效率,可以适当地增加入库的数量