Drill是查询JSON数据的绝佳工具。 但是Drill并不是神奇的东西,有时它会遇到一些尚无法完全处理的数据。 这篇文章介绍了这种情况的示例,以及如何使用少量Python代码解决此问题。
情境
您有更改架构的数据。 在此示例中,我们看您有一个字段从单个值更改为列表,反之亦然的情况。
样本数据
此数据分为两个文件。
{
"name": "Vince",
"favorite_foods": [
{
"name": "ice-cream",
"flavors": "vanilla"
},
{
"name": "cheeseburgers",
"flavors": [
"animal style",
"black label"
]
}
]
}
{
"name": "Nikki",
"favorite_foods": [
{
"name": "ice-cream",
"flavors": [
"chocolate",
"dulce de leche"
]
},
{
"name": "cheeseburgers",
"flavors": [
"black label"
]
}
]
}
询问
这两个文件都是有效的JSON,这很好,但是事实证明架构已更改。 文斯只喜欢香草冰淇淋,并且将其存储为单个值。 Nikki喜欢巧克力和dulce de leche(Vince不会拒绝它们,只是不喜欢他的口味),因此将它们存储为列表。
Drill宁愿您不要这样组织数据,也不要害羞地告诉您:
0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
java.lang.RuntimeException: java.sql.SQLException: DATA_READ ERROR: You tried to start when you are using a ValueWriter of type NullableVarCharWriterImpl.
File /Users/vince/src/drill-data-prep-example/schema-change/data/vince.json
Record 1
Line 1
Column 127
Field flavors
Fragment 0:0
[Error Id: 2c9020c1-cfcf-42e2-926c-d00962ceb7f9 on 192.168.56.1:31010]
at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73)
at sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
at sqlline.SqlLine.print(SqlLine.java:1583)
at sqlline.Commands.execute(Commands.java:852)
at sqlline.Commands.sql(Commands.java:751)
at sqlline.SqlLine.dispatch(SqlLine.java:738)
at sqlline.SqlLine.begin(SqlLine.java:612)
at sqlline.SqlLine.start(SqlLine.java:366)
at sqlline.SqlLine.main(SqlLine.java:259)
如果我们反转这些对象在输入中的顺序怎么办? 当我们从列表开始,然后更改为单个值时,Drill是否更喜欢它?
0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
Error: DATA_READ ERROR: You tried to write a VarChar type when you are using a ValueWriter of type SingleListWriter.
File /Users/vince/src/mapr/drill-data-prep/schema-change/sample_data.json
Record 2
Line 3
Column 84
Field flavors
Fragment 0:0
[Error Id: 93ceda84-8710-46de-8d5f-6b2290914703 on 192.168.56.1:31010] (state=,code=0)
否。演习仍然令人不悦。
在Drill 1.6中,可以启用联合类型(实验性) 。 这将允许将多种数据类型存储在一个字段中,这可能有助于我们解决此问题。 因此,让我们启用它,然后尝试一下。
ALTER SESSION SET `exec.enable_union_type` = true;
0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The field $offsets$(UINT4:REQUIRED) doesn't match the provided metadata major_type {
minor_type: MAP
mode: REQUIRED
}
. (state=,code=0)
可悲的是,这也不起作用。 因此,让我们禁用联合类型并继续。
有什么问题?
当您提供数据类型类型从单个值更改为列表,反之亦然时,Drill不喜欢它。
我们如何解决呢?
我们对数据进行处理,以使有时但并非总是使用列表的字段成为始终使用列表的字段。 在样本数据中,问题之一就是flavors
。 Drill可以帮助您准确指出问题所在的行和列,因此您可以在开始工作之前先加以关注,然后开始编写代码以对其进行修复。
修复数据的一种方法是使用python。 您可以简单地读取JSON对象并重写它们,从而将单个值转换为单个值列表:
import json
import types
for filename in ("data/vince.json", "data/nikki.json"):
# Read the file into memory.
with file(filename) as infile:
data = json.loads(infile.read())
# Open and truncate the input file.
with file(filename, "w") as outfile:
# enumerate returns an incrementing integer with each iteration.
# convenient way to keep track of the index into the list
# we need to modify.
for i,f in enumerate(data["favorite_foods"]):
# Check if the value pointed to by "flavors" is of type
# ListType.
if type(f["flavors"]) != types.ListType:
# If it's not, then overwrite the value as a list.
data["favorite_foods"][i]["flavors"] = [ f["flavors"] ]
outfile.write(json.dumps(data))
现在,查询经过按摩的数据后,Drill变得更快乐:
0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
+---------------------------------------------------------------------------------------------------------------------+--------+
| favorite_foods | name |
+---------------------------------------------------------------------------------------------------------------------+--------+
| [{"flavors":["chocolate","dulce de leche"],"name":"ice-cream"},{"flavors":["black label"],"name":"cheeseburgers"}] | Nikki |
| [{"flavors":["vanilla"],"name":"ice-cream"},{"flavors":["animal style","black label"],"name":"cheeseburgers"}] | Vince |
+---------------------------------------------------------------------------------------------------------------------+--------+
2 rows selected (0.119 seconds)
更满意的查询:
0: jdbc:drill:zk=local> select t.name, t.yum.name as fave_food, flatten(t.yum.flavors) as fave_flave from (select t.name, flatten(t.favorite_foods) as yum from (select name,favorite_foods from `/Users/vince/src/drill-data-prep-example/schema-change/data`) t) t;
+--------+----------------+-----------------+
| name | fave_food | fave_flave |
+--------+----------------+-----------------+
| Nikki | ice-cream | chocolate |
| Nikki | ice-cream | dulce de leche |
| Nikki | cheeseburgers | black label |
| Vince | ice-cream | vanilla |
| Vince | cheeseburgers | animal style |
| Vince | cheeseburgers | black label |
+--------+----------------+-----------------+
6 rows selected (0.153 seconds)
有趣的笔记
查询JSON文件的目录与查询文件时,将得到不同的错误输出样式。
翻译自: https://www.javacodegeeks.com/2016/06/resolving-json-schema-changes-drill-python.html