使用Drill和Python解决JSON模式更改

最新推荐文章于 2024-09-03 09:36:48 发布

danpu0978

最新推荐文章于 2024-09-03 09:36:48 发布

阅读量563

点赞数

文章标签： java python json mysql linux

Drill是查询JSON数据的绝佳工具。但是Drill并不是神奇的东西，有时它会遇到一些尚无法完全处理的数据。这篇文章介绍了这种情况的示例，以及如何使用少量Python代码解决此问题。

情境

您有更改架构的数据。在此示例中，我们看您有一个字段从单个值更改为列表，反之亦然的情况。

样本数据

此数据分为两个文件。

{
  "name": "Vince",
  "favorite_foods": [
    {
      "name": "ice-cream",
      "flavors": "vanilla"
    },
    {
      "name": "cheeseburgers",
      "flavors": [
        "animal style",
        "black label"
      ]
    }
  ]
}
{
  "name": "Nikki",
  "favorite_foods": [
    {
      "name": "ice-cream",
      "flavors": [
        "chocolate",
        "dulce de leche"
      ]
    },
    {
      "name": "cheeseburgers",
      "flavors": [
        "black label"
      ]
    }
  ]
}

询问

这两个文件都是有效的JSON，这很好，但是事实证明架构已更改。文斯只喜欢香草冰淇淋，并且将其存储为单个值。 Nikki喜欢巧克力和dulce de leche（Vince不会拒绝它们，只是不喜欢他的口味），因此将它们存储为列表。

Drill宁愿您不要这样组织数据，也不要害羞地告诉您：

0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
java.lang.RuntimeException: java.sql.SQLException: DATA_READ ERROR: You tried to start when you are using a ValueWriter of type NullableVarCharWriterImpl.

File  /Users/vince/src/drill-data-prep-example/schema-change/data/vince.json
Record  1
Line  1
Column  127
Field  flavors
Fragment 0:0

[Error Id: 2c9020c1-cfcf-42e2-926c-d00962ceb7f9 on 192.168.56.1:31010]
    at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73)
    at sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
    at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
    at sqlline.SqlLine.print(SqlLine.java:1583)
    at sqlline.Commands.execute(Commands.java:852)
    at sqlline.Commands.sql(Commands.java:751)
    at sqlline.SqlLine.dispatch(SqlLine.java:738)
    at sqlline.SqlLine.begin(SqlLine.java:612)
    at sqlline.SqlLine.start(SqlLine.java:366)
    at sqlline.SqlLine.main(SqlLine.java:259)

如果我们反转这些对象在输入中的顺序怎么办？当我们从列表开始，然后更改为单个值时，Drill是否更喜欢它？

0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
Error: DATA_READ ERROR: You tried to write a VarChar type when you are using a ValueWriter of type SingleListWriter.

File  /Users/vince/src/mapr/drill-data-prep/schema-change/sample_data.json
Record  2
Line  3
Column  84
Field  flavors
Fragment 0:0

[Error Id: 93ceda84-8710-46de-8d5f-6b2290914703 on 192.168.56.1:31010] (state=,code=0)

否。演习仍然令人不悦。

在Drill 1.6中，可以启用联合类型（实验性）。这将允许将多种数据类型存储在一个字段中，这可能有助于我们解决此问题。因此，让我们启用它，然后尝试一下。

ALTER SESSION SET `exec.enable_union_type` = true;
0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
Error: Unexpected RuntimeException: java.lang.IllegalArgumentException: The field $offsets$(UINT4:REQUIRED) doesn't match the provided metadata major_type {
  minor_type: MAP
  mode: REQUIRED
}

. (state=,code=0)

可悲的是，这也不起作用。因此，让我们禁用联合类型并继续。

有什么问题？

当您提供数据类型类型从单个值更改为列表，反之亦然时，Drill不喜欢它。

我们如何解决呢？

我们对数据进行处理，以使有时但并非总是使用列表的字段成为始终使用列表的字段。在样本数据中，问题之一就是flavors 。 Drill可以帮助您准确指出问题所在的行和列，因此您可以在开始工作之前先加以关注，然后开始编写代码以对其进行修复。

修复数据的一种方法是使用python。您可以简单地读取JSON对象并重写它们，从而将单个值转换为单个值列表：

import json
import types

for filename in ("data/vince.json", "data/nikki.json"):
    # Read the file into memory.
    with file(filename) as infile:
        data = json.loads(infile.read())

    # Open and truncate the input file.
    with file(filename, "w") as outfile:
        # enumerate returns an incrementing integer with each iteration.
        # convenient way to keep track of the index into the list
        # we need to modify.
        for i,f in enumerate(data["favorite_foods"]):
            # Check if the value pointed to by "flavors" is of type
            # ListType.
            if type(f["flavors"]) != types.ListType:
                # If it's not, then overwrite the value as a list.
                data["favorite_foods"][i]["flavors"] = [ f["flavors"] ]
        outfile.write(json.dumps(data))

现在，查询经过按摩的数据后，Drill变得更快乐：

0: jdbc:drill:zk=local> select * from `/Users/vince/src/drill-data-prep-example/schema-change/data`;
+---------------------------------------------------------------------------------------------------------------------+--------+
|                                                   favorite_foods                                                    |  name  |
+---------------------------------------------------------------------------------------------------------------------+--------+
| [{"flavors":["chocolate","dulce de leche"],"name":"ice-cream"},{"flavors":["black label"],"name":"cheeseburgers"}]  | Nikki  |
| [{"flavors":["vanilla"],"name":"ice-cream"},{"flavors":["animal style","black label"],"name":"cheeseburgers"}]      | Vince  |
+---------------------------------------------------------------------------------------------------------------------+--------+
2 rows selected (0.119 seconds)

更满意的查询：

0: jdbc:drill:zk=local> select t.name, t.yum.name as fave_food, flatten(t.yum.flavors) as fave_flave from (select t.name, flatten(t.favorite_foods) as yum from (select name,favorite_foods from `/Users/vince/src/drill-data-prep-example/schema-change/data`) t) t;
+--------+----------------+-----------------+
|  name  |   fave_food    |   fave_flave    |
+--------+----------------+-----------------+
| Nikki  | ice-cream      | chocolate       |
| Nikki  | ice-cream      | dulce de leche  |
| Nikki  | cheeseburgers  | black label     |
| Vince  | ice-cream      | vanilla         |
| Vince  | cheeseburgers  | animal style    |
| Vince  | cheeseburgers  | black label     |
+--------+----------------+-----------------+
6 rows selected (0.153 seconds)