蜂巢大数据_蜂巢数据验证的局限性

最新推荐文章于 2024-08-25 09:25:21 发布

weixin_26755331

最新推荐文章于 2024-08-25 09:25:21 发布

阅读量407

点赞数

文章标签： python java

原文链接：https://medium.com/analytics-vidhya/limitation-of-hive-data-validation-1eec015e5ca6

版权

蜂巢大数据

In a big data world, hive is one of the most popular data warehouse tool. Though it comes with some convenient and flexibility features including SQL liked data manipulation language or easily data importing mechanism which a user can just simply copy data file to specified hdfs location. It also have some disadvantages which is why it is necessary to have “Data validation” mechanism.

在大数据世界中，hive是最受欢迎的数据仓库工具之一。虽然它具有一些方便和灵活的功能，包括喜欢SQL的数据操作语言或轻松的数据导入机制，但用户只需将数据文件复制到指定的hdfs位置即可。它还有一些缺点，这就是为什么必须具有“数据验证”机制的原因 。

Hive数据验证机制限制 (Hive Data Validation Mechanism Limitation)

In hive when you create a table schema, you must specify each column name along with its data type. So simple like relational database which are widely used nowadays. You might think that afterward you can just import your data into a table’s hdfs location and use simple query as in the following:

在蜂巢中创建表模式时，必须指定每个列名称及其数据类型。关系数据库如此简单，如今已广泛使用。您可能会认为，之后您可以将数据导入表的hdfs位置，并使用简单的查询，如下所示：

SELECT * FROM [tablename].

SELECT * FROM [表名]。

But that’s not the case, in this scenario you won’t get an error message if your data isn’t in the same format as data type you’ve specified in a schema like those you saw in relational database. Instead, hive will convert a value that is not compatible with specified schema data type to NULL, which sometimes, undesirable.

但是事实并非如此，在这种情况下，如果数据的格式与在关系数据库中看到的模式所指定的数据类型不同，则不会收到错误消息。相反， hive会将与指定架构数据类型不兼容的值转换为NULL ，这有时是不希望的。

Until now, there is no tool which designed to fix this issue. But there are some methods you could use as a workaround.

到目前为止，还没有旨在解决此问题的工具。但是您可以使用一些方法来解决。

Preprocessing your data first before moving it into hive. You can do it by writing an ETL process in “Mapreduce” or by using “Pig” (An opensource tool which implementing higher-level mapreduce mechanism). However, if you have so many tables it will be inconvenient to write multiple mapreduce program in order to handle validation process for each table separately. With that in mind, it leads to our second method.
首先将数据预处理，然后再将其移入配置单元 。您可以通过在“ Mapreduce”中编写ETL流程或使用“ Pig” (实现高级mapreduce机制的开源工具)来实现。但是，如果您有太多的表，那么编写多个mapreduce程序以便分别处理每个表的验证过程将很不方便。考虑到这一点，它导致了我们的第二种方法。
Creating a staging table with all STRING data type. Then use some custom hive function to validate your data type before loading it into another persistence table.
创建具有所有STRING数据类型的登台表 。然后，使用一些自定义配置单元功能来验证您的数据类型，然后再将其加载到另一个持久性表中。

创建临时表并应用自定义配置单元功能进行数据验证 (Creating a staging table and applying a custom hive function for data validation)

You can follow steps below to create a temporary staging table for storing you raw data then applying a custom hive function for validating before storing it into your persistent hive table.

您可以按照以下步骤创建临时临时表以存储原始数据，然后应用自定义配置单元功能进行验证，然后再将其存储到持久配置单元表中。

Create a staging table for storing your raw data.
创建一个临时表以存储原始数据。

2. Copy your text data file it your destined table’s hdfs location.

2. 将文本数据文件复制到目标表的hdfs位置。

3. Use CAST and NVL function when selecting your data to convert invalid data to default value.

3.选择数据时， 请使用CAST和NVL功能将无效数据转换为默认值。

4. (Optional) Create your own Serde, a serialize-deserialize interface which hive used to read and write data into its specified hdfs location, to validate your data type. You can find out how to write a custom hive Serde in my other blog post link.

4.(可选) 创建您自己的 Serde (序列化-反序列化)接口，该接口可用于将数据读写到其指定的hdfs位置，以验证您的数据类型。您可以在其他博客文章链接中找到如何编写自定义配置单元Serde的方法。

An error will be displayed if there is any data type mismatch.

如果任何数据类型不匹配，将显示错误。

结论 (Conclusion)

In conclusion, even though using hive as a data warehouse tool gives you a lot of benefits. But there are some necessary issues which you should concern including “Data validation on read” in order to handle your data properly.

总之，即使将hive用作数据仓库工具也会为您带来很多好处。但是，您应该关注一些必需的问题，包括“ 读取时进行数据验证” ，以便正确处理数据。

In contrast to “Data validation on write” when calling an INSERT command hive will handle data type validation process for you automatically, so you don’t have to concern about it. If you want more detail about how to create “Custom hive Serde” you can go to my next blog post. I wish that this blog will help you understand some limitations of hive and overcome it with an appropriate solution.

与调用INSERT命令时的“写入时数据验证”相反，配置单元将自动为您处理数据类型验证过程，因此您不必担心。 如果您想了解有关如何创建“ Custom hive Serde”的更多详细信息，请转到 我的下一篇博客文章。我希望这个博客可以帮助您了解蜂巢的一些局限性，并通过适当的解决方案来克服它。