hive将list转成string_Hive:将逗号分隔的字符串转换为数组以便生成表生成函数

I am creating a Hive table on Amazon's Elastic MapReduce by using a gzipped JSON encoded file. I am using this JSON SerDe: http://code.google.com/p/hive-json-serde/

The unencoded file looks like this:

{"id":"101", "items":"A:231,234,119,12"}

{"id":"102", "items":"B:13,89,121"}

...

I'd like to create an array of the "items" column for user with a table generating function. The array I want would be the "exploded" CSV of ints ignoring the ":" and the letter before it. I want to be able to GROUP BY the item ints (231,234,etc.) or JOIN them against other tables. Changing around the format of this file would be difficult and all my old data is stored in this format so I'd like to solve this in Hive without making changes to how to store my data.

Is there a way for me to do this?

Edit:

Thanks @mark-grover, this worked!

I did something like this:

SELECT id, item FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item

to get:

101 231

101 234

101 119

101 12

102 13

102 89

102 121

In doing this I found out I'd like to do a little more. I'd like to also select the position that each item was in in the CSV. So I'd like the output to look like:

101 231 0

101 234 1

101 119 2

101 12 3

102 13 1

102 89 2

102 121 3

Any ideas here?

Edit again:

I came up with something for my followup question doing something like this:

SELECT id, item, find_in_set(item, substr(items, 3, length(items) - 2)) AS position

FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item

解决方案

You can do that use Hive UDFs.

You would want to use the split UDF to split the string on commas to get an array and then use Lateral view along with explode UDTF to "explode" the contents of the array.

To answer your second question:

This is what I can think of right now. There might be a better/easier/less hackier way to do it.

Create a UDF say split_extended that would return as a struct with 2 elements, first one being the actual contents of the array element, second one being the index. When you explode this the type of exploded column will be a struct. One of the entries of this struct would the actual element from the array, the other would be the index corresponding to the element.

If you think about it, we are augmenting split UDF to return 2 pieces of information - the element and the index. These are several ways of going about returning such information as a single object - structure is one of them, array being another (first element of the array being the actual element from the array, second element being the index). Another hackier way to return 2 pieces of information is to concat them with something (say a ':') that you know doesn't exist in your data so then you can have your UDF return '231:0', '234:0', '119:2', etc. and in the end instead of accessing them members of a struct, you can split the resulting string out based on ':' (using split UDF) to get the desired two pieces.

For creating your own UDF, you will be able to leverage most, if not all of split UDF code available here.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值