Combining Hive and Python

Querying large datasets with Hive is trivial. However, there are times when Hive’s built in functions are insufficient. For example, I recently needed to extract the ordering of elements in an array.

The common advice for such problems is to write a custom Java UDF (User Defined Function). However, since I do not know Java, that would take a lot of time. Luckily, it’s easy to create a script in another language and feed it into a hive query using the function TRANSFORM.

For example, the query below runs a python script on an array of ids seen in search and several other columns.

SELECT
TRANSFORM (hosting_ids, user_id, d) 
USING 'python combine_arrays.py' AS (hosting_ranks_array, user_id, d)
FROM 
s_table;

The python script below takes in a set of lines in which table columns are delimited by tabs. It splits the lines and transforms the first column to include the search rank as a subfield. Note, the script must also handle the the other fields and return them in the proper order. Getting the formatting correct is important because Hive is fickle and not very verbose when returning errors.

import sys

### First columns looks like [123, 213, 212312]
### First column returns [123:0,  213:1,  212312:2]. 

for line in sys.stdin:
    ### Format string
    line = line.strip()
    line = str(line).replace("[", '').replace(']', '').replace('{', '').replace('}', '').replace('(', '').replace(')', '')
    line = line.split('\t')

    ### Format array
    array1 = str(line[0]).split(',')
    num_items = len(array1)
    combined_array = []
    for i in range(num_items):
        this_item = str(array1[i]) + ':' + str(i)
        combined_array.append(this_item)

    ### Remove array characters from resulting string.
    combined_array = str(combined_array).replace('[', '').replace(']', '')
    combined_array = combined_array.replace('\'', '').replace('\"', '')

    ## Add additional fields if they exist and print.
    if len(line) == 1:
        print combined_array
    else:
        rest = '\t'.join(([str(i) for i in line[1:]]))
        print '\t'.join([combined_array, rest])
Please email comments to: afrad (at) stanford (dot) edu
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值