前言
最近开始尝试tensorflow,而数据存在hive里,如何读取数据是个很大问题。网上有不少例子,不过这些例子也带来了不少坑,最终于经过九九八十一难,总算连成功了。
先说说经历的坑吧
坑1
使用thrift安装,在网上查了很多相关资料,貌似年代都挺久远的,不过还是试了试,参考代码如下
# -*- coding: utf-8 -*-
#!/usr/bin/env python
import sys
import traceback
sys.path.append('/home/hadoop/hive/lib/py')
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
def hiveExe(sql):
try:
transport = TSocket.TSocket('127.0.0.1', 10000)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
client.execute(sql)
print "The return value is : "
print client.fetchAll()
print "............"
transport.close()
except Thrift.TException, tx:
traceback.print_exc()
print 'exception %s' % (tx.message)
if __name__ == '__main__':
hiveExe("select * from temp.")
用pip install 装好了thrift,但hive_service却不知如何装了,无法pip install, 也未Google到,只好放弃。
坑2
无意在github上发现了pythive这个神奇的东西,感觉瞬间看到了光明。参考代码如下
from pyhive import hive
from TCLIService.ttypes import TOperationState
cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10', async=True)
status = cursor.poll().operationState
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
logs = cursor.fetch_logs()
for message in logs:
print message
# If needed, an asynchronous query can be cancelled at any time with:
# cursor.cancel()
status = cursor.poll().operationState
print cursor.fetchall()
然而坑又来了,当from pyhive import hive
时又报错了,报错信息为 No module named sasl
,然后再pip install sasl
,又出现报错,没办法,只好根据报错一个个解决了。依稀记得得装python-dev
还有gcc相关的东西升级下,太多东西,记不清了。
装好sasl后,又无法读取数据,错误代码为='TSocket read 0 bytes
。去github里的Issus聊了下,也无法得到比较好的解决,于是换了段代码。
成功代码
# -*- coding: utf-8 -*-
import pyhs2
class HiveClient:
def __init__(self, db_host, database, port=10000, authMechanism="NOSASL"):
"""
create connection to hive server2
"""
self.conn = pyhs2.connect(host=db_host,
port=port,
authMechanism=authMechanism,
database=database,
)
def query(self, sql):
"""
query
"""
with self.conn.cursor() as cursor:
cursor.execute(sql)
return cursor.fetch()
def close(self):
"""
close connection
"""
self.conn.close()
class connecthive():
def __init__(self, database):
self.hive_client = HiveClient(db_host='localhost', port=10000,
database=database, authMechanism='NOSASL')
def query(self,sql):
"""
query
"""
result = self.hive_client.query(sql)
self.hive_client.close()
return result
这段代码成功让我看到了hive里的数据,总之,python连hive坑太多,入坑需谨慎啊!!!