Python连接Hive

最新推荐文章于 2024-08-16 14:00:00 发布

董云龙

最新推荐文章于 2024-08-16 14:00:00 发布

阅读量2.2w

点赞数 1

分类专栏： hadoop hive 文章标签： hive

本文链接：https://blog.csdn.net/dongyunlon/article/details/79085189

版权

hadoop 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

hive

1 篇文章 0 订阅

订阅专栏

1. Hiveserver1 & HiveServer2

1.1 HiveServer1

HiveServer是一个可选的服务，能够允许远程客户端使用各种编程语言向hive提交请求并检索结果。Hiveserver是建立在Apache Thrift上的,所以有时候称呼其为Thrift Server，尽管因为HiverServer2也是建立在Thrift之上，从而容易产生疑惑。HiveServer也被称为HiveServer1.

1.1.2 HiveServer1缺点

HiveServer无法处理来自多个客户端的并发请求，这实际上是由hiveserver导出的thrift接口施加的限制，并且不能通过修改hiveserver代码来解决。
　　HiveServer2是hiveserver的重写，它解决了这些问题，从hive0.11.0开始。建议使用hiveserver2。hiveserver1从hive1.0.0（以前称为0.14.1）开始将会被删除。

1.2 HiveServer2

HiveServer2(HS2)同样能够使客户端执行hive的查询，它是已经被弃用的HiveServer1的后继者。HS2支持多客户端并发和身份验证，它的目的是支持打开api客户端更好的支持，例如jdbc和odbc。

1.2.1 HS2结构

Thrift-based Hive service是HS2的核心，并且负责处理Hive的查询（eg. Beeline）。Thrift是构建跨平台服务的rpc框架。主要由四部分组成：Server，Transport，Protocol和Processor。详情可以参考Apache Thrift doc。

1.2.1.1 Server

HS2的TCP模式使用TThreadPookServer（from Thrift），Http模式使用Jetty Server。
　　HS2的TCP模式使用TThreadPookServer为每一个tcp连接分配一个工作线程。即使连接空闲，每个线程也始终与连接关联。所以，由于大量的并发连接而导致大量线程，将会产生潜在的性能问题。将来HS2可能会切换到另外一种tcp模式，例如tthreadedselectorserver。

1.2.1.2 Transport

在客户端和服务器之间需要代理（例如，出于负载平衡或安全原因）时，需要http模式。这就是为什么除了TCP模式，还需要支持Http模式的原因。通过hive配置属性hive.server2.transport.mode，可以指定thrift服务的传输模式。
　　hive.server2.transport.mode可选值为binary（tcp）和http，默认为binary。使用http时，默认监听端口变为10001，同时连接的url也将发生改变，详情参考Connection URL When HiveServer2 Is Running in HTTP Mode。

1.2.1.3 Protocol

Protocol负责序列化和反序列化。HS2目前使用TBinaryProtocol作为Thrift 的序列化协议。在未来，基于性能的评估，可能会选用其他的协议，例如TCompactProtocol。

1.2.1.4 Processor

Processor就是处理请求的应用逻辑单元。例如，ThriftCLIService.ExecuteStatement()方法就是编译和执行hive查询的方法。

1.2.2 HS2的依赖

Metastore
metastore可以配置成嵌入式（和HS2在同一个进程）和远程服务（Thrift-based服务），HS2需要和metastore通信以获取编译查询所需的元数据。
Hadoop cluster
HS2为各种执行引擎（MapReduce/tez/spark）准备物理执行计划，并将作业提交给Hadoop集群执行。

2. JDBC Client

建议客户端使用jdbc和HS2进行交互。注意,一些使用实例直接使用Thrift Client，从而跳过了JDBC，例如Hadoop Hue。以下是Api调用的步骤：

JDBC Client（例如Beeline）通过初始化Transport连接（例如TCP连接），然后通过OpenSession Api调用获取一个SessionHandle（会话句柄）来创建HiveConnection。这个Session将服务器端被创建。
HiveStatement被执行（遵循jdbc标准），并且ExecuteStatement API将会被Thrift Client调用。在API调用时，SessionHandle信息和查询信息一起被传递给服务器。
HS2 server收到请求，命令driver（是一个CommandProcessor）解析和编译查询。Driver将会启动一个后台任务用来和hadoop通信，然后立即对客户端作出响应。这是 ExecuteStatement API的异步设计。响应包含由服务端创建的OperationHandle（操作句柄）。
客户端使用OperationHandle和HS2交流，以轮询的方式获取query的执行状态。

2.1 Beeline -Command Line Shell

Beeline是工作在HiveServer2下的命令行程序。建议使用Beeline代替Hive CLI。Hive CLI的功能主要有两种：

hadoop sql的“胖客户端”
作为hive服务器的命令行工具

因为在Hive1.0.0中，Hive Cli已经过期，理想情况下，是直接丢弃Hive CLI，直接使用Beeline加HiveServer2的方式，但是由于Hive CLI使用的太广泛了，所以，现在退而求其次，更改Hive CLI的实现方式，使其变为Beeline的一个别名，内部实现完全由Beeline完成，这样就能最先限度的带来使用上的更改，但是由于一些现有的Hive CLI新特性在新的Hive CLI中不被支持，所以，默认情况下，依然使用的是旧的Hive CLI，使用如下配置启用基于Beeline的Hive CLI：

export USE_DEPRECATED_CLI=false

注意：此时，log4j配置文件已更改为“beeline-log4j.properties”。
　　Beeline同样分为嵌入式和远程两种模式，嵌入式和Hive CLI嵌入式模式类似，远程模式使用Thrift。推荐使用远程模式，其不会直接授予用户HDFS/metastore权限，因此更加安全。如下是使用示例：

% bin/beeline 
Hive version 0.11.0-SNAPSHOT by Apache
beeline> !connect jdbc:hive2://localhost:10000 scott tiger
!connect jdbc:hive2://localhost:10000 scott tiger 
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-SNAPSHOT)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show tables;
show tables;
+-------------------+
|     tab_name      |
+-------------------+
| primitives        |
| src               |
| src1              |
| src_json          |
| src_sequencefile  |
| src_thrift        |
| srcbucket         |
| srcbucket2        |
| srcpart           |
+-------------------+
9 rows selected (1.079 seconds)

如上，首先使用bin/beeline命令进入Beeline命令行，再使用!connect命令连接HiveServer2，“scott”和“tiger”分别为用户名和密码。也可以使用如下命令，直接连接到HS2：

% beeline -u jdbc:hive2://localhost:10000/default -n username -p password
Hive version 0.11.0-SNAPSHOT by Apache

Connecting to jdbc:hive2://localhost:10000/default

退出Beeline命令，推荐使用!quit命令，当然也可以使用CTRL+C的方式。

2.2 JDBC

2.2.1 Connection URL Format

连接URL格式如下所示：

jdbc:hive2://:,:/dbName;initFile=;sess_var_list?hive_conf_list#hive_var_list

：，：是要连接的服务器实例或逗号分隔的服务器实例列表（如果启用动态服务发现）。如果为空，则将使用嵌入式服务器。
dbname是初始数据库的名称。
是init脚本文件（hive 2.2.0及更高版本）的路径。这个脚本文件是用sql语句编写的，连接后会自动执行。这个选项可以是空的。
sess_var_list是会话变量（例如user = foo; password = bar）的键=值对的分号分隔列表。
hive_conf_list是此会话的配置单元配置变量的键=值对的分号分隔列表
hive_var_list是此会话的hive变量的key = value对的分号分隔列表。

2.2.2 Python HiveClient

1. ThriftHive

如下是官网HiveClient章节提供的方法，这是第一种方法，比较老了，需要到Hive的安装目录，将$HIVE_HOME/lib/py下的所有文件夹拷贝到python的库中，也就是site-package中，或者直接把代码和py库放到同一个目录下，用这个目录下提供的Thrift接口调用，至于是

from hive import ThriftHive
OR
from hive_service import ThriftHive

则取决于安装hive的版本，直接到py目录下能够看到包含的是hive还是hive_service目录。
使用如下命令查看Python的site-package的路径：

>>> from distutils.sysconfig import get_python_lib
>>> print(get_python_lib())
C:\Python27\Lib\site-packages
>>> exit()

程序示例如下：

#!/usr/bin/env python

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
    transport = TSocket.TSocket('localhost', 10000)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    client = ThriftHive.Client(protocol)
    transport.open()

    client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
    client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
    client.execute("SELECT * FROM r")
    while (1):
      row = client.fetchOne()
      if (row == None):
        break
      print row
    client.execute("SELECT * FROM r")
    print client.fetchAll()

    transport.close()

except Thrift.TException, tx:
    print '%s' % (tx.message)

2. pyhs2 driver

在官网的Setting Up HiveServer2章节提供了另外一种方法，直接使用pyhs2，看起来还是个人提供的，但是从GitHub上的声明看，2016-01-05开始，pyhs2已经停止维护了。使用该库要求Python 2.6+，安装pysh2命令如下：

pip install pyhs2

但是，直接安装有可能会出错，安装pysh2需要SASL等依赖，window上安装比较麻烦，示例代码如下：

import pyhs2

with pyhs2.connect(host='localhost',
                   port=10000,
                   authMechanism="PLAIN",
                   user='root',
                   password='test',
                   database='default') as conn:
    with conn.cursor() as cur:
        #Show databases
        print cur.getDatabases()

        #Execute query
        cur.execute("select * from table")

        #Return column info from query
        print cur.getSchema()

        #Fetch table results
        for i in cur.fetch():
            print i

3. PyHive
这个在pyhs2的介绍中能够看到，是其推荐的比较好的替代品，能够连接Hive和Presto。安装如下：
+ pip install pyhive[hive] for the Hive interface and
+ pip install pyhive[presto] for the Presto interface.

安装的时候，需要很多依赖，如下所示：

To install pyhs2 on a clean CentOS 6.4 64-bit desktop....

(as root or with sudo)

get ez_setup.py from https://pypi.python.org/pypi/ez_setup
python ez_setup.py
easy_install pip
yum install gcc-c++
yum install cyrus-sasl-devel.x86_64
yum install python-devel.x86_64
pip install pyhs2

异步的示例代码如下：

from pyhive import hive
from TCLIService.ttypes import TOperationState
cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10', async=True)

status = cursor.poll().operationState
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
    logs = cursor.fetch_logs()
    for message in logs:
        print message

    # If needed, an asynchronous query can be cancelled at any time with:
    # cursor.cancel()

    status = cursor.poll().operationState

print cursor.fetchall()

同步示例代码如下：

from pyhive import presto  # or import hive
cursor = presto.connect('localhost').cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
print cursor.fetchone()
print cursor.fetchall()

问题是，pyhs2和pyhive在Windows下很难完成安装，目测只能在Linux系统下使用，毕竟windows下配置相关库太麻烦了。
综上，第一种没有测试，但是看起来应该只有第一种方式有可能在windows下使用，所以，windows下尽量别用python连接hive。。。