基础环境
OS:win10
Python:3.9
IDE工具:pycharm
Hive:2.1.1+cdh6.1.1
Impala:3.1.0+cdh6.1.1
Hadoop:3.0.0+cdh6.1.1
pip command
# 安装本地文件
pip install sasl-0.3.1-cp39-cp39-win_amd64.whl
#默认源安装
pip install package_name==version
#指定从清华源安装
pip install thrift -i https://pypi.tuna.tsinghua.edu.cn/simple
#指定从清华源安装指定版本
pip install thrift_sasl==0.3.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
#卸载指定包
pip uninstall package_name
#查询已经安装的版本
pip show package_name
同时支持Hive与Impala版本
#需要安装的依赖
pip install pyhive==0.6.5
pip install SQLAlchemy==1.3.24
pip install pandas==1.2.5
pip install thrift_sasl==0.4.2
pip install impyla==0.16.2
经测试:
- thrift_sasl-0.2.1 #仅支持impala,连接hive报错
-
thrift_sasl-0.3.0 #仅支持hive,连接impala报错
Python连接Hive
from pyhive import hive
# 建立数据库连接
conn = hive.Connection(host='IP', port=10000, database="dmp",username='hiveTest')
cursor = conn.cursor()
query_sql = "show tables"
cursor.execute(query_sql)
result = cursor.fetchall()
cursor.close()
conn.close()
for res in result:
print(res)
Python连接Impala(也支持连接Hive)
from impala.dbapi import connect
# 建立数据库连接
#connect impala
conn = connect(host='IP', port=21050, auth_mechanism='NOSASL')
# connect hive
# conn = connect(host='IP', port=10000,database='default',user="dmp", auth_mechanism='PLAIN')
cursor = conn.cursor()
cursor.execute('show tables')
result = cursor.fetchall()
conn.close()
for res in result:
print(res)
Python基于Sqlalchemy与pandas访问Hive
from sqlalchemy import create_engine
import pandas as pd
# 读取数据 : jdbc:hive2://ip:10000 dmp
hive_engine = create_engine('hive://hiveTest@ip:10000/database?auth=NONE')
data = pd.read_sql("show tables",con=hive_engine)
print(data)