1.hbase和hive结合
(1)hbase建表添加数据
#test是表名,name是列族
#hbase可以一个列族里边多个字段
create 'test','name'
#添加数据
put 'test','1','name:t1','1'
put 'test','1','name:t2','2'
#查询
scan 'test'
#查询 get 表名,row-key,列族
get 'test','1','name:t1'
#删除表
disable 'test'
drop 'test'
#查看表信息
desc 'test'
(2)在hive上创建外部表,映射hbase
CREATE EXTERNAL TABLE test( key string,t1 int,
t2 int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,name:t1,name:t2"
)
TBLPROPERTIES ("hbase.table.name" = "test", "hbase.mapred.output.outputtable" = "test");
测试,两个平台数据是否相通。且数据同步更新。
2.hive连接和并用pandas读取数据
(1)配置hive-site.xml文件
<property>
<name>hive.server2.thrift.bind.host</name>
<value>192.168.99.250</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
(2)启动hive
hive --service metastore &
hiveserver2 &
(3)读取数据
from pyhive import hive
import pandas as pd
conn = hive.Connection(host = IP地址, port = 10000, username = 'hive')
#host主机ip,port:端口号,username:用户名,database:使用的数据库名称
cursor = conn.cursor()
cursor.execute('show databases')
# 打印结果
for result in cursor.fetchall():
print(result)
或者pandas读取
sql = 'select * from default.employees'
df = pd.read_sql(sql,conn)