在cdh6.2.0下,pyspark2.3.0读写hbase2.1.0
1.使用Hive
先建立Hbase表,插入数据
create 'books', 'info', 'analytics'
put 'books', 'In Search of Lost Time', 'info:author', 'Marcel Proust'
put 'books', 'In Search of Lost Time', 'info:year', '1922'
put 'books', 'In Search of Lost Time', 'analytics:views', '3298'
put 'books', 'Godel, Escher, Bach', 'info:author', 'Douglas Hofstadter'
put 'books', 'Godel, Escher, Bach', 'info:year', '1979'
put 'books', 'Godel, Escher, Bach', 'analytics:views', '820'
再建立hive表:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`books_ext` (
`title` string,
`author` string,
`year` int,
`views` double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=':key,info:author,info:year,analytics:views'
)
TBLPROPERTIES (
'hbase.mapred.output.outputtable'='books',
'hbase.table.name'='books'
);
下载hive-hbase-handler-2.1.1-cdh6.2.0.jar,放到/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/jars里
#! /usr/bin/python3.6
# -*- encoding:utf-8 -*-
"""
@author: xuan
@file:.py
@time:20-3-19
"""
from pyspark.sql import SparkSession
import os
if __name__ == "__main__":
os.environ["PYSPARK_PYTHON"] = "/home/venv/bin/python3.6"
warehouse_location = "/user/hive/warehouse/"
spark = SparkSession.builder \
.master("local[2]") \
.appName("Hive2Hbase")\
.enableHiveSupport() \
.config("spark.sql.warehouse.dir", warehouse_location) \
.getOrCreate()
#设置打印warn及以上级别的日志
spark.sparkContext.setLogLevel("WARN")
'''
执行扫描时,调整会话上的缓存参数有助于提高性能,例如:
spark.sql('SET hbase.scan.cache=10000')
spark.sql('SET hbase.client.scanner.cache=10000')
'''
'''
当直接从配置单元使用book_ext时,我们可以通过explain看到配置单元将简单谓词优化为正确的HBase语句:
其中title='something'成为HBase get。
其中title>=“something”和title<“something”被正确转换为范围扫描
例如,如果使用between运算符,则不是这种情况。
虽然配置单元hbasstoragehandler能够理解并正确转换简单的查询谓词,但spark引擎并没有这么聪明:
查询会立即转换为全表扫描,然后在Spark中进行过滤,这样Hive就无法优化和下推过滤器。Dataframe.explain()将显示Spark物理计划。
后一点意味着,在对整个表执行操作(如全表扫描)时,从Spark通过Hive访问HBase是一个不错的选择。
'''
spark.sql("select* from default.books_ext where title='In Search of Lost Time'").show(truncate=False)
spark.sql("select* from default.books_ext where title='In Search of Lost Time'").explain()
#spark.table("default.books_ext").show(truncate=False)
#先落地成Parquet文件再读取,速度会快一些
# spark.sql("create table default.books_ext_parquet stored as parquet as select * from default.books_ext")
spark.sql("select * from default.books_ext").write.parquet("%s/books_ext_parquet"%warehouse_location,"overwrite")
# spark.sql("select * from default.books_ext").write.format("parquet").mode("overwrite").save("%s/books_ext_parquet"%warehouse_location)
spark.read.parquet("%s/books_ext_parquet"%warehouse_location).show()
spark.stop()
2.使用hortonworks的开源框架shc:源码编译
软件准备:
- shc源码包,根据自己环境的spark版本选择下载
下载地址:https://github.com/hortonworks-spark/shc/releases
我下载的版本是v1.1.3-2.3-SystemTest,适用于spark2.3+ - idea打开v1.1.3-2.3-SystemTest文件夹,修改pom.xml文件,下载相关的依赖
1.v1.1.3-2.3-SystemTest/pom.xml,根据需要修改各版本<groupId>com.hortonworks</groupId> <artifactId>shc</artifactId> <version>spark-2.3.0-hbase-2.1.0</version> <packaging>pom</packaging> <name>HBase Spark Connector Project Parent POM</name> <url>https://github.co