python使用spark,如何使用Python连接HBase和Spark?

I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet to find a satisfactory (i.e., easy to use and scalable) way to read/write HBase data from/to Spark using Python.

What I've explored previously:

Connecting from within my Python processes using happybase. This package allows connecting to HBase from Python by using HBase's Thrift API. This way, I basically skip Spark for data reading/writing and am missing out on potential HBase-Spark optimizations. Read speeds seem reasonably fast, but write speeds are slow. This is currently my best solution.

Using SparkContext's newAPIHadoopRDD and saveAsNewAPIHadoopDataset that make use of HBase's MapReduce interface. Examples for this were once included in the Spark code base (see here). However, these are now considered outdated in favor of HBase's Spark bindings (see here). I've also found this method to be slow and cumbersome (for reading, writing worked well), for example as the strings returned from newAPIHadoopRDD had to be parsed and transformed in various ways to end up with the Python objects I wanted.

Alternatives that I'm aware of:

I'm currently using Cloudera's CDH and version 5.7.0 offers hbase-spark (CDH release notes, and a detailed blog post). This module (formerly known as SparkOnHBase) will officially be a part of HBase 2.0. Unfortunately, this wonderful solution seems to work only with Scala/Java.

Huawei's Spark-SQL-on-HBase / Astro (I don't see a difference between the two...). It does not look as robust and well-supported as I'd like my solution to be.

解决方案

I found this comment by one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.

And indeed, the pattern described here can be applied to query HBase with Spark SQL using PySpark, as the following example shows:

from pyspark import SparkContext

from pyspark.sql import SQLContext

sc = SparkContext()

sqlc = SQLContext(sc)

data_source_format = 'org.apache.hadoop.hbase.spark'

df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])

# ''.join(string.split()) in order to write a multi-line JSON string here.

catalog = ''.join("""{

"table":{"namespace":"default", "name":"testtable"},

"rowkey":"key",

"columns":{

"col0":{"cf":"rowkey", "col":"key", "type":"string"},

"col1":{"cf":"cf", "col":"col1", "type":"string"}

}

}""".split())

# Writing

df.write\

.options(catalog=catalog)\ # alternatively: .option('catalog', catalog)

.format(data_source_format)\

.save()

# Reading

df = sqlc.read\

.options(catalog=catalog)\

.format(data_source_format)\

.load()

I've tried hbase-spark-1.2.0-cdh5.7.0.jar (as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select when writing, java.util.NoSuchElementException: None.get when reading). As it turns out, the present version of CDH does not include the changes to hbase-spark that allow Spark SQL-HBase integration.

What does work for me is the shc Spark package, found here. The only change I had to make to the above script is to change:

data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'

Here's how I submit the above script on my CDH cluster, following the example from the shc README:

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py

Most of the work on shc seems to already be merged into the hbase-spark module of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframes for details). My example above shows what it looks like for PySpark users.

Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值