I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet to find a satisfactory (i.e., easy to use and scalable) way to read/write HBase data from/to Spark using Python.
What I've explored previously:
Connecting from within my Python processes using happybase. This package allows connecting to HBase from Python by using HBase's Thrift API. This way, I basically skip Spark for data reading/writing and am missing out on potential HBase-Spark optimizations. Read speeds seem reasonably fast, but write speeds are slow. This is currently my best solution.
Using SparkContext's newAPIHadoopRDD and saveAsNewAPIHadoopDataset that make use of HBase's MapReduce interface. Examples for this were once included in the Spark code base (see here). However, these are now considered outdated in favor of HBase's Spark bindings (see here). I've also found this method to be slow and cumbersome (for reading, writing worked well), for example as the strings returned from newAPIHadoopRDD had to be parsed and transformed in various ways to end up with the Python objects I wanted.
Alternatives that I'm aware of:
I'm currently using Cloudera's CDH and version 5.7.0 offers hbase-spark (CDH release notes, and a detailed blog post). This module (formerly known as SparkOnHBase) will officially be a part of HBase 2.0. Unfortunately, this wonderful solution seems to work only with Scala/Java.
Huawei's Spark-SQL-on-HBase / Astro (I don't see a difference between the two...). It does not look as robust and well-supported as I'd like my solution to be.
解决方案
I found this comment by one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.
And indeed, the pattern described here can be applied to query HBase with Spark SQL using PySpark, as the following example shows:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.hadoop.hbase.spark'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
"table":{"namespace":"default", "name":"testtable"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf", "col":"col1", "type":"string"}
}
}""".split())
# Writing
df.write\
.options(catalog=catalog)\ # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()
# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()
I've tried hbase-spark-1.2.0-cdh5.7.0.jar (as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select when writing, java.util.NoSuchElementException: None.get when reading). As it turns out, the present version of CDH does not include the changes to hbase-spark that allow Spark SQL-HBase integration.
What does work for me is the shc Spark package, found here. The only change I had to make to the above script is to change:
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
Here's how I submit the above script on my CDH cluster, following the example from the shc README:
spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py
Most of the work on shc seems to already be merged into the hbase-spark module of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframes for details). My example above shows what it looks like for PySpark users.
Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.