python saveas_火花写数据分流到HBase的与Python封锁saveAsNewAPIHadoopDataset

最新推荐文章于 2021-04-25 11:36:33 发布

weixin_39683858

最新推荐文章于 2021-04-25 11:36:33 发布

阅读量90

点赞数

文章标签： python saveas

I’m using spark-streaming python read kafka and write to hbase, I found the job on stage of saveAsNewAPIHadoopDataset very easily get blocked. As the below picture:

You will find the duration is 8 hours on this stage. Does the spark write data by Hbase api or directly write the data via HDFS api please?

解决方案

A bit late , but here is a similar example

To save an RDD to hbase :

Consider an RDD containing a single line :

{"id":3,"name":"Moony","color":"grey","description":"Monochrome kitty"}

Transform the RDD

We neet to transform the RDD into a (key,value) pair having the following contents:

( rowkey , [ row key , column family , column name , value ] )

datamap = rdd.map(lambda x: (str(json.loads(x)["id"]),[str(json.loads(x)["id"]),"cfamily","cats_json",x]))

Save to HBase

We can make use of the RDD.saveAsNewAPIHadoopDataset function as used in this example: PySpark Hbase example to save the RDD to HBase

datamap.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)

You can refer to my blog :pyspark-sparkstreaming hbase for the complete code of the working example.

weixin_39683858

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python saveas_火花写数据分流到HBase的与Python封锁saveAsNewAPIHadoopDataset

I’m using spark-streaming python read kafka and write to hbase, I found the job on stage of saveAsNewAPIHadoopDataset very easily get blocked. As the below picture:You will find the duration is 8 hour...
复制链接

扫一扫