自定义UDF完成稀疏矩阵在Pyspark和Java的交互运算

最新推荐文章于 2021-03-09 15:49:33 发布

续汉冕

最新推荐文章于 2021-03-09 15:49:33 发布

阅读量635

点赞数 2

分类专栏： pyspark java 文章标签： pyspark udf 稀疏矩阵 java

本文链接：https://blog.csdn.net/u010430471/article/details/90444354

版权

本文介绍了如何在Pyspark中处理稀疏矩阵，通过自定义UDF实现与Java的交互。具体步骤包括数据准备，如使用scipy.sparse创建csc_matrix并转换为pyspark dataframe；自定义Java UDF，添加la4j库，处理ArrayList<Integer>参数以生成稀疏矩阵并进行加1操作；上传jar文件到hdfs；最后在Pyspark中调用UDF并获取结果。

摘要由CSDN通过智能技术生成

背景

最近有个项目需求，要根据hive表内存储的稀疏矩阵数据，提取一些算法的运算结果。分布式的工具自然选择pyspark了，毕竟对python很熟，但是算法的代码是Java写的，只能自己将其打包为UDF在pyspark调用了，所以就研究了下稀疏矩阵数据在UDF中的开发运算和pyspark调用。

博客里就不弄得太麻烦了，主要目的是将整个流程打通。
问题定义：将pyspark中的稀疏矩阵数据传入UDF包，并在jar包内完成矩阵加1，再返回矩阵第一个数据。

具体步骤

1. 数据准备

首先是在pyspark中准备稀疏矩阵数据，首选自然就是scipy.sparse了，稀疏矩阵有两种压缩形式，一种是csr_matrix(csr:Compressed Sparse Row marix)，另一种是csc_matric(csc:Compressed Sparse Column marix)，这里采用后一种。

因为pyspark中的dataframe本身不支持scipy稀疏矩阵类型（可以在rdd内支持使用），所以hive表保存的稀疏矩阵实际上也是将csc_matric拆解为data、indices、indptr、shape四个array。

from scipy.sparse import csc_matrix

indices = [0, 2, 2, 0, 1, 2]
indptr = [0,2,3,6]
data = [1, 2, 3, 4, 5, 6]
shape = [3,3]
sp_mat = csc_matrix((data, indices, indptr), shape=shape).todense()
print(sp_mat)

[[1 0 4]
 [0 0 5]
 [2 3 6]]

我们下面的稀疏矩阵也是按照以上四个array的形式准备，并将其转换为pyspark中的dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField,FloatType,IntegerType,ArrayType

# create sparse matrix
indices = [0, 2, 2, 0, 1, 2]
indptr = [0,2,3,6]
data = [1, 2, 3, 4, 5, 6]
shape = [3,3]

sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
sp_data = [(0,12.1,indices,indptr,data,shape)
    ,(1,21.32,indices,indptr,data,shape)
    ,(2,21.2,indices,indptr,data,shape)]

schema = StructType([StructField("name",IntegerType(), nullable=True)
                    ,StructField("id",FloatType(), nullable=True)
                    ,StructField("in