pyspark 手写Apriori算法

 

pyspark-structure 

其中白色部分是新增的Python进程,在Driver端,通过Py4j实现在Python中调用Java的方法,即将用户写的PySpark程序”映射”到JVM中,例如,用户在PySpark中实例化一个Python的SparkContext对象,最终会在JVM中实例化Scala的SparkContext对象;在Executor端,则不需要借助Py4j,因为Executor端运行的Task逻辑是由Driver发过来的,那是序列化后的字节码,虽然里面可能包含有用户定义的Python函数或Lambda表达式,Py4j并不能实现在Java里调用Python的方法,为了能在Executor端运行用户定义的Python函数或Lambda表达式,则需要为每个Task单独启一个Python进程,通过socket通信方式将Python函数或Lambda表达式发给Python进程执行。语言层面的交互总体流程如下图所示,实线表示方法调用,虚线表示结果返回。 

 pyspark-call

 

PythonRunner入口main函数里主要做两件事:

  • 开启Py4j GatewayServer
  • 通过Java Process方式运行用户上传的Python脚本

用户Python脚本起来后,首先会实例化Python版的SparkContext对象,在实例化过程中会做两件事:

  • 实例化Py4j GatewayClient,连接JVM中的Py4j GatewayServer,后续在Python中调用Java的方法都是借助这个Py4j Gateway
  • 通过Py4j Gateway在JVM中实例化SparkContext对象

经过上面两步后,SparkContext对象初始化完毕,Driver已经起来了,开始申请Executor资源,同时开始调度任务。用户Python脚本中定义的一系列处理逻辑最终遇到action方法后会触发Job的提交,提交Job时是直接通过Py4j调用Java的PythonRDD.runJob方法完成,映射到JVM中,会转给sparkContext.runJob方法,Job运行完成后,JVM中会开启一个本地Socket等待Python进程拉取,对应地,Python进程在调用PythonRDD.runJob后就会通过Socket去拉取结果。

# 
# pyspark实现Apriori算法、循环迭代、并行处理
from pyspark import  SparkContext
myDat=[ [ 1, 3, 4,5 ], [ 2, 3, 5 ], [ 1, 2, 3,4, 5 ], [ 2,3,4, 5 ] ]
sc = SparkContext( 'local', 'pyspark')
myDat=sc.parallelize(myDat) #得到输入数据RDD #myDat.collect(): [[1, 3, 4, 5], [2, 3, 5], [1, 2, 3, 4, 5], [2, 3, 4, 5]]
C1=myDat.flatMap(lambda x: set(x)).distinct().collect() #distinct()是去重操作,对应C1=createC1(myDat) #得到1项集 #[1, 2, 3, 4, 5],
C1=[frozenset([var]) for var in C1] #需要这样做,因为python的代码里需要处理集合操作
D=myDat.map(lambda x: set(x)).collect() #将输入数据RDD转化为set的列表 #[{1, 3, 4, 5}, {2, 3, 5}, {1, 2, 3, 4, 5}, {2, 3, 4, 5}]
D_bc=sc.broadcast(D)
length=len(myDat.collect())
suppData=sc.parallelize(C1).map(lambda x: (x,len([var for var in D_bc.value if x.issubset(var)])/length) if len([var for var in D_bc.value \
        if x.issubset(var)])/length >=0.75 else ()).filter(lambda x: x).collect()
L=[]
L1=[frozenset(var) for var in map(lambda x:x[0],suppData)] #筛选出大于最小支持度
L.append(L1)
k=2
#D_bc=sc.broadcast(D)
while (len(L[k-2])>0):
    Ck=[var1|var2 for index,var1 in enumerate(L[k-2]) for var2 in L[k-2][index+1:] if list(var1)[:k-2]==list(var2)[:k-2]]
    #count_each_ele=myDat.flatMap(lambda x:x).map(lambda x: (x,1)).countByKey()
    #count_each_ele=sc.parallelize(Ck).map(lambda x: filter(lambda y: x.issubset(y),D_bc.value))
    suppData_temp=sc.parallelize(Ck).map(lambda x: (x,len([var for var in D_bc.value if x.issubset(var)])/length) if len([var for var in D_bc.value \
        if x.issubset(var)])/length >=0.75 else ()).filter(lambda x: x).collect()
    #Ck中的多个子集会分布到多个分布的机器的任务中运行,D_bc是D的分发共享变量,在每个任务中,都可以使用D_bc来统计本任务中包含某子集的个数
    suppData+=suppData_temp
    L.append([var[0] for var in suppData_temp]) #使用这行代码,最后跳出while后再过滤一下空的项
    k+=1
L=[var for var in L if var]
print(L)
print(suppData)
def calcConf(freqSet, H, supportData, brl, minConf=0.7 ):
    prunedH=[]
    #sc.parallelize(H).map(lambda x: ...) #这里也无法并行,因为,freqSet是局部的,如果弄成广播,那得好多副本
    for conseq in H:
        conf = supportData[ freqSet ] / supportData[ freqSet - conseq ]
        if conf >= minConf:
            print(freqSet - conseq, '-->', conseq, 'conf:', conf)
            brl.append( ( freqSet - conseq, conseq, conf ) )
            prunedH.append( conseq )
    return prunedH
def rulesFromConseq(freqSet,H,supportData,brl,minConf=0.7):
    m=len(H[0])
    if len(freqSet)>m+1:
        Hmp1=[var1|var2 for index,var1 in enumerate(H) for var2 in H[index+1:] if list(var1)[:m+1-2]==list(var2)[:m+1-2]]
        Hmp1 = calcConf( freqSet, Hmp1, supportData, brl, minConf )
        if len( Hmp1 ) > 1:
            rulesFromConseq( freqSet, Hmp1, supportData, brl, minConf )
def generateRules( L, supportData, minConf=0.7 ):
    bigRuleList = []
    for i in range( 1, len( L ) ):
        for freqSet in L[ i ]:
            H1 = [ frozenset( [ item ] ) for item in freqSet ]
            if i > 1:
                rulesFromConseq( freqSet, H1, supportData, bigRuleList, minConf )
            else:
                calcConf( freqSet, H1, supportData, bigRuleList, minConf )
    return bigRuleList
suppData_dict={}
suppData_dict.update(suppData) #查字典类型的update用法
sD_bc=sc.broadcast(suppData_dict)
rules = generateRules( L, sD_bc.value, minConf=0.9 )
print('rules:\n', rules) 
[[frozenset({3}), frozenset({4}), frozenset({5}), frozenset({2})], [frozenset({3, 4}), frozenset({3, 5}), frozenset({2, 3}), frozenset({4, 5}), frozenset({2, 5})], [frozenset({3, 4, 5}), frozenset({2, 3, 5})]]
[(frozenset({3}), 1.0), (frozenset({4}), 0.75), (frozenset({5}), 1.0), (frozenset({2}), 0.75), (frozenset({3, 4}), 0.75), (frozenset({3, 5}), 1.0), (frozenset({2, 3}), 0.75), (frozenset({4, 5}), 0.75), (frozenset({2, 5}), 0.75), (frozenset({3, 4, 5}), 0.75), (frozenset({2, 3, 5}), 0.75)]
frozenset({4}) --> frozenset({3}) conf: 1.0
frozenset({5}) --> frozenset({3}) conf: 1.0
frozenset({3}) --> frozenset({5}) conf: 1.0
frozenset({2}) --> frozenset({3}) conf: 1.0
frozenset({4}) --> frozenset({5}) conf: 1.0
frozenset({2}) --> frozenset({5}) conf: 1.0
frozenset({4}) --> frozenset({3, 5}) conf: 1.0
frozenset({2}) --> frozenset({3, 5}) conf: 1.0
rules:
 [(frozenset({4}), frozenset({3}), 1.0), (frozenset({5}), frozenset({3}), 1.0), (frozenset({3}), frozenset({5}), 1.0), (frozenset({2}), frozenset({3}), 1.0), (frozenset({4}), frozenset({5}), 1.0), (frozenset({2}), frozenset({5}), 1.0), (frozenset({4}), frozenset({3, 5}), 1.0), (frozenset({2}), frozenset({3, 5}), 1.0)]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ljtyxl

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值