我正在使用pyspark,并从以下代码获取结果rdd:
import numpy
model = PrefixSpan.train(input_rdd,minSupport=0.1)
result = model.freqSequences().filter(lambda x: (x.freq >= 50)).filter(lambda x: (len(x.sequence) >=2) ).cache()
当我使用input_rdd.take(5)检查时,input_rdd看起来不错.上面的代码创建了一个名为“结果”的rdd,其格式如下:
PythonRDD[103] at RDD at PythonRDD.scala:48
我确实已经安装了numpy,但是当我尝试执行result.take(5)或result.count()时,却不断出现以下错误.
Py4JJavaErrorTraceback (most recent call last)
in ()
----> 1 result.take(5)
/usr/local/spark-latest/python/pyspark/rdd.py in take(self, num)
1308
1309 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1310 res = self.context.runJob(self, takeUpToNumLeft, p)
1311
1312 items += res
/usr/local/spark-latest/python/pyspark/context.py in runJob(self, rd