pyspark DecisionTreeModel不能在RDD上直接使用

最新推荐文章于 2024-05-12 09:16:52 发布

cc深蓝

最新推荐文章于 2024-05-12 09:16:52 发布

阅读量2.8k

点赞数 1

分类专栏： python spark

本文链接：https://blog.csdn.net/yatusiter/article/details/51899989

版权

python 同时被 2 个专栏收录

17 篇文章 0 订阅

订阅专栏

spark

1 篇文章 0 订阅

订阅专栏

训练了一个DecisionTreeModel ，然后在RDD 上准备进行验证：

dtModel     = DecisionTree.trainClassifier(data, 2, {}, impurity="entropy", maxDepth=maxTreeDepth)

predictions = dtModel.predict(data.map(lambda lp: lp.features))


def GetDtLabel(x):
    return 1 if dtModel.predict(x.features) > 0.5 else 0


dtTotalCorrect = data.map(lambda point : 1 if  GetDtLabel(point) == point.label else 0).sum()

</pre><pre class="python" name="code" snippet_file_name="blog_20160713_7_9485148" code_snippet_id="1760790">

     提示错误：

     Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

     看scala的代码没问题，以为是dtModel需要广播一下，但是错误依旧：

      
         dtModelBroadcast = sc.broadcast(dtModel)

最后根据下面stackoverflow提到的才发现是pyspark的问题：

     http://stackoverflow.com/questions/31684842/how-to-use-java-scala-function-from-an-action-or-a-transformation

     http://stackoverflow.com/questions/36838024/combining-spark-streaming-mllib

    pyspark里面 DescitionTreeModel的predict方法源代码提到

      “In Python, predict cannot currently be used within an RDD transformation or action.

Call predict directly on the RDD instead.”

</pre><pre code_snippet_id="1760790" snippet_file_name="blog_20160713_14_3815520" name="code" class="python" style="color: rgb(36, 39, 41); font-size: 15px; line-height: 19.5px;"> def predict(self, x):
        """
        Predict the label of one or more examples.

        Note: In Python, predict cannot currently be used within an RDD
              transformation or action.
              Call predict directly on the RDD instead.

        :param x:  Data point (feature vector),
                   or an RDD of data points (feature vectors).
        """
        if isinstance(x, RDD):
            return self.call("predict", x.map(_convert_to_vector))

        else:
            return self.call("predict", _convert_to_vector(x))</span>

这个call是调用了self._sc方法，导致了model依赖sc

class JavaModelWrapper(object):
    """
    Wrapper for the model in JVM
    """
    def __init__(self, java_model):
        self._sc = SparkContext.getOrCreate()
        self._java_model = java_model

    def __del__(self):
        self._sc._gateway.detach(self._java_model)

    def call(self, name, *a):
        """Call method of java_model"""
        return callJavaFunc(self._sc, getattr(self._java_model, name), *a)

原因是这里通过py4j来调用java_model（ "org.apache.spark.mllib.tree.model.DecisionTreeModel"），导致了依赖SparkContext。

cc深蓝

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pyspark DecisionTreeModel不能在RDD上直接使用

训练了一个DecisionTreeModel ，然后在RDD 上准备进行验证：dtModel = DecisionTree.trainClassifier(data, 2, {}, impurity="entropy", maxDepth=maxTreeDepth)predictions = dtModel.predict(data.map(lambda lp: lp.
复制链接

扫一扫

专栏目录