【SparkAPI JAVA版】JavaPairRDD——countByValue、countByValueApprox（十三）

最新推荐文章于 2021-03-01 18:57:46 发布

sdut菜鸟

最新推荐文章于 2021-03-01 18:57:46 发布

阅读量434

点赞数 1

分类专栏： Spark 文章标签： countByValue spark countbyValueApprox

本文链接：https://blog.csdn.net/sdut406/article/details/88694144

版权

Spark 专栏收录该内容

19 篇文章 12 订阅

订阅专栏

JavaPairRDD的countByValue方法讲解

官方文档

/**
   * Return the count of each unique value in this RDD as a map of (value, count) pairs. The final
   * combine step happens locally on the master, equivalent to running a single reduce task.
   */

说明

返回RDD中每个值的计数，作为（value，count）对的映射。

返回的是map

函数原型

// java
public static java.util.Map<T,Long> countByValue()
// scala
def countByValue(): Map[(K, V), Long]

示例

public class CountByValue {
    public static void main(String[] args) {
        System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);

        JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("pig", "44"),
                new Tuple2<String, String>("duck", "55"), new Tuple2<String, String>("cat", "66")), 3);


        Map<Tuple2<String, String>, Long> value = javaPairRDD1.countByValue();
        for (Map.Entry<Tuple2<String, String>, Long> entry : value.entrySet()){
            System.out.println(entry.getKey()+"->"+entry.getValue());
        }

    }
}

结果

19/03/20 17:15:31 INFO DAGScheduler: Job 0 finished: countByValue at CountByValue.java:23, took 1.093040 s
19/03/20 17:15:31 INFO SparkContext: Invoking stop() from shutdown hook
(duck,55)->1
(dog,22)->1
(pig,44)->1
(cat,66)->1
(cat,11)->2
19/03/20 17:15:31 INFO SparkUI: Stopped Spark web UI at http://10.124.209.6:4040

JavaPairRDD的countByValueApprox方法讲解

官方文档

/**
   * Approximate version of countByValue().
   *
   * The confidence is the probability that the error bounds of the result will
   * contain the true value. That is, if countApprox were called repeatedly
   * with confidence 0.9, we would expect 90% of the results to contain the
   * true count. The confidence must be in the range [0,1] or an exception will
   * be thrown.
   *
   * @param timeout maximum time to wait for the job, in milliseconds
   * @param confidence the desired statistical confidence in the result
   * @return a potentially incomplete result, with error bounds
   */

说明

CountByValue（）的近似版本。

置信度必须在[0,1]范围内，否则异常将被扔掉。

*@参数超时等待作业的最长时间（毫秒）
*@参数置信度结果中所需的统计置信度
*@返回一个可能不完整的结果，带有错误界限

函数原型

// java
public static PartialResult<java.util.Map<T,BoundedDouble>> countByValueApprox(long timeout)
public static PartialResult<java.util.Map<T,BoundedDouble>> countByValueApprox(long timeout,
                                                                               double confidence)
// scala
def countByValueApprox(timeout: Long): PartialResult[Map[(K, V), BoundedDouble]]
def countByValueApprox(timeout: Long, confidence: Double): PartialResult[Map[(K, V), BoundedDouble]]

sdut菜鸟

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【SparkAPI JAVA版】JavaPairRDD——countByValue、countByValueApprox（十三）

JavaPairRDD的countByValue方法讲解官方文档/** * Return the count of each unique value in this RDD as a map of (value, count) pairs. The final * combine step happens locally on the master, equivalent to ...
复制链接

扫一扫

专栏目录