大数据Spark处理算法001-Top10

最新推荐文章于 2023-08-18 11:17:42 发布

犇犇.

最新推荐文章于 2023-08-18 11:17:42 发布

阅读量835

点赞数 1

分类专栏：大数据文章标签：大数据 spark 数据算法 Top10

本文链接：https://blog.csdn.net/qq_23528653/article/details/90730597

版权

大数据专栏收录该内容

11 篇文章 0 订阅

订阅专栏

目的：找出Top10喵星人

处理思路：

1.初始化创建Spark Masater的连接。

2.创建JavaRDD（lines）从HDFS（Hadoop分布式文件系统）读取一个txt文件，本例使用的文件为Top10.txt。

3.用lines创建JavaPairRDD

4.创建一个本地Top10数据

5.收集所有的本地top10并创建最终的top10列表

步骤1：初始化创建Spark Masater

SparkConf conf = new SparkConf().setAppName("Top10");
JavaSparkContext ctx = new JavaSparkContext(conf);

步骤2：创建JavaRDD（lines）

JavaRDD<String> lines = ctx.textFile("hdfs://spark01:9000/Top10.txt");//测试数据如下：附1

步骤3：用lines创建JavaPairRDD

JavaPairRDD<String,Integer> pairs = lines.mapToPair(new PairFunction<String,String,Integer>(){
           public Tuple2<String,Integer> call(String s){
               String[] tokens = s.split(",");
               return new Tuple2<String,Integer>(tokens[0],Integer.parseInt(tokens[1]));
           }
       });

步骤4：创建一个本地Top10

//创建本地top10列表
JavaRDD<SortedMap<Integer,String>> partitions = pairs.mapPartitions(
new FlatMapFunction<Iterator<Tuple2<String,Integer>>, SortedMap<Integer,String>>(){

                   @Override
                   public Iterable<SortedMap<Integer, String>> call(Iterator<Tuple2<String, Integer>> iter)
                           throws Exception {
                       SortedMap<Integer, String> top10 = new TreeMap<Integer,String>();
                       while(iter.hasNext()){
                           Tuple2<String,Integer> tuple = iter.next();
                           top10.put(tuple._2, tuple._1);
                           if(top10.size()>10){
                               top10.remove(top10.firstKey());
                           }
                       }
                       return Collections.singletonList(top10);
                   }
       });

步骤5：收集本地创建最终top10列表

//创建最终的top列表 one
       SortedMap<Integer,String> finaltop10one = new TreeMap<Integer,String>();
       List<SortedMap<Integer,String>> alltop10 = partitions.collect();
       for(SortedMap<Integer,String> localtop10 : alltop10){
           for(Map.Entry<Integer, String> entry : localtop10.entrySet()){
               finaltop10one.put(entry.getKey(), entry.getValue());
               if(finaltop10one.size()>10){
                   finaltop10one.remove(finaltop10one.firstKey());
               }
           }
       }
       //two
       SortedMap<Integer,String> finaltop10 = partitions.reduce(new Function2<
               SortedMap<Integer,String>,
               SortedMap<Integer,String>,
               SortedMap<Integer,String>
               >(){
                   private static final long serialVersionUID = 1L;

                   @Override
                   public SortedMap<Integer, String> call(SortedMap<Integer, String> m1,
                           SortedMap<Integer, String> m2) throws Exception {

                       SortedMap<Integer,String> top10 = new TreeMap<Integer,String>();

                       for(Map.Entry<Integer, String> entry : m1.entrySet()){
                           top10.put(entry.getKey(),entry.getValue());
                           if(top10.size()>10){
                               top10.remove(top10.firstKey());
                           }
                       }
                       for(Map.Entry<Integer, String> entry : m2.entrySet()){
                           top10.put(entry.getKey(),entry.getValue());
                           if(top10.size()>10){
                               top10.remove(top10.firstKey());
                           }
                       }
                       return top10;
                   }
       });

最后一步：输出

System.out.println("===tpo-10 list one====");
           for(Map.Entry<Integer, String> entry : finaltop10one.entrySet()){
               System.out.println(entry.getKey() + "--" + entry.getValue());
         }

System.out.println("===tpo-10 list ====");
for(Map.Entry<Integer, String> entry : finaltop10.entrySet()){
System.out.println(entry.getKey() + "--" + entry.getValue());
}

代码编译好后上传集群，输入命令（./Top.sh）运行脚本即可看到结果。

./Top.sh脚本文件如下：附2

结果如下：附3

附1：

cat1,12
cat2,13
cat3,14
cat4,15
cat5,10
cat100,100
cat200,200
cat300,300
cat1001,1001
cat67,67
cat22,22
cat23,23
cat1000,1000
cat2000,2000
cat400,400
cat500,500
cat34,34
cat78,78
cat21,21
cat37,37
cat39,39
cat88,88
cat66,66
cat666,666

将以上数据保存为一个txt文件，用命令（hadoop fs -put Top10.txt /Top10.txt）上传到HDFS文件系统。

附2：

/usr/local/spark1.5/bin/spark-submit \
--class cn.spark.study.core.Top10 \
--num-executors 3 \
--driver-memory 100m \
--executor-memory 100m \
--executor-cores 3 \
/usr/local/spark-text/java/Top10/jtop10.jar \

附3：

===tpo-10 list one====

88--cat88
100--cat100
200--cat200
300--cat300
400--cat400
500--cat500
666--cat666
1000--cat1000
1001--cat1001
2000--cat2000

===tpo-10 list ====
88--cat88
100--cat100
200--cat200
300--cat300
400--cat400
500--cat500
666--cat666
1000--cat1000
1001--cat1001
2000--cat2000