spark概述及Java wordcount、Scala wordcount

1.spark是什么

[1].官网 http://spark.apache.org/

[2].是什么

Apache Spark™ is a fast and general engine for large-scale data processing.

Apache Spark is an open source cluster computing system that aims to make data analytics fast

both fast to run and fast to wrtie

[3].Speed(速度)

基于内存速度是hadoop的100+倍

基于磁盘速度是hadoop的10+倍

[4].Ease of Use(易用性)

Write applications quickly in Java, Scala, Python, R, and SQL.

[5].Generality(通用性)

Combine SQL, streaming, and complex analytics.

[6].Runs Everywhere

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

[7].One stack rule them all

2.技术栈

3.Spark4种模式可以运行

[1].Local

多用于本地测试,如在eclipse,idea中写程序测试等。

[2].Standalone

Standalone是Spark自带的一个资源调度框架,它支持完全分布式。

[3].Yarn

Hadoop生态圈里面的一个资源调度框架,Spark也是可以基于Yarn来计算的。

要基于Yarn来进行资源调度,必须实现AppalicationMaster接口,Spark实现了这个接口,所以可以基于Yarn。

[4].Mesos

一个通用的资源调度框架。

4.Spark VS MapReduce

[1].MapReduce

[2].Spark,另外:Spark快的原因除了基于内存以外,还有一个重要原因就是DAG(负责任务的逻辑调度,负责将作业拆分成不同阶段的具有依赖关系的多批任务。)

5.wordcount

[0].准备

①.数据

bamor java
scala kotlin groovy
android w4xj h5 java
hadoop groovy kotlin griffon
java bamor

②.jar

spark-assembly-1.6.0-hadoop2.6.0.jar

[1].Scala版本

①.代码

package com.w4xj.scala.spark.wordcount





import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.rdd.RDD





/**

* @author w4xj

* @date 2019年6月7日 10:32:38

* 基于scala的spark wordcount

*/

object ScalaSparkWordcount {

  def main(args: Array[String]): Unit = {

//        val conf = new SparkConf()

//        conf.setMaster("local").setAppName("ScalaSparkWordcount")

//        val sc = new SparkContext(conf)

//        val lines: RDD[String] = sc.textFile("./words")

//        val words = lines.flatMap(line => {

//          line.split(" ")

//        })

//        val pairWords: RDD[(String, Int)] = words.map(word => {

//          new Tuple2(word, 1)

//        })

//    

//        val result: RDD[(String, Int)] = pairWords.reduceByKey((v1: Int, v2: Int) => {

//          v1 + v2

//        })

//    

//        //这里用sortByKey来,指定按次数排序

//        //先将次数和单词在而元祖的位置进行交换

//        val resultTmp1:RDD[(Int, String)] = result.map(tuple => {tuple.swap})

//        //排序(升序)

//        val resultTmp2:RDD[(Int, String)] =resultTmp1.sortByKey(true)

//        //再将二元祖调换为原来的顺序

//        val resultFinal:RDD[(String, Int)] = resultTmp2.map(tuple => {tuple.swap})

//        

//        resultFinal.foreach(tuple => {

//          println(tuple)

//        })

//    

//        sc.stop()



    

    //简写

    val conf = new SparkConf().setMaster("local").setAppName("ScalaSparkWordcount")

    val sc = new SparkContext(conf)

    //这里用的sortBy()排序,指定按次数排序(升序)

    sc.textFile("./words").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).sortBy(tuple => {tuple._2}, true).foreach(println)

    sc.stop()

  }

}

②.运行结果

(scala,1)
(hadoop,1)
(h5,1)
(android,1)
(griffon,1)
(w4xj,1)
(kotlin,2)
(groovy,2)
(bamor,2)
(java,3)

[2].java版本

①.代码

package com.w4xj.java.spark.wordcount;

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import  org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

/**

* @author w4xj

* @date 2019年6月7日 10:32:38

* 基于java的spark wordcount

*/

public class JavaSparkWordCount {

     public static void main(String[] args) {

          /**

           * conf

           *   1.可以设置spark的运行模式

           *   2.可以设置spark在webui中显示的application的名称。

           *   3.可以设置当前spark application 运行的资源:内存+core(4核8线程看做8个core)

           *

           * Spark运行模式:

           *   1.local --在eclipse ,IDEA中开发spark程序要用local模式,本地模式,多用于测试

           *  2.standalone -- Spark 自带的资源调度框架,支持分布式搭建,Spark任务可以依赖standalone调度资源

           *  3.yarn -- hadoop 生态圈中资源调度框架。Spark  也可以基于yarn 调度资源

           *  4.mesos -- 资源调度框架

           */

          SparkConf conf = new SparkConf();

          conf.setMaster("local");

          conf.setAppName("JavaSparkWordCount");

          

          /**

           * SparkContext 是通往集群的唯一通道

           */

          JavaSparkContext sc  = new  JavaSparkContext(conf);

          /**

           * sc.textFile 读取文件

           */

          JavaRDD<String> lines = sc.textFile("./words");

          

          /**

           * flatMap 进一条数据出多条数据,一对多关系

           */

//        JavaRDD<String> words = lines.flatMap(new  FlatMapFunction<String, String>() {

//

//            /**

//             *

//             */

//            private static final long serialVersionUID  = 1L;

//

//            @Override

//            public Iterable<String> call(String line)  throws Exception {

//                 return Arrays.asList(line.split(" "));

//            }

//        });

          

          //lambda表达式

          JavaRDD<String> words = lines.flatMap((String  line) -> {

              return Arrays.asList(line.split(" "));

          });

          

          /**

           * 在java中 如果想让某个RDD转换成K,V格式 使用xxxToPair

           * K,V格式的RDD:JavaPairRDD<String, Integer>

           */

//        JavaPairRDD<String, Integer> pairWords =  words.mapToPair(new PairFunction<String, String,  Integer>() {

//

//            /**

//             *

//             */

//            private static final long serialVersionUID  = 1L;

//

//            @Override

//            public Tuple2<String, Integer> call(String  word) throws Exception {

//                 return new Tuple2<String,  Integer>(word,1);

//            }

//        });

          JavaPairRDD<String, Integer> pairWords =  words.mapToPair((String word)->{

                   return new Tuple2<String,  Integer>(word,1);

          });

          

          

          /**

           * reduceByKey

           * 1.先将相同的key分组

           * 2.对每一组的key对应的value去按照你的逻辑去处理

           */

//        JavaPairRDD<String, Integer> result =  pairWords.reduceByKey(new Function2<Integer, Integer,  Integer>() {

//            

//            /**

//             *

//             */

//            private static final long serialVersionUID  = 1L;

//

//            @Override

//            public Integer call(Integer v1, Integer v2)  throws Exception {

//                 return v1+v2;

//            }

//        });

          

          //lambda表达式

          JavaPairRDD<String, Integer> result =  pairWords.reduceByKey((Integer v1, Integer v2) -> {

                   return v1+v2;

          });

          

          

          

          

          //Java的spark api没有sortBy方法,只有sortByKey,所以我们需要1.将key和的value的二元祖先进行交换,2.再sort,3.再转换回来

          //1.将一个Tuple2<String,Integer> 转为 Integer,  String 的K,V

//        JavaPairRDD<Integer, String> resultTmp1 =  result.mapToPair(new PairFunction<Tuple2<String,Integer>,  Integer, String>() {

//

//            private static final long serialVersionUID  = 1L;

//

//            @Override

//            public Tuple2<Integer, String>  call(Tuple2<String, Integer> tuple) throws Exception {

//                 //return new Tuple2<Integer,  String>(tuple._2, tuple._1);

//                 return tuple.swap();

//            }

//        });

          //lambda

          JavaPairRDD<Integer, String> resultTmp1 =  result.mapToPair((Tuple2<String, Integer> tuple)->{

                   //return new Tuple2<Integer,  String>(tuple._2, tuple._1);

                   return tuple.swap();

          });

          

          //2.排序(降序)

          JavaPairRDD<Integer, String> resultTmp2 =  resultTmp1.sortByKey(false);

          //3.再转回String,Integer的二元祖

//        JavaPairRDD<String, Integer> resultFinal =  resultTmp2.mapToPair(new  PairFunction<Tuple2<Integer,String>, String, Integer>() {

//

//            private static final long serialVersionUID  = 1L;

//

//            @Override

//            public Tuple2<String, Integer>  call(Tuple2<Integer, String> tuple) throws Exception {

//                 return new Tuple2<String,  Integer>(tuple._2, tuple._1);

//                 //return tuple.swap();

//            }

//        });

          

          //lambda

          JavaPairRDD<String, Integer> resultFinal =  resultTmp2.mapToPair((Tuple2<Integer, String> tuple) -> {

                   return new Tuple2<String,  Integer>(tuple._2, tuple._1);

                   //return tuple.swap();

          });

          

          

          

//        resultFinal.foreach(new  VoidFunction<Tuple2<String,Integer>>() {

//            

//            /**

//             *

//             */

//            private static final long serialVersionUID  = 639147013282066178L;

//

//            @Override

//            public void call(Tuple2<String, Integer>  tuple) throws Exception {

//                 System.out.println(tuple);

//            }

//        });

          

          //lambda表达式

          resultFinal.foreach((Tuple2<String, Integer>  tuple) -> {

                   System.out.println(tuple);

          });

          

          

          sc.stop();

          

          

     }

}

②.运行结果

(java,3)
(kotlin,2)
(groovy,2)
(bamor,2)
(scala,1)
(hadoop,1)
(h5,1)
(android,1)
(griffon,1)
(w4xj,1)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值