spark-core (scala版本)

最新推荐文章于 2024-08-20 00:30:00 发布

陈同学�

最新推荐文章于 2024-08-20 00:30:00 发布

阅读量1.2k

点赞数

本文链接：https://blog.csdn.net/weixin_43866666/article/details/102571751

版权

1.spark-RDD

RDD创建

    val conf : SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
    //创建上下文对象
    val sc = new SparkContext(conf)

    //内存中创建RDD，底层实现是parallelize
    val arrayRDD : RDD[Int]= sc.makeRDD(Array(1,2,3))
    //内存中创建parallelize
    val listRdd:RDD[Int] = sc.parallelize(Array(1,2,3,4))

    //外部存储中创建
    val fileRDD : RDD [String] = sc.textFile("in")

    listRdd.collect().foreach(println)

RDD的读取和存储

val fileRDD : RDD [String] = sc.textFile("in")

fileRDD.saveAsTextFile("output")

driver和excutor

Driver ：创建spark上下文对象的应用程序为Driver
Executor：接受任务执行任务

算子在Executor里执行
Driver里执行上部分的所有代码

网络中传输的是序列化的字符或字符串或数字。

2.RDD算子

map算子

map算子作为最通用的算子对RDD内的元素进行处理

    val sparkConf = new SparkConf().setAppName("map").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val number = Array(1,2,3,4,5)
    val numberRDD = sc.parallelize(number)
    val multipleRdd = numberRDD.map(num => num *2)
    multipleRdd.foreach(num => println(num))

reduce 算子

reduce为action算子，对RDD内元素做处理

  def reduce(): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("reduce")
    val sc = new SparkContext(conf)
    val number = Array(1, 2, 3, 4, 5, 5, 6, 7, 8)
    val numberRDD = sc.parallelize(number)
    val sum = numberRDD.reduce(_ + _)
    println(sum)
    sc.stop()
  }

flatMap算子

需求给定单词列表[“Hello”,“World”]，你想要返回列表[“H”,“e”,“l”, “o”,“W”,“r”,“d”]
这时候使用flatmap比较合适，它是一对多或多对多的处理

    val conf = new SparkConf().setAppName("flatMap").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val lineArray = Array("hello you", "hello me","hello world")
    val lineRDD =sc.parallelize(lineArray)
    lineRDD.foreach(line => println(line))
    val words = lineRDD.flatMap( line => line.split(" "))
    words.foreach(line => println(line))

在这里插入图片描述

groupByKey算子

对ky算子进行group处理

  val conf = new SparkConf().setMaster("local[*]").setAppName("groupByKey")
    val sc = new SparkContext(conf)
    val socre = Array(Tuple2("class1",80),Tuple2("class2",70),Tuple2("class2",20),Tuple2("class2",90),Tuple2("class1",80))
    val scoreRDD = sc.parallelize(socre)
    scoreRDD.groupByKey().foreach(a => println(a._2,a._1))
//      .foreach(score => { println(score._1);score._2.foreach(singlescore => println(singlescore))})
//    println("===============")
    sc.stop()

reduceByKey算子

对ky算子进行reduce处理(对相同Key值的rdd进行value相加)

    val conf = new SparkConf().setAppName("reduceByKey").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val score = Array(Tuple2("class1",80),Tuple2("class2",70),Tuple2("class2",20),Tuple2("class2",90),Tuple2("class1",80))
    val scoreRDD = sc.parallelize(score)
    scoreRDD.reduceByKey(_+_).foreach(num => println(num._1+ " : " + num._2) )
    sc.stop()

sortByKey

对ky算子进行排序处理

    val conf = new SparkConf().setMaster("local").setAppName("sortByKey")
    val sc = new SparkContext(conf)
    val score = Array(Tuple2(80,"class1"),Tuple2(70,"class3"),Tuple2(20,"class4"),Tuple2(90,"class5"),Tuple2(80,"class6"))
    val scoreRDD = sc.parallelize(score)
    scoreRDD.sortByKey().foreach(num => println(num._1 + " : " + num._2))
    sc.stop()

join算子

对ky算子进行join操作

    val conf = new SparkConf().setAppName("join").setMaster("local")
    val sc = new SparkContext(conf)
    val score = Array(Tuple2(1,80),Tuple2(1,70),Tuple2(2,30),Tuple2(3,50))
    val student = Array(Tuple2(1,"xiaohua"),Tuple2(2,"xiaoming"),Tuple2(3,"xiaochen"))
    val scoreRDD = sc.parallelize(score)
    val studentRDD = sc.parallelize(student)
    scoreRDD.join(studentRDD).foreach(stu => println("id :"+  stu._1 + " , name: " + stu._2._2 + " , score : " + stu._2._1))

filter 算子

filter算子用于过滤RDD内的元素
案例：取偶数

   val conf = new SparkConf()
      .setMaster("local")
      .setAppName("filter")
    val sc = new SparkContext(conf)
    val number = Array(1,2,3,4,5,6,7,8,9)
    val numberRDD = sc.parallelize(number)
    val eventNumberRDD = numberRDD.filter(num => num % 2 ==0)
    eventNumberRDD.foreach(num =>println(num))

mapPartitionsWithIndex（func）算子

该算子主要用于查询分区的名称

    val arrayRDD : RDD[Int]= sc.makeRDD(1 to 10 )

    val tupleIndexRDD : RDD[(Int,String)] = arrayRDD.mapPartitionsWithIndex{
      case(num,datas)=>{
        datas.map((_,"分区号:"+num))
      }
    }
    tupleIndexRDD.foreach(println)

输出结果

(4,,分区号:2)
(5,,分区号:2)
(1,,分区号:0)
(6,,分区号:3)
(7,,分区号:4)
(8,,分区号:4)
(2,,分区号:1)
(3,,分区号:1)
(9,,分区号:5)
(10,,分区号:5)

glom算子

将每一个分区形成一个数组，形成新的RDD类型时RDD[Array[T]]

val arrayRDD : RDD[Int]= sc.makeRDD(List(1,2,3,4,5,6,7,8),4 )
    val glomRDD:RDD[Array[Int]] = arrayRDD.glom()
      glomRDD.collect().foreach(array=>{
      println(array.mkString(","))
    })
  }

在这里插入图片描述

take算子

去RDD前几元素

    val conf = new SparkConf().setMaster("local").setAppName("take
    val sc = new SparkContext(conf)
    val number = Array(1,2,3,4,5,5,6,7,8)
    val numberRDD = sc.parallelize(number)
    val top = numberRDD.take(3)
    for(num <- top){
    println(num)}
    sc.stop()

takeOrdered算子

排序后取RDD的前N位
take是直接取RDD的前N位

broadcast变量

广播主要用于提高效率，将一个参数传入一组RDD内

    val conf = new SparkConf().setAppName("broadcast").setMaster("local")
    val sc = new SparkContext(conf)
    val factor = 3
    val factorBroadcast = sc.broadcast(factor)
    val number = Array(1,2,3,4,5)
    val numberRDD = sc.parallelize(number)
    numberRDD.map(num => num *factorBroadcast.value).foreach(num => println(num))
    sc.stop(

3.RDD操作MYSQL

查询操作

object spark_scala_test_delete {
  def main(args: Array[String]): Unit = {
    val conf : SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
    //创建上下文对象
    val sc = new SparkContext(conf)

    val driver = "com.mysql.jdbc.Driver"
    val url = "jdbc:mysql://localhost:3306/test"
    val userName = "root"
    val password = "root"

    val sql = "select CNO from course where tno >= ? and tno <= ?"
    val jdbcRDD = new JdbcRDD(
      sc,()=>{
        Class.forName(driver)
        java.sql.DriverManager.getConnection(url,userName,password)
      },
      sql,
      0,1000,2,
      (rs)=>{
        println(rs.getString(1))
      }

    )
    jdbcRDD.collect()
  }
}

问好代表上下限，2为分区数

插入操作

val dataRDD = sc.makeRDD(List(("1-111","摩登教育","999"),("2-222","摩登家庭","888")))
    dataRDD.foreach {
      case(a,b,c)=> {
        Class.forName(driver)
        val connection = java.sql.DriverManager.getConnection(url, userName, password)
        val sql = "insert into course (CNO,CNAME,TNO) value (?,?,?)"
        val statement = connection.prepareStatement(sql)
        statement.setString(1,a)
        statement.setString(2,b)
        statement.setString(3,c)
        statement.executeUpdate()
        statement.close()
        connection.close()
      }
    }

上图为RDD的遍历
改为foreachPartition
datas是集合中的数据循环遍历
不涉及到网络数据传递
效率高于上图，以partition为单位
若有两个分区，以下逻辑只走两边
连接的对象只有两个，会少于之前的连接对象，提高效率
缺点：OOM
分区循环完毕，继续加载数据会导致内存溢出

    dataRDD.foreachPartition(datas=> {
      Class.forName(driver)
      val connection = java.sql.DriverManager.getConnection(url, userName, password)
      datas.foreach {
        case (a, b, c) => {
          val sql = "insert into course (CNO,CNAME,TNO) value (?,?,?)"
          val statement = connection.prepareStatement(sql)
          statement.setString(1, a)
          statement.setString(2, b)
          statement.setString(3, c)
          statement.executeUpdate()
          statement.close()
        }
      }
      connection.close()
    }
    )

4.Spark-Core IDE开发

pom.xml配置

<plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.7.0</version>
            <configuration>
                <encoding>UTF-8</encoding>
                <source>1.8</source>
                <target>1.8</target>
                <showWarnings>true</showWarnings>
            </configuration>
        </plugin>

        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.1.6</version>
            <configuration>
                <scalaVersion>2.11.8</scalaVersion>
            </configuration>
            <executions>
                <execution>
                    <id>scala-compile</id>
                    <phase>process-resources</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>compile</goal>
                    </goals>
                </execution>
                <execution>
                    <id>scala-test-compile</id>
                    <phase>process-test-resources</phase>
                    <goals>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <transformers>
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>cn.strong.leke.bigdata.app.attendance.TeacherAttandanceStatApp</mainClass>
                            </transformer>
                                            </transformers>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                    </configuration>
                </execution>
            </executions>
        </plugin>

Maven Project 里的package 打包

连接sparkContext

SparkConf conf = new SparkConf().setAppName("conf_test").setMaster("local[*]");
        JavaSparkContext sc = new JavaSparkContext(conf);

wordcount案例

val conf = new SparkConf().setMaster("local").setAppName("wordcount")
val sc = new SparkContext(conf)
val lines = sc.textFile("file:///data/people.txt")
val words = lines.flatMap(line => line.split(" "))
//    words.foreach(num => println(num))
val pairs = words.map(num => (num,1))
val wordCount = pairs.reduceByKey(_+_)
//    wordCount.foreach(num => println(num))
val countworks = wordCount.map(num => (num._2,num._1))
val sortCount = countworks.sortByKey(false)
val sortedCount = sortCount.map(num=> (num._2,num._1))
sortedCount.foreach(num => println(num))
sc.stop()

spark IDE 上传后运行命令

bin/spark-submit \
--class com.csz.bigdata.spar
k.spark_wordcount \
/opt/jarPackage/spark-bigdata-1.0-shaded.jar

5.spark standalone 模式

安装使用

1.进入conf，修改配置文件

cd /opt/module/spark/conf

mv slaves.template slaves
mv spark-env.sh.template spark-env.sh

2.修改slaves

master
slave1
slave2

3.添加配置

vim spark-env.sh

SPARK_MASTER_HOST=master
SPARK_MASTER_PORT=7077

配置sbin下的spark-config.sh
添加

export JAVA_HOME=/opt/module/jdk

4.分发执行

scp -r /opt/module/spark root@slave1:/opt/module/

scp -r /opt/module/spark root@slave2:/opt/module/

5.启动

sbin/start-all.sh
jps

6.官方求PI案例

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://master:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.4.4.jar \
100

java io

java IO
输入，输出
字节(rar,zip,dat,png,jpeg)，字符(txt)

//文件输入流
InputStream in = new FileInputStream("xxxxx")
//缓冲流
InputStream bufferIn = new BufferedInputStream( new FileInputStream("xxxx"))
//使用字符流读取一行数据
Reader reader = new BufferedReader(new InputStreamReader(in,"UTF-8"))