1.spark-RDD
RDD创建
val conf : SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
//创建上下文对象
val sc = new SparkContext(conf)
//内存中创建RDD,底层实现是parallelize
val arrayRDD : RDD[Int]= sc.makeRDD(Array(1,2,3))
//内存中创建parallelize
val listRdd:RDD[Int] = sc.parallelize(Array(1,2,3,4))
//外部存储中创建
val fileRDD : RDD [String] = sc.textFile("in")
listRdd.collect().foreach(println)
RDD的读取和存储
val fileRDD : RDD [String] = sc.textFile("in")
fileRDD.saveAsTextFile("output")
driver和excutor
Driver :创建spark上下文对象的应用程序为Driver
Executor:接受任务执行任务
算子在Executor里执行
Driver里执行上部分的所有代码
网络中传输的是序列化的字符或字符串或数字。
2.RDD算子
map算子
map算子作为最通用的算子对RDD内的元素进行处理
val sparkConf = new SparkConf().setAppName("map").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val number = Array(1,2,3,4,5)
val numberRDD = sc.parallelize(number)
val multipleRdd = numberRDD.map(num => num *2)
multipleRdd.foreach(num => println(num))
reduce 算子
reduce为action算子,对RDD内元素做处理
def reduce(): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("reduce")
val sc = new SparkContext(conf)
val number = Array(1, 2, 3, 4, 5, 5, 6, 7, 8)
val numberRDD = sc.parallelize(number)
val sum = numberRDD.reduce(_ + _)
println(sum)
sc.stop()
}
flatMap算子
需求给定单词列表[“Hello”,“World”],你想要返回列表[“H”,“e”,“l”, “o”,“W”,“r”,“d”]
这时候使用flatmap比较合适,它是一对多或多对多的处理
val conf = new SparkConf().setAppName("flatMap").setMaster("local[*]")
val sc = new SparkContext(conf)
val lineArray = Array("hello you", "hello me","hello world")
val lineRDD =sc.parallelize(lineArray)
lineRDD.foreach(line => println(line))
val words = lineRDD.flatMap( line => line.split(" "))
words.foreach(line => println(line))
groupByKey算子
对ky算子进行group处理
val conf = new SparkConf().setMaster("local[*]").setAppName("groupByKey")
val sc = new SparkContext(conf)
val socre = Array(Tuple2("class1",80),Tuple2("class2",70),Tuple2("class2",20),Tuple2("class2",90),Tuple2("class1",80))
val scoreRDD = sc.parallelize(socre)
scoreRDD.groupByKey().foreach(a => println(a._2,a._1))
// .foreach(score => { println(score._1);score._2.foreach(singlescore => println(singlescore))})
// println("===============")
sc.stop()
reduceByKey算子
对ky算子进行reduce处理(对相同Key值的rdd进行value相加)
val conf = new SparkConf().setAppName("reduceByKey").setMaster("local[*]")
val sc = new SparkContext(conf)
val score = Array(Tuple2("class1",80),Tuple2("class2",70),Tuple2("class2",20),Tuple2("class2",90),Tuple2("class1",80))
val scoreRDD = sc.parallelize(score)
scoreRDD.reduceByKey(_+_).foreach(num => println(num._1+ " : " + num._2) )
sc.stop()
sortByKey
对ky算子进行排序处理
val conf = new SparkConf().setMaster("local").setAppName("sortByKey")
val sc = new SparkContext(conf)
val score = Array(Tuple2(80,"class1"),Tuple2(70,"class3"),Tuple2(20,"class4"),Tuple2(90,"class5"),Tuple2(80,"class6"))
val scoreRDD = sc.parallelize(score)
scoreRDD.sortByKey().foreach(num => println(num._1 + " : " + num._2))
sc.stop()
join算子
对ky算子进行join操作
val conf = new SparkConf().setAppName("join").setMaster("local")
val sc = new SparkContext(conf)
val score = Array(Tuple2(1,80),Tuple2(1,70),Tuple2(2,30),Tuple2(3,50))
val student = Array(Tuple2(1,"xiaohua"),Tuple2(2,"xiaoming"),Tuple2(3,"xiaochen"))
val scoreRDD = sc.parallelize(score)
val studentRDD = sc.parallelize(student)
scoreRDD.join(studentRDD).foreach(stu => println("id :"+ stu._1 + " , name: " + stu._2._2 + " , score : " + stu._2._1))
filter 算子
filter算子用于过滤RDD内的元素
案例:取偶数
val conf = new SparkConf()
.setMaster("local")
.setAppName("filter")
val sc = new SparkContext(conf)
val number = Array(1,2,3,4,5,6,7,8,9)
val numberRDD = sc.parallelize(number)
val eventNumberRDD = numberRDD.filter(num => num % 2 ==0)
eventNumberRDD.foreach(num =>println(num))
mapPartitionsWithIndex(func)算子
该算子主要用于查询分区的名称
val arrayRDD : RDD[Int]= sc.makeRDD(1 to 10 )
val tupleIndexRDD : RDD[(Int,String)] = arrayRDD.mapPartitionsWithIndex{
case(num,datas)=>{
datas.map((_,"分区号:"+num))
}
}
tupleIndexRDD.foreach(println)
输出结果
(4,,分区号:2)
(5,,分区号:2)
(1,,分区号:0)
(6,,分区号:3)
(7,,分区号:4)
(8,,分区号:4)
(2,,分区号:1)
(3,,分区号:1)
(9,,分区号:5)
(10,,分区号:5)
glom算子
将每一个分区形成一个数组,形成新的RDD类型时RDD[Array[T]]
val arrayRDD : RDD[Int]= sc.makeRDD(List(1,2,3,4,5,6,7,8),4 )
val glomRDD:RDD[Array[Int]] = arrayRDD.glom()
glomRDD.collect().foreach(array=>{
println(array.mkString(","))
})
}
take算子
去RDD前几元素
val conf = new SparkConf().setMaster("local").setAppName("take
val sc = new SparkContext(conf)
val number = Array(1,2,3,4,5,5,6,7,8)
val numberRDD = sc.parallelize(number)
val top = numberRDD.take(3)
for(num <- top){
println(num)}
sc.stop()
takeOrdered算子
排序后取RDD的前N位
take是直接取RDD的前N位
broadcast变量
广播主要用于提高效率,将一个参数传入一组RDD内
val conf = new SparkConf().setAppName("broadcast").setMaster("local")
val sc = new SparkContext(conf)
val factor = 3
val factorBroadcast = sc.broadcast(factor)
val number = Array(1,2,3,4,5)
val numberRDD = sc.parallelize(number)
numberRDD.map(num => num *factorBroadcast.value).foreach(num => println(num))
sc.stop(
3.RDD操作MYSQL
查询操作
object spark_scala_test_delete {
def main(args: Array[String]): Unit = {
val conf : SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
//创建上下文对象
val sc = new SparkContext(conf)
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/test"
val userName = "root"
val password = "root"
val sql = "select CNO from course where tno >= ? and tno <= ?"
val jdbcRDD = new JdbcRDD(
sc,()=>{
Class.forName(driver)
java.sql.DriverManager.getConnection(url,userName,password)
},
sql,
0,1000,2,
(rs)=>{
println(rs.getString(1))
}
)
jdbcRDD.collect()
}
}
问好代表上下限,2为分区数
插入操作
val dataRDD = sc.makeRDD(List(("1-111","摩登教育","999"),("2-222","摩登家庭","888")))
dataRDD.foreach {
case(a,b,c)=> {
Class.forName(driver)
val connection = java.sql.DriverManager.getConnection(url, userName, password)
val sql = "insert into course (CNO,CNAME,TNO) value (?,?,?)"
val statement = connection.prepareStatement(sql)
statement.setString(1,a)
statement.setString(2,b)
statement.setString(3,c)
statement.executeUpdate()
statement.close()
connection.close()
}
}
上图为RDD的遍历
改为foreachPartition
datas是集合中的数据循环遍历
不涉及到网络数据传递
效率高于上图,以partition为单位
若有两个分区,以下逻辑只走两边
连接的对象只有两个,会少于之前的连接对象,提高效率
缺点:OOM
分区循环完毕,继续加载数据会导致内存溢出
dataRDD.foreachPartition(datas=> {
Class.forName(driver)
val connection = java.sql.DriverManager.getConnection(url, userName, password)
datas.foreach {
case (a, b, c) => {
val sql = "insert into course (CNO,CNAME,TNO) value (?,?,?)"
val statement = connection.prepareStatement(sql)
statement.setString(1, a)
statement.setString(2, b)
statement.setString(3, c)
statement.executeUpdate()
statement.close()
}
}
connection.close()
}
)
4.Spark-Core IDE开发
pom.xml配置
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<encoding>UTF-8</encoding>
<source>1.8</source>
<target>1.8</target>
<showWarnings>true</showWarnings>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
<configuration>
<scalaVersion>2.11.8</scalaVersion>
</configuration>
<executions>
<execution>
<id>scala-compile</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>cn.strong.leke.bigdata.app.attendance.TeacherAttandanceStatApp</mainClass>
</transformer>
</transformers>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
Maven Project 里的package 打包
连接sparkContext
SparkConf conf = new SparkConf().setAppName("conf_test").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
wordcount案例
val conf = new SparkConf().setMaster("local").setAppName("wordcount")
val sc = new SparkContext(conf)
val lines = sc.textFile("file:///data/people.txt")
val words = lines.flatMap(line => line.split(" "))
// words.foreach(num => println(num))
val pairs = words.map(num => (num,1))
val wordCount = pairs.reduceByKey(_+_)
// wordCount.foreach(num => println(num))
val countworks = wordCount.map(num => (num._2,num._1))
val sortCount = countworks.sortByKey(false)
val sortedCount = sortCount.map(num=> (num._2,num._1))
sortedCount.foreach(num => println(num))
sc.stop()
spark IDE 上传后运行命令
bin/spark-submit \
--class com.csz.bigdata.spar
k.spark_wordcount \
/opt/jarPackage/spark-bigdata-1.0-shaded.jar
5.spark standalone 模式
安装使用
1.进入conf,修改配置文件
cd /opt/module/spark/conf
mv slaves.template slaves
mv spark-env.sh.template spark-env.sh
2.修改slaves
master
slave1
slave2
3.添加配置
vim spark-env.sh
SPARK_MASTER_HOST=master
SPARK_MASTER_PORT=7077
配置sbin下的spark-config.sh
添加
export JAVA_HOME=/opt/module/jdk
4.分发执行
scp -r /opt/module/spark root@slave1:/opt/module/
scp -r /opt/module/spark root@slave2:/opt/module/
5.启动
sbin/start-all.sh
jps
6.官方求PI案例
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://master:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.4.4.jar \
100
java io
java IO
输入,输出
字节(rar,zip,dat,png,jpeg),字符(txt)
//文件输入流
InputStream in = new FileInputStream("xxxxx")
//缓冲流
InputStream bufferIn = new BufferedInputStream( new FileInputStream("xxxx"))
//使用字符流读取一行数据
Reader reader = new BufferedReader(new InputStreamReader(in,"UTF-8"))