Spark使用笔记汇总

安装IDEA及打包-常见问题

参考http://www.cnblogs.com/seaspring/p/5615976.html
https://yq.aliyun.com/articles/60346?spm=5176.8251999.569296.68
版本问题很重要,修改版本后注意新建项目的时候版本也得匹配
参考https://www.zhihu.com/question/34099679
1. 安装scala插件
2. 新建项目选择scala-jdk-scala
3. 项目结构(快捷键F4)-包结构设计
4. Library-引入spark的jar包

IDEA本地执行

  1. 编译代码 Build->Make Project
  2. 编程运行参数,Run->Edit Configurations (Application)
  3. Run->Run或Alt+Shift+F10
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf, SparkContext}
/**
  * Created by yuyin on 17-1-3.
  * 本地执行
  */
object SparkWord2 {
  def main(args: Array[String]) {
    //输入文件既可以是本地linux系统文件,也可以是其它来源文件,例如HDFS
    if (args.length == 0) {
      System.err.println("Usage: SparkWordCount <inputfile>")
      System.exit(1)
    }
    //以本地线程方式运行,可以指定线程个数,
    //如.setMaster("local[2]"),两个线程执行
    //下面给出的是单线程执行
    val conf = new SparkConf().setAppName("SparkWord2").setMaster("local")
    val sc = new SparkContext(conf)

    //wordcount操作,计算文件中包含Spark的行数
    val count=sc.textFile(args(0)).filter(line => line.contains("ex")).count()
    //打印结果
    println("count="+count)
    sc.stop()
  }

}

打包提交集群

  1. 点击工程,然后按F4打个Project Structure并选择Artifacts
  2. 选择Jar->form modules with dependencies
  3. 在main class中,选择SparkWord3
  4. 点击确定后为减小jar包的体积,将spark-assembly-1.5.0-hadoop2.4.0.jar等jar包删除
  5. 确定后,再点击Build->Build Artifacts
  6. 生成后的jar文件保存在~/IdeaProjects/Spark02/out/artifacts/Spark02_jar
  7. 提交集群运行 执行
./spark-submit --master spark://sparkmaster:7077 --class SparkWord3 --executor-memory 1g /home/yuyin/IdeaProjects/Spark02/out/artifacts/Spark02_jar/Spark02.jar hdfs://ns1/README.md hdfs://ns1/SparkWordCountResult

代码

import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by yuyin on 17-1-3.
  * 集群提交
  */
object SparkWord3 {
  def main(args: Array[String]) {
    //输入文件既可以是本地linux系统文件,也可以是其它来源文件,例如HDFS
    if (args.length == 0) {
      System.err.println("Usage: SparkWordCount <inputfile> <outputfile>")
      System.exit(1)
    }

    val conf = new SparkConf().setAppName("SparkWordCount")
    val sc = new SparkContext(conf)

    //rdd2为所有包含Spark的行
    val rdd2=sc.textFile(args(0)).filter(line => line.contains("Spark"))
    //保存内容,在例子中是保存在HDFS上
    rdd2.saveAsTextFile(args(1))
    sc.stop()
  }

}

RDD 常用Transformation函数

union并集

union将两个RDD数据集元素合并,类似两个集合的并集

val rdd1=sc.parallelize(1 to 5)
val rdd2=sc.parallelize(4 to 8)
rdd1.union(rdd2).collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)

intersection交集

rdd1.intersection(rdd2).collect
res1: Array[Int] = Array(4, 5)

distinct去除重复元素

rdd1.union(rdd2).distinct.collect
res2: Array[Int] = Array(8, 1, 2, 3, 4, 5, 6, 7)

groupByKey([numTasks]) 合并相同key

输入数据为(K, V) 对, 返回的是 (K, Iterable) ,numTasks指定task数量,该参数是可选的

rdd1.union(rdd2).map((_,1)).groupByKey.collect
res3: Array[(Int, Iterable[Int])] = Array((8,CompactBuffer(1)), (1,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)), (4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (6,CompactBuffer(1)), (7,CompactBuffer(1)))

reduceByKey(func, [numTasks]) 聚合

reduceByKey函数输入数据为(K, V)对,返回的数据集结果也是(K,V)对,只不过V为经过聚合操作后的值

rdd1.union(rdd2).map((_,1)).reduceByKey(_+_).collect
res4: Array[(Int, Int)] = Array((8,1), (1,1), (2,1), (3,1), (4,2), (5,2), (6,1), (7,1))

sortByKey([ascending], [numTasks]) 排序

对输入的数据集按key排序 true升序false降序

var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(7,9),(2,4)))
data.sortByKey(true).collect
res5: Array[(Int, Int)] = Array((1,3), (1,2), (1,4), (2,3), (2,4), (7,9))
data.sortByKey(false).collect
res7: Array[(Int, Int)] = Array((7,9), (2,3), (2,4), (1,3), (1,2), (1,4))

join(otherDataset, [numTasks]) 连接

对于数据集类型为 (K, V) 及 (K, W)的RDD,join操作后返回类型为 (K, (V, W)),join函数有三种:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

val rdd1=sc.parallelize(Array((1,2),(1,3)))
val rdd2=sc.parallelize(Array((1,3)))
rdd1.join(rdd2).collect
res10: Array[(Int, (Int, Int))] = Array((1,(2,3)), (1,(3,3)))

def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))]

rdd1.leftOuterJoin(rdd2).collect
res12: Array[(Int, (Int, Option[Int]))] = Array((1,(2,Some(3))), (1,(3,Some(3))))
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
RDD[(K, (Option[V], W))]
rdd1.rightOuterJoin(rdd2).collect
res13: Array[(Int, (Option[Int], Int))] = Array((1,(Some(2),3)), (1,(Some(3),3)))

cogroup(otherDataset, [numTasks])

如果输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同

rdd1.cogroup(rdd2).collect
res14: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))
rdd1.groupWith(rdd2).collect
res15: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))

cartesian(otherDataset) 笛卡尔积

求两个RDD数据集间的笛卡尔积

val rdd1=sc.parallelize(Array(1,2,3,4))
val rdd2=sc.parallelize(Array(5,6))
rdd1.cartesian(rdd2).collect
res16: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (3,6), (4,5), (4,6))

coalesce(numPartitions) 减少分区

将RDD的分区数减至指定的numPartitions分区数

val rdd1=sc.parallelize(1 to 100,3)
val rdd2=rdd1.coalesce(2)

repartition(numPartitions),功能与coalesce函数相同,实质上它调用的就是coalesce函数,只不是shuffle = true,意味着可能会导致大量的网络开销。

repartitionAndSortWithinPartitions

repartitionAndSortWithinPartitions函数是repartition函数的变种,与repartition函数不同的是,repartitionAndSortWithinPartitions在给定的partitioner内部进行排序,性能比repartition要高。

val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
import org.apache.spark.HashPartitioner
data.repartitionAndSortWithinPartitions(new HashPartitioner(3)).collect
res3: Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (2,3), (2,4), (5,4))

RDD actions

reduce() 累计

reduce采样累加或关联操作减少RDD中元素的数量

val data=sc.parallelize(1 to 3)
data.reduce((x,y)=>x+y)
res20: Int = 6
data.reduce(_+_)
res21: Int = 6  

count()计数

data.count
res23: Long = 3

first()第一个元素

data.first
res24: Int = 1

take(n)

data.take(2)
res25: Array[Int] = Array(1, 2)

takeSample(withReplacement, num, [seed])采样

对RDD中的数据进行是否有放回的采样

val data=sc.parallelize(1 to 9)
data.takeSample(false,5)
res28: Array[Int] = Array(2, 6, 1, 5, 7)
data.takeSample(true,5,2)
res32: Array[Int] = Array(8, 8, 6, 8, 9)

takeOrdered(n, [ordering]) 取最小的前几个

含隐式排序

sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(2) 
res34: Array[Int] = Array(2, 3)

saveAsTextFile(path)

将RDD保存到文件,本地模式时保存在本地文件,集群模式指如果在Hadoop基础上则保存在HDFS上

countByKey()

将RDD中的数据按Key计数

val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
data.countByKey()
res37: scala.collection.Map[Int,Long] = Map(1 -> 3, 5 -> 1, 2 -> 2)

foreach(func)

foreach方法遍历RDD中所有的元素

val data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(2,4)))
data.foreach(x=>println("key="+x._1+",value="+x._2))
key=1,value=2
key=1,value=4
key=2,value=3
key=2,value=4
key=1,value=3

参见API文档:http://spark.apache.org/docs/latest/api/scala/index.html

spark-submit提交参数

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

本地模式
./spark-submit --master local 
--class SparkWordCount 
--executor-memory 1g 
/root/IdeaProjects/SparkWordCount/out/artifacts/SparkWordCount_jar/SparkWordCount.jar 
file:/hadoopLearning/spark-1.5.0-bin-hadoop2.4/README.md 
file:/SparkWordCountResult

Standalone集群运行方式
./spark-submit --master spark://sparkmaster:7077 
--class SparkWordCount --executor-memory 1g 
/root/IdeaProjects/SparkWordCount/out/artifacts/SparkWordCount_jar/SparkWordCount.jar 
file:/hadoopLearning/spark-1.5.0-bin-hadoop2.4/README.md 
file:/SparkWordCountResult2

Yarn运行方式
./spark-submit --master yarn-cluster 
--class org.apache.spark.examples.SparkPi 
--executor-memory 1g 
/root/IdeaProjects/SparkWordCount/out/artifacts/SparkWordCount_jar/SparkWordCount.jar

spark运行过程

参考
http://blog.csdn.net/book_mmicky/article/details/25714419?spm=5176.100239.blogcont60342.5.Tn4nZG
https://yq.aliyun.com/articles/60342?spm=5176.100239.blogcont60343.9.hHw25F

Spark SQL 与DataFrame

Spark SQL的运行原理可参见:
http://blog.csdn.net/book_mmicky/article/details/39956809
DataFrame方法与临时表SQL语句方法
读取json格式数据
{“name”:”Michael”}
{“name”:”Andy”, “age”:30}
{“name”:”Justin”, “age”:19}

//自己创建RDD数据
case class Person(name:String,age:Int)
val data = sc.parallelize(List(('a',18),('b',21)))
//data.map(x=>Person(x._1.toString,x._2.toInt)).collect
val df=data.map(x=>Person(x._1.toString,x._2.toInt)).toDF()
//从json中获取数据
val df = sqlContext.read.json("/data/people.json")
//查看DataFrame元数据信息
df.printSchema()
//返回DataFrame某列所有数据
df.select("name").show()
//DataFrame数据过滤
df.filter(df("age") > 19).show()
//按年龄分组
df.groupBy("age").count().show()
//注册成表
df.registerTempTable("people")
//执行SparkSQL
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 29")
//结果格式化输出
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

//显示表
df.show()  
//显示前2行
df.show(2)
//显示树形结构
df.printSchema()
//查询某字段内容
df.select("name").show()
// 查询所有内容,但是age字段的内容+1
df.select(df("name"), df("age") + 1).show()
//查询满足某条件的字段--操作可以转换成表,然后用sql语句查询
df.where('age >= 10).where('age <= 39).select('name).show()
df.where("age >= 10").where("age <= 39").select("name").show()
//查询满足某条件的内容
df.filter(df("age") > 21).show()
// 某字段分组后的数量 分组计数--操作可以转换成表,然后用sql语句查询
df.groupBy("age").count().show()
// 左联表(注意是3个等号!)--操作可以转换成表,然后用sql语句查询
df.join(df2, df("name") === df2("name"), "left").show()

示例对于DataFrame与临时表使用

case class Person(author:String,commit:Int)
val data = sc.parallelize(List(('a',18),('b',4),('b',1),('b',2),('a',20),('b',10)))
val df=data.map(x=>Person(x._1.toString,x._2.toInt)).toDF()
//显式前两行数据
df.show(2)
//计算总行数
df.count
//按提交次数进行降序排序
df.groupBy("author").count.sort($"count".desc).show

DataFrame注册成临时表使用实战

//将DataFrame注册成表commitlog
val commitLog=df.registerTempTable("commitlog")
//显示前2行数据
sqlContext.sql("SELECT * FROM commitlog").show(2)
//计算总行数
sqlContext.sql("SELECT count(*) as TotalCommitNumber  FROM commitlog").show
//按提交次数进行降序排序
sqlContext.sql("SELECT author,count(*) as CountNumber  FROM commitlog GROUP BY author ORDER BY CountNumber DESC").show

SparkSQL应用案例

Date.txt格式如下

//Date.txt文件定义了日期的分类,将每天分别赋予所属的月份、星期、季度等属性  
//日期,年月,年,月,日,周几,第几周,季度,旬、半月 
2014-12-24,201412,2014,12,24,3,52,4,36,24

Stock.txt格式如下:

//Stock.txt文件定义了订单表头  
//订单号,交易位置,交易日期  
ZYSL00014630,ZY,2009-5-7

StockDetail.txt格式如下:

//订单号,行号,货品,数量,价格,金额  
HMJSL00006421,9,QY524266010101,1,80,80

案例实战-查询所有订单中每年的销售单数、销售总额

//定义case class用于后期创建DataFrame schema
//对应Date.txt
case class DateInfo(dateID:String,theyearmonth :String,theyear:String,themonth:String,thedate :String,theweek:String,theweeks:String,thequot :String,thetenday:String,thehalfmonth:String) 
//对应Stock.txt
case class StockInfo(ordernumber:String,locationid :String,dateID:String)
//对应StockDetail.txt
case class StockDetailInfo(ordernumber:String,rownum :Int,itemid:String,qty:Int,price:Double,amount:Double) 

//加载数据并转换成DataFrame
val DateInfoDF = sc.textFile("/data/Date.txt").map(_.split(",")).map(d => DateInfo(d(0), d(1),d(2),d(3),d(4),d(5),d(6),d(7),d(8),d(9))).toDF()
//加载数据并转换成DataFrame
val StockInfoDF= sc.textFile("/data/Stock.txt").map(_.split(",")).map(s => StockInfo(s(0), s(1),s(2))).toDF()
//加载数据并转换成DataFrame
val StockDetailInfoDF = sc.textFile("/data/StockDetail.txt").map(_.split(",")).map(s => StockDetailInfo(s(0), s(1).trim.toInt,s(2),s(3).trim.toInt,s(4).trim.toDouble,s(5).trim.toDouble)).toDF()

//注册成表
DateInfoDF.registerTempTable("tblDate")
StockInfoDF.registerTempTable("tblStock")
StockDetailInfoDF.registerTempTable("tblStockDetail")

//执行SQL
//所有订单中每年的销售单数、销售总额  
//三个表连接后以count(distinct a.ordernumber)计销售单数,sum(b.amount)计销售总额 
sqlContext.sql("select c.theyear,count(distinct a.ordernumber),sum(b.amount) from tblStock a join tblStockDetail b on a.ordernumber=b.ordernumber join tblDate c on a.dateid=c.dateid group by c.theyear order by c.theyear").collect().foreach(println)

案例实战-求所有订单每年最大金额订单的销售额:

sqlContext.sql("select c.theyear,max(d.sumofamount) from tblDate c join (select a.dateid,a.ordernumber,sum(b.amount) as sumofamount from tblStock a join tblStockDetail b on a.ordernumber=b.ordernumber group by a.dateid,a.ordernumber ) d  on c.dateid=d.dateid group by c.theyear sort by c.theyear").collect().foreach(println)  

spark-streaming

参考https://yq.aliyun.com/articles/60316?spm=5176.8251999.569296.76
单词计数

import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming._
/**
  * Created by yuyin on 17/1/5.
  * 参数/Users/yuyin/Downloads/software/spark/streaming
  * 启动后/streaming目录下执行 echo "A B C D" >> test12.txt; echo "A B" >> test12.txt
  */
object SparkStreaming {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[4]")
    //每一秒处理一次
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    //读取本地文件~/streaming文件夹
    val lines = ssc.textFileStream(args(0))
    val words = lines.flatMap(_.split(" "))
    val wordMap = words.map(x => (x, 1))
    val wordCounts=wordMap.reduceByKey(_ + _)
    val filteredWordCounts=wordCounts.filter(_._2>1)
    val numOfCount=filteredWordCounts.count()
    val countByValue=words.countByValue()
    val union=words.union(words)
    val transform=words.transform(x=>x.map(x=>(x,1)))
    //显式原文件
    lines.print()
//    A B C D
//    A B
    //打印flatMap结果
    words.print()
//    A
//    B
//    C
//    D
//    A
//    B
    //打印map结果
    wordMap.print()
//    (A,1)
//    (B,1)
//    (C,1)
//    (D,1)
//    (A,1)
//    (B,1)
    //打印reduceByKey结果
    wordCounts.print()
//    (D,1)
//    (A,2)
//    (B,2)
//    (C,1)
    //打印filter结果
    filteredWordCounts.print()
//    (A,2)
//    (B,2)
    //打印count结果
    numOfCount.print()
//    2
    //打印countByValue结果
    countByValue.print()
//    (D,1)
//    (A,2)
//    (B,2)
//    (C,1)
    //打印union结果
    union.print()
//    A
//    B
//    C
//    D
//    A
//    B
//    A
//    B
//    C
//    D
//    ...
    //打印transform结果
    transform.print()
//    (A,1)
//    (B,1)
//    (C,1)
//    (D,1)
//    (A,1)
//    (B,1)
    ssc.start()
    ssc.awaitTermination()
  }
}
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming._

/**
  * Created by yuyin on 17/1/5.
  * 参数 localhost 9999
  * 启动netcat server命令:nc -lk 9999
  * 输入hello / world
  */
object SparkStreamingWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    //函数字面量,输入的当前值与前一次的状态结果进行累加
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.sum

      val previousCount = state.getOrElse(0)

      Some(currentCount + previousCount)
    }

    //输入类型为K,V,S,返回值类型为K,S
    //V对应为带求和的值,S为前一次的状态
    val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
    }

    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[4]")

    //每一秒处理一次
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    //当前目录为checkpoint结果目录,后面会讲checkpoint在Spark Streaming中的应用
    ssc.checkpoint(".")

    //RDD的初始化结果
    val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))


    //使用Socket作为输入源,本例ip为localhost,端口为9999
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    //flatMap操作
    val words = lines.flatMap(_.split(" "))
    //map操作
    val wordDstream = words.map(x => (x, 1))

    //updateStateByKey函数使用
    val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

Spark Streaming窗口 DStream Window操作

窗口每滑动一次,落在该窗口中的RDD被一起同时处理,生成一个窗口DStream(windowed DStream),窗口操作需要设置两个参数:
(1)窗口长度(window length),即窗口的持续时间,上图中的窗口长度为3
(2)滑动间隔(sliding interval),窗口操作执行的时间间隔,上图中的滑动间隔为2
这两个参数必须是原始DStream 批处理间隔(batch interval)的整数倍
参考https://yq.aliyun.com/articles/60316?spm=5176.8251999.569296.76

reduceByKeyAndWindow

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
/**
  * Created by yuyin on 17/1/9.
  * 1传入的参数为localhost 9999 30 10   持续30秒 每次窗口10秒
  * 2启动netcat server nc -lk 9999
  * Spark is a fast and general cluster computing system for Big Data. It provides
  */
object WindowWordCount {
  def main(args: Array[String]) {
    //传入的参数为localhost 9999 30 10
    if (args.length != 4) {
      System.err.println("Usage: WindowWorldCount <hostname> <port> <windowDuration> <slideDuration>")
      System.exit(1)
    }
//    StreamingExamples.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("WindowWordCount").setMaster("local[4]")
    val sc = new SparkContext(conf)

    // 创建StreamingContext,batch interval为5秒
    val ssc = new StreamingContext(sc, Seconds(5))


    //Socket为数据源
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)

    val words = lines.flatMap(_.split(" "))

    // windows操作,对窗口中的单词进行计数
    val wordCounts = words.map(x => (x , 1)).reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(args(2).toInt), Seconds(args(3).toInt))

    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

countByWindow方法使用

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
/**
  * Created by yuyin on 17/1/9.
  *   * 1传入的参数为localhost 9999 30 10   持续30秒 每次窗口10秒
  * 2启动netcat server nc -lk 9999
  * Spark is a fast and general cluster computing system for Big Data. It provides
  */
object WindowWordCount2 {
  def main(args: Array[String]) {
    if (args.length != 4) {
      System.err.println("Usage: WindowWorldCount <hostname> <port> <windowDuration> <slideDuration>")
      System.exit(1)
    }
//    StreamingExamples.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("WindowWordCount").setMaster("local[2]")
    val sc = new SparkContext(conf)

    // 创建StreamingContext
    val ssc = new StreamingContext(sc, Seconds(5))
    // 定义checkpoint目录为当前目录
    ssc.checkpoint(".")


    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)
    val words = lines.flatMap(_.split(" "))

    //countByWindowcountByWindow方法计算基于滑动窗口的DStream中的元素的数量。
    val countByWindow=words.countByWindow(Seconds(args(2).toInt), Seconds(args(3).toInt))

    countByWindow.print()
    ssc.start()
    ssc.awaitTermination()
  }

}

reduceByWindow方法使用

reduceByWindow方法基于滑动窗口对源DStream中的元素进行聚合操作,返回包含单元素的一个新的DStream。

//reduceByWindow方法基于滑动窗口对源DStream中的元素进行聚合操作,返回包含单元素的一个新的DStream。
 val reduceByWindow=words.map(x=>1).reduceByWindow(_+_,_-_Seconds(args(2).toInt), Seconds(args(3).toInt))

下面两个方法得到的结果是一样的,只是效率不同,后面的方法方式效率更高:

//以过去5秒钟为一个输入窗口,每1秒统计一下WordCount,本方法会将过去5秒钟的每一秒钟的WordCount都进行统计
//然后进行叠加,得出这个窗口中的单词统计。 这种方式被称为叠加方式
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, Seconds(5s),seconds(1))
//计算t+4秒这个时刻过去5秒窗口的WordCount,可以将t+3时刻过去5秒的统计量加上[t+3,t+4]的统计量
//再减去[t-2,t-1]的统计量,这种方法可以复用中间三秒的统计量,提高统计的效率。 这种方式被称为增量方式
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(5s),seconds(1))

Spark SQL、DataFrame与Spark Streaming结合

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Time, Seconds, StreamingContext}
import org.apache.spark.util.IntParam
import org.apache.spark.sql.SQLContext
import org.apache.spark.storage.StorageLevel
/**
  * Created by yuyin on 17/1/9.
  * 1传入的参数为localhost 9999
  * 2启动netcat server nc -lk 9999
  */
object SqlNetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

//    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 2 second batch size
    val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[4]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    //Socke作为数据源
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    //words DStream
    val words = lines.flatMap(_.split(" "))

    // Convert RDDs of the words DStream to DataFrame and run SQL query
    //调用foreachRDD方法,遍历DStream中的RDD
    words.foreachRDD((rdd: RDD[String], time: Time) => {
      // Get the singleton instance of SQLContext
      val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
      import sqlContext.implicits._

      // Convert RDD[String] to RDD[case class] to DataFrame
      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      // Register as table
      wordsDataFrame.registerTempTable("words")

      // Do word count on table using SQL and print it
      val wordCountsDataFrame =
      sqlContext.sql("select word, count(*) as total from words group by word")
      println(s"========= $time =========")
      wordCountsDataFrame.show()
    })

    ssc.start()
    ssc.awaitTermination()
  }
}


/** Case class for converting RDD to DataFrame */
case class Record(word: String)


/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {

  @transient  private var instance: SQLContext = _

  def getInstance(sparkContext: SparkContext): SQLContext = {
    if (instance == null) {
      instance = new SQLContext(sparkContext)
    }
    instance
  }
}

Streaming 缓存、Checkpoint机制

Spark Stream 缓存

DStream是由一系列的RDD构成的,它同一般的RDD一样,也可以将流式数据持久化到内容当中,采用的同样是persisit方法,调用该方法后DStream将持久化所有的RDD数据。这对于一些需要重复计算多次或数据需要反复被使用的DStream特别有效
参数参考:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Checkpoint机制

有两种数据可以chekpoint:

(1)Metadata checkpointing
将流式计算的信息保存到具备容错性的存储上如HDFS,Metadata Checkpointing适用于当streaming应用程序Driver所在的节点出错时能够恢复,元数据包括:
Configuration(配置信息) - 创建streaming应用程序的配置信息
DStream operations - 在streaming应用程序中定义的DStreaming操作
Incomplete batches - 在列队中没有处理完的作业

(2)Data checkpointing
将生成的RDD保存到外部可靠的存储当中,对于一些数据跨度为多个bactch的有状态tranformation操作来说,checkpoint非常有必要,因为在这些transformation操作生成的RDD对前一RDD有依赖,随着时间的增加,依赖链可能会非常长,checkpoint机制能够切断依赖链,将中间的RDD周期性地checkpoint到可靠存储当中,从而在出错时可以直接从checkpoint点恢复。

具体来说,metadata checkpointing主要还是从drvier失败中恢复,而Data Checkpoing用于对有状态的transformation操作进行checkpointing

Checkpointing具体的使用方式时通过下列方法:

//checkpointDirectory为checkpoint文件保存目录
streamingContext.checkpoint(checkpointDirectory)

案例

启动后创建checkpoint目录。
手动停止,重新运行将从checkpoint目录中恢复

import java.io.File
import java.nio.charset.Charset

import com.google.common.io.Files

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Time, Seconds, StreamingContext}
import org.apache.spark.util.IntParam

/**
  * Counts words in text encoded with UTF8 received from the network every second.
  *
  * Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> <output-file>
  *   <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive
  *   data. <checkpoint-directory> directory to HDFS-compatible file system which checkpoint data
  *   <output-file> file to which the word counts will be appended
  *
  * <checkpoint-directory> and <output-file> must be absolute paths
  *
  * To run this on your local machine, you need to first run a Netcat server
  *
  *      `$ nc -lk 9999`
  *
  * and run the example as
  *
  *      `$ ./bin/run-example org.apache.spark.examples.streaming.RecoverableNetworkWordCount \
  *              localhost 9999 ~/checkpoint/ ~/out`
  *
  * If the directory ~/checkpoint/ does not exist (e.g. running for the first time), it will create
  * a new StreamingContext (will print "Creating new context" to the console). Otherwise, if
  * checkpoint data exists in ~/checkpoint/, then it will create StreamingContext from
  * the checkpoint data.
  *
  * Refer to the online documentation for more details.
  */
/**
  * Created by yuyin on 17/1/9.
  * 1传入的参数为localhost 9999 /Users/yuyin/Downloads/software/scala/checkpoint/ /Users/yuyin/Downloads/software/scala/out
  * 2启动netcat server nc -lk 9999
  */
object RecoverableNetworkWordCount {

  def createContext(ip: String, port: Int, outputPath: String, checkpointDirectory: String)
  : StreamingContext = {


    //程序第一运行时会创建该条语句,如果应用程序失败,则会从checkpoint中恢复,该条语句不会执行
    println("Creating new context")
    val outputFile = new File(outputPath)
    if (outputFile.exists()) outputFile.delete()
    val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount").setMaster("local[4]")
    // Create the context with a 1 second batch size
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.checkpoint(checkpointDirectory)

    //将socket作为数据源
    val lines = ssc.socketTextStream(ip, port)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.foreachRDD((rdd: RDD[(String, Int)], time: Time) => {
      val counts = "Counts at time " + time + " " + rdd.collect().mkString("[", ", ", "]")
      println(counts)
      println("Appending to " + outputFile.getAbsolutePath)
      Files.append(counts + "\n", outputFile, Charset.defaultCharset())
    })
    ssc
  }
  //将String转换成Int
  private object IntParam {
    def unapply(str: String): Option[Int] = {
      try {
        Some(str.toInt)
      } catch {
        case e: NumberFormatException => None
      }
    }
  }
  def main(args: Array[String]) {
    if (args.length != 4) {
      System.err.println("You arguments were " + args.mkString("[", ", ", "]"))
      System.err.println(
        """
          |Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory>
          |     <output-file>. <hostname> and <port> describe the TCP server that Spark
          |     Streaming would connect to receive data. <checkpoint-directory> directory to
          |     HDFS-compatible file system which checkpoint data <output-file> file to which the
          |     word counts will be appended
          |
          |In local mode, <master> should be 'local[n]' with n > 1
          |Both <checkpoint-directory> and <output-file> must be absolute paths
        """.stripMargin
      )
      System.exit(1)
    }
    val Array(ip, IntParam(port), checkpointDirectory, outputPath) = args
    //getOrCreate方法,从checkpoint中重新创建StreamingContext对象或新创建一个StreamingContext对象
    val ssc = StreamingContext.getOrCreate(checkpointDirectory,
      () => {
        createContext(ip, port, outputPath, checkpointDirectory)
      })
    ssc.start()
    ssc.awaitTermination()
  }
}

Spark Streaming与Kafka结合

参考

import org.apache.kafka.clients.producer.{ProducerConfig, KafkaProducer, ProducerRecord}
import org.apache.log4j.{Level, Logger}

import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{Logging, SparkConf}
//参数sparkmaster:2181  test-consumer-group kafkatopictest 1 接受kafka传过来的消息做Wordcount
object KafkaWordCount {
  def main(args: Array[String]) {
    if (args.length < 4) {
      System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }
   // StreamingExamples.setStreamingLogLevels()

    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[4]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.checkpoint("checkpoint")

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    //创建ReceiverInputDStream
    val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L))
      .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Spark MLlib

密度向量与稀疏矩阵

稀疏矩阵csc表示说明

import org.apache.spark.mllib.linalg.{Vector, Vectors}

//密度矩阵,零值也存储
scala> val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
dv: org.apache.spark.mllib.linalg.Vector = [1.0,0.0,3.0]

// 创建稀疏矩阵,指定元素的个数、索引及非零值,数组方式
scala> val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
sv1: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])

// 创建稀疏矩阵,指定元素的个数、索引及非零值,采用序列方式
scala> val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
sv2: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
//密度矩阵的存储
scala> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//创建一个密度矩阵
scala> val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
dm: org.apache.spark.mllib.linalg.Matrix = 
1.0  2.0  
3.0  4.0  
5.0  6.0  
//下列矩阵
    1.0 0.0 4.0

    0.0 3.0 5.0

    2.0 0.0 6.0
如果采用稀疏矩阵存储的话,其存储信息包括:
实际存储值: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
矩阵元素对应的行索引:rowIndices=[0, 2, 1, 0, 1, 2]`
列起始位置索引: `colPointers=[0, 2, 3, 6]`.
使用的是CSC 行索引是每列的位置 列索引0代表第一列1.00偏移位置开始2代表第二列3.0是从2偏移位置,3代表第三列4.03偏移位置开始,6代表总的元素个数
http://www.tuicool.com/articles/A3emmqi?spm=5176.100239.blogcont60351.3.QnIY01

scala> val sparseMatrix= Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
sparseMatrix: org.apache.spark.mllib.linalg.Matrix = 
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 2.0
(1,1) 3.0
(0,2) 4.0
(1,2) 5.0
(2,2) 6.0

带类标签的特征向量(Labeled point)

scala> import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LabeledPoint

// LabeledPoint第一个参数是类标签数据,第二参数是对应的特征数据
//下面给出的是其密度向量实现方式
scala> val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
pos: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[1.0,0.0,3.0])

 // LabeledPoint的稀疏向量实现方式
scala> val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
neg: org.apache.spark.mllib.regression.LabeledPoint = (0.0,(3,[0,2],[1.0,3.0]))

实际中常常使用稀疏的实现方式,使用的是LIBSVM格式:label index1:value1 index2:value2 …进行特征标签及特征的存储与读取

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "/data/sample_data.txt")

分布式矩阵RowMatrix与CoordinateMatrix

package cn.ml.datastruct

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix


object RowMatrixDedmo extends App {
  val sparkConf = new SparkConf().setAppName("RowMatrixDemo").setMaster("spark://sparkmaster:7077")
  val sc = new SparkContext(sparkConf)
  // 创建RDD[Vector]
  val rdd1= sc.parallelize(
      Array(
          Array(1.0,2.0,3.0,4.0),
          Array(2.0,3.0,4.0,5.0),
          Array(3.0,4.0,5.0,6.0)
          )
      ).map(f => Vectors.dense(f))
   //创建RowMatrix
   val rowMatirx = new RowMatrix(rdd1)
   //计算列之间的相似度,返回的是CoordinateMatrix,采用
   //case class MatrixEntry(i: Long, j: Long, value: Double)存储值
   var coordinateMatrix:CoordinateMatrix= rowMatirx.columnSimilarities()
   //返回矩阵行数、列数
   println(coordinateMatrix.numCols())
   println(coordinateMatrix.numRows())
   //查看返回值,查看列与列之间的相似度
   //Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] 
   //= Array(MatrixEntry(2,3,0.9992204753914715), 
   //MatrixEntry(0,1,0.9925833339709303), 
   //MatrixEntry(1,2,0.9979288897338914), 
   //MatrixEntry(0,3,0.9746318461970762), 
   //MatrixEntry(1,3,0.9946115458726394), 
   //MatrixEntry(0,2,0.9827076298239907))
   println(coordinateMatrix.entries.collect())

   //转成后块矩阵,下一节中详细讲解
   coordinateMatrix.toBlockMatrix()
   //转换成索引行矩阵,下一节中详细讲解
   coordinateMatrix.toIndexedRowMatrix()
   //转换成RowMatrix
   coordinateMatrix.toRowMatrix()

   //计算列统计信息
    var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()
   //每列的均值, org.apache.spark.mllib.linalg.Vector = [2.0,3.0,4.0,5.0]
   mss.mean
   // 每列的最大值org.apache.spark.mllib.linalg.Vector = [3.0,4.0,5.0,6.0]
   mss.max
   // 每列的最小值 org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0]
   mss.min
   //每列非零元素的个数org.apache.spark.mllib.linalg.Vector = [3.0,3.0,3.0,3.0]
   mss.numNonzeros
   //矩阵列的1-范数,||x||1 = sum(abs(xi));
   //org.apache.spark.mllib.linalg.Vector = [6.0,9.0,12.0,15.0]
   mss.normL1
   //矩阵列的2-范数,||x||2 = sqrt(sum(xi.^2));
   // org.apache.spark.mllib.linalg.Vector = [3.7416573867739413,5.385164807134504,7.0710678118654755,8.774964387392123]
   mss.normL2
   //矩阵列的方差
   //org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,1.0]
   mss.variance
   //计算协方差
   //covariance: org.apache.spark.mllib.linalg.Matrix = 
   //1.0  1.0  1.0  1.0  
   //1.0  1.0  1.0  1.0  
   //1.0  1.0  1.0  1.0  
   //1.0  1.0  1.0  1.0  
   var covariance:Matrix=rowMatirx.computeCovariance()
    //计算拉姆矩阵rowMatirx^T*rowMatirx,T表示转置操作
   //gramianMatrix: org.apache.spark.mllib.linalg.Matrix = 
    //14.0  20.0  26.0  32.0  
    //20.0  29.0  38.0  47.0  
    //26.0  38.0  50.0  62.0  
    //32.0  47.0  62.0  77.0  
   var gramianMatrix:Matrix=rowMatirx.computeGramianMatrix()
   //对矩阵进行主成分分析,参数指定返回的列数,即主分成个数
   //PCA算法是一种经典的降维算法
   //principalComponents: org.apache.spark.mllib.linalg.Matrix = 
  //-0.5000000000000002  0.8660254037844388    
  //-0.5000000000000002  -0.28867513459481275  
  //-0.5000000000000002  -0.28867513459481287  
  //-0.5000000000000002  -0.28867513459481287  
   var principalComponents=rowMatirx.computePrincipalComponents(2)

/**
   * 对矩阵进行奇异值分解,设矩阵为A(m x n). 奇异值分解将计算三个矩阵,分别是U,S,V
   * 它们满足 A ~= U * S * V', S包含了设定的k个奇异值,U,V为相应的奇异值向量
   */
  //   svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] = 
  //SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@688884e,[13.011193721236575,0.8419251442105343,7.793650306633694E-8],-0.2830233037672786  -0.7873358937103356  -0.5230588083704528  
  //-0.4132328277901395  -0.3594977469144485  0.5762839813994667   
  //-0.5434423518130005  0.06834039988143598  0.4166084623124157   
  //-0.6736518758358616  0.4961785466773299   -0.4698336353414313  )
   var svd:SingularValueDecomposition[RowMatrix, Matrix]=rowMatirx.computeSVD(3,true)


   //矩阵相乘积操作
   var multiplyMatrix:RowMatrix=rowMatirx.multiply(Matrices.dense(4, 1, Array(1,2,3,4)))
}

IndexedRowMatrix

带索引的RowMatrix
index表示的就是它的索引,vector表示其要存储的内容

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg._
/**
  * Created by yuyin on 17/1/9.
  */
object RowMatrixDedmo extends App{
  val sparkConf = new SparkConf().setAppName("RowMatrixDemo").setMaster("spark://sparkmaster:7077")
  val sc = new SparkContext(sparkConf)
  // 创建RDD[Vector]
  val rdd1= sc.parallelize(
    Array(
      Array(1.0,2.0,3.0,4.0),
      Array(2.0,3.0,4.0,5.0),
      Array(3.0,4.0,5.0,6.0)
    )
  ).map(f => Vectors.dense(f))
  //创建RowMatrix
  val rowMatirx = new RowMatrix(rdd1)
  //计算列之间的相似度,返回的是CoordinateMatrix,采用
  //case class MatrixEntry(i: Long, j: Long, value: Double)存储值
  var coordinateMatrix:CoordinateMatrix= rowMatirx.columnSimilarities()
  //返回矩阵行数、列数
  println(coordinateMatrix.numCols())
  println(coordinateMatrix.numRows())
  //查看返回值,查看列与列之间的相似度
  //Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry]
  //= Array(MatrixEntry(2,3,0.9992204753914715),
  //MatrixEntry(0,1,0.9925833339709303),
  //MatrixEntry(1,2,0.9979288897338914),
  //MatrixEntry(0,3,0.9746318461970762),
  //MatrixEntry(1,3,0.9946115458726394),
  //MatrixEntry(0,2,0.9827076298239907))
  println(coordinateMatrix.entries.collect())

  //转成后块矩阵,下一节中详细讲解
  coordinateMatrix.toBlockMatrix()
  //转换成索引行矩阵,下一节中详细讲解
  coordinateMatrix.toIndexedRowMatrix()
  //转换成RowMatrix
  coordinateMatrix.toRowMatrix()

  //计算列统计信息
  var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()
  //每列的均值, org.apache.spark.mllib.linalg.Vector = [2.0,3.0,4.0,5.0]
  mss.mean
  // 每列的最大值org.apache.spark.mllib.linalg.Vector = [3.0,4.0,5.0,6.0]
  mss.max
  // 每列的最小值 org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0]
  mss.min
  //每列非零元素的个数org.apache.spark.mllib.linalg.Vector = [3.0,3.0,3.0,3.0]
  mss.numNonzeros
  //矩阵列的1-范数,||x||1 = sum(abs(xi));
  //org.apache.spark.mllib.linalg.Vector = [6.0,9.0,12.0,15.0]
  mss.normL1
  //矩阵列的2-范数,||x||2 = sqrt(sum(xi.^2));
  // org.apache.spark.mllib.linalg.Vector = [3.7416573867739413,5.385164807134504,7.0710678118654755,8.774964387392123]
  mss.normL2
  //矩阵列的方差
  //org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,1.0]
  mss.variance
  //计算协方差
  //covariance: org.apache.spark.mllib.linalg.Matrix =
  //1.0  1.0  1.0  1.0
  //1.0  1.0  1.0  1.0
  //1.0  1.0  1.0  1.0
  //1.0  1.0  1.0  1.0
  var covariance:Matrix=rowMatirx.computeCovariance()
  //计算拉姆矩阵rowMatirx^T*rowMatirx,T表示转置操作
  //gramianMatrix: org.apache.spark.mllib.linalg.Matrix =
  //14.0  20.0  26.0  32.0
  //20.0  29.0  38.0  47.0
  //26.0  38.0  50.0  62.0
  //32.0  47.0  62.0  77.0
  var gramianMatrix:Matrix=rowMatirx.computeGramianMatrix()
  //对矩阵进行主成分分析,参数指定返回的列数,即主分成个数
  //PCA算法是一种经典的降维算法
  //principalComponents: org.apache.spark.mllib.linalg.Matrix =
  //-0.5000000000000002  0.8660254037844388
  //-0.5000000000000002  -0.28867513459481275
  //-0.5000000000000002  -0.28867513459481287
  //-0.5000000000000002  -0.28867513459481287
  var principalComponents=rowMatirx.computePrincipalComponents(2)

  /**
    * 对矩阵进行奇异值分解,设矩阵为A(m x n). 奇异值分解将计算三个矩阵,分别是U,S,V
    * 它们满足 A ~= U * S * V', S包含了设定的k个奇异值,U,V为相应的奇异值向量
    */
  //   svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] =
  //SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@688884e,[13.011193721236575,0.8419251442105343,7.793650306633694E-8],-0.2830233037672786  -0.7873358937103356  -0.5230588083704528
  //-0.4132328277901395  -0.3594977469144485  0.5762839813994667
  //-0.5434423518130005  0.06834039988143598  0.4166084623124157
  //-0.6736518758358616  0.4961785466773299   -0.4698336353414313  )
  var svd:SingularValueDecomposition[RowMatrix, Matrix]=rowMatirx.computeSVD(3,true)


  //矩阵相乘积操作
  var multiplyMatrix:RowMatrix=rowMatirx.multiply(Matrices.dense(4, 1, Array(1,2,3,4)))

}

BlockMatrix的使用

import org.apache.spark.mllib.linalg.distributed.BlockMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.SparkConf
/**
  * Created by yuyin on 17/1/9.
  */
object BlockMatrixDemo extends App{
  val sparkConf = new SparkConf().setAppName("BlockMatrixDemo").setMaster("spark://sparkmaster:7077") //这里指在本地运行,2个线程
  val sc = new SparkContext(sparkConf)

  implicit def double2long(x:Double)=x.toLong
  val rdd1= sc.parallelize(
    Array(
      Array(1.0,20.0,30.0,40.0),
      Array(2.0,50.0,60.0,70.0),
      Array(3.0,80.0,90.0,100.0)
    )
  ).map(f => IndexedRow(f.take(1)(0),Vectors.dense(f.drop(1))))
  val indexRowMatrix = new IndexedRowMatrix(rdd1)
  //将IndexedRowMatrix转换成BlockMatrix,指定每块的行列数
  val blockMatrix:BlockMatrix=indexRowMatrix.toBlockMatrix(2, 2)

  //执行后的打印内容:
  //Index:(0,0)MatrixContent:2 x 2 CSCMatrix
  //(1,0) 20.0
  //(1,1) 30.0
  //Index:(1,1)MatrixContent:2 x 1 CSCMatrix
  //(0,0) 70.0
  //(1,0) 100.0
  //Index:(1,0)MatrixContent:2 x 2 CSCMatrix
  //(0,0) 50.0
  //(1,0) 80.0
  //(0,1) 60.0
  //(1,1) 90.0
  //Index:(0,1)MatrixContent:2 x 1 CSCMatrix
  //(1,0) 40.0
  //从打印内容可以看出:各分块矩阵采用的是稀疏矩阵CSC格式存储
  blockMatrix.blocks.foreach(f=>println("Index:"+f._1+"MatrixContent:"+f._2))

  //转换成本地矩阵
  //0.0   0.0   0.0
  //20.0  30.0  40.0
  //50.0  60.0  70.0
  //80.0  90.0  100.0
  //从转换后的内容可以看出,在indexRowMatrix.toBlockMatrix(2, 2)
  //操作时,指定行列数与实际矩阵内容不匹配时,会进行相应的零值填充
  blockMatrix.toLocalMatrix()

  //块矩阵相加
  blockMatrix.add(blockMatrix)

  //块矩阵相乘blockMatrix*blockMatrix^T(T表示转置)
  blockMatrix.multiply(blockMatrix.transpose)

  //转换成CoordinateMatrix
  blockMatrix.toCoordinateMatrix()

  //转换成IndexedRowMatrix
  blockMatrix.toIndexedRowMatrix()

  //验证分块矩阵的合法性
  blockMatrix.validate()

}

org.apache.spark.mllib.stat包及子包-统计基础

http://spark.apache.org/docs/latest/mllib-statistics.html#kernel-density-estimation

获取矩阵列(column-wise)统计信息

如每列的最大值、最小值、均值等其它统计特征

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary

object StatisticsDemo extends App {
  val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077") 
  val sc = new SparkContext(sparkConf)

  val rdd1= sc.parallelize(
      Array(
          Array(1.0,2.0,3.0,4.0),
          Array(2.0,3.0,4.0,5.0),
          Array(3.0,4.0,5.0,6.0)
          )
      ).map(f => Vectors.dense(f))
  //在第一节中,我们使用过该MultivariateStatisticalSummary该类,通过下列方法
  // var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()
  // 这里是通过Statistics方法去获取相关统计信息,它们的内部实现原理是一致的,最终返回其实都是
  // MultivariateOnlineSummarizer的实例(下一小节将讲解该类)
  //Statistics.colStats方法它的源码如下:
  //  def colStats(X: RDD[Vector]): MultivariateStatisticalSummary = {
  //  new RowMatrix(X).computeColumnSummaryStatistics()
  //}
  //可以看到 Statistics.colStats方法调用的是RowMatrix中的computeColumnSummaryStatistics方法
  val mss:MultivariateStatisticalSummary=Statistics.colStats(rdd1)
  //因此下列方面返回的结果与第一节通过调用computeColumnSummaryStatistics得到的结果
  //返回值是一致的
  mss.max
  mss.min
  mss.normL1
  //其它normL2等统计信息
}

Kernel density estimation(核密度估计)

Spark中只实现了高斯核函数

import org.apache.spark.mllib.stat.KernelDensity
val sample = sc.parallelize(Seq(0.0, 1.0, 4.0, 4.0))
  val kernelDensity=new KernelDensity()
                          .setSample(sample) //设置密度估计样本
                          .setBandwidth(3.0) //设置带宽,对高斯核函数来讲就是标准差
  //给定相应的点,估计其概率密度
  //densities: Array[Double] = 
  //Array(0.07464879256673691, 0.1113106036883375, 0.08485447240456075)
  val densities = kernelDensity.estimate(Array(-1.0, 2.0, 5.0))

Hypothesis testing(假设检验)

import org.apache.spark.mllib.stat.test.ChiSqTestResult
val land1 = Vectors.dense(1000.0, 1856.0)
val land2 = Vectors.dense(400, 560)   
val c1 = Statistics.chiSqTest(land1, land2)

Correlation 相关性分析

Spark中只实现了两种相关性分析方法,分别是皮尔逊(Pearson)与斯皮尔曼(Spearman)相关性分析方法

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.stat._
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.{Matrix, Vector}

object CorrelationDemo extends App {
   val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077") 
   val sc = new SparkContext(sparkConf)

   val rdd1:RDD[Double] = sc.parallelize(Array(11.0, 21.0, 13.0, 14.0))
   val rdd2:RDD[Double] = sc.parallelize(Array(11.0, 20.0, 13.0, 16.0))
   //两个rdd间的相关性
   //返回值:correlation: Double = 0.959034501397483
   //[-1, 1],值越接近于1,其相关度越高
   val correlation:Double = Statistics.corr(rdd1, rdd2, "pearson")


   val rdd3:RDD[Vector]= sc.parallelize(
      Array(
          Array(1.0,2.0,3.0,4.0),
          Array(2.0,3.0,4.0,5.0),
          Array(3.0,4.0,5.0,6.0)
          )
      ).map(f => Vectors.dense(f))
  //correlation3: org.apache.spark.mllib.linalg.Matrix = 
  //1.0  1.0  1.0  1.0  
  //1.0  1.0  1.0  1.0  
  //1.0  1.0  1.0  1.0  
  //1.0  1.0  1.0  1.0  
   val correlation3:Matrix = Statistics.corr(rdd3, "pearson")
}

分层采样(Stratified sampling)

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkConf

object StratifiedSampleDemo extends App {

 val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077") 
 val sc = new SparkContext(sparkConf)
 //读取HDFS上的README.md文件
val textFile = sc.textFile("/README.md")
//wordCount操作,返回(K,V)汇总结果
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

//定义key为spark,采样比率为0.5
val fractions: Map[String, Double] = Map("Spark"->0.5)

//使用sampleByKey方法进行采样
val approxSample = wordCounts.sampleByKey(false, fractions)
//使用sampleByKeyExact方法进行采样,该方法资源消耗较sampleByKey更大
//但采样后的大小与预期大小更接近,可信度达到99.99%
val exactSample = wordCounts.sampleByKeyExact(false, fractions)
}
// an RDD[(K, V)] of any key value pairs
val data = sc.parallelize(
  Seq((1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')))

// specify the exact fraction desired from each key
val fractions = Map(1 -> 0.1, 2 -> 0.6, 3 -> 0.3)

// Get an approximate sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)
// Get an exact sample from each stratum
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)

随机数据生成(Random data generation)

scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala> import org.apache.spark.mllib.random.RandomRDDs._
import org.apache.spark.mllib.random.RandomRDDs._

//生成100个服从标准正态分面N(0,1)的随机RDD数据,10为指定的分区数
scala> val u = normalRDD(sc, 100L, 10)
u: org.apache.spark.rdd.RDD[Double] = RandomRDD[26] at RDD at RandomRDD.scala:38

//转换使其服从N(1,4)的正太分布
scala> val v = u.map(x => 1.0 + 2.0 * x)
v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[27] at map at <console>:27

Spark MLlib算法

参考另一篇我的文章

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值