一.Introduction
https://en.wikipedia.org/wiki/Apache_Spark
二.Installation && Test
1.tar -xzvf spark-1.6.1-bin-hadoop2.6.tgz
2.mv spark-1.6.1-bin-hadoop2.6 spark-1.6.1
3.vi /etc/profile
4, vi slaves
5, vi spark-env.sh
Standalone Cluster:
export JAVA_HOME=/usr/javaexport SPARK_MASTER_IP=192.168.16.100
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
Start:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://192.168.16.100:7077 --executor-memory 1G --total-executor-cores 1 ./lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
Yarn Cluster:
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark-1.6.1
export SPARK_JAR=/usr/local/spark-1.6.1/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
export PATH=$SPARK_HOME/bin:$PATH
yarn on client
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 1G --num-executors 1 ./lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-memory 1G --num-executors 1 ./lib/spark-examples-1.6.1-hadoop2.6.0.jar 100
*If you encounter your yarn stopped, adding the following code in yarn.site.xml:
<property>
//是否启动一个线程检查每个任务正使用的物理内存量,如果超出分配值,就kill掉
<name>yarn.nodemanager.peme-check-enabled</name>
<value>false</value>
</property>
//是否启动一个线程检查每个任务正使用的虚拟内存量,如果超出分配值,就kill掉
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
三.RDD(Spark Core)
Resilient Distributed DataSet
Five Character:
WordCount:
import org.apache.spark.{SparkConf, SparkContext}
object WC {
def main(args: Array[String]): Unit = {
//配置conf,如果是运行在本机,需加setMaster("local[2]")
val conf = new SparkConf().setAppName("WC").setMaster("local[2]")
val sc = new SparkContext(conf)
//加载数据,lines就是RDD
val lines = sc.textFile("D://felqi//feiq//Recv Files//student.txt")
//将数据压扁并切分
val words = lines.flatMap(_.split(" "))
//将数据变成pairs
val pairs = words.map((_,1))
//println(pairs.collect().toBuffer)
//对pairs进行wc
val res = pairs.reduceByKey(_+_)
//对结果进行排序
//println(res.collect().toBuffer)
//将key-value值对调
//val sort = res.map(x=>(x._2,x._1))
//val sort = res.sortByKey(false) ---- 倒序
val sort = res.sortBy(_._2,false).saveAsTextFile("d://sparkout") //----倒序
// println(sort.collect().toBuffer)
}
}
WordCount-Cluster:
object WcCluster {
def main(args: Array[String]): Unit = {
//配置conf,如果是运行在本机,需加setMaster("local[2]")
val conf = new SparkConf().setAppName("WcCluster")
val sc = new SparkContext(conf)
//加载数据,lines就是RDD
val sort = sc.textFile("hdfs://192.168.16.100:9000//sparkwc//student.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).saveAsTextFile("hdfs://192.168.16.100:9000//sparkwc//out")
// println(sort.collect().toBuffer)
}
}
Others:
object SumDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SumDemo").setMaster("local[2]")
val sc = new SparkContext(conf)
//以并行化scala集合来创建rdd
val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10,11,12))
//将里面的每个元素乘以100
val res0 = rdd1.map(_*100)
//将大于600的取出来
val res1 = res0.filter(_>600)
//查看rdd有多少个分区
val res2 = rdd1.partitions.length
println(res2)
}
}