Spark(day03) -- Installation,RDD(WC)

一.Introduction

https://en.wikipedia.org/wiki/Apache_Spark


二.Installation && Test

1.tar -xzvf spark-1.6.1-bin-hadoop2.6.tgz
2.mv spark-1.6.1-bin-hadoop2.6 spark-1.6.1
3.vi /etc/profile
4, vi slaves

5, vi spark-env.sh


Standalone Cluster:

export JAVA_HOME=/usr/java
export SPARK_MASTER_IP=192.168.16.100
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1

export SPARK_WORKER_MEMORY=1g


Start:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://192.168.16.100:7077 --executor-memory 1G --total-executor-cores 1 ./lib/spark-examples-1.6.1-hadoop2.6.0.jar 100


Yarn Cluster:

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_HOME=/usr/local/spark-1.6.1
export SPARK_JAR=/usr/local/spark-1.6.1/lib/spark-assembly-1.6.1-hadoop2.6.0.jar

export PATH=$SPARK_HOME/bin:$PATH


yarn on client

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 1G --num-executors 1 ./lib/spark-examples-1.6.1-hadoop2.6.0.jar 100



yarn on cluster

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-memory 1G --num-executors 1 ./lib/spark-examples-1.6.1-hadoop2.6.0.jar 100


*If you encounter your yarn stopped, adding the following code in yarn.site.xml:

<property>
//是否启动一个线程检查每个任务正使用的物理内存量,如果超出分配值,就kill掉
<name>yarn.nodemanager.peme-check-enabled</name>
<value>false</value>
</property>
//是否启动一个线程检查每个任务正使用的虚拟内存量,如果超出分配值,就kill掉
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>

</property>


三.RDD(Spark Core)

Resilient Distributed DataSet

Five Character:



WordCount:

import org.apache.spark.{SparkConf, SparkContext}

object WC {
  def main(args: Array[String]): Unit = {
    //配置conf,如果是运行在本机,需加setMaster("local[2]")
    val conf = new SparkConf().setAppName("WC").setMaster("local[2]")
    val sc = new SparkContext(conf)
    //加载数据,lines就是RDD
    val lines = sc.textFile("D://felqi//feiq//Recv Files//student.txt")
    //将数据压扁并切分
    val words = lines.flatMap(_.split(" "))
    //将数据变成pairs
    val pairs = words.map((_,1))
    //println(pairs.collect().toBuffer)
    //对pairs进行wc
    val res = pairs.reduceByKey(_+_)
    //对结果进行排序
    //println(res.collect().toBuffer)
    //将key-value值对调
    //val sort = res.map(x=>(x._2,x._1))
    //val sort = res.sortByKey(false)   ---- 倒序
    val sort = res.sortBy(_._2,false).saveAsTextFile("d://sparkout")  //----倒序
  //  println(sort.collect().toBuffer)
  }
}

WordCount-Cluster:

object WcCluster {
  def main(args: Array[String]): Unit = {
    //配置conf,如果是运行在本机,需加setMaster("local[2]")
    val conf = new SparkConf().setAppName("WcCluster")
    val sc = new SparkContext(conf)
    //加载数据,lines就是RDD
    val sort = sc.textFile("hdfs://192.168.16.100:9000//sparkwc//student.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).saveAsTextFile("hdfs://192.168.16.100:9000//sparkwc//out")
    //  println(sort.collect().toBuffer)
  }
}

Others:

object SumDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SumDemo").setMaster("local[2]")
    val sc = new SparkContext(conf)

    //以并行化scala集合来创建rdd
    val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10,11,12))
    //将里面的每个元素乘以100
    val res0 = rdd1.map(_*100)
    //将大于600的取出来
    val res1 = res0.filter(_>600)
    //查看rdd有多少个分区
    val res2 = rdd1.partitions.length
    println(res2)
  }
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值