spark cluster 下 spark-shell/spark-spark提交wordcount sparksql Demo

12 篇文章 0 订阅

一:环境配置
My conf/spark-env.sh is:
export SPARK_MASTER_IP=node1.cluster.local
export   SPARK_WORKER_CORES=20
export   SPARK_WORKER_MEMORY=12g
export   SPARK_WORKER_DIR=/scratch/cperez/spark
export  SPARK_WORKER_INSTANCES=1
export   STANDALONE_SPARK_MASTER_HOST=node1.cluster.local
二:日志
spark log 所在的路径:export SPARK_LOG_DIR=/var/log/spark
spark配置文件所在的位置:vi /etc/spark/conf.dist/spark-env.sh

三:spark-shell方式提交WordCount
1.两个主节点:spark-shell --master spark://10.1.0.141:7077 --executor-memory 2g
2.读取文件:scala>val rdd1 = sc.textFile("hdfs://master:8020/net_bi/people.txt")(启动hdfs:sbin/start-dfs.sh查看:hadoop1:50070) 
                    scala>val rdd1 = sc.textFile("hdfs://master:8020/net_bi/adm_apps_price.txt")
                     scala>val rdd1 = sc.textFile(" hdfs://master:8020/net_bi/ web-private-4_20141011_10.log ")
                     scala>val rdd1 = sc.textFile("hdfs://master:8020/tmp/hive-hive/hive_2015-01-06_14-16-26_035_8767682183007088983-1 ")
hdfs://UCloudcluster/umr-0zdao0/uhivefkb0jb/warehouse/test_output/test3/_logs
                      scala>val rdd1 = sc.textFile(" hdfs://UCloudcluster/umr-0zdao0/uhivefkb0jb/warehouse/test_output/test3/_logs ")
                     scala>val rdd1 = sc.textFile("/home/username/test.txt ")
hdfs://launcher-17.build.lewatek.com:8020/
T1  /usr/bin/spark-shell
3.scala>rdd1.cache();
4.单词计数的操作,此时不会执行:scala>.val rdd2 = rdd1.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_)  
                                                       scala> val rdd2 = rdd1.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect.foreach(println)

5.统计前十个,此时才会执行:scala>rdd2.take(10)  只有action的时候才会进行操作

四:spark-submit方式提交WordCount

/opt/cloudera/parcels/CDH/bin/spark-submit   --master  spark://10.1.0.141:7077   -- class  com.username.WorkCount  --executor- memory 200M  /home/username/WorkCount.jar
src:
WordCount的例子
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
class  WordCount {}
object  WordCount {
def main (args: Array[String]) {
  val sparkConf = new SparkConf().setAppName("test")
//val sparkConf = new SparkConf().setAppName("test").setMaster("local")
  val sc = new SparkContext(sparkConf)
  val lines = sc.textFile("hdfs://master.dw.lewatek.com:8020/net_bi/people.txt")
// val lines = sc.textFile("people.txt")
  lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect.foreach(println)
  sc.stop()
  }

六:sparksql

开始运行
RDD方式
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Person(name:String,age:Int)
val people=sc.textFile("hdfs://master:8020/net_bi/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim.toInt))
people.registerAsTable("people")
val teenagers = sqlContext.sql("SELECT * FROM people")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)


//DSL方式
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Person(name:String,age:Int)
val people=sc.textFile("hdfs://master:8020/net_bi/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim.toInt))
people.registerAsTable("people")
val teenagers_dsl = people.where('age >= 10).where('age <= 19).select('name)
teenagers_dsl.map(t => "Name: " + t(0)).collect().foreach(println)

10464049
//parquet方式
import sqlContext.createSchemaRDD
people.saveAsParquetFile("hdfs://master.dw.lewatek.com:8020/net_bi/people.parquet")
val parquetFile = sqlContext.parquetFile("hdfs://master:8020/net_bi/people.parquet")
parquetFile.registerAsTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)


//join方式
val jointbls = sqlContext.sql("SELECT people.name FROM people join parquetFile where people.name=parquetFile.name")
jointbls.map(t => "Name: " + t(0)).collect().foreach(println)

//另一个parquet方式
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val wikiData = sqlContext.parquetFile("hdfs://master:8020/net_bi/people.parquet")
wikiData.count()
wikiData.registerAsTable("wikiData")
val countResult = sqlContext.sql("SELECT COUNT(*) FROM wikiData").collect()
sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)


启用hive的方式(将hive-site.xml复制到conf下面)/etc/spark/conf
##./spark-shell --driver-library-path :/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/assembly/lib/  --executor-memory 3g     
//启动hive metasotre service
nohup bin/hive --service metastore > metastore.log 2>&1 &
启动spark-shell:spark-shell --executor-memory 3g   

异常
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to term hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling HiveContext.class.
error:
     while compiling: <console>
        during phase: erasure
     library version: version 2.10.4
    compiler version: version 2.10.4
  reconstructed args: 

异常
scala> val rdd2 = rdd1.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect.foreach(println)
15/01/19 16:15:50 ERROR GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
        at java.lang.Runtime.loadLibrary0(Runtime.java:849)
        at java.lang.System.loadLibrary(System.java:1088)
        at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32)
        at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
解决方法:cp /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/native/libhadoop.so  /usr/local/java/jre/lib/amd64/
                  cp /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/native/libsnappy.so  /usr/local/java/jre/lib/amd64/
添加一个软连接: ln -s /opt/cloudera/parcels/CDH/bin .




五:运行官方的例子
编译
org.apache.spark.examples.SparkPi
./make-distribution.sh -- hadoop 2.2.0 --with-yarn --with-hive --tgz
官方的例子
  1. ./bin/spark-submit --master spark://spark113:7077 \ 
  2. --class org.apache.spark.examples.SparkPi \  --name Spark-Pi --executor-memory 400M \  --driver-memory 512M \  
  3. /home/hadoop/spark-1.0.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.0.0-cdh4.5.0.jar 
/opt/cloudera/parcels/CDH/bin/spark-submit  --master spark:// 10.1.0.141 :7077   --class org.apache.spark.examples.SparkPi  --name Spark-Pi --executor-memory 44M   --driver-memory 44M  /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar 1000

spark参数配置,讲的很好:http://database.51cto.com/art/201407/445881.htm





My conf/spark-env.sh is:

export SPARK_MASTER_IP=node1.cluster.local
export   SPARK_WORKER_CORES=20
export   SPARK_WORKER_MEMORY=12g
export   SPARK_WORKER_DIR=/scratch/cperez/spark
export  SPARK_WORKER_INSTANCES=1
export   STANDALONE_SPARK_MASTER_HOST=node1.cluster.local
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值