一:环境配置
My conf/spark-env.sh is:
export SPARK_MASTER_IP=node1.cluster.local
export
SPARK_WORKER_CORES=20
export
SPARK_WORKER_MEMORY=12g
export
SPARK_WORKER_DIR=/scratch/cperez/spark
export
SPARK_WORKER_INSTANCES=1
export
STANDALONE_SPARK_MASTER_HOST=node1.cluster.local
spark log 所在的路径:export SPARK_LOG_DIR=/var/log/spark
spark配置文件所在的位置:vi /etc/spark/conf.dist/spark-env.sh
1.两个主节点:spark-shell --master spark://10.1.0.141:7077 --executor-memory 2g
2.读取文件:scala>val rdd1 = sc.textFile("hdfs://master:8020/net_bi/people.txt")(启动hdfs:sbin/start-dfs.sh查看:hadoop1:50070)
scala>val rdd1 = sc.textFile("hdfs://master:8020/net_bi/adm_apps_price.txt")
scala>val rdd1 = sc.textFile("
hdfs://master:8020/net_bi/
web-private-4_20141011_10.log
")
scala>val rdd1 = sc.textFile("hdfs://master:8020/tmp/hive-hive/hive_2015-01-06_14-16-26_035_8767682183007088983-1
")
hdfs://UCloudcluster/umr-0zdao0/uhivefkb0jb/warehouse/test_output/test3/_logs
scala>val rdd1 = sc.textFile("
hdfs://UCloudcluster/umr-0zdao0/uhivefkb0jb/warehouse/test_output/test3/_logs
")
scala>val rdd1 = sc.textFile("/home/username/test.txt
")
hdfs://launcher-17.build.lewatek.com:8020/
T1 /usr/bin/spark-shell
3.scala>rdd1.cache();
4.单词计数的操作,此时不会执行:scala>.val rdd2 = rdd1.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_)
scala> val rdd2 = rdd1.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect.foreach(println)
5.统计前十个,此时才会执行:scala>rdd2.take(10) 只有action的时候才会进行操作
四:spark-submit方式提交WordCount
/opt/cloudera/parcels/CDH/bin/spark-submit
--master
spark://10.1.0.141:7077
--
class
com.username.WorkCount
--executor-
memory 200M
/home/username/WorkCount.jar
src:
WordCount的例子
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
class
WordCount
{}
object
WordCount
{
def main (args: Array[String]) {
val sparkConf = new SparkConf().setAppName("test")
//val sparkConf = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile("hdfs://master.dw.lewatek.com:8020/net_bi/people.txt")
// val lines = sc.textFile("people.txt")
lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect.foreach(println)
sc.stop()
}
}
六:sparksql
开始运行
RDD方式
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Person(name:String,age:Int)
val people=sc.textFile("hdfs://master:8020/net_bi/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim.toInt))
people.registerAsTable("people")
val teenagers = sqlContext.sql("SELECT * FROM people")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
//DSL方式
val teenagers_dsl = people.where('age >= 10).where('age <= 19).select('name)
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Person(name:String,age:Int)
val people=sc.textFile("hdfs://master:8020/net_bi/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim.toInt))
people.registerAsTable("people")
teenagers_dsl.map(t => "Name: " + t(0)).collect().foreach(println)
10464049
//parquet方式
import sqlContext.createSchemaRDD
people.saveAsParquetFile("hdfs://master.dw.lewatek.com:8020/net_bi/people.parquet")
val parquetFile = sqlContext.parquetFile("hdfs://master:8020/net_bi/people.parquet")
parquetFile.registerAsTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
//join方式
val jointbls = sqlContext.sql("SELECT people.name FROM people join parquetFile where people.name=parquetFile.name")
jointbls.map(t => "Name: " + t(0)).collect().foreach(println)
//另一个parquet方式
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val wikiData = sqlContext.parquetFile("hdfs://master:8020/net_bi/people.parquet")
wikiData.count()
wikiData.registerAsTable("wikiData")
val countResult = sqlContext.sql("SELECT COUNT(*) FROM wikiData").collect()
sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
启用hive的方式(将hive-site.xml复制到conf下面)/etc/spark/conf
//启动hive metasotre service
nohup bin/hive --service metastore > metastore.log 2>&1 &
启动spark-shell:spark-shell --executor-memory 3g
异常
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to term hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling HiveContext.class.
error:
while compiling: <console>
during phase: erasure
library version: version 2.10.4
compiler version: version 2.10.4
reconstructed args:
异常
scala> val rdd2 = rdd1.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).collect.foreach(println)
15/01/19 16:15:50 ERROR GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32)
at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
解决方法:cp /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/native/libhadoop.so /usr/local/java/jre/lib/amd64/
cp /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/native/libsnappy.so /usr/local/java/jre/lib/amd64/
添加一个软连接: ln -s /opt/cloudera/parcels/CDH/bin .
编译
org.apache.spark.examples.SparkPi
./make-distribution.sh -- hadoop 2.2.0 --with-yarn --with-hive --tgz
官方的例子
- ./bin/spark-submit --master spark://spark113:7077 \
- --class org.apache.spark.examples.SparkPi \ --name Spark-Pi --executor-memory 400M \ --driver-memory 512M \
- /home/hadoop/spark-1.0.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.0.0-cdh4.5.0.jar
spark参数配置,讲的很好:http://database.51cto.com/art/201407/445881.htm
My conf/spark-env.sh is:
export SPARK_MASTER_IP=node1.cluster.local
export
SPARK_WORKER_CORES=20
export
SPARK_WORKER_MEMORY=12g
export
SPARK_WORKER_DIR=/scratch/cperez/spark
export
SPARK_WORKER_INSTANCES=1
export
STANDALONE_SPARK_MASTER_HOST=node1.cluster.local