配置scala:
vi /etc/profile
export SCALA_HOME=/home/bigdata/scala
export PATH=$PATH:$SCALA_HOME/bin
source /etc/profile
#scala -version
Scala code runner version 2.9.3 -- Copyright 2002-2011, LAMP/EPFL
报错:
WARN component.AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@655f272d: java.net.BindException: Address already in use
java.net.BindException: Address already in use
解决:杀死遗留僵尸进程spark-submit ps -ef | grep defunct_process_pid
Spark-shell运行成功验证程序:
1 object HelloWorld {
2 def main(args: Array[String]) {
3 System.out.println("HelloWorld");
4 }
5 }
运行spark-shell
#./bin/spark-shell master=yarn-client
ctrl+D停止spark-shell
安装的是spark-1.0.0-bin-cdh4.tgz
编辑Spark-env.sh:
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_EXECUTOR_INSTANCES=3
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=500m
export SPARK_DRIVER_MEMORY=512m
export SPARK_YARN_APP_NAME=Spark
export SPARK_YARN_QUEUE=default
export SCALA_HOME=/usr/lib/spark/scala-2.9.3
spark-default.conf
spark.yarn.applicationMaster.waitTries 10
spark.yarn.submit.file.replication 3
spark.yarn.preserve.staging.files true
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.max.executor.failures 2*numExecutors
spark.yarn.historyServer.address 192.168.10.224
~
SparkSQL
1.准备数据employee.txt
1001,sophia,1
1002,cindy,2
1003,angela,3
1004,kimi,4
1005,tiny,5
将数据放入hdfs
# hdfs dfs -put employee.txt /user/
显示数据:hadoop dfs –cat /user/employee.txt
2.启动spark shell
3.编写脚本
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Employee(employeeId: Int, name: String, departmentId: Int)
// Create an RDD of Employee objects and register it as a table.
val employees = sc.textFile("hdfs://product/user/employee.txt").map(_.split(",")).map(p => Employee(p(0), p(1), p(2).trim.toInt))
employees.registerAsTable("employee")
// SQL statements can be run by using the sql methods provided by sqlContext.
val fsis = sql("SELECT name FROM employee WHERE departmentId = 1")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
fsis.map(t => "Name: " + t(0)).collect().foreach(println)
4.运行结果
Took 0.268434462
Name:sophia