Spark安装部署
安装Scala环境
wget https://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.tgz
tar -zxvf scala-2.12.1.tgz
mv scala-2.12.1 /home
sudo vim /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export JRE_HOME=$JAVA_HOME/jre
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export SCALA_HOME=/home/scala-2.12.1
export HIVE_HOME=/home/hadoop/apache-hive-1.2.1-bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$PATH
source /etc/profile
输入scala测试
报错
cat: /usr/lib/jvm/java-7-openjdk-amd64/release: No such file or directory
Exception in thread "main" java.lang.UnsupportedClassVersionError: scala/tools/nsc/MainGenericRunner : Unsupported major.minor version 52.0
at java.lang.ClassLoader.findBootstrapClass(Native Method)
at java.lang.ClassLoader.findBootstrapClassOrNull(ClassLoader.java:1073)
at java.lang.ClassLoader.loadClass(ClassLoader.java:414)
at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
我们看了一下/usr/lib/jvm/java-7-openjdk-amd64/release jdk1.7确实没有release文件,
更新一下jdk1.8
官网下载
wget http://192.168.97.99/cache/3/02/oracle.com/94f24c6b67f3e2aaed2a018eb4b2ea5e/jdk-8u121-linux-x64.tar.gz
sudo tar xvf jdk-8u121-linux-x64.tar.gz
因为一些环境变量修改比较多,我们直接替换/usr/lib/jvm/java-7-openjdk-amd64这个文件下的内容
java -version
java version “1.8.0_121”
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
已经变为1.8了
scala测试
Welcome to Scala 2.12.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121).
Type in expressions for evaluation. Or try :help.
scala>
已经成功了
安装Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.6.tgz
tar zxvf spark-2.1.0-bin-hadoop2.6.tgz -C /home
sudo vim /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export JRE_HOME=$JAVA_HOME/jre
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export SPARK_HOME=/home/spark-2.1.0-bin-hadoop2.6
export SCALA_HOME=/home/scala-2.12.1
export HIVE_HOME=/home/hadoop/apache-hive-1.2.1-bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH
source /etc/profile
spark-shell测试
成功
cd sbin
./start-all.sh
修改配置,
cd spark-2.1.0-bin-hadoop2.6/conf
修改spark-env.sh,slaves文件名,
spark-env.sh内容
export SCALA_HOME=/home/scala-2.12.1
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export STANDALONE_SPARK_MASTER_HOST=master
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
### Let's run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
#export SPARK_MASTER_PORT=7077
#export SPARK_WORKER_PORT=7078
#export SPARK_WORKER_WEBUI_PORT=18081
#export SPARK_WORKER_DIR=/var/run/spark/work
#export SPARK_LOG_DIR=/var/log/spark
#export SPARK_PID_DIR='/var/run/spark/'
if [ -n "$HADOOP_HOME" ]; then
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
slaves内容
slave1
slave2
运行实例
在bin中
./run-example SparkPi
得到计算结果
17/04/06 16:53:23 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.709367 s
Pi is roughly 3.1399556997784988
17/04/06 16:53:23 INFO server.ServerConnector: Stopped ServerConnector@156d6753{HTTP/1.1}{0.0.0.0:4040}
17/04/06 16:53:23 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7f811d00{/stages/stage/kill,null,UNAVAILABLE}
spark shell群集启动
spark-shell –master spark://master:7077
后台管理 http://master:4040
spark-shell是Spark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下用scala编写spark程序。
spark-shell –master spark://master:7077 –executor-memory 2g –total-executor-cores 2
参数说明:
–master spark://master:7077 指定Master的地址
–executor-memory 2g 指定每个worker可用内存为2G,现有集群该任务将无法启动,应该修改为512m。
–total-executor-cores 2 指定整个任务使用的cup核数为2个。
新建hdfs文件夹
hadoop fs -mkdir /input
放文件到hdfs
hadoop fs -put /root/file/file1.txt /input
查看hdfs文件
hadoop fs -ls /input
Spark WordCount
读取本地文件
scala> val file = sc.textFile(“file:///usr/local/spark/README.md”)
scala> val file=sc.textFile(“hdfs://master:55555/input/file1.txt”)
scala> val count=file.flatMap(line => line.split(” “)).map(word => (word,1)).reduceByKey(+)
scala> count.collect()
动态加载Spark属性
在一些情况下,你可能想在SparkConf中避免硬编码确定的配置。例如,你想用不同的master或者不同的内存数运行相同的应用程序。Spark允许你简单地创建一个空conf。
val sc = new SparkContext(new SparkConf())
然后你在运行时设置变量:
./bin/spark-submit –name “My app” –master local[4] –conf spark.shuffle.spill=false
–conf “spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps” myApp.jar
spark-submit –master spark://master:7077 –name JsonTanZhen –class JsonTanZhen –executor-memory 1G –total-executor-cores 2 –jars /home/examples/mysql.jar /home/examples/JsonTanZhen.jar hdfs://master:55555/input/data*.txt
spark-submit –master spark://master:7077 –name RDDToMysql –class RDDtoMysql –executor-memory 1G –total-executor-cores 2 –jars /home/examples/mysql.jar /home/examples/RDDtoMysql.jar hdfs://master:55555/input/spark.txt
2017-04-29 今天又栽了跟头,
–master spark://master:55555 一直这么提交了,总是报错,很难受,
查了一下,修改了 yarn-site.xml文件,增加了以下内容
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
然后格式化重启。
依然不行,因为根本不在这里。
提交的端口应该是7077,,,,,,,,
–master spark://master:7077