安装scala
解压文档:tar -zxvf scala-2.9.2.tgz
将下面语句加入到~/.bashrc 或 .profile
export SCALA_HOME="/opt/scala"
export PATH="${SCALA_HOME}/bin:${JAVA_HOME}/bin:${PATH}"
然后 $ source ~/.bashrc
测试scala安装是否成功
$ scala
安装Spark-0.6.1
Spark requires Scala 2.9.2. You will need to have Scala’s bin
directory in yourPATH
,or you will need to set theSCALA_HOME
environment variable to pointto where you’ve installed Scala. Scala must also be accessible through oneof these methods on slave nodes on your cluster.
Spark uses Simple Build Tool, which is bundled with it. To compile the code, go into the top-level Spark directory and run
sbt/sbt package
Testing the Build
Spark comes with a number of sample programs in theexamples
directory.To run one of the samples, use./run <class> <params>
in the top-level Spark directory(therun
script sets up the appropriate paths and launches that program).For example,./run spark.examples.SparkPi
will run a sample program that estimates Pi. Each of theexamples prints usage help if no params are given.
Note that all of the sample programs take a <master>
parameter specifying the cluster URLto connect to. This can be a URL for a distributed cluster,or local
to run locally with one thread, orlocal[N]
to run locally with N threads. You should start by usinglocal
for testing.
Finally, Spark can be used interactively from a modified version of the Scala interpreter that you can start through./spark-shell
. This is a great way to learn Spark.
Running Spark on Mesos
Spark can run on private clusters managed by the Apache Mesos resource manager. Follow the steps below to install Mesos and Spark:
- Download and build Spark using the instructions here.
- Download Mesos 0.9.0-incubating from a mirror.
- Configure Mesos using the
configure
script, passing the location of yourJAVA_HOME
using--with-java-home
. Mesos comes with “template” configure scripts for different platforms, such asconfigure.macosx
, that you can run. See the README file in Mesos for other options.Note: If you want to run Mesos without installing it into the default paths on your system (e.g. if you don’t have administrative privileges to install it), you should also pass the--prefix
option toconfigure
to tell it where to install. For example, pass--prefix=/home/user/mesos
. By default the prefix is/usr/local
. - Build Mesos using
make
, and then install it usingmake install
. - Create a file called
spark-env.sh
in Spark’sconf
directory, by copyingconf/spark-env.sh.template
, and add the following lines in it:export MESOS_NATIVE_LIBRARY=<path to libmesos.so>
. This path is usually<prefix>/lib/libmesos.so
(where the prefix is/usr/local
by default). Also, on Mac OS X, the library is calledlibmesos.dylib
instead of.so
.export SCALA_HOME=<path to Scala directory>
.
- Copy Spark and Mesos to the same paths on all the nodes in the cluster (or, for Mesos,
make install
on every node). - Configure Mesos for deployment:
- On your master node, edit
<prefix>/var/mesos/deploy/masters
to list your master and<prefix>/var/mesos/deploy/slaves
to list the slaves, where<prefix>
is the prefix where you installed Mesos (/usr/local
by default). - On all nodes, edit
<prefix>/var/mesos/conf/mesos.conf
and add the linemaster=HOST:5050
, where HOST is your master node. - Run
<prefix>/sbin/mesos-start-cluster.sh
on your master to start Mesos. If all goes well, you should see Mesos’s web UI on port 8080 of the master machine. - See Mesos’s README file for more information on deploying it.
- On your master node, edit
- To run a Spark job against the cluster, when you create your
SparkContext
, pass the stringmesos://HOST:5050
as the first parameter, whereHOST
is the machine running your Mesos master. In addition, pass the location of Spark on your nodes as the third parameter, and a list of JAR files containing your JAR’s code as the fourth (these will automatically get copied to the workers). For example:
new SparkContext("mesos://HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar"))
运行SparkKMeans算法在Mesos
启动各个节点的mesos服务,检查WebUI各个slaves有没有挂载上,启动hadoop alongside, 上传keansdata.txt到hdfs上,在master进入spark目录,运行kmeans算法。
./run spark.examples.SparkKMeans 192.168.1.130:5050 hdfs://master:9000/user/liu/testdata/kmeansdata.txt 8 2.0
注意添加环境变量
export JAVA_HOME=$HOME/jdk1.7.0_05
export HADOOP_VERSION=1.0.4
export HADOOP_HOME=$HOME/hadoop-$HADOOP_VERSION
export SCALA_HOME=$HOME/scala-2.9.2
export MESOS_HOME=$HOME/mesos-0.9.0
export MESOS_NATIVE_LIBRARY=$MESOS_HOME/src/.libs/libmesos.so
export SPARK_HOME=$HOME/spark-0.6.1
export LD_LIBRARY_PATH=$MESOS_HOME/src/.libs
export CLASSPATH=/home/hadoop/spark-0.6.1/core/target/spark-core-assembly-0.6.1.jar:.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$SCALA_HOME/bin
Note:
第八步说明的不是很具体,这里以Spark自带的SparkKMeans.scala为例,如何编译与运行程序。以下步骤都只在master节点上操作即可。可参考spark programming guide
首先生成 Spark和依赖的jar包(core/target/spark-core-assembly-0.6.0.jar)
sbt/sbt assembly
将此jar包加入到CLASSPATH中
export CLASSPATH=/home/hadoop/spark-0.6.1/core/target/spark-core-assembly-0.6.1.jar:.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
将下面语句加入到scala程序文件中:
import spark.SparkContext
import SparkContext._
编译scala程序
scalac SparkKMeans.scala
运行编译好的 SparkKMeans 程序
scala spark.examples.SparkKMeans mesos://192.168.1.130:5050 hdfs://192.168.1.130:9000/dataset/Square-10m.txt 8 2.0
如何写Spark程序
The first thing a Spark program must do is to create aSparkContext
object, which tells Spark how to access a cluster.This is done through the following constructor:
new SparkContext(master, jobName, [sparkHome], [jars])
The master
parameter is a string specifying aMesos cluster to connect to, or a special “local” string to run in local mode, as described below.jobName
is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable calledsc
. Making your own SparkContext will not work. You can set which master the context connects to using theMASTER
environment variable. For example, to run on four cores, use
$ MASTER=local[4] ./spark-shell