个人遗漏补充。。
spark搭建
http://spark.apcha.org/docs/latest/cluster-overview.html
wget http://www.scala-lang.org/files/archive/scala-2.11.6.tgz
tar xvf scala-2.11.6.tgz
sudo mv scala-2.11.6 /usr/local/scala
sudo gedit ~/.bashrc
#SCALA Variables
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
#SCALA Variables
source ~/.bashrc
wget https://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.6.tgz
tar zxf spark-1.4.0-bin-hadoop2.6.tgz
sudo mv spark-1.4.0-bin-hadoop2.6 /usr/local/spark/
sudo gedit ~/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
source ~/.bashrc
cd /usr/local/spark/conf
cp log4j.properties.template log4j.properties
sudo gedit log4j.properties
spark-shell --master local[4]
val textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count
val textFile=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count
val textFile=sc.textFile("hdfs://master:9000/user/hduser/tests/5000-8.txt")
SPARK_JAR=/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client /usr/local/spark/bin/spark-shell
cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
sudo gedit /usr/local/spark/conf/spark-env.sh
export SPARK_MASTER_IP=master
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=800m
export SPARK_WORKER_INSTANCES=2
ssh data1
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data1:/usr/local
sudo gedit /usr/local/spark/conf/slaves
data1
data2
data3
/usr/local/spark/sbin/start-all.sh
http://master:8080/
/usr/local/spark/sbin/stop-all.sh
package com.taobao.moxing
class Demo {
def doStart(name: java.lang.String) =
print("hello Scala" + name)
}
1.选择一台主机来下载虚拟机。
2.在虚拟机创建一个master主机与三个节点机data1,data2和data3,在四台虚拟机之间来完成hadoop和spark框架的搭建。
3.将医疗大数据的python文件通过命令spark-submit 提交到spark集群上来运行并选择standalone模式来达到完全分布式计算的目的。