精通Spark集群搭建与测试
etc/hadoop/core-site.xml
文件系统入口
hdfs-site.xml
dfs.replication
#副本数量
dfs.namenode.secondary.http-address
#镜像namenode
dfs.namenode.name.dir
dfs.datanode.data.dir
dfs.namenode.checkpoint.dir
mapred-site.xml
mapreduce.framwork.name
yarn
yarn-site.xml
yarn.resourcemanager.hostname
Master
yarn.nodemanager.aux-services
mapreduce_shuffle
hadoop-env.sh
JAVA_HOME
HADOOP_HOME
vim ~/.bashrc
HADOOP_HOME
HADOOP_CONF_DIR
bin/
hadoop jar ....
sbin/
一些shell
start-dfs.sh
配置slave,etc/hadoop/slaves
格式化文件系统之前,需要在其他机器上也安装
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
spark安装配置
conf/spark-env.sh
JAVA_HOME
SCALA_HOME
HADOOP_HOME
HADOOP_CONF_DIR
SPARK_MASTER_IP
SPARK_WORKER_MOEMORY=4G
SPARK_EXECUTION_MEMORY=4G
SPARK_DRIVER_MEMORY=4G
SPARK_WORKER_CORES=8
conf/slaves
Worker/Slave地址
conf/spark-defaults.conf
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.eventLog.enabled true
spark.eventLog.dir hdfs://Master:9000/historyserverforSpark
spark.yarn.historyServer.address Master:18080
spark.history.fs.logDirectory hdfs://Master:9000/historyserverforSpark
#spark.default.parallelism 100
同步到其他机器:
scp -r ./spark-1.6.0-bin-hadoop2.6/ root@Worker1:/usr/local/spark
.bashrc加入SCALA_HOME SPARK_HOME 把SPARK_HOME/bin和SPARK_HOME/sbin加入PATH
sbin/start-all.sh
start-history-server.sh #记录日志
运行:
spark-submit --class org.apache.spark.examples.SparkPi (包名+类名) --master spark://Master:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 1000(并行1000个)
spark-submit --class org.apache.spark.examples.SparkPi --master spark://Master:7077 $SPARK_HOME/lib/spark-examples-1.6.2-hadoop2.6.0.jar 10
任务开始前分配资源,对资源进行复用。粗粒度
spark例子
./spark-shell --master spark://Master:7077
>scala sc.textFile("/library/wordcount/input/Data").flatMap(_.split(" ")).map(word =>(word, 1)).reduceByKey(_+_).map(pair => (pair._2, pair._1)).sortByKey(false,1).map(pair => (pair._2, pair._1)).saveAsTextFile("/library/wordcount/output/dt_spark_......")