关闭

[置顶] Spark1.5.2 on Hadoop2.4.0 安装配置

标签: spark环境搭建大数据hadoop
855人阅读 评论(0) 收藏 举报
分类:

一、           前言

 本文简单介绍yarn安装,主要介绍spark1.5.2on yarn模式安装,仅供参考。

 

二、           yarn配置

1.     yarne.xml

yarne.xml需要添加的配置如下:

    <property>

         <name>yarn.nodemanager.aux-services</name>

         <value>mapreduce_shuffle,spark_shuffle</value>

     </property>

     <property>

          <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>

          <value>org.apache.spark.network.yarn.YarnShuffleService</value>

      </property>

 

  如果配置了yarn.log-aggregation-enable,一定要设置yarn.log.server.url,不然spark历史任务无法查看:

      

<property> 

      <name>yarn.log-aggregation-enable</name> 

       <value>true</value>

</property>

<property> 

   <name>yarn.log.server.url</name> 

   <value>http://namenode1:19888/jobhistory/logs</value> 

     </property>

 

2.     spark-1.5.2-yarn-shuffle.jar

将spark-1.5.2-yarn-shuffle.jar拷贝所有nodemanager的classpath下面:

3.     重启所有nodemanager

三、           Spark

1.     安装hadoop 客户端

  添加如下内容到.bashrc

JAVA_HOME=/home/spark/software/java

HADOOP_HOME=/home/bigdata/software/hadoop

SPARK_HOME=/home/spark/software/spark

R_HOME=/home/spark/software/R

PATH=$R_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

export LANG JAVA_HOME HADOOP_HOME SPARK_HOME HIVE_HOME R_HOME PATH CLASSPATH

 

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export YARN_CONF_DIR=/home/yarn/software/hadoop/etc/hadoop

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_LIB_NATIVE_DIR

export PATH="/home/spark/software/anaconda/bin:$PATH"

export HADOOP_COMMON_LIB_NATIVE_DIR=/home/bigdata/software/hadoop/lib/native/Linux-amd64-64

export HADOOP_CLIENT_OPTS="-Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_LIB_NATIVE_DIR

 

 

 

2.     配置spark-defaults.conf

# This is useful for setting default environmental settings.

spark.serializer                  org.apache.spark.serializer.KryoSerializer

spark.eventLog.enabled              true

#spark.eventLog.dir                  hdfs://SuningHadoop2/sparklogs/sparklogs1.4.0

spark.eventLog.dir                   hdfs:///sparklogs/sparklogshistorylogpre

spark.driver.cores                   1

spark.driver.memory                 4096m

# Tuning parameters

spark.shuffle.consolidateFiles      true

spark.sql.shuffle.partitions        40

spark.default.parallelism           20

#spark.cores.max                     2

spark.shuffle.consolidateFiles      true

 

[spark@spark-pre1 conf]$ vim spark-defaults.conf

# Default system properties included when running spark-submit.

# This is useful for setting default environmental settings.

 

# Example:

# spark.master                     spark://master:7077

# spark.eventLog.enabled           true

# spark.eventLog.dir               hdfs://namenode:8021/directory

spark.serializer                  org.apache.spark.serializer.KryoSerializer

# spark.driver.memory              5g

# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

 

spark.eventLog.enabled              true

spark.eventLog.dir                   hdfs:///sparklogs/sparklogshistorylogpre

spark.driver.cores                  1

spark.driver.memory                 4096m

# Tuning parameters

spark.shuffle.consolidateFiles      true

spark.sql.shuffle.partitions        40

spark.default.parallelism           20

    spark.shuffle.consolidateFiles      true

 

3.     配置spark-env.sh

export SPARK_LOCAL_DIRS=/data/spark/sparkLocalDir

export HADOOP_CONF_DIR=/home/bigdata/software/hadoop/etc/hadoop

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/bigdata/software/hadoop/lib/native/Linux-amd64-64/

export SPARK_HISTORY_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/spark/spark/logs/historyserver.hprof -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/spark/spark/logs/historyserver.gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=512M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=19229 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false   -Dspark.history.fs.logDirectory=hdfs://SuningHadoop2/sparklogs/sparklogshistorylogpre  -Dspark.history.ui.port=8078

 

4.     启动HistoryServer

start-history-server.sh


四、           测试脚本

yarnmode=yarn-cluster

mytime=`date '+%s'`

datasizeG=100G

inputfiles=/user/spark/TeraGen$mytime

outputfiles=/user/spark/TeraSort$mytime

Validatefiles=/user/spark/TeraValidate$mytime

numexecutors=20

executormemory=8G

export HADOOP_USER_NAME=spark

sparkqueue=spark

sparkconf="spark.default.parallelism=2000"

classTeraGen=com.github.ehiggs.spark.terasort.TeraGen

classTeraSort=com.github.ehiggs.spark.terasort.TeraSort

classTeraValidate=com.github.ehiggs.spark.terasort.TeraValidate

sourcejar=/home/spark/workspace/tersort/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar

 

spark-submit --master $yarnmode              \

             --supervise                      \

             --num-executors  $numexecutors    \

             --executor-memory $executormemory  \

             --queue  $sparkqueue               \

             --conf $sparkconf                   \

              --class $classTeraGen  $sourcejar  $datasizeG $inputfiles

beginTime=`date '+%s'`

spark-submit   --master $yarnmode              \

               --supervise                      \

               --queue  $sparkqueue              \

               --executor-memory $executormemory \

               --num-executors $numexecutors      \

               --conf $sparkconf  --class $classTeraSort $sourcejar  $inputfiles $outputfiles

endTime=`date '+%s'`

speadtime=$((endTime-beginTime))

spark-submit --master $yarnmode              \

             --supervise                      \

             --executor-memory $executormemory \

             --queue  $sparkqueue              \

             --num-executors $numexecutors      \

             --conf $sparkconf                   \

             --class $classTeraValidate $sourcejar $outputfiles $classTeraValidate

echo times:$speadtime

hadoop fs -rm -r $inputfiles

hadoop fs -rm -r $outputfiles

hadoop fs -rm -r $Validatefiles

 

1
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:54071次
    • 积分:1072
    • 等级:
    • 排名:千里之外
    • 原创:49篇
    • 转载:1篇
    • 译文:6篇
    • 评论:7条
    最新评论