Pyspark安装及问题

最新推荐文章于 2024-07-29 14:01:21 发布

qq_33638017

最新推荐文章于 2024-07-29 14:01:21 发布

阅读量8.3k

点赞数 1

分类专栏： # spark

spark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

配置 jdk,scala,hadoop,spark,hive,mysql,pyspark集群(yarn)

参见http://blog.csdn.net/bailu66/article/details/53863693
参见https://www.cnblogs.com/K-artorias/p/7141479.html
参见https://www.cnblogs.com/boshen-hzb/p/5889633.html
参见http://blog.csdn.net/w12345_ww/article/details/51910030
参见http://www.cnblogs.com/lonenysky/p/6775876.html
参见https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/

ssh免密登陆(略）

主机两台，/etc/hosts里命名为slave1/slave2

安装python

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda2-4.2.0-Linux-x86_64.sh
bash Anaconda2-4.2.0-Linux-x86_64.sh

配置环境变量

vi /etc/profile

添加如下信息

export JAVA_HOME=/usr/local/jdk1.8.0_151
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export HADOOP_HOME=/usr/local/hadoop-2.8.2
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native/"
export CLASSPATH=$CLASSPATH:/usr/local/hadoop-2.8.2/lib/*:. 
export HADOOP_CONF_DIR=/usr/local/hadoop-2.8.2/etc/hadoop
export LD_LIBRARY_PATH=/usr/local/hadoop-2.8.2/lib/native:$LD_LIBRARY_PATH

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export SPARK_HOME=/usr/local/spark-2.1.2
export PYSPARK_PYTHON=/root/anaconda2/bin/python
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export SCALA_HOME=/usr/local/scala-2.11.12
export PATH=$PATH:$SCALA_HOME/bin

export HIVE_HOME=/usr/local/hadoop-2.8.2/hive  
export PATH=$PATH:$HIVE_HOME/bin   
export CLASSPATH=$CLASSPATH:/usr/local/hadoop-2.8.2/hive/lib/*:. 

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH

source /etc/profile

服务端配置如下

准备文件

1)jdk8u151
2)hive2.3.2 wget http://mirrors.shuosc.org/apache/hive/stable-2/apache-hive-2.3.2-bin.tar.gz
3)spark2.1.2 wget http://mirrors.shuosc.org/apache/spark/spark-2.1.2/spark-2.1.2-bin-hadoop2.7.tgz
4)mysql-connector-java-5.1.41-bin.jar
5)scala2.11.12 wget https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
6)hadoop2.8.2 wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.8.2/hadoop-   2.8.2.tar.gz

解压缩

1)tar -C /usr/local/ -xzf jdk-8u151-linux-x64.tar.tar.gz
2)tar -C /usr/local/ -xzf hadoop-2.8.2.tar.gz
3)tar zxvf spark-2.1.2-bin-hadoop2.7.tgz
  mv spark-2.1.2-bin-hadoop2.7  /usr/local/spark-2.1.2
4)tar zxvf scala-2.11.12.tgz
  mv scala-2.11.12  /usr/local/
5)tar zxvf apache-hive-2.3.2-bin.tar.gz
  mv apache-hive-2.3.2 /usr/local/hadoop-2.8.2/hive
6)mv mysql-connector-java-5.1.41-bin.jar /usr/local/hadoop-2.8.2/hive/lib/

安装mysql

sudo apt-get install mysql-server
sudo apt-get install mysql-client
sudo apt-get install libmysqlclient-dev

安装过程会让你给数据库root用户输入密码，不要忽略。然后通过如下命令检查是否安装成功：

sudo netstat -tap | grep mysql

登录验证：

mysql -uroot -p

在mysql中创建hive用户,数据库等

create user 'hivespark' identified by 'hivespark';
create database hivespark;
grant all on hivespark.* to hivespark@'%'  identified by 'hivespark';
grant all on hivespark.* to hivespark@'localhost'  identified by 'hivespark';
flush privileges;

修改hadoop配置文件
参见https://www.cnblogs.com/ggjucheng/archive/2012/04/17/2454590.html

cd /usr/local/hadoop-2.8.2/etc/hadoop
vim core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!--配置namenode的地址-->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
<!-- 指定hadoop运行时产生文件的存储目录 -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:///data/hadoop/data/tmp</value>
 </property>        
</configuration>

vim hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!--指定hdfs的副本数-->
  <property>
        <name>dfs.replication</name>
        <value>2</value>
  </property>
<!--设置hdfs的权限-->
  <property>
         <name>dfs.permissions</name>
         <value>false</value>
  </property>
<!-- secondary name node web 监听端口 -->
  <property>
         <name>dfs.namenode.secondary.http-address</name>
         <value>master:50090</value>
  </property>
 <!-- name node web 监听端口 -->

  <property>
    <name>dfs.namenode.http-address</name>
    <value>master:50070</value>
  </property>
<!-- 真正的datanode数据保存路径 -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///data/hadoop/data/datanode</value>
  </property>
<!-- NN所使用的元数据保存-->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/data/namenode</value>
  </property>
<!--存放 edit 文件-->
  <property>
    <name>dfs.namenode.edits.dir</name>
    <value>file:///data/hadoop/data/edits</value>
  </property>
<!-- secondary namenode 节点存储 checkpoint 文件目录-->
  <property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>file:///data/hadoop/data/checkpoints</value>
  </property>
<!-- secondary namenode 节点存储 edits 文件目录-->
  <property>
    <name>dfs.namenode.checkpoint.edits.dir</name>
    <value>file:///data/hadoop/data/checkpoints/edits</value>
  </property>


</configuration>

vi mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- 指定mr运行在yarn上 -->
  <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
  </property>
<!--历史服务的web端口地址  -->
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>master:19888</value>
  </property>
<!--历史服务的端口地址-->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>master:10020</value>
  </property>
<!--Uber运行模式-->
  <property>
    <name>mapreduce.job.ubertask.enable</name>
    <value>false</value>
  </property>

<!--是job运行时的临时文件夹-->
    <property>
        <name>yarn.app.mapreduce.am.staging-dir</name>
        <value>hdfs://master:9000/tmp/hadoop-yarn/staging</value>
        <description>The staging dir used while submitting jobs.</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>
    </property>
    <!--MR JobHistory Server管理的日志的存放位置-->
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>
    </property>

<property>
        <name>mapreduce.map.memory.mb</name>
        <value>2048</value>
</property>

<property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>4096</value>
</property>
</configuration>

vi yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- 指定nodeManager组件在哪个机子上跑 -->
  <property>
         <name>yarn.acl.enable</name>
         <value>0</value>
  </property>
  <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
  </property>
<!-- 指定resourcemanager组件在哪个机子上跑 -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property> 
 <!--resourcemanager web地址-->
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>master:8088</value>
  </property>
 <!--启用日志聚集功能-->
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
<!--在HDFS上聚集的日志最多保存多长时间-->
  <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>86400</value>
  </property> 


<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>2648</value>
</property>

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
</property>
<property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>4096</value>
</property>
<property>
        <name>yarn.scheduler.increment-allocation-mb</name>
        <value>512</value>
</property>
<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>
<property>  
    <name>yarn.nodemanager.pmem-check-enabled</name>  
    <value>false</value>  
</property>  
<property>  
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>/var/log/hadoop-yarn/apps</value>
</property>
</configuration>

vi slaves
slave2
slave1

vi hadoop-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
#配置jdk的环境
export JAVA_HOME=/usr/local/jdk1.8.0_151

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
#export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"  
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

将配置拷贝到所有节点

scp -r usr/local/hadoop-2.8.2/* root@slave2:usr/local/hadoop-2.8.2/

验证

hadoop namenode -format
start-dfs.sh
http://192.81.212.100:50070/dfshealth.html#tab-overview查看hdfs情况
start-yarn.sh
yarn node -list
yarn application -list
http://192.81.212.100:8088/cluster查看yarn情况
hdfs dfs -mkdir /books
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
hdfs dfs -put alice.txt holmes.txt /books/
hdfs dfs -put frankenstein.txt /books/
hdfs dfs -ls /books
yarn jar /usr/local/hadoop-2.8.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount "/books/*" output
hdfs dfs -cat output/part-r-00000
stop-yarn.sh
stop-dfs.sh

这里写图片描述

start-all.sh 启动所有的Hadoop守护进程。包括NameNode、 Secondary NameNode、DataNode、JobTracker、 TaskTrack 
stop-all.sh 停止所有的Hadoop守护进程。包括NameNode、 Secondary NameNode、DataNode、JobTracker、 TaskTrack 
start-dfs.sh 启动Hadoop HDFS守护进程NameNode、SecondaryNameNode和DataNode 
stop-dfs.sh 停止Hadoop HDFS守护进程NameNode、SecondaryNameNode和DataNode 
hadoop-daemons.sh start namenode 单独启动NameNode守护进程 
hadoop-daemons.sh stop namenode 单独停止NameNode守护进程 
hadoop-daemons.sh start datanode 单独启动DataNode守护进程 
hadoop-daemons.sh stop datanode 单独停止DataNode守护进程 
hadoop-daemons.sh start secondarynamenode 单独启动SecondaryNameNode守护进程 
hadoop-daemons.sh stop secondarynamenode 单独停止SecondaryNameNode守护进程 
start-mapred.sh 启动Hadoop MapReduce守护进程JobTracker和TaskTracker 
stop-mapred.sh 停止Hadoop MapReduce守护进程JobTracker和TaskTracker 
hadoop-daemons.sh start jobtracker 单独启动JobTracker守护进程 
hadoop-daemons.sh stop jobtracker 单独停止JobTracker守护进程 
hadoop-daemons.sh start tasktracker 单独启动TaskTracker守护进程 
hadoop-daemons.sh stop tasktracker 单独启动TaskTracker守护进程

成功启动后，可以访问 Web 界面 http://localhost:50070 查看 NameNode 和 Datanode 信息，还可以在线查看 HDFS 中的文件。

修改spark配置文件(服务端)

cd /usr/local/spark-2.1.2/conf
mv spark-env.sh.template spark-env.sh
vi spark-env.sh
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
export JAVA_HOME=/usr/local/jdk1.8.0_151
#spark主节点的ip
export SCALA_HOME=/usr/local/scala-2.11.12
export SPARK_HOME=/usr/local/spark-2.1.2
#export SPARK_MASTER_IP=master
#spark主节点的端口号
#export SPARK_MASTER_PORT=7077
#export SPARK_MASTER_WEBUI_PORT=18080
export HADOOP_CONF_DIR=/usr/local/hadoop-2.8.2/etc/hadoop
#export SPARK_LOCAL_IP=slave1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/hadoop-2.8.2/lib/native

修改spark-default.conf

spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master:9000/usr/spark/eventLogging
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory  512m
spark.yarn.am.memory    2048m
spark.executor.memory          1536m
spark.yarn.jars hdfs://master:9000/spark-jars/*
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs://master:9000/usr/spark/eventLogging
spark.history.fs.update.interval  10s
spark.history.ui.port             18080

vi slaves
slave1
slave2

上传jar包到hdfs

hdfs dfs -mkdir -p /usr/spark/eventLogging
把SPARK_HOME/jars下所有jar包复制到hdfs://master:9000/spark-jars/
下载kryo-3.0.3.jar asm-5.0.3.jar minlog-1.3.0.jar objenesis-2.1.jar reflectasm-1.10.1.jar
hdfs dfs -mkdir -p /spark-jars
hdfs dfs -put jars/* /spark-jars

拷贝配置到所有节点

scp -r spark-2.1.2/* root@slave2:/usr/local/spark-2.1.2/

验证

前提是start-dfs.sh和start-yarn.sh
通过jps查看启动namenode/datanode等情况
$SPARK_HOME/sbin/start-history-server.sh
spark-submit --deploy-mode client \
               --class org.apache.spark.examples.SparkPi \
               $SPARK_HOME/examples/jars/spark-examples_2.11-2.1.2.jar 10
http://master:18080 查看spark运行情况
yarn logs -applicationId application_xxxxx查看日志（输出结果，错误等）
yarn node -list查看节点数

hive配置文件

cd /usr/local/hadoop-2.8.2/hive/conf
cp hive-env.sh.template hive-env.sh

在hdfs目录下建立三个文件，用来存放hive信息，并赋予777权限

start-dfs.sh
hdfs dfs -mkdir -p /usr/hive/warehouse
hdfs dfs -mkdir -p /usr/hive/tmp
hdfs dfs -mkdir -p /usr/hive/log
hdfs dfs -mkdir -p /usr/hive/download
hdfs dfs -chmod -R 777 /usr/hive/warehouse
hdfs dfs -chmod -R 777 /usr/hive/tmp 
hdfs dfs -chmod -R 777 /usr/hive/log
hdfs dfs -chmod -R 777 /usr/hive/download

修改hive-env.sh文件

export JAVA_HOME=/usr/local/jdk1.8.0_151
export HADOOP_HOME=/usr/local/hadoop-2.8.2
export HIVE_HOME=/usr/local/hadoop-2.8.2/hive
export HIVE_CONF_DIR=/usr/local/hadoop-2.8.2/hive/conf 
export  HIVE_AUX_JARS_PATH=/usr/local/hadoop-2.8.2/hive/lib

vi hive-site.xml,加入如下内容

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
       <name>hive.metastore.uris</name>
       <value>thrift://master:9083</value>
    </property>
    <property>
       <name>hive.execution.engine</name>
       <value>spark</value>
    </property>
    <property>
        <name>hive.exec.scratchdir</name>
        <value>/usr/hive/tmp</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/usr/hive/warehouse</value>
    </property>
    <property>
        <name>hive.querylog.location</name>
        <value>/usr/hive/log</value>
    </property>
    <property>
　　  <name>hive.downloaded.resources.dir</name>
　　  <value>/usr/hive/download</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://master:3306/hivespark?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hivespark</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hivespark</value>
    </property>
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
</configuration>

cp /usr/local/spark-2.1.2/jars/spark-* /usr/local/hadoop-2.8.2/hive/lib/
cp /usr/local/spark-2.1.2/jars/scala-* /usr/local/hadoop-2.8.2/hive/lib/
cp /hive/conf/hive-site.xml /usr/local/spark-2.1.2/conf/

拷贝到各台机器

scp -r hadoop-2.8.2/* root@slave2:/usr/local/hadoop-2.8.2/
scp -r spark-2.1.2/* root@slave2:/usr/local/spark-2.1.2/
scp -r hive/* root@slave2:/usr/local/hadoop-2.8.2/hive/

初始化hive，在hive2.0以后的版本，初始化命令(服务端）：

edit /etc/mysql/mysql.conf.d/mysqld.cnf (可能在别的路经）
bind-address            = 127.0.0.1   修改为
bind-address            = x.x.x.x(自己的ip)
/etc/init.d/mysql restart
schematool -initSchema -dbType mysql

初始化成功后，就可以运行hive了，可以检测一下hive是否正常

hive --service metastore & (服务端开启hive metastore服务(需要先开启hdfs）)
hive
use defautlt;
show tables;
create table test(key string);
select count(*) from test;
exit;
hdfs dfs -ls /usr/hive/warehouse
mysql -uhivespark -phivespark
use hive；
select TBL_NAME from TBLS;

测试pyspark

pyspark
data=[1,2,3,4,5]
distData=sc.parallelize(data)
distData.first()

测试spark-submit python脚本

./bin/spark-submit examples/src/main/python/pi.py

pyspark中连接hive

在界面上查看hive位于hdfs的数据
http://master:50070/explorer.html#/user/hive/warehouse/
jps查看dfs和yarn和historyserver开启情况
kill -9 PID(history-server)
kill -9 PID(RunJar)
stop-yarn.sh
stop-dfs.sh
start-dfs.sh
start-yarn.sh
bash /usr/local/spark-2.1.2/sbin/start-history-server.sh
hive --service metastore &   开启hive metastore
测试：
pyspark
from pyspark.sql import HiveContext 
sqlContext = HiveContext(sc) 
my_dataframe = sqlContext.sql("Select count(*) from test") 
my_dataframe.show()

spark教程

https://spark.apache.org/docs/latest/rdd-programming-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
https://spark.apache.org/mllib/

hive使用

参见http://blog.csdn.net/strongyoung88/article/details/53743937

yarn集群参数配置

参见http://blog.csdn.net/youngqj/article/details/47315167
这里写图片描述

Containers = minimum of (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE)
RAM-per-Container = maximum of (MIN_CONTAINER_SIZE, (Total Available RAM) / Containers))
例子
Cluster nodes have 12 CPU cores, 48 GB RAM, and 12 disks.
Reserved Memory = 6 GB reserved for system memory + (if HBase) 8 GB for HBase
Min Container size = 2 GB
If there is no HBase:
# of Containers = minimum of (2*12, 1.8* 12, (48-6)/2) = minimum of (24, 21.6, 21) = 21
RAM-per-Container = maximum of (2, (48-6)/21) = maximum of (2, 2) = 2

一个spark application所使用的资源为：
client模式
cores = spark.yarn.am.cores + spark.executor.cores * spark.executor.instances
memory = spark.yarn.am.memory + spark.yarn.am.memoryOverhead + (spark.executor.memory + spark.yarn.executor.memoryOverhead) * spark.executor.instances + --driver-memory
cluster模式：
cores = spark.driver.cores + spark.executor.cores * spark.executor.instances
memory = spark.driver.memory + spark.yarn.driver.memoryOverhead + (spark.executor.memory + spark.yarn.executor.memoryOverhead) * spark.executor.instances
运行时参数：
1.num-executors
该参数用于设置Spark作业总共要用多少个Executor进程来执行
2.executor-memory
3.executor-cores
4.driver-memory
Driver的内存通常来说不设置，或者设置1G左右应该就够了。唯一需要注意的一点是，如果需要使用collect算子将RDD的数据全部拉取到Driver上进行处理，那么必须确保Driver的内存足够大，否则会出现OOM内存溢出的问题
5.spark.default.parallelism
设置该参数为num-executors * executor-cores的2~3倍较为合适，比如Executor的总CPU core数量为300个，那么设置1000个task是可以的
6.storage.memoryFraction
该参数用于设置RDD持久化数据在Executor内存中能占的比例，默认是0.6。也就是说，默认Executor 60%的内存，可以用来保存持久化的RDD数据。根据你选择的不同的持久化策略，如果内存不够时，可能数据就不会持久化，或者数据会写入磁盘
7.spark.shuffle.memoryFraction
该参数用于设置shuffle过程中一个task拉取到上个stage的task的输出后，进行聚合操作时能够使用的Executor内存的比例，默认是0.2
8.total-executor-cores
以下是一份spark-submit命令的示例：

./bin/spark-submit \
  --master spark://192.168.1.1:7077 \
  --num-executors 100 \
  --executor-memory 6G \
  --executor-cores 4 \
　--total-executor-cores 400 \ ##standalone default all cores 
  --driver-memory 1G \
  --conf spark.default.parallelism=1000 \
  --conf spark.storage.memoryFraction=0.5 \
  --conf spark.shuffle.memoryFraction=0.3 \

问题

1) no module named pyspark

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

2）WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

修改spark-env.sh文件加入LD_LIBRARY_PATH环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/hadoop-2.8.2/lib/native

3）WARN [Thread-378] 2015-06-11 13:41:39,712 ExternalLogger.java (line 73) SparkWorker: Your hostname, myhost1.somedomain.com resolves to a loopback address: 127.0.0.1; using 10.1.2.1 instead (on interface bond1)
WARN [Thread-378] 2015-06-11 13:41:39,714 ExternalLogger.java (line 73) SparkWorker: Set SPARK_LOCAL_IP if you need to bind to another address

export SPARK_LOCAL_IP="<IP address>"

4) java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native/"

5）cannot access /usr/local/spark/lib/spark-assembly-*.jar: No such file or directory
参见http://blog.csdn.net/Gpwner/article/details/73457108

打开/usr/local/apache-hive-1.2.1-bin/bin下hive
sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`  修改为
sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`

6）Error: ERROR: relation “BUCKETING_COLS” already exists (state=42P07,code=0)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
* schemaTool failed *

原因是mysql已经存在hive数据库，删除即可

7)hive启动log4j重复

SLF4J: Class path contains multiple SLF4J bindings.  
SLF4J: Found binding in [jar:file:/D:/software/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]  
SLF4J: Found binding in [jar:file:/D:/software/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]  
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.  
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

从中筛选出信息，slf4j-log4j12-1.7.2.jar、log4j-slf4j-impl-2.4.1.jar，以及Class path contains multiple SLF4J bindings，说明是slf4j-log4j12-1.7.2.jar和 log4j-slf4j-impl-2.4.1.jar重复了，应该去掉其中一个jar包。把log4j-slf4j-impl-2.4.1.jar包去掉后项目启动正常。

8)Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083

执行命令jps
将Runjar kill掉，再重启hive --service metastore，就可以了

8) hadoop namenode -format

每次format之前删除datanode的文件夹
rm -r /data/hadoop/

9）Call From /127.0.0.1 to :36682failed on connection exception: java.net.ConnectException: ubuntu

master机器
hostname master(暂时修改）
mv /etc/hostname /etc/hostname.bak永久修改
vi /etc/hostname
slave1
hostname -i
slave机器
hostname slave2(暂时修改）
mv /etc/hostname /etc/hostname.bak永久修改
vi /etc/hostname
slave2
hostname -i

10) Exception in thread “main” org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083.

遇到这种情况大家都找不到头绪，是因为你开始运行了hive的metastore，可以输入jps
杀掉Runjar进程 kill -9 PID

qq_33638017

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录