spark+hive win7开发环境配置

最新推荐文章于 2024-04-13 14:08:36 发布

swcj

最新推荐文章于 2024-04-13 14:08:36 发布

阅读量500

点赞数

本文链接：https://blog.csdn.net/swcj/article/details/49783467

版权

典型配置：

spark在linux中以standalone方式运行,配置参数如下：“spark-env配置参数 ”。
在windows idea中编辑代码，运行driver，连接远程master，运行程序，可同步运行查看日志，可单步调试
如果连接hive需要在本机src中配置hive-site.xml中的 hive.metastore.uris
配置本地hadoop_home，下载winutils.exe，拷贝到hadoop_home/bin
参照下面虚拟机中运行步骤7-9
如果多网卡，则需在本机配置SPARK_LOCAL_IP=集群里ip
在低配置机器上运行idea 有可能导致内存溢出，需要指定程序运行内存： -Xms128m -Xmx512m -XX:PermSize=250m -XX:MaxPermSize=512m

windows 需要配置环境变量 HADOOP_HOME ，HADOOP_USER_NAME

spark-env配置参数：

# Where the pid file is stored. (Default: /tmp) 用于后台运行spark

export SPARK_PID_DIR=/var/run/spark

# A string representing this instance of spark.(Default: $USER)
SPARK_IDENT_STRING=$USER

export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}

# The java implementation to use.

export JAVA_HOME=/usr/java/jdk1.7.0_67

export SPARK_MASTER_IP=master1

# 应小于master cpu个数，如果设置为master节点最大cpu个数，则会报：WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=2"

if [ -d "/etc/tez/conf/" ]; then
export TEZ_CONF_DIR=/etc/tez/conf
else
export TEZ_CONF_DIR=

虚拟中运行步骤：

软件：hdp2.3.2虚拟机，vmware，gnome，idea 15

下载hdp虚拟机
修改虚拟机dns，vi /etc/resolve，增加本地dns，默认为google dns，否则很多网站ping或连接不稳定。
下载163yum源， Centos配置国内yum源-zhuzusong-ChinaUnix博客
安装gnome
增加dev用户
下载idea，上传到虚拟机，解压，运行。
下载scala插件，建立scala工程，设置jar文件，需要导入spark/lib下所有jar文件
设置sparkmaster, 必须用spark://sandbox.hortonworks.com:7077，在spark worker启动日志里面找到，或ps -ef查看worker进程参数，如果是用ip，则报错 one。。stragy
导入hive-site.xml文件到src，不然会报，表找不到。

测试代码

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext

/**
 * Created by czp on 2015/11/2.
 */
object SparkHiveRun {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setAppName("HiveFromSpark").setMaster("spark://master1:7077")
    val sc = new SparkContext(sparkConf)

    val hiveContext = new HiveContext(sc)
    // import hiveContext.implicits._
    import hiveContext.sql

    // Queries are expressed in HiveQL
    println("Result of 'SELECT *': ")
    sql("SELECT * FROM test_part").collect().foreach(println)

    // Aggregation queries are also supported.
    val count = sql("SELECT COUNT(*) FROM test_part").collect().head.getLong(0)
    println(s"COUNT(*): $count")



  }

}

hive-site

<!--Sat Oct 10 23:00:44 2015-->
<configuration>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://node6:9083</value>
    </property>
</configuration>