- sparkSession配置(最重要的地方:一定要设置idea本地机的访问hadoop集群的用户名,不然会报错用户没权限;然后需要设置hive数据仓库在hdfs上的位置,使sparkSQL跟其互通;第三sparkSession一定设置一个congfig配置存储元数据要连接的地址,通过thrift的方式连接)
System.setProperty("HADOOP_USER_NAME","jack")
val sparkConf: SparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("sparkSQL")
.set("spark.driver.host", "myword")
.set("spark.sql.warehouse.dir", "hdfs://hadoop101:8020/user/hive/warehouse")
val sparksession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.enableHiveSupport
.config("hive.metastore.uris", "thrift://hadoop101:9083")
.getOrCreate()
val sc: SparkContext = sparksession.sparkContext
sparksession.sql("show databases").show()
2.idea本地运行sparkSQL,连接到大数据集群hive库,首先配置文件是必不可少的,包括(hdfs-site.xml,hive-site.xml,core-site.xml),其中core和hdfs文件大差不差,我这里就分享一下hive的配置文件(因为后面sparkSQL连接需要用到Metastore和thrift方式去连接,没有配置的朋友可以参考网上的教学配置,B站尚硅谷的hive教学视频就有),以下是配置文件,详细密码端口啥的根据自己的集群配置修改即可
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- jdbc 连接的 URL -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop101:3306/metastore?useSSL=false</value>
</property>
<!-- jdbc 连接的 Driver-->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- jdbc 连接的 username-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!-- jdbc 连接的 password -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<!-- Hive 元数据存储版本的验证 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<!--元数据存储授权-->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<!-- 指定存储元数据要连接的地址 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop101:9083</value>
</property>
<!-- 指定 hiveserver2 连接的 host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop101</value>
</property>
<!-- 指定 hiveserver2 连接的端口号 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<!-- Hive 默认在 HDFS 的工作目录 -->
<!-- <property>-->
<!-- <name>hive.metastore.warehouse.dir</name>-->
<!-- <value>/user/hive/warehouse</value>-->
<!-- </property>-->
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>hadoop101,hadoop102,hadoop103</value>
<description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
<description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
</configuration>
- pom文件(这里使用的aliyun的mawen配置库,我找了很多资料才发现这个合适的镜像仓库下载)
maven的setting.xml文件阿里云库配置
<mirror>
<id>aliyun</id>
<mirrorOf>*</mirrorOf>
<name>spring-plugin</name>
<url>https://maven.aliyun.com/repository/spring-plugin</url>
</mirror>
<mirror>
<id>aliyunmaven</id>
<mirrorOf>*</mirrorOf>
<name>阿里云公共仓库</name>
<url>https://maven.aliyun.com/repository/public</url>
</mirror>
pom文件
<dependencies>
<dependency>
<groupId>com.hadoop.compression</groupId>
<artifactId>com.hadoop.compression</artifactId>
<version>1.0</version>
<scope>system</scope>
<systemPath>${project.basedir}/src/main/resources/libs/hadoop-lzo-0.4.20.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>org.apache.spark</groupId>-->
<!-- <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>-->
<!-- <version>2.1.1</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>org.apache.spark</groupId>-->
<!-- <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>-->
<!-- <version>2.1.1</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>mysql</groupId>-->
<!-- <artifactId>mysql-connector-java</artifactId>-->
<!-- <version>5.1.27</version>-->
<!-- </dependency>-->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.11</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 该插件用于将Scala代码编译成class文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<!-- 声明绑定到maven的compile阶段 -->
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>