spark之读写Hive

最新推荐文章于 2024-05-07 19:24:58 发布

大数据翻身

最新推荐文章于 2024-05-07 19:24:58 发布

阅读量6.6k

点赞数

文章标签： spark hive sql

本文链接：https://blog.csdn.net/qq_56870570/article/details/118632112

版权

一、数据准备

1.1 将hive-site.xml拷贝到spark/conf目录下：

分析：从错误提示上面就知道，spark无法知道hive的元数据的位置，所以就无法实例化对应的client。
解决的办法就是必须将hive-site.xml拷贝到spark/conf目录下

1.2 测试代码中没有加sc.stop会出现如下错误：

ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
在代码最后一行添加sc.stop()解决了该问题。

1.3 提前在Hive数据仓库中创建库spark_sql_hive

hive (default)> create database spark_sql_hive;
OK
Time taken: 5.502 seconds

准备people.txt与peoplescores.txt两个文本文件
people.txt内容为：(人员信息表:姓名，年龄)
Michael    29
Andy    30
Justin    19

peoplescores.txt内容为：（人员分数表：姓名，分数）
Michael    99
Andy    97
Justin    68

1.4 pom.xml文件添加依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>sparkandhive</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <spark.version>2.4.4</spark.version>
        <scala.version>2.11.12</scala.version>
        <hadoop.version>2.6.0</hadoop.version>
        <hive.version>2.3.0</hive.version>
    </properties>
    <dependencies>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${hive.version}</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <recompileMode>incremental</recompileMode>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.2.1</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

二、Spark加载数据到Hive

备注：

 val spark = SparkSession
      .builder()//.master("local[*]")
      .appName("spark_readOrwrite_hive")
      // 开启Hive支持
      .enableHiveSupport()
      // 配置Hivewarehouse地址
      .config("spark.sql.warehouse.dir","hdfs:///user/hive/warehouse")
      // 设置了(CUSTOM)账号和密码的需要输入,如果为NONE则不需要添加账号和密码
      .config("username","sibat")
      .config("password","sibat706")
      .getOrCreate()

2.1 将hive中student表经过条件筛选，保存到person表中

package com.sibat.applications

import org.apache.spark.sql.{SaveMode, SparkSession}

object SparkWriteHive {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()//.master("local[*]")
      .appName("spark_write_hive")
      .config("spark.sql.warehouse.dir", "/user/hive/warehouse") //这行可有可无
      .enableHiveSupport()
      .getOrCreate()

    spark.table("student").createOrReplaceTempView("person") //自己hive的表，person是创建的临时视图
    // 注册成临时表
    val tt= spark.sql(" select * from person")
    tt.show();
    tt.write.mode(SaveMode.Overwrite).saveAsTable("person")

  }
}

2.2 加载people.txt与peoplescores.txt数据到Hive的people和peoplescores表中，并且两张表做关联，筛选出成绩大于90分的人群

package com.sibat.applications

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

/**
 *
 * Spark集群中使用Hive数据仓库
 *
 */
object SparkSQLOnHive {  // 提交到集群运行
  def main(args:Array[String]):Unit = {
    // 创建SparkConf对象
    val sparkConf = new SparkConf()
      .setMaster("local[*]")
      .setAppName("SparkSQLOnHive")  // 设置Spark应用名称
    // 创建SparkContext上下文对象
    val sparkContext = new SparkContext(sparkConf);

    /**
     * 第一:直接通过saveAsTable的方式把DataFrame的数据保存到Hive数据仓库
     * 第二:可以直接通过HiveContext.table方法来直接加载Hive中的表而生成DataFrame
     */
    val hiveContext = new HiveContext(sparkContext)
    hiveContext.sql("CREATE DATABASE IF NOT EXISTS spark_sql_hive") // 需要提前在Hive中创建测试库
    hiveContext.sql("use spark_sql_hive")
    hiveContext.sql("drop table if exists people") // 删除同名表
    hiveContext.sql("create table if not exists people(name STRING,age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'")//创建自定义的表
    /**
     * 把本地的数据加载到Hive数据仓库中,背后实际上发生了数据的复制
     * 当然,也可以通过LOAD DATA INPATH 获取HDFS等上面的数据到Hive中,此时发生了数据的移动
     */
    hiveContext.sql("load data local inpath '/home/pro/people.txt'into table people")
    hiveContext.sql("drop table if exists peoplescores") // 删除peoplescores
    hiveContext.sql("create table if not exists peoplescores(name STRING,score INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'")
    hiveContext.sql("load data local inpath '/home/pro/peoplescores.txt' into table peoplescores")

    /**
     * 通过HiveContext使用join直接基于Hive中的两张表进行操作,获得day90分的人的
     * name,score,age
     */
    val resultDF = hiveContext.sql("select pi.name,pi.age,ps.score from people pi join peoplescores ps on pi.name=ps.name" +
      " where ps.score > 90")

    /**
     * resultDF都是大于90分的人员信息表Data Frame,包括姓名，分数,调用DataFrame的saveAsTable方法,将resultDF中大于90分的记录保存到Hive
     * 数据表peopleinformationresult中
     */
    hiveContext.sql("drop table if exists peopleinformationresult")

    // 数据插入,Spark2.0.x版本后提交，spark1.x resultDF.savaAsTable("peopleinformationresult");
    resultDF.write.mode(SaveMode.Overwrite).saveAsTable("peopleinformationresult")

    val dataFrameHive = hiveContext.table("peopleinformationresult")
    dataFrameHive.show();
    sparkContext.stop()
  }
}

三、spark读取Hive中的数据

3.1 读取出ods库中ajm_idcard表中fileDate是20210608的数据

package com.sibat.applications

import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}

object SparkReadHive {
  def main(args: Array[String]): Unit = {

    val spark:SparkSession = SparkSession.builder().appName("spark_read_hive").enableHiveSupport().getOrCreate()
    val sc: SparkContext = spark.sparkContext

    val df: DataFrame = spark.sql(s"SELECT * FROM ods.ajm_idcard WHERE fileDate='20210608'")
      .select("serverReceiveTimestamp", "idno", "capTime", "gbNo")
    df.show(10,false)
  }
}

3.2 使用HiveContext

（1）方式一：

package com.sibat.applications

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object SparkUseHiveText {
  def main(args: Array[String]) {
    val conf=new SparkConf().setMaster("local").setAppName("SparkHiveText")
    val sc=new SparkContext(conf)
    val hc=new HiveContext(sc)
    hc.sql("select * from ods.sibat_face").show()
    sc.stop()

  }
}

（2）方式二：

package com.sibat.applications

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

object App {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("test").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new HiveContext(sc)
    sqlContext.table("person") // 库名.表名 的格式
      .registerTempTable("person")  // 注册成临时表
    sqlContext.sql(
      """
        | select *
        |   from person
        |  limit 10
      """.stripMargin).show()
    sc.stop()
  }
}

(3)当发现HiveContext过期，则"Use SparkSession.builder.enableHiveSupport instead"( "2.0.0")源码如下：

package org.apache.spark.sql.hive

import org.apache.spark.SparkContext
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.internal.Logging
import org.apache.spark.sql.{SparkSession, SQLContext}


/**
 * An instance of the Spark SQL execution engine that integrates with data stored in Hive.
 * Configuration for Hive is read from hive-site.xml on the classpath.
 */
@deprecated("Use SparkSession.builder.enableHiveSupport instead", "2.0.0")
class HiveContext private[hive](_sparkSession: SparkSession)
  extends SQLContext(_sparkSession) with Logging {

  self =>

  def this(sc: SparkContext) = {
    this(SparkSession.builder().sparkContext(HiveUtils.withHiveExternalCatalog(sc)).getOrCreate())
  }

  def this(sc: JavaSparkContext) = this(sc.sc)

  /**
   * Returns a new HiveContext as new session, which will have separated SQLConf, UDF/UDAF,
   * temporary tables and SessionState, but sharing the same CacheManager, IsolatedClientLoader
   * and Hive client (both of execution and metadata) with existing HiveContext.
   */
  override def newSession(): HiveContext = {
    new HiveContext(sparkSession.newSession())
  }

  /**
   * Invalidate and refresh all the cached the metadata of the given table. For performance reasons,
   * Spark SQL or the external data source library it uses might cache certain metadata about a
   * table, such as the location of blocks. When those change outside of Spark SQL, users should
   * call this function to invalidate the cache.
   *
   * @since 1.3.0
   */
  def refreshTable(tableName: String): Unit = {
    sparkSession.catalog.refreshTable(tableName)
  }

}

大数据翻身

关注

0
点赞
踩
19

收藏

觉得还不错? 一键收藏
打赏
0
评论
spark之读写Hive

一、数据准备1.1 将hive-site.xml拷贝到spark/conf目录下：分析：从错误提示上面就知道，spark无法知道hive的元数据的位置，所以就无法实例化对应的client。解决的办法就是必须将hive-site.xml拷贝到spark/conf目录下1.2测试代码中没有加sc.stop会出现如下错误：ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exceptionja..
复制链接

扫一扫