Spark SQL Hive数据源复杂综合案例实战

最新推荐文章于 2024-08-21 08:55:45 发布

张章章Sam

最新推荐文章于 2024-08-21 08:55:45 发布

阅读量6.6k

点赞数

本文链接：https://blog.csdn.net/qq_16103331/article/details/53562349

版权

Spark SQL Hive数据源复杂综合案例实战（读写各种分区表）

Hive数据源来源

Spark SQL除了支持SQLContext之外，还支持HIVESQL语法，利用HIVEContext来创建，HiveContext继承自SQLContext，但是增加了在Hive表中数据库中查找，同时也支持hql（方法）。Hiveql的功能要比sql的功能要强大很多。
使用HiveContext，可以执行Hive的大部分功能，包括创建表、往表里导入数据以及用SQL语句查询表中的数据。查询出来的数据是一个Row数组。
我们之前在将hive与spark集成的时候，讲过他们是如何集成的，大家在详细看看老师之前讲解的文档。
将hive-site.xml拷贝到spark/conf目录下，将mysql connector拷贝到spark/lib目录下

修改saprk-submit.sh

standalone 集群模式是以这样的配置方式是可以的

/opt/spark/bin/spark-submit \
--class $1 \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--files /opt/hive/conf/hive-site.xml \
--driver-class-path /opt/hive/lib/mysql-connector-java-5.1.39.jar \
/opt/jars/spark/spark-hive.jar\

将数据保存到Hive数据仓库中

Spark SQL还允许将数据保存到Hive表中。调用DataFrame的saveAsTable命令，即可将DataFrame中的数据保存到Hive表中。
与registerTempTable区别，
saveAsTable是会将DataFrame中的数据物化到Hive表中的，而且还会在Hive元数据库中创建表的元数据。
saveAsTable会创建一张Hive Managed Table，也就是说，数据的位置都是由元数据库中的信息控制的。当Managed Table被删除时，表中的数据也会一并被物理删除。
registerTempTable只是注册一个临时的表，只要Spark Application重启或者停止了，那么表就没了。而saveAsTable创建的是物化的表，无论Spark Application重启或者停止，表都会一直存在。
同时我们在调用HiveContext.table()方法，还可以直接针对Hive中的表，创建一个DataFrame方便我们以后进行操作。

代码案例 ##、

Scala 版本

/**
  *  关于Spark 和 Hive  的一个集成
  */
object SparkHiveOps extends App{
    val conf = new SparkConf().setAppName("SparkHiveOps")
    val sc  = new  SparkContext(conf)
    val hiveContext = new HiveContext(sc)
    /**
      * 查询hvie中的数据
      * */
    val df = hiveContext.table("word");
    df.show()
  /**
    * 向hive中写数据
    * teacher_info
    *     name,height
    * teacher_basic
    *     name,age,married,children
    *  这是两张有关联的表，关联字段是name，
    *  需求：
    *    teacher_info和teacher_basic做join操作，求出每个teacher的name，age，height，married，children
    *    要求height>180
    */
  //在hive中创建相应的表
  hiveContext.sql("DROP TABLE IF EXISTS teacher_basic")
  hiveContext.sql("CREATE TABLE teacher_basic(" +
    "name string, " +
    "age int, " +
    "married boolean, " +
    "children int) " +
    "row format delimited " +
    "fields terminated by ','")
    //向teacher_basic表中加载数据
    hiveContext.sql("LOAD DATA LOCAL INPATH '/opt/data/spark/teacher_basic.txt' INTO TABLE teacher_basic")
    //创建第二张表
        hiveContext.sql("DROP TABLE IF EXISTS teacher_info")
    hiveContext.sql("CREATE TABLE teacher_info(name string, height int) row format delimited fields terminated by ','")
    hiveContext.sql("LOAD DATA LOCAL INPATH '/opt/data/spark/teacher_info.txt' INTO TABLE teacher_info")
    //执行多表关联
    val joinDF = hiveContext.sql("select b.name, b.age, b.married, b.children, i.height from teacher_basic b left join teacher_info i on b.name = i.name where i.height > 180")
    hiveContext.sql("DROP TABLE IF EXISTS teacher")
    joinDF.show()
    joinDF.write.saveAsTable("teacher")
    sc.stop()

}

Java版本

public class SparkHiveJava {
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName(SparkHiveJava.class.getSimpleName());
    JavaSparkContext sc = new JavaSparkContext(conf);
    HiveContext hiveContext = new HiveContext(sc);
    //查询hvie 中的数据
    DataFrame df = hiveContext.table("word");
    df.show();
    //向数据库中写数据

    /**
     * 向hive中写数据
     * teacher_info
     *     name,height
     * teacher_basic
     *     name,age,married,children
     *  这是两张有关联的表，关联字段是name，
     *  需求：
     *    teacher_info和teacher_basic做join操作，求出每个teacher的name，age，height，married，children
     *    要求height>180
     */
    //在hive中创建相应的表
    hiveContext.sql("DROP TABLE IF EXISTS teacher_basic");
    hiveContext.sql("CREATE TABLE teacher_basic(" +
            "name string, " +
            "age int, " +
            "married boolean, " +
            "children int) " +
            "row format delimited " +
            "fields terminated by ','");
    //向teacher_basic表中加载数据
    hiveContext.sql("LOAD DATA LOCAL INPATH '/opt/data/spark/teacher_basic.txt' INTO TABLE teacher_basic");
    //创建第二张表
    hiveContext.sql("DROP TABLE IF EXISTS teacher_info");
    hiveContext.sql("CREATE TABLE teacher_info(name string, height int) row format delimited fields terminated by ','");
    hiveContext.sql("LOAD DATA LOCAL INPATH '/opt/data/spark/teacher_info.txt' INTO TABLE teacher_info");
    //执行多表关联
    DataFrame joinDF = hiveContext.sql("select b.name, b.age, b.married, b.children, i.height from teacher_basic b left join teacher_info i on b.name = i.name where i.height > 180");
    hiveContext.sql("DROP TABLE IF EXISTS teacher");
    joinDF.show();
    joinDF.write().saveAsTable("teacher");
    sc.stop();
}

}

常见错误

O audit: ugi=kkk    ip=unknown-ip-addr  cmd=get_table : db=default tbl=word

这样写无法执行
/opt/spark/bin/spark-submit \
--class $1 \
--master spark://master:7077 \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
/opt/jars/spark/spark-hive.jar\

standalone 集群模式是以这样的配置方式是可以的

/opt/spark/bin/spark-submit \
--class $1 \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--files /opt/hive/conf/hive-site.xml \
--driver-class-path /opt/hive/lib/mysql-connector-java-5.1.39.jar \
/opt/jars/spark/spark-hive.jar\

/opt/spark/bin/spark-submit \
--class $1 \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--files /opt/hive/conf/hive-site.xml \
--driver-class-path /opt/hive/lib/mysql-connector-java-5.1.39.jar \
/opt/jars/spark/spark-hive.jar\
目前yarn集群模式，搞不定呀