Spark SQL

最新推荐文章于 2020-02-22 11:41:53 发布

lucasmaluping

最新推荐文章于 2020-02-22 11:41:53 发布

阅读量436

点赞数

分类专栏： Spark hive

本文链接：https://blog.csdn.net/lucasmaluping/article/details/103155395

版权

Spark 同时被 2 个专栏收录

41 篇文章 1 订阅

订阅专栏

hive

3 篇文章 0 订阅

订阅专栏

创建DataFrames
1）spark-shell版本
spark中已经创建好了SparkContext和SQLContext对象
2）代码：
spark-shell --master spark://hdp-1:7077 --executor-memory 512m --total-executor-cores 2
//创建了一个数据集，实现了并行化
val seq= Seq((“1”,“xiaoming”,15),(“2”,“xiaohong”,20),(“3”,“xiaobi”,10))

val rdd1 = sc.parallelize(seq)

将当前的rdd对象转换为DataFrame对象(数据信息和数据结构信息存储到DataFrame)
//_1:string,_2:string,3:int
//在使用toDF进行转换的时候，空参的情况下。默认是+数据作为列名，数字从1开始逐渐递增
rdd1.toDF
val df = rdd1.toDF(“id”,“name”,“age”)

_1:列名，String当前列的数据类型
//查看数据 show 算子来打印，show是一个action类型算子
df.show

DSL 风格语法
1.查询：

df.select("name").show
df.select("name","age").show
//条件过滤
df.select("name","age").filter("age >10").show
//参数必须是一个字符串，filter中的表达式也需要时一个字符串

//2.参数是类名col (“列名”)

df.select("name","age").filter(col("age") >10).show

//3.分组统计个数

df.groupBy("age").count().show()

//4.打印DataFrame结构信息

df.printSchema

Sql 风格语法：
1.将DataFrame注册成表(临时表)，表会被存储
df.registerTempTable(“t_person”)

查询：
spark.sqlContext.sql(“select name,age from t_person where age > 10”).show
spark.sqlContext.sql(“select name,age from t_person order by age desc limit 2”).show

在Idea中：
依赖：注意依赖的版本

		<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

代码：方式一

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * sparkSQL --就是查询
  */
object SparkSQLDemo1 {
  def main(args: Array[String]): Unit = {
    //之前在spark-shell中，sparkContext和SQLContext是创建好的 所以不需要创建
    //因为是代码编程，需要进行创建
    val conf = new SparkConf().setAppName("SparkSQLDemo1").setMaster("local")
    val sc  =new SparkContext(conf)
    //创建SQLContext对象
    val sqlc = new SQLContext(sc)
    //集群中获取数据生成RDD
    val lineRDD: RDD[Array[String]] = sc.textFile("C:\\Users\\S\\Desktop\\内民大实训\\person.txt").map(_.split(" "))
    //lineRDD.foreach(x => println(x.toList))

    //将获取数据 关联到样例类中
    val personRDD: RDD[Person] = lineRDD.map(x => Person(x(0).toInt,x(1),x(2).toInt))
    import sqlc.implicits._
    //toDF相当于反射，这里若要使用的话，需要导入包
    /**
      * DataFrame [_1:int,_2:String,_3:Int]
      * spark-shell 数据是一个自己生成并行化数据并没有使用样例类来 存数据而是直接使用
      * 直接调用toDF的时候，使用就是默认列名 _+数字  数字从1开始逐渐递增
      * 可以在调用toDF方法的时候指定类的名称(指定名称多余数据会报错)
      *
      * 列名不要多余，也不要少于
      * 也就是说列名要和数据一一对应
      *
      * 使用代码编程数据是存储到样例类中，样例类中的构造方法中的参数就是对应的列名
      * 所以通过toDF可以直接获取对应的属性名作为列名使用
      * 同时也可以自定义列名
      *
      */
    val personDF: DataFrame = personRDD.toDF()
    //val personDF: DataFrame = personRDD.toDF("ID","NAME","AGE")
    personDF.show()

    //使用Sql语法
    //注册临时表，这个表相当于存储在 SQLContext中所创建对象中
    personDF.registerTempTable("t_person")
    val sql = "select  * from t_person where age > 20 order by age"
    //查询
    val res = sqlc.sql(sql)
    //  def show(numRows: Int, truncate: Boolean): Unit = println(showString(numRows, truncate))
    //默认打印是20行
    res.show()

    //固化数据
    //将数据写到文件中mode是以什么形式写  写成什么文件
    /**
      * def mode(saveMode: String): DataFrameWriter = {
      *     this.mode = saveMode.toLowerCase match {
      * case "overwrite" => SaveMode.Overwrite  -复写
      * case "append" => SaveMode.Append -- 追加
      * case "ignore" => SaveMode.Ignore
      * case "error" | "default" => SaveMode.ErrorIfExists
      * case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " +
      * "Accepted modes are 'overwrite', 'append', 'ignore', 'error'.")
      *
      */
    //    res.write.mode("append").json("out3")
    //    hdfs://hadoop2:8020/out111")
    //除了这两种还可以csv模式,json模式
    //csv在 1.6.3 spark中需要第三方插件,才能使用能使用,,,,2.0之后自动集成
    //这个方法不要使用因为在2.0会被删除
    res.write.mode("append").save("C:\\Users\\S\\Desktop\\内民大实训\\out111")
  }
  case class Person(id:Int,name:String,age:Int)
}

方式二：

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}

object SparkSQLStructTypeDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SparkSQLStructTypeDemo").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlcontext = new SQLContext(sc)

    //获取数据并拆分
    val lineRDD =  sc.textFile("C:\\Users\\S\\Desktop\\内民大实训\\person.txt").map(_.split(" "))
    //创建StructType对象  封装了数据结构(类似于表的结构)
    val structType: StructType = StructType {
      List(
        //列名   数据类型 是否可以为空值
        StructField("id", IntegerType, false),
        StructField("name", StringType, true),
        StructField("age", IntegerType, false)

        //列需要和数据对应，但是StructType这种可以：
        /**
          * 列的数据大于数据，所对应列的值应该是null
          * 列数是不能小于数据，不然会抛出异常
          *  StructField("oop", IntegerType, false)
          *   StructField("poo", IntegerType, false)
          */
      )
    }
    //将数据进行一个映射操作
    val rowRDD: RDD[Row] = lineRDD.map(arr => Row(arr(0).toInt,arr(1),arr(2).toInt))
    //将RDD转换为DataFrame
    val personDF: DataFrame = sqlcontext.createDataFrame(rowRDD,structType)
    personDF.show()
  }
}

把数据写到数据库mysql中：

import java.util.Properties

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}

object DataFormeInputJDBC {
  /*  def createSC(AppName:String,Master:String):SparkContext = {

    }
    def createSC(AppName:String,Master:String,sc:SparkContext):SQLContext = {

    }*/
  def main(args: Array[String]): Unit = {
    val conf = new  SparkConf().setAppName("DataFormeInputJDBC").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    //获取数据拆分
    val lines = sc.textFile("C:\\Users\\S\\Desktop\\内民大实训\\person.txt").map(_.split(" "))

    // StructType 存的表结构
    val structType: StructType = StructType(
      Array(
        StructField("id", IntegerType, false),
        StructField("name", StringType, true),
        StructField("age", IntegerType, true)
      )
    )
    //开始映射
    val rowRDD: RDD[Row] = lines.map(arr => Row(arr(0).toInt,arr(1),arr(2).toInt))
    //将当前RDD转换为DataFrame
    val personDF: DataFrame = sqlContext.createDataFrame(rowRDD,structType)

    //创建一个用于写入mysql配置信息
    val prop = new Properties()
    prop.put("user","root")
    prop.put("password","lucas")
    prop.put("driver","com.mysql.cj.jdbc.Driver")
    //提供mysql的URL

    val jdbcurl = "jdbc:mysql://localhost/classonedb?characterEncoding=utf-8&serverTimezone=UTC"

    //表名
    val table = "person1"
    //数据库要对，表若不存在会自动创建并存储
    //需要将数据写入到jdbc
    //propertities的实现是HashTable
    personDF.write.mode("append").jdbc(jdbcurl,table,prop)
    println("插入数据成功")
    sc.stop()
  }
}

SparkSQL的前身是Shark，给熟悉RDBMS但又不理解MapReduce的技术人员提供快速上手的工具，hive应运而生，它是当时唯一运行在Hadoop上的SQL-on-hadoop工具。但是MapReduce计算过程中大量的中间磁盘落地过程消耗了大量的I/O，降低的运行效率，为了提高SQL-on-Hadoop的效率，Shark应运而生，但又因为Shark对于Hive的太多依赖（如采用Hive的语法解析器、查询优化器等等),2014年spark团队停止对Shark的开发，将所有资源放SparkSQL项目上

spark-sql访问hive：

需要在安装hive的机器上启动spark-shell、
需要在spark安装路径中的conf中加入hive-site.xml和core-site.xml
需要把mysql的jar包放到spark安装路径的jars目录下：

bin/spark-shell --master spark://hdp-1:7077 --executor-memory 512m --total-executor-cores 2

[root@hdp-4 spark-2.2.0-bin-hadoop2.7]# bin/spark-shell --master spark://hdp-1:7077 --executor-memory 512m --total-executor-cores 2
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/03 15:02:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/03 15:02:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/03/03 15:03:00 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.81.132:4041
Spark context available as 'sc' (master = spark://hdp-1:7077, app id = app-20200324071246-0003).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@28551755

scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala> val hc = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@308d4981

scala> val rows = hc.sql("select * from zpark.users")
rows: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> rows.first
res1: org.apache.spark.sql.Row = [jack,20]

scala>

lucasmaluping

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark SQL

创建DataFrames1）spark-shell版本spark中已经创建好了SparkContext和SQLContext对象2）代码：spark-shell --master spark://hadoop1:7077 --executor-memory 512m --total-executor-cores 2//创建了一个数据集，实现了并行化val seq= Seq((“1”,“...
复制链接

扫一扫

专栏目录