Spark SQL

最新推荐文章于 2019-12-16 11:17:24 发布

这个超人不会灰

最新推荐文章于 2019-12-16 11:17:24 发布

阅读量334

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_41143582/article/details/78632495

版权

Spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

DataFrame

DataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格。带有schema信息的RDD，主要对结构化的数据高度抽象。
DataFrame和RDD的区别：DataFrame带有schema元信息，DataFrame所表示的二维表数据集的每一列都带有名称和类型，这使得SparkSQL得意洞察等多的结构信息，从而对藏于DataFrame背后的数据源以及作用于DataFrame之上的变换进行了针对性的优化，最终达到大幅提升运行时效率的目标。RDD无法得知数据元素的具体内部结构，Spark　Core 只能在stage层面进行简单通用的流水线优化。

Spark DataFrame

Spark DataFrame:
1.分布式数据集
2.类似关系型数据库汇总的table，或者excel里面的一张sheet。
3.拥有丰富的操作函数，类似于rdd中的算子
4.一个dataframe可以被注册成一张数据表，然后用sql语言在上边操作。
5.丰富的创建方式：
已有的rdd
结构化数据文件
json数据集
hive表
外部数据库

注：

（1）使用hive数据源的时候需要将hive-site.xml放到spark的conf目录下。scp conf/hive-site.xml root@node22:/usr/hadoop/spark/conf

（2）访问hdfs上的文件:8020端口

DataFrame和RDD的互操作

Spark Sql的scala接口支持rdd转化为dataframe，case类定义表的模式，case类的参数名称是使用放射读取的，并成为列的名称，case类也可以嵌套或者包含复杂类型，如序列或者数组，次rdd可以隐式的转化为dataframe，然后将其注册为表。表可以在随后的sql语句中使用。

UDF

object UDF {

val myUDF = (str : String) => {
str.length
}

def main(args: Array[String]) {
val conf = new SparkConf().setAppName("UDF").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val names = Array("libai", "dufu", "baijuyi", "wangchanlin", "hezhizang")
val namesRDD = sc.parallelize(names,4)
val namesRowRDD = namesRDD.map(name => Row(name))
val structType = StructType(Array(StructField("name",StringType,true)))
val namesDF = sqlContext.createDataFrame(namesRowRDD,structType)

namesDF.registerTempTable("names")
// sqlContext.udf.register("strLength", (str: String) => str.length)
sqlContext.udf.register("strLength", myUDF)

sqlContext.sql("select name , strLength(name) from names").collect().foreach(println)
}
}

生成DataFrame

1.反射：

case class Age(id: Int, age:Int)
object SSQL02 {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Age").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._
    val lines = sc.textFile("D:\\***")
    val df = lines.map(_.split("\\t")).map(line => Age(line(0).trim.toInt, line(1).trim.toInt)).toDF()
    df.registerTempTable("Age")
    val allAge = sqlContext.sql(" select * from Age").collect().foreach(println)
    val allAge1 = sqlContext.sql("select * from Age")
    allAge1.map(word => "id : " + word(0) + " age: " + word(1)).collect().foreach(println)
    allAge1.map(word => "id : " + word.getAs("id") + " name : " + word.getAs("age")).collect().foreach(println)
  }

}

2.动态生成：（

Row(p(0), p(1).trim) 不能添加toInt

）

object SSQL03 {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("ssql03").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    val people = sc.textFile("D:\\***")
    val schemaString = "name age"

    // Import Row.
    import org.apache.spark.sql.Row;

    // Import Spark SQL data types

    // Generate the schema based on the string of schema
    val schema =
      StructType(
        schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

    // Convert records of the RDD (people) to Rows.
    val rowRDD = people.map(_.split("\\t")).map(p => Row(p(0), p(1).trim))

    // Apply the schema to the RDD.
    val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

    // Register the DataFrames as a table.
    peopleDataFrame.registerTempTable("people")

    // SQL statements can be run by using the sql methods provided by sqlContext.
    val results = sqlContext.sql("SELECT * FROM people")

    results.map(t => "Name: " + t(0) + " Age: " + t(1)).collect().foreach(println)

  }

}

问题

1.SparkSQL: no typetag available for xxxx

case class 类要定义在Object类的上面

这个超人不会灰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark SQL

DataFrameDataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格。带有schema信息的RDD，主要对结构化的数据高度抽象。DataFrame和RDD的区别：DataFrame带有schema元信息，DataFrame所表示的二维表数据集的每一列都带有名称和类型，这使得SparkSQL得意洞察等多的结构信息，从而对藏于DataFrame背后的数据源以及作
复制链接

扫一扫

专栏目录