saprkSQL的思考及开发实战(一)

最新推荐文章于 2024-05-10 20:03:52 发布

熊怪叔叔

最新推荐文章于 2024-05-10 20:03:52 发布

阅读量280

点赞数

分类专栏：大数据-spark

本文链接：https://blog.csdn.net/king_hui719/article/details/105007092

版权

大数据-spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

sparkSQL:数据读入到sparkSQL,进行数处理或者算法实现，然后再把处理收的数据输出到相应的输出源中,在实际开发过程中需要考虑如下几个点：

1、数据源多少，sparkSql支持多种数据源hive\json\.txt\orc文件\jdbc;

2、字段的映射关系与语言的定义类型相关

3、数据拥有的情况下，组织这些数据->数据结构->操作数据

sparkSQL的实战介绍

2、hiveContext和hiveContext

sparkContext处理的是多种数据源，HiveContext处理的是hive源数据,它集成了sparkContext的两个方法；

analyze(tableName:String):对sql的解析分析

refreshTable(tableName:String)：再sparkSQL中处理的某个便的存储为止发生了变化

需要再内存metaDate中缓存这张表，需要调用这个方法来使这个缓存无效重新加载。

3.1读取数据

json文件.txt文件person[]

{"id":1,"name":"bob","total":10}

{"id":2,"name":"alice","total":15}

{"id":3,"name":"alice","total":40}

{"id":4,"name":"jerry","total":25}

Json文件personloan发放额【total】

{"id":1,"name":"bob","total":40}

{"id":2,"name":"alice","total":65}

{"id":3,"name":"alice","total":20}

{"id":4,"name":"jerry","total":25}

eg:

val sql=new SQLContext(sc) //声明一个sqlContext的对象，以便对数据进行操作

val shangpin=sql.read.json("文件路径")

//shangpin返回结果是org.apache.spark.sql.DataFrame=

[total:bigint,id:bigint,name:String],就这样把数据读入到内存中了

描述：找到文件，以json的方式进行解析，同时通过json的key形成schema,顺序不一定，

int->bigint;""->String,返回结果是一个DataFrame数据类型；

DataFrame是按照行进行存储的Row,一行一行进行呼叫存储，处理粒度是行粒度，数据结构读入sparkSql中的数据，

DataFrame数据结构和数据库的结构有相似之处，数据按照行进行存储，

同时还有一个schema,相当于数据库的表结果，记录数据属于那个字段；

shangpin.schema;//查看schema的数据

返回结果：org.apache.spark.sql.types.structType=StructType(StructField(total,LongType,true),StructField(id,LongType,true)StructField(name,StringType,true))

从上看出shangpin的存储数据信息，表与scala的对应的关系：bigint->Long;int->Int;

Float->Float,double->Double,string->String等；DataFrame有scheam和行存储的数据；

read的其他接口

json(path:String):读取json文件

table(tableName:String)读取数据库中的表

jdbc:(url:String,table:String,predicates:Array[String],connectionProperties:Properties):通过jdbc连接数据库

orc(path:String):读取orc格式进行存储的文件

parquet(path:String):读取parquet格式进行存储的文件

schema(schema:StructType):读入数据指定了schema,底层不会再次解析schema从而进行了优化，

一般不需要这样的优化。

3.2数据读入：

处理从一个数据表中选择部分字段，而不是选择表中的所有字段，可以采用这个数据读入方式比较有优势

val shangpin=sql.sql("""select * from shangpin""".stripMargin)

//stripMargin方法用来解析sql

3.3数据写入，一般都是dataFrame类，不是这个类型进行组装

通过类进行构建：

val sql=new SQLContext(sc)//创建一个sqlContext对象

import sql.implicits._//z这个是我们上面定义的sqlContxt

val peoson=sc.textFile(person.txt)//采用spark类型读入数据，SQLContext会带有schema

case clase person(Id:Int,name:String,total:Int)//定义类

valu personInfo=person.map(lines=>lines.split(",")).map(p=>person(p(0).toInt,p(1),p(2).toInt)).toDF

//构建成了 dataFrame,需要导入sql.implicys._，使用反射生成一个DataFrame

直接构造schema,需要把数据转化为row类型

val sql=new SQLContext(sc);

val person=sc.textFile("person.txt").map(lines=>lnes.split(","))//=>匿名函数

val personRow=sc.map(p=>Row(p(0),p(1),p(2)))//rdd转换成RDD(Row)类型

val schema=StructType(StructFile("id",IntegerType,true)::

StructFile("nme",StringtType,true)::

StructFile("total",IntegerType,true)::Nil)

val personInfo=sql.createDataFrame(peopleRow,schema)

//保证长度与类型一致structFile(字段名称，类型，数据是否用null进行填充)

直接构造的制约受字段的多少限制

写数据操作：

val sql=new SQLContext(sc);

val person=sc.textFile("person.txt").map(lines=>lnes.split(","))//=>匿名函数

val personRow=sc.map(p=>Row(p(0),p(1),p(2)))//rdd转换成RDD(Row)类型

val schema=StructType(StructFile("id",IntegerType,true)::

StructFile("nme",StringtType,true)::

StructFile("total",IntegerType,true)::Nil)

val personInfo=sql.createDataFrame(peopleRow,schema)

personInfo.registerTempTable("tempTable")//中间注册表

sql.sql("""insert overwrite table tagetTable select * from tempTable """.stripMargin)

sql()就是用来执行sql的语句

3.4通过DataFrame中方法对数据进行操作

1>对schema的操作

val sql=new SQLContext(sc);

val person=sql.read.json("person.txt")//person是一个DataFrame类型对象

person.schema//返回类型

person.dtypes//以数组额形式分会schema,Array[(String, String)] = Array((age,LongType), (id,LongType), (name,StringType))

person.columns//返回schema的字段

person.printSchema//以tree的形式打印输出schema

2>对表的操作，sql()方法调用sql，更灵活和实用：

1.方法(sql=new SQLContext(sc))df是DataFrame对象：实例说明

2.sql.read.table(tableName):读取一张表的数据

3.df.where(),df.filter():过滤条件，相当于sql的where部分，用法：选择总额中大于20的字段

返回值类型：DataFrame,df.where("total>=20"),df.filter("total>=20")

4.df.limit():限制输出的行数，对应sql的limit,用法：限制输出一百行，返回值类型DataFrame,df.limit(100)

5.df.join:链接操作，相当于sql的join

6.df.groupBy():聚合操作，相当于sql的groupBy,用法：对于某几行进行聚合

返回值类型：DataFrame

7.df.agg():求聚合函数

8.df.intersect(other:DataFame):求两个DataFrame的交集

9.df.exccept(other:DataFrame):求df中而不再other的行

10.df.withColumn(colName:String,col:Column)增加一列

11.df.withColumnRenamed(exName,newName)对某一列的名字进行重新命名

12.df.map();df.flatMap();df.mapPartitions();df.foreach();df.foreachPartition();df.collect();df.collectAsList();

df.repartition();df.distinct();df.count():这些方法都是对spark的rdd的操作，其中再DataFrame类中也封装了这些方法，需要注意的是这谢谢方法的放回

值类型是RDD的类型

13.df.select:选取某几列元素，这个方法相当于sql的select功能，返回类型是DataFrame

eg：df.seletct("id","name")

3.5sparkSQL的join操作

1、内连接，等值链接，会把链接的列和步兵成一个列

eg：val sql=new SQLContext(sc);

val personInfo=sql.read.json("person.txt");

val personloan=sql.read.json("loan.txt")

val info_loan=personInfo.join(personloan,"id")//通过id进行内连接

val info_loan0=personInfo.join(personloan,Seq("id","name"))

/*join 还支持左连接和右连接，其含义和sql中的差不多*/

eg：

//单字段链接

val left=personInfo.join(personloan,personInfo("id")===personloan("id"),"left_outer")

//多字段链接

val left0=personInfo.join(personloan,personInfo("id")===psersonloan("id") and personInfo("name")===personloan("name"),"left_outer")

由上面的执行结果知道sql中的缺陷存在会出现重复的字段名称，利用registerTempTable()函数把此DataFrame注册成一个内部表，然后通过sql.sql("")写sql的方法进行链接，

解决重复字段。

eg:

object RDD2DataFrame {

def main(args: Array[String]): Unit = {

val conf = new SparkConf().setMaster("local").setAppName("RDD2DataFrame ");

val sc = new SparkContext(conf);

val sqlContext = new SQLContext(sc);

/**

* 在scala中使用反射方法，进行RDD到DataFrame的转换，需要手动导入一个隐式转换

import sqlContext.implicits._

case class Personloan(id:Int,name:String,total:Int)

//这里其实就是一个普通的，元素为case class的RDD

//直接使用RDD 的toDF，即可将其转换为DataFrame

val personloanDF = sc.textFile("./student.txt", 1)

.map(line => line.split(","))

.map(arr => Personloan(arr(0).trim().toInt,arr(1),arr(2).trim().toInt))

.toDF();//不知道为什么此处总是报错，无法继续运行

personloanDF .registerTempTable("spersonloan ");//用dataframe注册一个临时表

val PerloanDF = sqlContext.sql("select * from personloan where total<=20");

val PerloanRDD = tPerloanDF.rdd

PerloanRDD.map(row => Personloan(row(0).toString().toInt,row(1).toString,row(2).toString.toInt))

.collect()

.foreach(per => println(per.id +":" + per.name + ": " + per.age))

}

3.6sparkSQL操作agg

一般使用agg和groupby()一起操作

val personloan=new SQLContext(sc).read.json("personloan.txt")

val agg1=personloan.groupBy("name").agg("salary"->"avg")

val agg2=personloan.groupby("id","name").agg("salary"->"avg")

val agg3=personloan.groupby("name").agg(Map("id"->"avg","salary"->"max"))

使用agg操作时候，同一个字段只能进行一次操作，接入类型是Map,key-vlue的结构，当key相同时候，会覆盖原来的

3.7sqprksql的na操作

....

Dataset<Row> nums =

spark.createDataset(rowList, RowEncoder.apply(DataTypes.createStructType

(new StructField[]{

DataTypes.createStructField("col_1", DataTypes.IntegerType, true),

DataTypes.createStructField("col_2", DataTypes.IntegerType, true),

})));

nums.show(false);

DataFrameNaFunctions dataFrameNaFunctions = nums.na();

dataFrameNaFunctions.drop("any").show();

// 当某行中的所有列为null或NaN时丢弃掉此行

dataFrameNaFunctions.drop("all").show();

// 当某行的指定列为null或any时丢弃掉此行

dataFrameNaFunctions.drop(new String[]{"col_1", "col_2"}).show();

....

主要的方法由drop();fill();replace()

以上是实际开发过程中的一些汇总和整理，欢迎大家及时交流

补充：去重字段，还可以使用select指定字段或者drop方法删除重复字段或者seq对象实现字段的去重操作

熊怪叔叔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
saprkSQL的思考及开发实战(一)

sparkSQL:数据读入到sparkSQL,进行数处理或者算法实现，然后再把处理收的数据输出到相应的输出源中,在实际开发过程中需要考虑如下几个点：1、数据源多少，sparkSql支持多种数据源hive\json\.txt\orc文件\jdbc;2、字段的映射关系与语言的定义类型相关3、数据拥有的情况下，组织这些数据->数据结构->操作数据sparkSQL的实战介绍...
复制链接

扫一扫