Spark SQL源码函数解读及UDF/UDAF例子 spark研习第六集

四、 Spark SQL源码函数解读

1. Spark SQL内置函数解密与实战

SparkSQL的DataFrame引入了大量的内置函数,这些内置函数一般都有CG(CodeGeneration)功能,这样的函数在编译和执行时都会经过高度优化。

问题:SparkSQL操作Hive和Hive on Spark一样吗?

=> 不一样。SparkSQL操作Hive只是把Hive当作数据仓库的来源,而计算引擎就是SparkSQL本身。Hive on spark是Hive的子项目,Hive on Spark的核心是把Hive的执行引擎换成Spark。众所周知,目前Hive的计算引擎是Mapreduce,因为性能低下等问题,所以Hive的官方就想替换这个引擎。

SparkSQL操作Hive上的数据叫Spark on Hive,而Hive on Spark依旧是以Hive为核心,只是把计算引擎由MapReduce替换为Spark。

Spark官网上DataFrame 的API Docs: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

Experimental 
A distributed collection of data organized into named columns. 
A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.

 
 
  1. val people = sqlContext.read.parquet("...") // in Scala
  2. DataFrame people = sqlContext.read().parquet("...") // in Java
  3. Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.
  4. To select a column from the data frame, use apply method in Scala and col in Java.
  5. val ageCol = people("age") // in Scala
  6. Column ageCol = people.col("age") // in Java
  7. Note that the Column type can also be manipulated through its various functions.
  8. // The following creates a new column that increases everybody's age by 10.
  9. people("age") + 10 // in Scala
  10. people.col("age").plus(10); // in Java
  11. A more concrete example in Scala:
  12. // To create DataFrame using SQLContextval people = sqlContext.read.parquet("...")val department = sqlContext.read.parquet("...")
  13. people.filter("age > 30")
  14. .join(department, people("deptId") === department("id"))
  15. .groupBy(department("name"), "gender")
  16. .agg(avg(people("salary")), max(people("age")))
  17. and in Java:
  18. // To create DataFrame using SQLContext
  19. DataFrame people = sqlContext.read().parquet("...");
  20. DataFrame department = sqlContext.read().parquet("...");
  21. people.filter("age".gt(30))
  22. .join(department, people.col("deptId").equalTo(department("id")))
  23. .groupBy(department.col("name"), "gender")
  24. .agg(avg(people.col("salary")), max(people.col("age")));

以上内容中的join,groupBy,agg都是SparkSQL的内置函数。 
SParkl1.5.x以后推出了很多内置函数,据不完全统计,有一百多个内置函数。 
下面实战开发一个聚合操作的例子:

 
 
  1. package com.dt.spark
  2. import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
  3. import org.apache.spark.{SparkConf, SparkContext}
  4. import org.apache.spark.sql.{Row, SQLContext}
  5. import org.apache.spark.sql.functions._
  6. /**
  7. * 使用Spark SQL中的内置函数对数据进行分析,Spark SQL API不同的是,DataFrame中的内置函数操作的结果是返回一个Column对象,而
  8. * DataFrame天生就是"A distributed collection of data organized into named columns.",这就为数据的复杂分析建立了坚实的基础
  9. * 并提供了极大的方便性,例如说,我们在操作DataFrame的方法中可以随时调用内置函数进行业务需要的处理,这之于我们构建附件的业务逻辑而言是可以
  10. * 极大的减少不必须的时间消耗(基于上就是实际模型的映射),让我们聚焦在数据分析上,这对于提高工程师的生产力而言是非常有价值的
  11. * Spark 1.5.x开始提供了大量的内置函数,例如agg:
  12. * def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame = {
  13. * groupBy().agg(aggExpr, aggExprs : _*)
  14. *}
  15. * 还有max、mean、min、sum、avg、explode、size、sort_array、day、to_date、abs、acros、asin、atan
  16. * 总体上而言内置函数包含了五大基本类型:
  17. * 1,聚合函数,例如countDistinct、sumDistinct等;
  18. * 2,集合函数,例如sort_array、explode等
  19. * 3,日期、时间函数,例如hour、quarter、next_day
  20. * 4, 数学函数,例如asin、atan、sqrt、tan、round等;
  21. * 5,开窗函数,例如rowNumber等
  22. * 6,字符串函数,concat、format_number、rexexp_extract
  23. * 7, 其它函数,isNaN、sha、randn、callUDF
  24. */
  25. object SparkSQLAgg {
  26. def main(args: Array[String]) {
  27. System.setProperty("hadoop.home.dir", "G:/datarguru spark/tool/hadoop-2.6.0")
  28. val conf = new SparkConf()
  29. conf.setAppName("SparkSQLlinnerFunctions")
  30. //conf.setMaster("spark://master:7077")
  31. conf.setMaster("local")
  32. val sc = new SparkContext(conf)
  33. val sqlContext = new SQLContext(sc) //构建SQL上下文
  34. //要使用Spark SQL的内置函数,就一定要导入SQLContext下的隐式转换
  35. import sqlContext.implicits._
  36. //模拟电商访问的数据,实际情况会比模拟数据复杂很多,最后生成RDD
  37. val userData = Array(
  38. "2016-3-27,001,http://spark.apache.org/,1000",
  39. "2016-3-27,001,http://Hadoop.apache.org/,1001",
  40. "2016-3-27,002,http://fink.apache.org/,1002",
  41. "2016-3-28,003,http://kafka.apache.org/,1020",
  42. "2016-3-28,004,http://spark.apache.org/,1010",
  43. "2016-3-28,002,http://hive.apache.org/,1200",
  44. "2016-3-28,001,http://parquet.apache.org/,1500",
  45. "2016-3-28,001,http://spark.apache.org/,1800"
  46. )
  47. val userDataRDD = sc.parallelize(userData)//生成分布式集群对象
  48. //根据业务需要对数据进行预处理生成DataFrame,要想把RDD转换成DataFrame,需要先把RDD中的元素类型变成Row类型
  49. //于此同时要提供DataFrame中的Columns的元数据信息描述
  50. val userDataRDDRow = userDataRDD.map(row => {val splited = row.split(","); Row(splited(0),splited(1).toInt,splited(2), splited(3).toInt)})
  51. val structType = StructType(Array(
  52. StructField("time", StringType, true),
  53. StructField("id", IntegerType, true),
  54. StructField("url", StringType, true),
  55. StructField("amount", IntegerType, true)
  56. ))
  57. val userDataDF = sqlContext.createDataFrame(userDataRDDRow, structType)
  58. //第五步:使用Spark SQL提供的内置函数对DataFrame进行操作,特别注意:内置函数生成的Column对象且自定进行CG;
  59. userDataDF.groupBy("time").agg('time, countDistinct('id))
  60. .map(row => Row(row(1),row(2))).collect().foreach(println)
  61. userDataDF.groupBy("time").agg('time, sum('amount))
  62. .map(row => Row(row(1),row(2))).collect().foreach(println)
  63. }
  64. }

2. Spark SQL窗口函数解密与实战

窗口函数包括: 
分级函数、分析函数、聚合函数 
较全的窗口函数介绍参考: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html

窗口函数中最重要的是row_number。row_bumber是对分组进行排序,所谓分组排序就是说在分组的基础上再进行排序。 
下面使用SparkSQL的方式重新编写TopNGroup.scala程序并执行:

 
 
  1. package com.dt.spark
  2. import org.apache.spark.sql.hive.HiveContext
  3. import org.apache.spark.{SparkConf, SparkContext}
  4. object SparkSQLWindowFunctionOps {
  5. def main(args: Array[String]) {
  6. val conf = new SparkConf()
  7. conf.setMaster("spark://master:7077")
  8. conf.setAppName("SparkSQLWindowFunctionOps")
  9. val sc = new SparkContext(conf)
  10. val hiveContext = new HiveContext(sc)
  11. hiveContext.sql("DROP TABLE IF EXISTS scores")
  12. hiveContext.sql("CREATE TABLE IF NOT EXISTS scores(name STRING,score INT)"
  13. +"ROW FORMAT DELIMITED FIELDS TERMINATED ' ' LINES TERMINATED BY '\\n'")
  14. //将要处理的数据导入到Hive表中
  15. hiveContext.sql("LOAD DATA LOCAL INPATH 'G://datarguru spark/tool/topNGroup.txt' INTO TABLE SCORES")
  16. //hiveContext.sql("LOAD DATA LOCAL INPATH '/opt/spark-1.4.0-bin-hadoop2.6/dataSource' INTO TABLE SCORES")
  17. /**
  18. * 使用子查询的方式完成目标数据的提取,在目标数据内幕使用窗口函数row_number来进行分组排序:
  19. * PARTITION BY :指定窗口函数分组的Key;
  20. * ORDER BY:分组后进行排序;
  21. */
  22. val result = hiveContext.sql("SELECT name,score FROM ("
  23. + "SELECT name,score,row_number() OVER (PARTITION BY name ORDER BY score DESC) rank FROM scores) sub_scores"
  24. + "WHERE rank <= 4")
  25. result.show() //在Driver的控制台上打印出结果内容
  26. //把数据保存在Hive数据仓库中
  27. hiveContext.sql("DROP TABLE IF EXISTS sortedResultScores")
  28. result.saveAsTable("sortedResultScores")
  29. }
  30. }

报错:

 
 
  1. ERROR metadata.Hive: NoSuchObjectException(message:default.scores table not found)
  2. Exception in thread "main" org.apache.spark.sql.AnalysisException: missing BY at '' '' near '<EOF>'; line 1 pos 96

参考: 
http://blog.csdn.net/slq1023/article/details/51138709

3. Spark SQL UDF和UDAF解密与实战

UDAF=USER DEFINE AGGREGATE FUNCTION 
通过案例实战Spark SQL下的UDF和UDAF的具体使用: 
* UDF: User Defined Function,用户自定义的函数,函数的输入是一条具体的数据记录,实现上讲就是普通的Scala函数; 
* UDAF:User Defined Aggregation Function,用户自定义的聚合函数,函数本身作用于数据集合,能够在聚合操作的基础上进行自定义操作; 
* 实质上讲,例如说UDF会被Spark SQL中的Catalyst封装成为Expression,最终会通过eval方法来计算输入的数据Row(此处的Row和DataFrame中的Row没有任何关系)

1)实战编写UDF和UDAF:
 
 
  1. package com.dt.spark
  2. import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
  3. import org.apache.spark.sql.types._
  4. import org.apache.spark.sql.{Row, SQLContext}
  5. import org.apache.spark.{SparkConf, SparkContext}
  6. object SparkSQLUDFUDAF {
  7. def main(args: Array[String]) {
  8. System.setProperty("hadoop.home.dir", "G:/datarguru spark/tool/hadoop-2.6.0");
  9. val conf = new SparkConf()
  10. conf.setAppName("SparkSQLUDFUDAF")
  11. conf.setMaster("local")
  12. val sc = new SparkContext(conf)
  13. val sqlContext = new SQLContext(sc)
  14. //模拟实际使用数据
  15. val bigData = Array("Spark", "Spark", "Hadoop", "Spark", "Hadoop", "Spark", "Spark", "Hadoop", "Spark", "Hadoop")
  16. //基于提供的数据创建DataFrame
  17. val bigDataRDD = sc.parallelize(bigData)
  18. val bigDataRow = bigDataRDD.map(item => Row(item))
  19. val structType = StructType(Array(StructField("word", StringType, true)))
  20. val bigDataDF = sqlContext.createDataFrame(bigDataRow, structType)
  21. bigDataDF.registerTempTable("bigDataTable") //注册成为临时表
  22. //通过SQLContext注册UDF,在Scala 2.10.x版本UDF函数最多可以接受22个输入参数
  23. sqlContext.udf.register("computeLength", (input: String) => input.length)
  24. //直接在SQL语句中使用UDF,就像使用SQL自动的内部函数一样
  25. sqlContext.sql("select word, computeLength(word) as length from bigDataTable").show()
  26. sqlContext.udf.register("wordCount", new MyUDAF)
  27. sqlContext.sql("select word,wordCount(word) as count,computeLength(word) " +
  28. "as length from bigDataTable group by word").show()
  29. while(true){}
  30. }
  31. }
  32. class MyUDAF extends UserDefinedAggregateFunction{ //ctrl+I实现复写方法
  33. /**
  34. * 该方法指定具体输入数据的类型
  35. * @return
  36. */
  37. override def inputSchema: StructType = StructType(Array(StructField("input", StringType, true)))
  38. /**
  39. * 在进行聚合操作的时候要处理的数据的结果的类型
  40. * @return
  41. */
  42. override def bufferSchema: StructType = StructType(Array(StructField("count", IntegerType, true)))
  43. /**
  44. * 指定UDAF函数计算后返回的结果类型
  45. * @return
  46. */
  47. override def dataType: DataType = IntegerType
  48. override def deterministic: Boolean = true
  49. /**
  50. * 在Aggregate之前每组数据的初始化结果
  51. * @param buffer
  52. * @param input
  53. */
  54. override def initialize(buffer: MutableAggregationBuffer): Unit = {buffer(0)=0}
  55. /**
  56. * 在进行聚合的时候有新的值进来,对分组后的聚合如何进行计算
  57. * 本地的聚合操作,相当于Hadoop MapReduce模型中的Combiner(这里的Row跟DataFrame的Row无关)
  58. * @param buffer
  59. * @param input
  60. */
  61. override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
  62. buffer(0) = buffer.getAs[Int](0) + 1
  63. }
  64. /**
  65. * 最后在分布式节点进行Local Reduce完成后需要进行全局级别的Merge操作
  66. * @param buffer1
  67. * @param buffer2
  68. */
  69. override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
  70. buffer1(0) = buffer1.getAs[Int](0) + buffer2.getAs[Int](0)
  71. }
  72. /**
  73. * 返回UDAF最后的计算结果
  74. * @param buffer
  75. * @return
  76. */
  77. override def evaluate(buffer: Row): Any = buffer.getAs[Int](0)
  78. }
2) UDFRegistration的源码:
 
 
  1. /**
  2. * Functions for registering user-defined functions. Use [[SQLContext.udf]] to access this.
  3. *
  4. * @since 1.3.0
  5. */
  6. class UDFRegistration private[sql] (sqlContext: SQLContext) extends Logging {
  7. private val functionRegistry = sqlContext.functionRegistry
  8. protected[sql] def registerPython(name: String, udf: UserDefinedPythonFunction): Unit = {
  9. log.debug(
  10. s"""
  11. | Registering new PythonUDF:
  12. | name: $name
  13. | command: ${udf.command.toSeq}
  14. | envVars: ${udf.envVars}
  15. | pythonIncludes: ${udf.pythonIncludes}
  16. | pythonExec: ${udf.pythonExec}
  17. | dataType: ${udf.dataType}
  18. """.stripMargin)
  19. functionRegistry.registerFunction(name, udf.builder)
  20. }
  21. /**
  22. * Register a user-defined aggregate function (UDAF).
  23. *
  24. * @param name the name of the UDAF.
  25. * @param udaf the UDAF needs to be registered.
  26. * @return the registered UDAF.
  27. */
  28. def register(
  29. name: String,
  30. udaf: UserDefinedAggregateFunction): UserDefinedAggregateFunction = {
  31. def builder(children: Seq[Expression]) = ScalaUDAF(children, udaf)
  32. functionRegistry.registerFunction(name, builder)
  33. udaf
  34. }
  35. // scalastyle:off
  36. /* register 0-22 were generated by this script
  37. (0 to 22).map { x =>
  38. val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"})
  39. val typeTags = (1 to x).map(i => s"A${i}: TypeTag").foldLeft("RT: TypeTag")(_ + ", " + _)
  40. val inputTypes = (1 to x).foldRight("Nil")((i, s) => {s"ScalaReflection.schemaFor[A$i].dataType :: $s"})
  41. println(s"""
  42. /**
  43. * Register a Scala closure of ${x} arguments as user-defined function (UDF).
  44. * @tparam RT return type of UDF.
  45. * @since 1.3.0
  46. */
  47. def register[$typeTags](name: String, func: Function$x[$types]): UserDefinedFunction = {
  48. val dataType = ScalaReflection.schemaFor[RT].dataType
  49. val inputTypes = Try($inputTypes).getOrElse(Nil)
  50. def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes)
  51. functionRegistry.registerFunction(name, builder)
  52. UserDefinedFunction(func, dataType, inputTypes)
  53. }""")
  54. }
  55. (1 to 22).foreach { i =>
  56. val extTypeArgs = (1 to i).map(_ => "_").mkString(", ")
  57. val anyTypeArgs = (1 to i).map(_ => "Any").mkString(", ")
  58. val anyCast = s".asInstanceOf[UDF$i[$anyTypeArgs, Any]]"
  59. val anyParams = (1 to i).map(_ => "_: Any").mkString(", ")
  60. println(s"""
  61. |/**
  62. | * Register a user-defined function with ${i} arguments.
  63. | * @since 1.3.0
  64. | */
  65. |def register(name: String, f: UDF$i[$extTypeArgs, _], returnType: DataType) = {
  66. | functionRegistry.registerFunction(
  67. | name,
  68. | (e: Seq[Expression]) => ScalaUDF(f$anyCast.call($anyParams), returnType, e))
  69. |}""".stripMargin)
  70. }
  71. */
  72. /**
  73. * Register a Scala closure of 0 arguments as user-defined function (UDF).
  74. * @tparam RT return type of UDF.
  75. * @since 1.3.0
  76. */
  77. def register[RT: TypeTag](name: String, func: Function0[RT]): UserDefinedFunction = {
  78. val dataType = ScalaReflection.schemaFor[RT].dataType
  79. val inputTypes = Try(Nil).getOrElse(Nil)
  80. def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e, inputTypes)
  81. functionRegistry.registerFunction(name, builder)
  82. UserDefinedFunction(func, dataType, inputTypes)
  83. }

FunctionRegistry的源码如下:

 
 
  1. object FunctionRegistry {
  2. type FunctionBuilder = Seq[Expression] => Expression
  3. val expressions: Map[String, (ExpressionInfo, FunctionBuilder)] = Map(
  4. // misc non-aggregate functions
  5. expression[Abs]("abs"),
  6. expression[CreateArray]("array"),
  7. expression[Coalesce]("coalesce"),
  8. expression[Explode]("explode"),
  9. expression[Greatest]("greatest"),
  10. expression[If]("if"),
  11. expression[IsNaN]("isnan"),
  12. expression[IsNull]("isnull"),
  13. expression[IsNotNull]("isnotnull"),
  14. expression[Least]("least"),
  15. expression[Coalesce]("nvl"),
  16. expression[Rand]("rand"),
  17. expression[Randn]("randn"),
  18. expression[CreateStruct]("struct"),
  19. expression[CreateNamedStruct]("named_struct"),
  20. expression[Sqrt]("sqrt"),
  21. expression[NaNvl]("nanvl"),
  22. // math functions
  23. expression[Acos]("acos"),
  24. expression[Asin]("asin"),
  25. expression[Atan]("atan"),
  26. expression[Atan2]("atan2"),
  27. expression[Bin]("bin"),
  28. expression[Cbrt]("cbrt"),
  29. expression[Ceil]("ceil"),
  30. expression[Ceil]("ceiling"),
  31. expression[Cos]("cos"),
  32. expression[Cosh]("cosh"),
  33. expression[Conv]("conv"),
  34. expression[EulerNumber]("e"),
  35. expression[Exp]("exp"),
  36. expression[Expm1]("expm1"),
  37. expression[Floor]("floor"),
  38. expression[Factorial]("factorial"),
  39. expression[Hypot]("hypot"),
  40. expression[Hex]("hex"),
  41. expression[Logarithm]("log"),
  42. expression[Log]("ln"),
  43. expression[Log10]("log10"),
  44. expression[Log1p]("log1p"),
  45. expression[Log2]("log2"),
  46. expression[UnaryMinus]("negative"),
  47. expression[Pi]("pi"),
  48. expression[Pow]("pow"),
  49. expression[Pow]("power"),
  50. expression[Pmod]("pmod"),
  51. expression[UnaryPositive]("positive"),
  52. expression[Rint]("rint"),
  53. expression[Round]("round"),
  54. expression[ShiftLeft]("shiftleft"),
  55. expression[ShiftRight]("shiftright"),
  56. expression[ShiftRightUnsigned]("shiftrightunsigned"),
  57. expression[Signum]("sign"),
  58. expression[Signum]("signum"),
  59. expression[Sin]("sin"),
  60. expression[Sinh]("sinh"),
  61. expression[Tan]("tan"),
  62. expression[Tanh]("tanh"),
  63. expression[ToDegrees]("degrees"),
  64. expression[ToRadians]("radians"),
  65. // aggregate functions
  66. expression[HyperLogLogPlusPlus]("approx_count_distinct"),
  67. expression[Average]("avg"),
  68. expression[Corr]("corr"),
  69. expression[Count]("count"),
  70. expression[First]("first"),
  71. expression[First]("first_value"),
  72. expression[Last]("last"),
  73. expression[Last]("last_value"),
  74. expression[Max]("max"),
  75. expression[Average]("mean"),
  76. expression[Min]("min"),
  77. expression[StddevSamp]("stddev"),
  78. expression[StddevPop]("stddev_pop"),
  79. expression[StddevSamp]("stddev_samp"),
  80. expression[Sum]("sum"),
  81. expression[VarianceSamp]("variance"),
  82. expression[VariancePop]("var_pop"),
  83. expression[VarianceSamp]("var_samp"),
  84. expression[Skewness]("skewness"),
  85. expression[Kurtosis]("kurtosis"),
  86. // string functions
  87. expression[Ascii]("ascii"),
  88. expression[Base64]("base64"),
  89. expression[Concat]("concat"),
  90. expression[ConcatWs]("concat_ws"),
  91. expression[Encode]("encode"),
  92. expression[Decode]("decode"),
  93. expression[FindInSet]("find_in_set"),
  94. expression[FormatNumber]("format_number"),
  95. expression[GetJsonObject]("get_json_object"),
  96. expression[InitCap]("initcap"),
  97. expression[JsonTuple]("json_tuple"),
  98. expression[Lower]("lcase"),
  99. expression[Lower]("lower"),
  100. expression[Length]("length"),
  101. expression[Levenshtein]("levenshtein"),
  102. expression[RegExpExtract]("regexp_extract"),
  103. expression[RegExpReplace]("regexp_replace"),
  104. expression[StringInstr]("instr"),
  105. expression[StringLocate]("locate"),
  106. expression[StringLPad]("lpad"),
  107. expression[StringTrimLeft]("ltrim"),
  108. expression[FormatString]("format_string"),
  109. expression[FormatString]("printf"),
  110. expression[StringRPad]("rpad"),
  111. expression[StringRepeat]("repeat"),
  112. expression[StringReverse]("reverse"),
  113. expression[StringTrimRight]("rtrim"),
  114. expression[SoundEx]("soundex"),
  115. expression[StringSpace]("space"),
  116. expression[StringSplit]("split"),
  117. expression[Substring]("substr"),
  118. expression[Substring]("substring"),
  119. expression[SubstringIndex]("substring_index"),
  120. expression[StringTranslate]("translate"),
  121. expression[StringTrim]("trim"),
  122. expression[UnBase64]("unbase64"),
  123. expression[Upper]("ucase"),
  124. expression[Unhex]("unhex"),
  125. expression[Upper]("upper"),
  126. ...

可以看出SparkSQL的内置函数也是和UDF一样注册的。

4. Spark SQL Thrift Server实战

The Thrift JDBC/ODBC server implemented here corresponds to the HiveServer2 in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1. 
打开JDBC/ODBC server:

 
 
  1. ps -aux | grep hive
  2. hive --service metastore & //先打开hive元数据
  3. [1] 28268
  4. ./sbin/start-thriftserver.sh
  5. //Now you can use beeline to test the Thrift JDBC/ODBC server:
  6. ./bin/beeline
  7. //Connect to the JDBC/ODBC server in beeline with:
  8. beeline> !connect jdbc:hive2://master:10000
  9. //:root
  10. //密码为空
  11. hive命令
Java通过JDBC访问Thrift Server
 
 
  1. package com.dt.sparksql;
  2. import java.sql.Connection;
  3. import java.sql.DriverManager;
  4. import java.sql.PreparedStatement;
  5. import java.sql.ResultSet;
  6. import java.sql.SQLException;
  7. /**
  8. * 演示Java通过JDBC访问Thrift Server,进而访问Spark SQL,进而访问Hive,这是企业级开发中最为常见的方式
  9. * @author dt_sparl
  10. *
  11. */
  12. public class SparkSQLJDBC2ThriftServer {
  13. public static void main(String[] args) throws SQLException {
  14. String sqlTest = "select name from people where age = ?";
  15. Connection conn = null;
  16. ResultSet resultSet = null;
  17. try {
  18. Class.forName("org.apache.hive.jdbc.HiveDriver");
  19. conn = DriverManager.getConnection("jdbc:hive2://<master>:<10001>/<default>?"
  20. + "hive.server2.transport.mode=http;hive.server2.thrift.http.path=<cliserver>",
  21. "root", "");
  22. PreparedStatement preparedStatement = conn.prepareStatement(sqlTest);
  23. preparedStatement.setInt(1, 30);
  24. resultSet = preparedStatement.executeQuery();
  25. while(resultSet.next()){
  26. System.out.println(resultSet.getString(1)); //这里的数据应该保存在parquet中
  27. }
  28. } catch (ClassNotFoundException e) {
  29. // TODO Auto-generated catch block
  30. e.printStackTrace();
  31. }finally {
  32. resultSet.close();
  33. conn.close();
  34. }
  35. }
  36. }
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark SQL中的自定义函数UDFUDAF、UDTF)是用户自己定义的函数,可以用于对数据进行处理和转换。下面是一些自定义函数的实例: 1. UDF(User-Defined Function):用户自定义函数,可以将一个或多个输入参数转换为输出值。例如,我们可以定义一个UDF来计算两个数的和: ``` import org.apache.spark.sql.functions.udf val sumUDF = udf((a: Int, b: Int) => a + b) val df = Seq((1, 2), (3, 4)).toDF("a", "b") df.select(sumUDF($"a", $"b")).show() ``` 2. UDAF(User-Defined Aggregate Function):用户自定义聚合函数,可以对一组数据进行聚合操作,例如求和、平均值等。例如,我们可以定义一个UDAF来计算一组数的平均值: ``` import org.apache.spark.sql.expressions.MutableAggregationBuffer import org.apache.spark.sql.expressions.UserDefinedAggregateFunction import org.apache.spark.sql.types._ class AvgUDAF extends UserDefinedAggregateFunction { // 输入数据类型 def inputSchema: StructType = StructType(StructField("value", DoubleType) :: Nil) // 聚合缓冲区数据类型 def bufferSchema: StructType = StructType( StructField("sum", DoubleType) :: StructField("count", LongType) :: Nil ) // 输出数据类型 def dataType: DataType = DoubleType // 是否是确定性的 def deterministic: Boolean = true // 初始化聚合缓冲区 def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0.0 buffer(1) = 0L } // 更新聚合缓冲区 def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = buffer.getDouble(0) + input.getDouble(0) buffer(1) = buffer.getLong(1) + 1L } // 合并聚合缓冲区 def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0) buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1) } // 计算最终结果 def evaluate(buffer: Row): Any = { buffer.getDouble(0) / buffer.getLong(1) } } val avgUDAF = new AvgUDAF() val df = Seq(1.0, 2.0, 3.0, 4.0).toDF("value") df.agg(avgUDAF($"value")).show() ``` 3. UDTF(User-Defined Table-Generating Function):用户自定义表生成函数,可以将一个或多个输入参数转换为一个表。例如,我们可以定义一个UDTF来将一个字符串拆分成多个单词: ``` import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.UserDefinedFunction import org.apache.spark.sql.functions.{explode, udf} import org.apache.spark.sql.types._ class SplitUDTF extends UserDefinedFunction { // 输入数据类型 def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) // 输出数据类型 def dataType: DataType = ArrayType(StringType) // 是否是确定性的 def deterministic: Boolean = true // 计算结果 def apply(value: Row): Any = { value.getString(0).split(" ") } } val splitUDTF = udf(new SplitUDTF(), ArrayType(StringType)) val df = Seq("hello world", "spark sql").toDF("value") df.select(explode(splitUDTF($"value"))).show() ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值