spark-sql-catalyst
@(spark)[sql][catalyst]
简单说这部分就是做optimizer的工作的,关于这部分是有一篇论文,写的很清楚,可以当作high leve design来看。
还有一篇blog,内容差不多。
总的来说,在catalyst这部分做的事情基本上是传统关系数据库的:
1. parse(让sql语句变成合法的语法树)
2. resolve(验证olumn,table之类的确实存在,并把table,column的scheme和具体的名字结合起来。
3. 生成具体logicplan,详细的见talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
,典型的比如filter,project,sort,union等等。
4. 这里是一个基于规则的优化器,具体代码在catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
1. 按道理来说,catalyst和Spark没有必然的联系,可以看作一个SQL的optimizer。
types
值得一提的是
/**
* ::DeveloperApi::
* The data type for User Defined Types (UDTs).
*
* This interface allows a user to make their own classes more interoperable with SparkSQL;
* e.g., by creating a [[UserDefinedType]] for a class X, it becomes possible to create
* a `DataFrame` which has class X in the schema.
*
* For SparkSQL to recognize UDTs, the UDT must be annotated with
* [[SQLUserDefinedType]].
*
* The conversion via `serialize` occurs when instantiating a `DataFrame` from another RDD.
* The conversion via `deserialize` occurs when reading from a `DataFrame`.
*/
@DeveloperApi
abstract class UserDefinedType[UserType] extends DataType with Serializable {
让我们来看一个例子:
class PointUDT extends UserDefinedType[Point] {
def dataType = StructType(Seq( // Our native structure
StructField("x", DoubleType),
StructField("y", DoubleType)
))
def serialize(p: Point) = Row(p.x, p.y)
def deserialize(r: Row) =
Point(