Spark Dataset

spark sql中很多后续操作,如select(),filter()等都是在dataset中定义的。比如select()操作会生成新的Projectection类型的logicPlan,filter会生成Filter类型的logicPlan。dataset中有两大类数据源:一种是format()方法从DataSource子类中读取数据,如cvs、json、txt等格式;另一种是sql()方法从sql语句来,解析成SparkPlan。最后两者都是通过RDD[internalRow]格式的迭代器来检索数据。

format()返回的是DataFrameReader对象,然后load()方法返回DataFrame。load()的核心实现代码:

@scala.annotation.varargs

  def load(paths: String*): DataFrame = {

    if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {

      throw new AnalysisException("Hive data source can only be used with tables, you can not " +

        "read files of Hive data source directly.")

    }

 

    sparkSession.baseRelationToDataFrame(

      DataSource.apply(

        sparkSession,

        paths = paths,

        userSpecifiedSchema = userSpecifiedSchema,

        className = source,

        options = extraOptions.toMap).resolveRelation())

  }

 

SparkSession中:

  def baseRelationToDataFrame(baseRelation: BaseRelation): DataFrame = {

    Dataset.ofRows(self, LogicalRelation(baseRelation))

  }

 

    1. BaseRelation

抽象类,代表基础的数据源关系,子类一般通过Scan方法来返回RDD[Row]格式的数据。对于json、cvs、txt等数据源都有对应的BaseRelation子类。

一些子类的特性定义的例子,定义在sql/source/interfaces.scala中:

/**

 * A BaseRelation that can produce all of its tuples as an RDD of Row objects.

 *

 * @since 1.3.0

 */

@InterfaceStability.Stable

trait TableScan {

  def buildScan(): RDD[Row]

}

 

/**

 * A BaseRelation that can eliminate unneeded columns before producing an RDD

 * containing all of its tuples as Row objects.

 *

 * @since 1.3.0

 */

@InterfaceStability.Stable

trait PrunedScan {

  def buildScan(requiredColumns: Array[String]): RDD[Row]

}

 

/**

 * A BaseRelation that can eliminate unneeded columns and filter using selected

 * predicates before producing an RDD containing all matching tuples as Row objects.

 *

 * The actual filter should be the conjunction of all `filters`,

 * i.e. they should be "and" together.

 *

 * The pushed down filters are currently purely an optimization as they will all be evaluated

 * again.  This means it is safe to use them with methods that produce false positives such

 * as filtering partitions based on a bloom filter.

 *

 * @since 1.3.0

 */

@InterfaceStability.Stable

trait PrunedFilteredScan {

  def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

}

 

    1. LogicalRelation

逻辑查询关系,基础LeadNode。从BaseRelation生成LogicalRelation。

 

    1. SparkStrategy

BaseRelation生成的LogicalRelationLogicalPlan生成SparkPlan。这里定义了一系列策略决定如果处理各种不同的数据源。

最核心和基础的一个方法是BasicOperators,定义了各种LogicalPlan的执行逻辑,如下:

// Can we automate these 'pass through' operations?

  object BasicOperators extends Strategy {

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      case r: RunnableCommand => ExecutedCommandExec(r) :: Nil

 

      case MemoryPlan(sink, output) =>

        val encoder = RowEncoder(sink.schema)

        LocalTableScanExec(output, sink.allData.map(r => encoder.toRow(r).copy())) :: Nil

 

      case logical.Distinct(child) =>

        throw new IllegalStateException(

          "logical distinct operator should have been replaced by aggregate in the optimizer")

      case logical.Intersect(left, right) =>

        throw new IllegalStateException(

          "logical intersect operator should have been replaced by semi-join in the optimizer")

      case logical.Except(left, right) =>

        throw new IllegalStateException(

          "logical except operator should have been replaced by anti-join in the optimizer")

 

      case logical.DeserializeToObject(deserializer, objAttr, child) =>

        execution.DeserializeToObjectExec(deserializer, objAttr, planLater(child)) :: Nil

      case logical.SerializeFromObject(serializer, child) =>

        execution.SerializeFromObjectExec(serializer, planLater(child)) :: Nil

      case logical.MapPartitions(f, objAttr, child) =>

        execution.MapPartitionsExec(f, objAttr, planLater(child)) :: Nil

      case logical.MapPartitionsInR(f, p, b, is, os, objAttr, child) =>

        execution.MapPartitionsExec(

          execution.r.MapPartitionsRWrapper(f, p, b, is, os), objAttr, planLater(child)) :: Nil

      case logical.FlatMapGroupsInR(f, p, b, is, os, key, value, grouping, data, objAttr, child) =>

        execution.FlatMapGroupsInRExec(f, p, b, is, os, key, value, grouping,

          data, objAttr, planLater(child)) :: Nil

      case logical.MapElements(f, _, _, objAttr, child) =>

        execution.MapElementsExec(f, objAttr, planLater(child)) :: Nil

      case logical.AppendColumns(f, _, _, in, out, child) =>

        execution.AppendColumnsExec(f, in, out, planLater(child)) :: Nil

      case logical.AppendColumnsWithObject(f, childSer, newSer, child) =>

        execution.AppendColumnsWithObjectExec(f, childSer, newSer, planLater(child)) :: Nil

      case logical.MapGroups(f, key, value, grouping, data, objAttr, child) =>

        execution.MapGroupsExec(f, key, value, grouping, data, objAttr, planLater(child)) :: Nil

      case logical.FlatMapGroupsWithState(

          f, key, value, grouping, data, output, _, _, _, timeout, child) =>

        execution.MapGroupsExec(

          f, key, value, grouping, data, output, timeout, planLater(child)) :: Nil

      case logical.CoGroup(f, key, lObj, rObj, lGroup, rGroup, lAttr, rAttr, oAttr, left, right) =>

        execution.CoGroupExec(

          f, key, lObj, rObj, lGroup, rGroup, lAttr, rAttr, oAttr,

          planLater(left), planLater(right)) :: Nil

 

      case logical.Repartition(numPartitions, shuffle, child) =>

        if (shuffle) {

          ShuffleExchange(RoundRobinPartitioning(numPartitions), planLater(child)) :: Nil

        } else {

          execution.CoalesceExec(numPartitions, planLater(child)) :: Nil

        }

      case logical.Sort(sortExprs, global, child) =>

        execution.SortExec(sortExprs, global, planLater(child)) :: Nil

      case logical.Project(projectList, child) =>

        execution.ProjectExec(projectList, planLater(child)) :: Nil

      case logical.Filter(condition, child) =>

        execution.FilterExec(condition, planLater(child)) :: Nil

      case f: logical.TypedFilter =>

        execution.FilterExec(f.typedCondition(f.deserializer), planLater(f.child)) :: Nil

      case e @ logical.Expand(_, _, child) =>

        execution.ExpandExec(e.projections, e.output, planLater(child)) :: Nil

      case logical.Window(windowExprs, partitionSpec, orderSpec, child) =>

        execution.window.WindowExec(windowExprs, partitionSpec, orderSpec, planLater(child)) :: Nil

      case logical.Sample(lb, ub, withReplacement, seed, child) =>

        execution.SampleExec(lb, ub, withReplacement, seed, planLater(child)) :: Nil

      case logical.LocalRelation(output, data) =>

        LocalTableScanExec(output, data) :: Nil

      case logical.LocalLimit(IntegerLiteral(limit), child) =>

        execution.LocalLimitExec(limit, planLater(child)) :: Nil

      case logical.GlobalLimit(IntegerLiteral(limit), child) =>

        execution.GlobalLimitExec(limit, planLater(child)) :: Nil

      case logical.Union(unionChildren) =>

        execution.UnionExec(unionChildren.map(planLater)) :: Nil

      case g @ logical.Generate(generator, join, outer, _, _, child) =>

        execution.GenerateExec(

          generator, join = join, outer = outer, g.qualifiedGeneratorOutput,

          planLater(child)) :: Nil

      case logical.OneRowRelation =>

        execution.RDDScanExec(Nil, singleRowRdd, "OneRowRelation") :: Nil

      case r: logical.Range =>

        execution.RangeExec(r) :: Nil

      case logical.RepartitionByExpression(expressions, child, numPartitions) =>

        exchange.ShuffleExchange(HashPartitioning(

          expressions, numPartitions), planLater(child)) :: Nil

      case ExternalRDD(outputObjAttr, rdd) => ExternalRDDScanExec(outputObjAttr, rdd) :: Nil

      case r: LogicalRDD =>

        RDDScanExec(r.output, r.rdd, "ExistingRDD", r.outputPartitioning, r.outputOrdering) :: Nil

      case h: ResolvedHint => planLater(h.child) :: Nil

      case _ => Nil

    }

  }

 

    1. DataSource

数据源基类,cvs、json、txt等都是其扩张子类。统一的方法:

 

    1. DataSourceScanExec

定义了怎么物理执行DataSource,生成InternalRow数据。

一系列子类或者实现类:

名称

说明

FileSourceScanExec

读取文件,生成Rdd[InternalRow] 。

里面会调用到具体的DataSource或者BaseRelationreadFile方法来迭代数据源生成InternalRow的迭代器。

内部会调用FileScanRDD来分区读数据

RowDataSourceScanExec

对行执行操作,如Project,选择指定的输出列的操作之类。

 

 

主要这两个实现类。

      1. jdbc数据源的处理

jdbc首先定义了JDBCRelation继承BaseRelation,重写了buildScan方法,该方法通过创建JDBCRDD的方式来返回RDD[Row],JDBCRDD中通过jdbc驱动生成查询sql语句,查询数据库。生成sql语句的时候JDBCRDD会拼接一个sql,包括了过滤条件的解析。

JDBCRDD中有一段compute()方法的定义:

override def compute(thePart: Partition, context: TaskContext): Iterator[InternalRow]

其中最后这段:

  CompletionIterator[InternalRow, Iterator[InternalRow]](

      new InterruptibleIterator(context, rowsIterator), close())

它的作用是将jdbc驱动生成的Iterator封装成新的Iterator,新的Iterator会在最后一条数据处理完成之后执行close()方法,close()方法关闭jdbc连接释放连接资源。

      1. UnsafeRow

UnsafeRowInternalRow的子类,是字节数组表示的一行数据,如果字段是整形、浮点等数字型,则unsafeRow中对应的字段直接保存值,如果是字符串、数组等大数据,则保存数据的位置。

UnsafeRow的类说明摘录如下:

/**

 * An Unsafe implementation of Row which is backed by raw memory instead of Java objects.

 *

 * Each tuple has three parts: [null bit set] [values] [variable length portion]

 *

 * The bit set is used for null tracking and is aligned to 8-byte word boundaries.  It stores

 * one bit per field.

 *

 * In the `values` region, we store one 8-byte word per field. For fields that hold fixed-length

 * primitive types, such as long, double, or int, we store the value directly in the word. For

 * fields with non-primitive or variable-length values, we store a relative offset (w.r.t. the

 * base address of the row) that points to the beginning of the variable-length field, and length

 * (they are combined into a long).

 *

 * Instances of `UnsafeRow` act as pointers to row data stored in this format.

 */

 

      1. HadoopFsRelation

所有读取DataSource的实现都通过HadoopFsRelation,它继承BaseRelation。

 

    1. 聚合操作

一般sql操作都是作用 在单行数据上,只要在Iterator上层层嵌套操作运算符就可以了,一般的聚合操作如求和,求平均值也好办,只要在迭代器上添加一个全局累加器变量也可以很轻松的实现,有的聚合操作不是那么容易的,比如排序、连接操作。这里可以重点观察连接操作的数据链,看它的内部实现机制是怎样的。

      1. dropDuplicates

消除指定的列中的重复行,只保留一份。

定义Deduplicate,指定哪些列是要判断是否重复的(keys)

case class Deduplicate(

    keys: Seq[Attribute],

    child: LogicalPlan,

    streaming: Boolean) extends UnaryNode {

  override def output: Seq[Attribute] = child.output

}

对应的物理执行计划,在SparkStrateies中:

/**

   * Used to plan the streaming deduplicate operator.

   */

  object StreamingDeduplicationStrategy extends Strategy {

    override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      case Deduplicate(keys, child, true) =>

        StreamingDeduplicateExec(keys, planLater(child)) :: Nil

 

      case _ => Nil

    }

  }

顺藤摸瓜,看看StreamingDeduplicateExec的关键函数的代码,以下代码在sql/execution/streaming/state/statefulOperatores中:

override protected def doExecute(): RDD[InternalRow] = {

    metrics // force lazy init at driver

 

    child.execute().mapPartitionsWithStateStore(

      getStateId.checkpointLocation,

      getStateId.operatorId,

      getStateId.batchId,

      keyExpressions.toStructType,

      child.output.toStructType,

      sqlContext.sessionState,

      Some(sqlContext.streams.stateStoreCoordinator)) { (store, iter) =>

      val getKey = GenerateUnsafeProjection.generate(keyExpressions, child.output)

      val numOutputRows = longMetric("numOutputRows")

      val numTotalStateRows = longMetric("numTotalStateRows")

      val numUpdatedStateRows = longMetric("numUpdatedStateRows")

 

      val baseIterator = watermarkPredicateForData match {

        case Some(predicate) => iter.filter(row => !predicate.eval(row))

        case None => iter

      }

 

      val result = baseIterator.filter { r =>

        val row = r.asInstanceOf[UnsafeRow]

        val key = getKey(row)

        val value = store.get(key)

        if (value.isEmpty) {

          store.put(key.copy(), StreamingDeduplicateExec.EMPTY_ROW)

          numUpdatedStateRows += 1

          numOutputRows += 1

          true

        } else {

          // Drop duplicated rows

          false

        }

      }

 

      CompletionIterator[InternalRow, Iterator[InternalRow]](result, {

        watermarkPredicateForKeys.foreach(f => store.remove(f.eval _))

        store.commit()

        numTotalStateRows += store.numKeys()

      })

    }

看看里面用到了哪些关键类或者关键方法:

1mapPartitionsWithStateStore

引入StateStore。跟踪代码进去看看StateStore是干什么用的。

      1. StateStore

sql/execution/streaming/state/package.scala中,定义了StateStoreOps类,里面有mapPartitionsWithStateStore方法,这里贴出其代码:

/** Map each partition of an RDD along with data in a [[StateStore]]. */

    def mapPartitionsWithStateStore[U: ClassTag](

        sqlContext: SQLContext,

        checkpointLocation: String,

        operatorId: Long,

        storeVersion: Long,

        keySchema: StructType,

        valueSchema: StructType)(

        storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U]): StateStoreRDD[T, U] = {

 

      mapPartitionsWithStateStore(

        checkpointLocation,

        operatorId,

        storeVersion,

        keySchema,

        valueSchema,

        sqlContext.sessionState,

        Some(sqlContext.streams.stateStoreCoordinator))(

        storeUpdateFunction)

    }

 

    /** Map each partition of an RDD along with data in a [[StateStore]]. */

    private[streaming] def mapPartitionsWithStateStore[U: ClassTag](

        checkpointLocation: String,

        operatorId: Long,

        storeVersion: Long,

        keySchema: StructType,

        valueSchema: StructType,

        sessionState: SessionState,

        storeCoordinator: Option[StateStoreCoordinatorRef])(

        storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U]): StateStoreRDD[T, U] = {

 

      val cleanedF = dataRDD.sparkContext.clean(storeUpdateFunction)

      val wrappedF = (store: StateStore, iter: Iterator[T]) => {

        // Abort the state store in case of error

        TaskContext.get().addTaskCompletionListener(_ => {

          if (!store.hasCommitted) store.abort()

        })

        cleanedF(store, iter)

      }

      new StateStoreRDD(

        dataRDD,

        wrappedF,

        checkpointLocation,

        operatorId,

        storeVersion,

        keySchema,

        valueSchema,

        sessionState,

        storeCoordinator)

    }

  }

 

sql/execution/streaming/state/StateStore.scala中,定义了stateStore.scala。它是用来处理流式Iterator的。比如处理重复行数据等类似操作。

 

      1. HDFSBackedStateStoreProvider

一个StateStore的实现。

 

      1. Distinct

去重操作的实现。

还没完全看完,先说说编程e思想。

定义一种StoreState对象,用于管理数据源的中间存储,每行数据保存到StoreState时可以进行去重处理,然后再从StoreState生成Iterator作为结果数据,完成Distinct之类的聚合操作。性能优化和扩展在StoreState中完成,保证框架的稳定性。

 

      1. BytesToBytesMap

一种内存中保存<Key Value>数据的机制。字节组级别的操作。

/**

 * An append-only hash map where keys and values are contiguous regions of bytes.

 *

 * This is backed by a power-of-2-sized hash table, using quadratic probing with triangular numbers,

 * which is guaranteed to exhaust the space.

 *

 * The map can support up to 2^29 keys. If the key cardinality is higher than this, you should

 * probably be using sorting instead of hashing for better cache locality.

 *

 * The key and values under the hood are stored together, in the following format:

 *   Bytes 0 to 4: len(k) (key length in bytes) + len(v) (value length in bytes) + 4

 *   Bytes 4 to 8: len(k)

 *   Bytes 8 to 8 + len(k): key data

 *   Bytes 8 + len(k) to 8 + len(k) + len(v): value data

 *   Bytes 8 + len(k) + len(v) to 8 + len(k) + len(v) + 8: pointer to next pair

 *

 * This means that the first four bytes store the entire record (key + value) length. This format

 * is compatible with {@link org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter},

 * so we can pass records from this map directly into the sorter to sort records in place.

 */

 

      1. UnsafeExternalSorter

内核中的内存外部排序类,供排序场景时调用。

将Compare接口传给它,它会维护一个内存区域保存要排序的数据,insertRecord方法每次处理一条记录将数据插入到内存中,UnsafeExternalSorter在插入数据时做排序,最终它返回一个Iterator出来代表已经排序好的迭代器。数据在内存中以<Key,Value>的方式保存为字节组。

主要是内存管理,Sort利用Sorter.javaTimSort.java的逻辑来实现。

 

      1. UnsafeInMemorySorter

内存排序算法,使用Alpha-prefix算法。首先他会用到SortComparator

 

SortComparator

它使用memoryManager来保存带比较的两个对象,将对象的基础对象和位移(offset)传递给Sorter做比较,比较的结果为1,0,-1分别代表是大于、等于还是小于。

比较的对象包括对象的内存地址和prefix,如果prefix不等则直接根据prefix的结果作为比较的结果,如果prefix相等,则根据内存地址从memoryManager中取出内存块的首地址和该对象在内存块的offset,然后比较两个内存块。内存块代表的具体对象格式自己来定义,如果解析内存块对象也通过入参recordComparator来指定。

      1. Agg

系列聚合操作的实现。

Dataset.agg的实现,大部分是调用dataset.groupby().agg方法。

groupByKey

先看看groupByKey方法

def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] = {

    val inputPlan = logicalPlan

    val withGroupingKey = AppendColumns(func, inputPlan)

    val executed = sparkSession.sessionState.executePlan(withGroupingKey)

 

    new KeyValueGroupedDataset(

      encoderFor[K],

      encoderFor[T],

      executed,

      inputPlan.output,

      withGroupingKey.newColumns)

  }

对sql添加一个表示聚合函数的Column,然后生成KeyValueGroupedDataset,这是dataset的子类,表示聚合过之后的dataset。

groupBy

分组函数。

@scala.annotation.varargs

  def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {

    val colNames: Seq[String] = col1 +: cols

    RelationalGroupedDataset(

      toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType)

 }

 

RelationalGroupedDataset

groupby操作都是将dataset转换成RelationalGroupedDataset,通过RelationalGroupedDataset来操作数据的,看看RelationalGroupedDataset的关键定义和关键方法。

RelationalGroupedDataset对于GroupByType类型的groupby操作,转换成Aggregate类型的逻辑计划来执行。转换代码如下:

private[this] def toDF(aggExprs: Seq[Expression]): DataFrame = {

    val aggregates = if (df.sparkSession.sessionState.conf.dataFrameRetainGroupColumns) {

      groupingExprs ++ aggExprs

    } else {

      aggExprs

    }

 

    val aliasedAgg = aggregates.map(alias)

 

    groupType match {

      case RelationalGroupedDataset.GroupByType =>

        Dataset.ofRows(

          df.sparkSession, Aggregate(groupingExprs, aliasedAgg, df.logicalPlan))

      case RelationalGroupedDataset.RollupType =>

        Dataset.ofRows(

          df.sparkSession, Aggregate(Seq(Rollup(groupingExprs)), aliasedAgg, df.logicalPlan))

      case RelationalGroupedDataset.CubeType =>

        Dataset.ofRows(

          df.sparkSession, Aggregate(Seq(Cube(groupingExprs)), aliasedAgg, df.logicalPlan))

      case RelationalGroupedDataset.PivotType(pivotCol, values) =>

        val aliasedGrps = groupingExprs.map(alias)

        Dataset.ofRows(

          df.sparkSession, Pivot(aliasedGrps, pivotCol, values, aggExprs, df.logicalPlan))

    }

  }

下面再看看里面比较关键的一个方法agg的实现代码:

/**

   * (Scala-specific) Compute aggregates by specifying the column names and

   * aggregate methods. The resulting `DataFrame` will also contain the grouping columns.

   *

   * The available aggregate methods are `avg`, `max`, `min`, `sum`, `count`.

   * {{{

   *   // Selects the age of the oldest employee and the aggregate expense for each department

   *   df.groupBy("department").agg(

   *     "age" -> "max",

   *     "expense" -> "sum"

   *   )

   * }}}

   *

   * @since 1.3.0

   */

  def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame = {

    toDF((aggExpr +: aggExprs).map { case (colName, expr) =>

      strToExpr(expr)(df(colName).expr)

    })

  }

通过toDF方法将本Dataframe转换成目标Dataframe,将agg方法放在转换函数中。其他各种聚合操作如count,avg,min,max等都是通过调用toDF方法来执行的。toDF方法通过Aggregate来描述逻辑计划(LogicPlan)。因此研究聚合函数的重心其实是在研究Aggregate的编程实现上面来。

Aggregate

看看Aggregate的主函数和主要逻辑。它的类名是:org.apache.spark.sql.catalyst.plans.logical.Aggregate,定义在文件catalyst/plans/logical/badicLogicalOperators.scala中。

定义如下:

case class Aggregate(

    groupingExpressions: Seq[Expression],

    aggregateExpressions: Seq[NamedExpression],

    child: LogicalPlan)

  extends UnaryNode {

。。。。

}

再看看对应的Aggregate的物理计划是怎么定义的。

object Aggregation extends Strategy

在SparkStrateries中定义如何处理Agg类型的逻辑计划。

val aggregateOperator =

          if (functionsWithDistinct.isEmpty) {

            aggregate.AggUtils.planAggregateWithoutDistinct(

              groupingExpressions,

              aggregateExpressions,

              resultExpressions,

              planLater(child))

          } else {

            aggregate.AggUtils.planAggregateWithOneDistinct(

              groupingExpressions,

              functionsWithDistinct,

              functionsWithoutDistinct,

              resultExpressions,

              planLater(child))

          }

 

        aggregateOperator

 

通过AggUtils.planAggregateWithoutDistinct方法来构造对应的SparkPlan物理执行计划。而该planAggregateWithoutDistinct(或planAggregateWithDistinct)方法的主要作用就是判断是应该选择哪种类型的Agg操作,是HashAggregateExec、ObjectHashAggregateExec还是SortAggregateExec。

我们看一个HashAggregateExec的例子就行了,其他的都很类似。

HashAggregateExec

HashAggregateExec的定义和主要方法。

HashAggregateExec用到了TungstenAggregationIterator。

TungstenAggregationIterator

源码说明摘要:

/**

 * An iterator used to evaluate aggregate functions. It operates on [[UnsafeRow]]s.

 *

 * This iterator first uses hash-based aggregation to process input rows. It uses

 * a hash map to store groups and their corresponding aggregation buffers. If

 * this map cannot allocate memory from memory manager, it spills the map into disk

 * and creates a new one. After processed all the input, then merge all the spills

 * together using external sorter, and do sort-based aggregation.

 *

 * The process has the following step:

 *  - Step 0: Do hash-based aggregation.

 *  - Step 1: Sort all entries of the hash map based on values of grouping expressions and

 *            spill them to disk.

 *  - Step 2: Create an external sorter based on the spilled sorted map entries and reset the map.

 *  - Step 3: Get a sorted [[KVIterator]] from the external sorter.

 *  - Step 4: Repeat step 0 until no more input.

 *  - Step 5: Initialize sort-based aggregation on the sorted iterator.

 * Then, this iterator works in the way of sort-based aggregation.

 *

 * The code of this class is organized as follows:

 *  - Part 1: Initializing aggregate functions.

 *  - Part 2: Methods and fields used by setting aggregation buffer values,

 *            processing input rows from inputIter, and generating output

 *            rows.

 *  - Part 3: Methods and fields used by hash-based aggregation.

 *  - Part 4: Methods and fields used when we switch to sort-based aggregation.

 *  - Part 5: Methods and fields used by sort-based aggregation.

 *  - Part 6: Loads input and process input rows.

 *  - Part 7: Public methods of this iterator.

 *  - Part 8: A utility function used to generate a result when there is no

 *            input and there is no grouping expression.

 *

 * @param groupingExpressions

 *   expressions for grouping keys

 * @param aggregateExpressions

 * [[AggregateExpression]] containing [[AggregateFunction]]s with mode [[Partial]],

 * [[PartialMerge]], or [[Final]].

 * @param aggregateAttributes the attributes of the aggregateExpressions'

 *   outputs when they are stored in the final aggregation buffer.

 * @param resultExpressions

 *   expressions for generating output rows.

 * @param newMutableProjection

 *   the function used to create mutable projections.

 * @param originalInputAttributes

 *   attributes of representing input rows from `inputIter`.

 * @param inputIter

 *   the iterator containing input [[UnsafeRow]]s.

 */

 

    1. 总结:

最终执行都是子类的要么是buildScana生成RDD[Row],要么是readFile生成InternalRow的迭代器。

转载于:https://my.oschina.net/u/778683/blog/1831150

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值