RDD:
rdd无head函数
take(num: Int): Array[T]: 返回rdd的前num个元素组成array到driver。若rdd为nothing或null会报错
first(): T: 返回rdd的第一个元素,若rdd为空,或为nothing或null会报错
isEmpty(): Boolean : 判断rdd是否为空(空分区或空元素都为空,即使分区有一个,元素为空也为空)。rdd为Nothing或null的RDD引用会抛出异常(在内部实际使用了take(1))
注意: `parallelize(Seq())` 为 `RDD[Nothing]`, (`parallelize(Seq())` 可通过 `parallelize(Seq[T]())`.)避免
Dataset:
head(n: Int): Array[T] :提取前n行数据,会将数据拉到driver端。dataset为空会报错
def head(): T:返回第一行,等价于head(1)。dataset为空会报错
def first(): T :等价于head()。dataset为空会报错
take(n: Int): Array[T]: 等价于head(n)。dataset为空会报错
isEmpty: Boolean:只有Spark 2.4.0之后才有
count()效率不如foreachPartition.
/** | |
* Returns the first `n` rows. | |
* | |
* @note this method should only be used if the resulting array is expected to be small, as | |
* all the data is loaded into the driver's memory. | |
* | |
* @group action | |
* @since 1.6.0 | |
*/ | |
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan) | |
/** | |
* Returns the first row. | |
* @group action | |
* @since 1.6.0 | |
*/ | |
def head(): T = head(1).head | |
/** | |
* Returns the first row. Alias for head(). | |
* @group action | |
* @since 1.6.0 | |
*/ | |
def first(): T = head() |
def take(n: Int): Array[T] = head(n)
scala> val r2=Seq()
r2: Seq[Nothing] = List()
scala> val d3=r2.toDS()
<console>:36: error: value is not a member of Seq[Nothing]
count()和foreachPartition()效率:
ds.rdd.isEmpty性能最高
Dataframe:
head(int n):也是拉到driver端,性能特别差
head():head(1)
first():等价于head()
take(int n):等价于 head(int n)
没有isEmpty函数
df.rdd.isEmpty性能最好
public Row[] head(int n)
{
return limit(n).collect();
}
public Row head()
{
return (Row)Predef..MODULE$.refArrayOps((Object[])head(1)).head();
}
public Row first()
{
return head();
}
public Row[] take(int n)
{
return head(n);
}