mappartitions java,Apache Spark:在Java中有效使用mapPartitions

In the currently early-release textbook titled High Performance Spark, the developers of Spark note that:

To allow Spark the flexibility to spill some records

to disk, it is important to represent your functions inside of mapPartitions in such a

way that your functions don’t force loading the entire partition in-memory (e.g.

implicitly converting to a list). Iterators have many methods we can write functional style

transformations on, or you can construct your own custom iterator. When a

transformation directly takes and returns an iterator without forcing it through

another collection, we call these iterator-to-iterator transformations.

However, the textbook lacks good examples using mapPartitions or similar variations of the method. And there's few good code examples existing online--most of which are Scala. For example, we see this Scala code using mapPartitions written by zero323 on How to add columns into org.apache.spark.sql.Row inside of mapPartitions.

def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)

sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show

Unfortunately, Java doesn't provide anything as nice as iter.map(...) for iterators. So it begs the question, how can one effectively use the iterator-to-iterator transformations with mapPartitions without entirely spilling an RDD to disk as a list?

JavaRDD collection = prevCollection.mapPartitions((Iterator iter) -> {

ArrayList out = new ArrayList<>();

while(iter.hasNext()) {

InObj current = iter.next();

out.add(someChange(current));

}

return out.iterator();

});

This seems to be the general syntax for using mapPartitions in Java examples, but I don't see how this would be the most efficient, supposing you have a JavaRDD with tens of thousands of records (or even more...since, Spark is for big data). You'd eventually end up with a list of all the objects in the iterator, just to turn it back into an iterator (which begs to say that a map function of some sort would be much more efficient here).

Note: while these 8 lines of code using mapPartitions could be written as 1 line with a map or flatMap, I'm intentionally using mapPartitions to take advantage of the fact that it operates over each partition rather than each element in the RDD.

Any ideas, please?

解决方案

One way to prevent forcing the "materialization" of the entire partition is by converting the Iterator into a Stream, and then using Stream's functional API (e.g. map function).

How to convert an iterator to a stream? suggests a few good ways to convert an Iterator into a Stream, so taking one of the options suggested there we can end up with:

rdd.mapPartitions((Iterator iter) -> {

Iterable iterable = () -> iter;

return StreamSupport.stream(iterable.spliterator(), false)

.map(s -> transformRow(s)) // or whatever transformation

.iterator();

});

Which should be an "Itrator-to-Iterator" transformation, because all the intermediate APIs used (Iterable, Stream) are lazily evaluated.

EDIT: I haven't tested it myself, but the OP commented, and I quote, that "there is no efficiency increase by using a Stream over a list". I don't know why that is, and I don't know if that would be true in general, but worth mentioning.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值