转载:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
- Avoid
reduceByKey
When the input and output value types are different. For example, consider writing a transformation that finds all the unique strings corresponding to each key. One way would be to use map to transform each element into aSet
and then combine theSet
s withreduceByKey
:
This code results in tons of unnecessary object creation because a new set must be allocated for each record. It’s better to use
aggregateByKey
, which performs the map-side aggregation more efficiently: - Avoid the
flatMap-join-groupBy
pattern. When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.