- 博客(7)
- 收藏
- 关注
原创 Optimize map performamce with mapPartitions
As we can see in previous article "CSV Parser" we may need to create a new object for each record of an RDD as in123456 defmLine(line:String)={ valparser=
2015-01-26 13:55:18 281
原创 CSV Parser
Most of our data files are in CSV format. Although the String.split('\t') approach can handle a lot cases, there are CSV files which has quotes. In that case if a delimiter character is in between o
2015-01-24 14:18:31 369
原创 Partition by Hash on Keys
When an RDD object is created, it will partitioned to multiple pieces for parallel processing. If we have to join the RDD with other RDDs many times on some Key, we’d better partition the RDDs by the
2015-01-24 13:49:55 452
原创 Sample by a Hash Function (Scala)
It’s really common in Big Data ad hoc analysis we need to down sample the data. However for most of the cases, we need to down sample based on some hash function of a Key of the data. For example, to
2015-01-23 13:22:34 737
原创 Histogram with Spark (2) – Implicit class
As in the previous post we studied how to calculate the histogram on a RDD[String].By using implicit type conversion, we can add the helper method to the Map class and make the code looks better.
2015-01-22 14:22:16 424
原创 Histogram in Spark (1)
Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in th
2015-01-21 08:47:39 711
原创 Histogram in Scala
```.scalascala> val hist=Array("aa","bb","aa").foldLeft(Map[String,Int]()){| (m,c) => m + (c -> (m.getOrElse(c,0)+1))| }```or use the updated method of mutable Map```.scalascala>
2015-01-15 09:11:21 571
空空如也
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人