bzhangusc-CSDN博客

原创 Optimize map performamce with mapPartitions

As we can see in previous article "CSV Parser" we may need to create a new object for each record of an RDD as in123456 defmLine(line:String)={ valparser=

2015-01-26 13:55:18 281

原创 CSV Parser

Most of our data files are in CSV format. Although the String.split('\t') approach can handle a lot cases, there are CSV files which has quotes. In that case if a delimiter character is in between o

2015-01-24 14:18:31 369

原创 Partition by Hash on Keys

When an RDD object is created, it will partitioned to multiple pieces for parallel processing. If we have to join the RDD with other RDDs many times on some Key, we’d better partition the RDDs by the

2015-01-24 13:49:55 452

原创 Sample by a Hash Function (Scala)

It’s really common in Big Data ad hoc analysis we need to down sample the data. However for most of the cases, we need to down sample based on some hash function of a Key of the data. For example, to

2015-01-23 13:22:34 737

原创 Histogram with Spark (2) – Implicit class

As in the previous post we studied how to calculate the histogram on a RDD[String].By using implicit type conversion, we can add the helper method to the Map class and make the code looks better.

2015-01-22 14:22:16 424

Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in th

2015-01-21 08:47:39 711

原创 Histogram in Scala

```.scalascala> val hist=Array("aa","bb","aa").foldLeft(Map[String,Int]()){| (m,c) => m + (c -> (m.getOrElse(c,0)+1))| }```or use the updated method of mutable Map```.scalascala>

2015-01-15 09:11:21 571

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

bzhangusc的专栏

原创 Optimize map performamce with mapPartitions

原创 CSV Parser

原创 Partition by Hash on Keys

原创 Sample by a Hash Function (Scala)

原创 Histogram with Spark (2) – Implicit class

原创 Histogram in Spark (1)

原创 Histogram in Scala

空空如也

空空如也