Sample by a Hash Function (Scala)

最新推荐文章于 2024-08-17 12:49:11 发布

bzhangusc

最新推荐文章于 2024-08-17 12:49:11 发布

阅读量741

点赞数

文章标签： Scala

本文链接：https://blog.csdn.net/bzhangusc/article/details/43054379

版权

It’s really common in Big Data ad hoc analysis we need to down sample the data. However for most of the cases, we need to down sample based on some hash function of a Key of the data. For example, to process credit card data, we want to perform the sampling consistently across all the data files which contend account ID as the key. The pseudo code for this is

 
        if Hash(account_ID) % 100 < 5: Keep 
       
        else: Drop

In Spark, we can use the MurmurHash3 function in Scala to do the sampling:

 
   
        import 
        scala.util.hashing.{MurmurHash 
        3 
        = 
        >MH 
        3 
        } 
       
 
        ..... 
       
 
        val 
        seed 
        = 
        12345 
       
 
        val 
        rate 
        = 
        0.05 
       
 
        val 
        sample 
        = 
        data.filter(line 
        = 
        >(MH 
        3 
        .stringHash(line.take( 
        10 
        ),seed)& 
        0x7FFFFFFF 
        )<(rate* 
        0x7FFFFFFF 
        )) 
       
 
 

Here I take the first 10 characters as the key and applied the MH3 hash. Since MH3.stringHash returns an Int, I apply a binary “and” to drop the 1st bit and applied the sample rate comparison.