It’s really common in Big Data ad hoc analysis we need to down sample the data. However for most of the cases, we need to down sample based on some hash function of a Key of the data. For example, to process credit card data, we want to perform the sampling consistently across all the data files which contend account ID as the key. The pseudo code for this is
1
2
|
if Hash(account_ID) % 100 < 5: Keep
else: Drop
|
In Spark, we can use the MurmurHash3 function in Scala to do the sampling:
1
2
3
4
5
|
import
scala.util.hashing.{MurmurHash
3
=
>MH
3
}
.....
val
seed
=
12345
val
rate
=
0.05
val
sample
=
data.filter(line
=
>(MH
3
.stringHash(line.take(
10
),seed)&
0x7FFFFFFF
)<(rate*
0x7FFFFFFF
))
|
Here I take the first 10 characters as the key and applied the MH3 hash. Since MH3.stringHash returns an Int, I apply a binary “and” to drop the 1st bit and applied the sample rate comparison.