说明
CSDN泄露的用户数据的格式如下:
aaaaaaa # bbbbbb # xxxxxx@hotmail.com
aaaaaaa # bbbbbb # xxxxxx@hotmail.com
aaaaaaa # bbbbbb # xxxxxx@hotmail.com
aaaaaaa # bbbbbb # xxxxxx@hotmail.com___csdn_1
aaaaaaa # bbbbbb # xxxxxx@hotmail.com
格式为:用户名、 密码、邮箱,字段之间使用" # “(星两边各有一个空格)进行分隔。
分析最多人使用的TOPn个密码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
case class User(username: String, password: String, email: String)
var filePath = "/data/www.csdn.net.sql"
var linesRDD = sc.textFile(filePath)
var partsRDD = linesRDD.map(l => l.split(","))
var csdnRDD = partsRDD.map(r => User(username=r(0), password=r(1), email=r(2)))
var csdnDF = csdnRDD.toDF()
csdnDF.printSchema()
csdnDF.count()
csdnDF.registerTempTable("csdn")
var pwdSet = sqlContext.sql("SELECT password,COUNT(password) AS password_cnt
FROM csdn GROUP BY password ORDER BY password_cnt DESC LIMIT 50")
pwdSet.map(r => "Password: " + r(0) + " Count: " + r(1)).collect().foreach(println)
csdnDF.groupBy("password").count().show()
|