前两天有小老弟面试遇到如下问题:
那应该如何完成这个SQL语句的书写呢?
测试数据:
val rdd = spark.sparkContext.makeRDD(
List(
("2018-01-01", 1, "www.baidu.com", "10:01"),
("2018-01-01", 2, "www.baidu.com", "10:01"),
("2018-01-01", 1, "www.sina.com", "10:01"),
("2018-01-01", 3, "www.baidu.com", "10:01"),
("2018-01-01", 3, "www.baidu.com", "10:01"),
("2018-01-01", 1, "www.sina.com", "10:01")
)
)
解题思路:
- 首先第一步:肯定需要根据user_id,page_id进行分组操作,将相同用户的访问页面归为一类,并统计出每个用户访问相同页面的次数
执行结果:spark .sql( """ |select | userId, | page, | count(page) as page_count | from | t_log | group by | userId,page |""".stripMargin) .show()
+------+-------------+----------+ |userId| page|page_count| +------+-------------+----------+ | 2|www.baidu.com| 1| | 3|www.baidu.com| 2| | 1|www.baidu.com| 1| | 1| www.sina.com| 2| +------+-------------+----------+
- 接下来第二步:通过窗口函数中的排名函数对相同用户分区,并且按照页面访问量降序排列
执行结果:spark .sql( """ |select | * |from | ( | select | userId, | page, | page_count, | rank() | over(partition by userId order by page_count desc) as rank | from | ( | select | userId, | page, | count(page) as page_count | from | t_log | group by | userId,page | ) | ) |""".stripMargin) .show()
+------+-------------+----------+----+ |userId| page|page_count|rank| +------+-------------+----------+----+ | 1| www.sina.com| 2| 1| | 1|www.baidu.com| 1| 2| | 3|www.baidu.com| 2| 1| | 2|www.baidu.com| 1| 1| +------+-------------+----------+----+
- 最后,保留Rank排名前10的结果
执行结果:spark .sql( """ |select | * |from | ( | select | userId, | page, | page_count, | rank() | over(partition by userId order by page_count desc) as rank | from | ( | select | userId, | page, | count(page) as page_count | from | t_log | group by | userId,page | ) | ) |where | rank < 10 |""".stripMargin) .show()
+------+-------------+----------+----+ |userId| page|page_count|rank| +------+-------------+----------+----+ | 1| www.sina.com| 2| 1| | 1|www.baidu.com| 1| 2| | 3|www.baidu.com| 2| 1| | 2|www.baidu.com| 1| 1| +------+-------------+----------+----+
总结:
主要考察SQL的基本语法及窗口函数的使用