作者:周志湖
放假了,终于能抽出时间更新博客了…….
1. 获取数据
本文通过将github上的Spark项目git日志作为数据,对SparkSQL的内容进行详细介绍
数据获取命令如下:
[root@master spark]# git log --pretty=format:'{
"commit":"%H","author":"%an","author_email":"%ae","date":"%ad","message":"%f"}' > sparktest.json
格式化日志内容输出如下:
[root@master spark]# head -1 sparktest.json
{"commit":"30b706b7b36482921ec04145a0121ca147984fa8","author":"Josh Rosen","author_email":"joshrosen@databricks.com","date":"Fri Nov 6 18:17:34 2015 -0800","message":"SPARK-11389-CORE-Add-support-for-off-heap-memory-to-MemoryManager"}
</