Spark对HDFS上json数据的操作非常方便,本文以两种方式进行简单介绍,分别为Spark Shell 和 编写Scala应用程序。
Spark Shell
进入Spark Shell
hadoop@master:~$ spark-1.6.0-bin-hadoop2.4/bin/spark-shell
测试数据可以自己定义,随便写个几条数据即可,我就不貼出来了。
scala> val kg = sqlContext.read.json("hdfs://master:9000/test/data/knowledgeGraph/knowledgeGraph-11M.json")
JSON数据的格式如下
scala> kg.printSchema()
root
|-- course: string (nullable = true)
|-- courseId: string (nullable = true)
|-- date: long (nullable = true)
|-- grade: string (nullable = true)
|-- gradeId: string (nullable = true)
|-- name: string (nullable = true)
|-- school: string (nullable = true)
|-- schoolId: string (nullable = true)
|-- score: double (nullable =