启动交互shell
[root@hdp-1 bin]# ./spark-shell --master spark://hdp-2:7077 --executor-memory 500m --total-executor-cores 1
–master spark://hdp-2:7077 sparkmaster节点的地址
–executor-memory 500m
–total-executor-cores 1
(spark中已经创建好了SparkContext和SQLContext对象)
加载数据
vi datajson.txt
{“name”:“Michael”}
{“name”:“Andy”, “age”:30}
{“name”:“Justin”, “age”:19}
scala> val jsondf=spark.sqlContext.read.json("hdfs://hdp-0:9000/spark/datajson.txt")
jsondf: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala>
scala> jsondf.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
DataFrame的使用
A DataFrame is a distributed collection of data organized into named columns.
DataFrame是一个分布式数据组织成命名列的集合。
scala> jsondf.select(jsondf("name"),jsondf("age")).show()
+-------+----+
| name| age|
+-------+----+
|Michael|null|
| Andy