一、读取数据源
(1)读取json ,使用spark.read。注意:路径默认是从HDFS,如果要读取本机文件,需要加前缀file://,如下
scala> val people = spark.read.format("json").load("file:///opt/software/data/people.json")
people: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> people.show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
spark.read.format(“json”).load(“file:///opt/software/data/people.json”)
等价于spark.read.json(“file:///opt/software/data/people.json”)
如要要读取其它格式文件,只需修改format(“json”)即可,如format(“parquet”)
(2)读取Hive表,使用spark.sql。其中hive数据库名default(默认数据库名可省略),表为people
scala> val peopleDF=spark.sql("select * from default.people")
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]
scala> peopleDF.show
+--------+---+--------+
| name|age| address|
+--------+---+--------+
|zhangsan| 22| chengdu|
| wangwu| 33| beijing|
| lisi| 28|shanghai|
+--------+---+--------+
scala> peopleDF.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- address: string (nullable = true)
二、取数据列
取列的三种方式如下
scala> peopleDF.select("name","age").show
+--------+---+
| name|age|
+--------+---+
|zhangsan| 22|
| wangwu| 33|
| lisi| 28|
+--------+---+
scala> peopleDF.select($"name",$"age").show
+--------+---+
| name|age|
+--------+---+
|zhangsan| 22|
| wangwu| 33|
| lisi| 28|
+--------+---+
scala> peopleDF.select(peopleDF.col("name"),peopleDF.col("age")).show
+--------+---+
| name|age|
+--------+---+
|zhangsan| 22|
| wangwu| 33|
| lisi| 28|
+--------+---+
注意:如果在IDEA中编辑代码,使用 , 则 必 须 增 加 语 句 : i m p o r t s p a r k . i m p l i c i t s . , 否 则 ,则必须增加语句:import spark.implicits._,否则 ,则必须增加语句:importspark.implicits.,否则表达式会报错。spark-shell默认已经导入了的
$”列名”这个是语法糖,返回Column对象