Creating DataFrames
官网:https://spark.apache.org/docs/latest/sql-getting-started.html
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
启动spark
[hadoop@hadoop001 spark-2.4.0-bin-2.6.0-cdh5.7.0]$ cd bin
[hadoop@hadoop001 bin]$ ./spark-shell
找到官方提供的json文件
[hadoop@hadoop001 resources]$ pwd
/home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources
[hadoop@hadoop001 resources]$ cat people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
scala> val df = spark.read.json("file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json")
[Stage 0:> (0 + 1) df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
用spark sql来处理数据是非常方便的,他的底层是外部数据源实现的
扩展
scala> spark.table("ruoze_emp").show
这个读hive在里面的文件在这个运行之前一定要把hdfs启动起来
在idea上如何操作
pom中要下载hive的依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
然后:
package g5.learning
import org.apache.spark.sql.SparkSession
object SparkSessionApp {
def main(args: Array[String]): Unit = {
val sparksession= SparkSession.builder().appName("SparkSessionApp")
.master("local[2]")
.enableHiveSupport()//使用到hive一定要开启这个
.getOrCreate()
// sparksession.sparkContext.parallelize(Array(1,2,3,4)).collect().foreach( println)
sparksession.table("ruoze_emp").show
sparksession.stop()
}
}
.enableHiveSupport()//使用到hive一定要开启这个
在windows上跑hive还是很麻烦的,还需要很多操作,获取文件