DataFrames不是Spark SQL提出的,它早期在R、Python中就有了。原来数据R语言中的DataFrame可以快速转到Spark SQL中来。
Spark RDD API是通过函数式编程模式把大数据中的数据转换成分布式数据集RDD再进行编程。
DataFrames概述
DataSet是一个分布式数据集,它是spark1.6后新增的。
DataFrame是一个以列(列名、列的类型、列值)的形式构成的分布式数据集(RDD with schema。 有schema的RDD,RDD本身只是一个分布式数据集,没有列名、列值、类型的概念),按照列赋予不同的名称。可以把DataFrame看作是关系型数据库中的一张表。DataFrame可以通过一下方式创建:结构化数据文件(json)、hive中的表、其他外部数据源(mysql、no sql)、已经存在的RDD。
使用RDD,java/scala 底层是jvm ,python底层是 python runtime,用不同的语言,性能不一样。
而使用DataFrame ,无论是java/scala/python 都是把他们转成逻辑计划(logic plan)来执行,性能是一样的。
DataFrame基本API操作
项目目录
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sid.com</groupId>
<artifactId>sparksqltrain</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<!-- scala依赖 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- spark依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- hivecontext要用这个依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.spark-project.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1.spark2</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
代码
package com.sid.com
import org.apache.spark.sql.SparkSession
/***
* DataFrame API基本操作
*/
object DataFrameApi {
def main(args: Array[String]): Unit = {
//Spark SQL的入口点是Spark Session
val spark = SparkSession.builder().appName("DataFrameApi").master("local[2]").getOrCreate()
/**
* Loads input in as a `DataFrame`, for data sources that require a path (e.g. data backed by
* a local or distributed file system).
*
* @since 1.4.0
*/
//将json文件加载成一个DataFrame
val peopleDF = spark.read.format("json").load("file:///G:\\desktop\\people.json")
//打印Schema信息到控制台
peopleDF.printSchema()
//展示DataFrame中的记录,默认前20条
peopleDF.show()
//查询某列 对应mysql是 select name from t;
peopleDF.select("name").show()
//peopleDF.col("name")得到一个Column对象
//select name,age+10 as age2 from t
//(peopleDF.col("age")+10).as("age2") 取别名
peopleDF.select(peopleDF.col("name"),(peopleDF.col("age")+10).as("age2")).show()
//select * from table where age > 19
peopleDF.filter(peopleDF.col("age") > 30).show()
//select age,count(*) from table group by age
peopleDF.groupBy("age").count().show()
//select * from t where name = 'sid' or name='lisi'
peopleDF.filter("name='sid' or name='lisi'").show()
//select * from t where substr(name,0,1)='s'
peopleDF.filter("substr(name,0,1)='s'").show()
//select * from t order by name,age
peopleDF.sort("name","age").show()
//select * from t order by name asc,age desc
peopleDF.sort(peopleDF("name").asc,peopleDF("age").desc).show()
val peopleDF2 = spark.read.format("json").load("file:///G:\\desktop\\people.json")
//join里面用3个等号
//select * from t1 inner join t2 on t1.age = t2.age
peopleDF.join(peopleDF2,peopleDF.col("age")===peopleDF2.col("age"),"inner").show()
spark.stop()
}
}
people.json
{"name":"zhangsan","age":30}
{"name":"lisi","age":31}
{"name":"wangwu","age":32}
{"name":"sid","age":32}
运行结果