一、文件的读取,首先准备一个people.json文件和一个people.csv文件,文件内容如下:
json文件:
{"name":"Michael"}
{"name":"Andy","age":30}
{"name":"Justin","age":19}
csv文件为:
name,age
Michael,
Andy,30
Justin,19
1、读取和写入json文件
(1)从本地文件读取,首先导入:import spark.implicits._包,然后执行:val df=spark.read.json("file:///sunxj/work/spark/people.json"),表示从文件读取,最后通过df.show()显示读取到的内容,如下图所示:
也可以通过:val df1=spark.read.format("json").load("file:///sunxj/work/spark/people.json")读取
(2)写入json文件,执行:df.write.json("file:///sunxj/work/spark/people1.json")或df.write.format("json").save("file:///sunxj/work/spark/people2.json"),表示将json写入到文件中(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:
(3)可以通过df.select("name","age").write.csv("file:///sunxj/work/spark/newpeople1.json")或df.select("name","age").write.format("json").save("file:///sunxj/work/spark/newpeople.json")写入文件(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:
2、读取和写入csv文件
(1)从本地文件读取,首先导入:import spark.implicits._包,然后执行:val csvdf=spark.read.csv("file:///sunxj/work/spark/people.csv")或val csvdf1=spark.read.format("csv").load("file:///sunxj/work/spark/people.csv")表示从文件读取,最后通过show()显示读取到的内容,如下图所示:
(2)写入csv文件,执行:csvdf.select("_c0","_c1").write.format("csv").save("file:///sunxj/work/spark/newcsv/")或csvdf.select("_c0","_c1").write.csv("file:///sunxj/work/spark/newcsv1/"),表示将csv写入到文件中(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:
(3)、可以通过csvdf.select("_c0","_c1").write.format("csv").save("file:///sunxj/work/spark/newcsv/")或csvdf.select("_c0","_c1").write.csv("file:///sunxj/work/spark/newcsv1/")写入文件(注意:此方法是写入到一个目录中的,目录中有一个文件),如下图所示:
3、不管是json还是csv都可以通过rdd.saveAsTextFile()来保存,如:csvdf.rdd.saveAsTextFile("file:///sunxj/work/spark/rddcsv/")或df.rdd.saveAsTextFile("file:///sunxj/work/spark/rddjson/"),如下图所示:
4、使用printSchema()打印出DataFrame的模式(Schema)信息,如下图所示:
5、select()它的功能是从DataFrame中选取部分列的数据,或者重命名,如下图所示:
6、filter过滤,可以根据某个字段进行过滤,如下图所示:
7、groupBy()分组,按照某个字段进行分组,并统计个数,如下图所示:
8、sort排序,按照age字段进行降序排序,如果age相同则在按照name进行升序排序,如下图所示:
9、利用反射机制推断RDD模式(通过定义case class模式来定义表,然后通过sql查询)
(1)首先新建一个people.txt文件内容如下:
Michael,29
Andy,30
Justin,19
(2)在pom.xml文件添加如下内容:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>comprehensive-example</groupId>
<artifactId>comprehensiveexample</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.1</version>
<configuration>
<archive>
<manifest>
<mainClass>PeopleSql</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
(3)新建一个PeopleSql.scala文件,内容为:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.SparkSession
case class People(name:String,age:Long)
object PeopleSql {
def main(args:Array[String]): Unit ={
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
//从文件中读取数据
val lines=spark.sparkContext.textFile("file:///sunxj/work/spark/people.txt")
//如果要将RDD转换成DataFram必须要定义一个case class,首先对数据进行分割,然后返回People列表,在通过toDF()将RDD转换成DataFrame
val pdf=lines.map(line=>line.split(",")).map(x=>People(x(0),x(1).toInt)).toDF()
//将DataFrame注册为一个临时表,表名为people
pdf.createOrReplaceTempView("people")
//然后通过sql查询语句在生成一个DataFrame
val personRDD=spark.sql("select name,age from people where age>20")
//遍历DataFrame打印信息
personRDD.map(x=>"Name:"+x(0)+","+"Age:"+x(1)).show()
}
}
(4)使用mvn package打包成jar,然后执行:spark-submit comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar,输出结果如下图所示:
10、使用编程方式定义RDD模式(就是将文件内容转换为表(表头+表记录)来通过sql查询)
(1)新建一个PeopleSqlProgramming.scala,内容如下:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object PeopleSqlProgramming {
def main(args:Array[String]): Unit ={
//定义表头,name是表名,StringType是表类型,true表示是否允许为空值
val fields=Array(StructField("name",StringType,true),StructField("age",IntegerType,true))
//生成一个schema,实际意义上的表头
val schema=StructType(fields)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
//从文件读取数据
val peopleRDD=spark.sparkContext.textFile("file:///sunxj/work/spark/people.txt")
//将数据转换成Row方式,并返回一个列表
val rowRDD=peopleRDD.map(_.split(",")).map(x=>Row(x(0),x(1).trim.toInt))
//将schema和Row列表组合在一起
val peopleDF=spark.createDataFrame(rowRDD,schema)
//注册成为一个临时表
peopleDF.createOrReplaceTempView("people")
//执行sql语句查询
val result=spark.sql("select name,age from people")
//将结果打印输出
result.map(x=>"Name:"+x(0)+","+"Age:"+x(1)).show()
}
}
(2)将<mainClass>PeopleSql</mainClass>改为<mainClass>PeopleSqlProgramming</mainClass>,然后通过mvn package打包成jar包,然后执行:spark-submit comprehensive-example/target/comprehensiveexample-1.0-SNAPSHOT-jar-with-dependencies.jar,输出结果如下图所示:
11、spark sql通过jdbc读写数据库,首先创建数据库和表
(1)首先通过mysql -u root -p登录数据库输入密码123456
(2)使用:create database spark;创建数据库,如下图所示:
(3)依次执行:
use spark;
create table student(id int(4),name varchar(20),gender varchar(4),age int(4));
insert into student values (1,'Xueqian','F',23);
insert into student values (2,'Weiliang','M',24);
如下图所示:
(4)查询记录为:
(5)新建一个项目,配置pom.xml,如下内容:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark-sql-mysql-hive</groupId>
<artifactId>spark-sql-mysql-hive</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.17</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.1</version>
<configuration>
<archive>
<manifest>
<mainClass>SparkSqlOperateMysql</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
(6)新建一个SparkSqlOperateMysql.scala文件,文件内容如下:
import org.apache.spark.sql.SparkSession
object SparkSqlOperateMysql {
def main(args:Array[String]): Unit ={
val spark=SparkSession.builder().getOrCreate()
val jdbcDF=spark.read.format("jdbc")
.option("url","jdbc:mysql://localhost:3306/spark")
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable","student")
.option("user","root")
.option("password","123456")
.load()
jdbcDF.show();
}
}
(7)使用mvn package打包jar文件,然后执行:spark-submit spark-sql-mysql-hive/target/spark-sql-mysql-hive-1.0-SNAPSHOT-jar-with-dependencies.jar,输出结果如下图所示:
(8)修改代码使其能够插入记录,修改如下:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.util.Properties
object SparkSqlOperateMysql {
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
def getDF(databaseName:String,tablename:String,user:String,password:String): DataFrame ={
spark.read.format("jdbc")
.option("url","jdbc:mysql://localhost:3306/"+databaseName)
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable",tablename)
.option("user",user)
.option("password",password)
.load()
}
def printStudent(databaseName:String,tablename:String,user:String,password:String): Unit ={
val jdbcDF=getDF(databaseName,tablename,user,password)
jdbcDF.show()
}
def getStudentCount(databaseName:String,tablename:String,user:String,password:String): Long ={
val jdbcDF=getDF(databaseName,tablename,user,password)
jdbcDF.count()
}
def main(args:Array[String]): Unit ={
val dbName="spark"
val tabName="student"
val user="root"
val pwd="123456"
var count=getStudentCount(dbName,tabName,user,pwd)
println("插入前记录个数:%d".format(count))
println("插入前记录:")
printStudent(dbName,tabName,user,pwd)
//构造两个学生信息
val studentRDD=spark.sparkContext.parallelize(Array("%d Rongcheng%d M 26".format(count+1,count+1),"%d Guanhua%d M 27".format(count+2,count+2)))
.map(_.split(" "))
//设置模式信息,即表头以及类型
val schema=StructType(List(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("gender",StringType,true),StructField("age",IntegerType,true)))
//创建Row对象,每个Row对象都是rowRDD中的一行
val rowRDD=studentRDD.map(p=>Row(p(0).toInt,p(1).trim,p(2).trim,p(3).toInt))
//建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
val studentDF=spark.createDataFrame(rowRDD,schema)
//下面创建一个prop变量用来保存JDBC连接参数
val prop=new Properties()
prop.put("user",user)
prop.put("password",pwd)
prop.put("driver","com.mysql.jdbc.Driver")
//连接数据库,采用append模式,表示追加记录到数据库spark的student表中
studentDF.write.mode("append").jdbc("jdbc:mysql://localhost:3306/"+dbName,"spark."+tabName,prop)
count=getStudentCount(dbName,tabName,user,pwd)
println("插入后记录个数:%d".format(count))
println("插入后记录:")
printStudent(dbName,tabName,user,pwd)
}
}
(9)然后打包,spark-submit提交运行,输出 结果如下图所示:
(10)直接在数据库中查询,如下图所示:
12、spark操作hive数据库
(1)首先进入hive,创建数据库spark_test和数据表student,如下图所示:
(2)插入数据,如下图所示:
由此可见hive的sql语句是转换为MapReduce执行的。
(3)新建一个SprkSqlOperateHive.scala文件,文件内容如下:
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
case class Record(key:Int,value:String)
object SprkSqlOperateHive {
val warehouseLocation="hdfs://master:9000/user/hive/warehouse"
def main(args:Array[String]): Unit ={
val spark=SparkSession.builder()
.appName("Spark Hive Example")
.config("hive.metastore.uris","thrift://master:9083")//设置操作hive的url
//.config("spark.sql.warehouse.dir",warehouseLocation)//此处已经在hive-site.xml配置了
.enableHiveSupport()//启用hive
.getOrCreate()
spark.sql("select * from spark_test.student").show()
}
}
pom.xml文件内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark-sql-mysql-hive</groupId>
<artifactId>spark-sql-mysql-hive</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.17</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.1</version>
<configuration>
<archive>
<manifest>
<mainClass>SprkSqlOperateHive</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
(4)将pom.xml中的<mainClass>SparkSqlOperateMysql</mainClass>改为:<mainClass>SprkSqlOperateHive</mainClass>,然后重新打包,然后执行,输出结果如下:
(5)修改代码,使其能够写入记录,修改如下:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
object SprkSqlOperateHive {
val spark=SparkSession.builder()
.appName("Spark Hive Example")
.config("hive.metastore.uris","thrift://master:9083")//设置操作hive的url
.enableHiveSupport()//启用hive
.getOrCreate()
def printStudent(): Unit ={
spark.sql("select * from spark_test.student").show()
}
def getStudentCount(): Long ={
spark.sql("select * from spark_test.student").count()
}
def main(args:Array[String]): Unit ={
var count=getStudentCount()
println("插入前记录个数:%d".format(count))
println("插入前记录:")
printStudent()
//构造两条学生信息
val studentRDD=spark.sparkContext.parallelize(Array("%d Rongcheng_%d_hive M 26".format(count+1,count+1),"%d Guanhua_%d_hive M 27".format(count+2,count+2)))
.map(_.split(" "))
//设置模式信息,即表头以及类型
val schema=StructType(List(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("gender",StringType,true),StructField("age",IntegerType,true)))
//创建Row对象,每个Row对象都是rowRDD中的一行
val rowRDD=studentRDD.map(p=>Row(p(0).toInt,p(1).trim,p(2).trim,p(3).toInt))
//建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
val studentDF=spark.createDataFrame(rowRDD,schema)
//下面注册临时表
studentDF.createOrReplaceTempView("tempTable")
spark.sql("insert into spark_test.student select * from tempTable")
count=getStudentCount()
println("插入后记录个数:%d".format(count))
println("插入后记录:")
printStudent()
}
}
输出结果如下:
然后在查看hive,如下图所示: