一:获取文件
官网:https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
spark本身 有测试文件
[hadoop@hadoop001 resources]$ ls
employees.json kv1.txt people.json user.avsc users.orc
full_user.avsc people.csv people.txt users.avro users.parquet
[hadoop@hadoop001 resources]$ cat people.txt
Michael, 29
Andy, 30
Justin, 19
[hadoop@hadoop001 resources]$ pwd
/home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources
在spark上操作:这种事标准写法
scala> spark.read.format("text").load("file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.txt").show(false)
+-----------+
|value |
+-----------+
|Michael, 29|
|Andy, 30 |
|Justin, 19 |
+-----------+
二:idea操作
package g5.learning
import org.apache.spark.sql.{SaveMode, SparkSession}
object DataSpurceAPIApp {
def main(args: Array[String]): Unit = {
val sparksession= SparkSession.builder().appName("DataSpurceAPIApp")
.master("local[2]")
.getOrCreate()
//标准读写法(loat也可以传2个路径,逗号隔开就可以了)
sparksession.read.format("text/json/parquet/jdbc/orc").load("file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.txt").show(false)
//简化写法(format这里换成相应的文件)SparkSql默认处理的format就是parquent
val df =sparksession.read.text("")
//这种写法也是可以的,path是固定写法
sparksession.read.format("text/json/parquet/jdbc/orc/csv").option("path","file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.txt").load.show(false)
//在CSV的时候要注意一些,其中是分隔符和包含头和推导相应数据类型
sparksession.read.format("csv").option("sep", ";").option("header", "true") .option("inferSchema", "true").load("file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.txt").show(false)
//标准写写法(筛选之前文本的name写成json格式,如果存在删除掉之前的,写在这个路径下
df.select("name").write.format("json").mode("overwrite").save("file:///home/hadoop/tmp/jsonfile")
//写到hive的库里面
df.select("name").write.saveAsTable("ruoze_saprk")
sparksession.stop()
}
}
三:总结
Developer:build libraries for various data sources
Users:easy loading/saving DataFrames
读:spark.read.format("")
内置数据源:json,parquet,jdbc,csv(2.x以后有)
外部数据源:https://spark-packages.org/
SparkSql默认处理的format就是parquent
写:people.write.format().save("");
操作Hive表数据:
spark.table(tablename)
df.write.saveAsTable(tableName)