文件studentinfo--> Linux --> HDFS --> Hive --> Spark读取
1.把本地的文件上传到Linux上
利用命令:rz-E 上传文件studentinfo到Linux下的/dataset/路径下
2.把Linux中/dataset/路径下studentinfo文件上传到HDFS上
hdfs dfs -mkdir -p /dataset
hdfs dfs -put studentinfo /dataset/
3.使用hive或beeline执行SQL,创建hive表student
CREATE DATABASE IF NOT EXISTS spark_integrition;
USE spark_integrition;
CREATE EXTERNAL TABLE student
(
name STRING,
age INT,
gpa string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/dataset/hive';
4.加载HDFS数据到hive
LOAD DATA INPATH '/dataset/studentinfo' OVERWRITE INTO TABLE student;
5.通过SparkSQL查询hive的表
scala> spark.sql("use spark_integrition")
scala> val resultDF = spark.sql("select * from student limit 10")
scala> resultDF.show()
二、通过 SparkSQL
创建 Hive
表
val createTableStr =
"""
|CREATE EXTERNAL TABLE student
|(
| name STRING,
| age INT,
| gpa string
|)
|ROW FORMAT DELIMITED
| FIELDS TERMINATED BY '\t'
| LINES TERMINATED BY '\n'
|STORED AS TEXTFILE
|LOCATION '/dataset/hive'
""".stripMargin
spark.sql("CREATE DATABASE IF NOT EXISTS spark_integrition1")
spark.sql("USE spark_integrition1")
spark.sql(createTableStr)
spark.sql("LOAD DATA INPATH '/dataset/studenttab10k' OVERWRITE INTO TABLE student")
spark.sql("select * from student limit").show()
目前 SparkSQL
支持的文件格式有 sequencefile
, rcfile
, orc
, parquet
, textfile
, avro
, 并且也可以指定 serde
的名称。