加载外部数据源 |
可以加载好多种外部数据源的格式,例如:csv,text,json,parquet等。我们在这里讲解下json和parquet格式。
json:
代码:
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
import spark.implicits._
val df_json = spark.read.json("file:///d:/测试数据/sample1.json") //读取json文件存储为DataFrame
df_json.printSchema() //打印文件的字段和字段类型
}
加载Json格式后,df会自动识别Schema,如下:
df_json.printSchema() --> root |-- city: string (nullable = true) |-- count: string (nullable = true) |-- id: string (nullable = true) |-- info: string (nullable = true) |-- infocode: string (nullable = true) |-- pois: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- adcode: string (nullable = true) | | |-- address: string (nullable = true) | | |-- adname: string (nullable = true) | | |-- alias: string (nullable = true) | | |-- biz_ext: struct (nullable = true) | | | |-- cost: string (nullable = true) | | | |-- hotel_ordering: string (nullable = true) | | | |-- lowest_price: string (nullable = true) | | | |-- meal_ordering: string (nullable = true) | | | |-- rating: string (nullable = true) | | | |-- seat_ordering: string (nullable = true) | | | |-- star: array (nullable = true) | | | | |-- element: string (containsNull = true) | | | |-- ticket_ordering: string (nullable = true) | | |-- biz_type: string (nullable = true) | | |-- business_area: string (nullable = true) | | |-- children: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- citycode: string (nullable = true) | | |-- cityname: string (nullable = true) | | |-- discount_num: string (nullable = true) | | |-- distance: string (nullable = true) | | |-- email: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- entr_location: string (nullable = true) | | |-- event: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- exit_location: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- gridcode: string (nullable = true) | | |-- groupbuy_num: string (nullable = true) | | |-- id: string (nullable = true) | | |-- importance: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- indoor_data: struct (nullable = true) | | | |-- cmsid: string (nullable = true) | | | |-- cpid: string (nullable = true) | | | |-- floor: string (nullable = true) | | | |-- truefloor: string (nullable = true) | | |-- indoor_map: string (nullable = true) | | |-- location: string (nullable = true) | | |-- match: string (nullable = true) | | |-- name: string (nullable = true) | | |-- navi_poiid: string (nullable = true) | | |-- pcode: string (nullable = true) | | |-- photos: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- title: string (nullable = true) | | | | |-- url: string (nullable = true) | | |-- pname: string (nullable = true) | | |-- poiweight: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- postcode: string (nullable = true) | | |-- recommend: string (nullable = true) | | |-- shopid: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- shopinfo: string (nullable = true) | | |-- tag: string (nullable = true) | | |-- tel: string (nullable = true) | | |-- timestamp: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- type: string (nullable = true) | | |-- typecode: string (nullable = true) | | |-- website: string (nullable = true) |-- school_name: string (nullable = true) |-- status: string (nullable = true) |-- suggestion: struct (nullable = true) | |-- cities: array (nullable = true) | | |-- element: string (containsNull = true) | |-- keywords: array (nullable = true) | | |-- element: string (containsNull = true)
执行查询:
df_json.createOrReplaceTempView("sample") val df1 = spark.sql("select school_name from sample") df1.show(10) +------------+ | school_name| +------------+ |三明市尤溪县城南中心小学| | 三亚市第七小学| | 天涯小学| | 三亚市榆红小学| | 三明市泰宁县下渠学校| | 三亚丰和学校| | 铜川学校| | 三明市三元区岩前小学| | 三亚市天涯区桶井小学| | 三亚市吉阳区红沙小学| +------------+
注意:如果Json格式不严谨,sparkSQL能将问题数据解析出来
+--------------------+----+-----+------+----+--------+--------------------+------------+------+--------------------+ | _corrupt_record|city|count| id|info|infocode| pois| school_name|status| suggestion| +--------------------+----+-----+------+----+--------+--------------------+------------+------+--------------------+ | null| 三明市| 186|216938| OK| 10000|[[350426,城关镇解放路3号...|三明市尤溪县城南中心小学| 1|[WrappedArray(),W...| |{"id":"419293","s...|null| null| null|null| null| null| null| null| null| | null| 三亚市| 25|420278| OK| 10000|[[460204,225国道与31...| 天涯小学| 1|[WrappedArray(),W...| | null| 三亚市| 73|420132| OK| 10000|[[460203,15路;17路;...| 三亚市榆红小学| 1|[WrappedArray(),W...| | null| 三明市| 578|216867| OK| 10000|[[350429,松山寺悟旺法师,...| 三明市泰宁县下渠学校| 1|[WrappedArray(),W...| | null| 三亚市| 745|419818| OK| 10000|[[460203,月北路附近,吉阳...| 三亚丰和学校| 1|[WrappedArray(),W...| | null| 上海市| 891| 4121| OK| 10000|[[310107,62路;323路...| 铜川学校| 1|[WrappedArray(),W...| | null| 三明市| 900|216812| OK| 10000|[[350403,星桥大路线,三元...| 三明市三元区岩前小学| 1|[WrappedArray(),W...| | null| 三亚市| 81|420277| OK| 10000|[[460204,凤凰镇桶井小学对...| 三亚市天涯区桶井小学| 1|[WrappedArray(),W...| | null| 三亚市| 70|419991| OK| 10000|[[460203,2路,吉阳区,[...| 三亚市吉阳区红沙小学| 1|[WrappedArray(),W...| +--------------------+----+-----+------+----+--------+--------------------+------------+------+--------------------+ only showing top 10 rows
parquet:
代码:
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
import spark.implicits._
val df_parquet = spark.read.parquet("file:///D:/测试数据/sogou/parquet/part-00000-30a2dca2-91f2-4ea6-8993- c22870048d6f.snappy.parquet")
df_parquet.show(10);
}
运行结果:
+--------------------+
| value|
+--------------------+
|20111230000005 57...|
|20111230000005 66...|
|20111230000007 b9...|
|20111230000008 69...|
|20111230000008 f2...|
|20111230000009 96...|
|20111230000009 69...|
|20111230000009 59...|
|20111230000010 f5...|
|20111230000010 28...|
+--------------------+
only showing top 10 rows
存储至外部数据源 |
1.parquet格式存储
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
import spark.implicits._
val df = spark.read.textFile("file:///D:/测试数据/sogou/SogouQ3.txt/*")
df.repartition(1).write.parquet("file:///D:/测试数据/sogou/parquet/")
}
2.orcFile
需添加hive依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.2</version>
<scope>provided</scope>
</dependency>
解决加载数据乱码问题 |
说明:如果数据源默认不是UTF-8的编码集,那么加载数据将会产生乱码,解决方法如下:
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
import spark.implicits._=
val rdd = spark.sparkContext.hadoopFile("file:///D:/测试数据/sogou/SogouQ3.txt/*",classOf[TextInputFormat],classOf[LongWritable],classOf[Text],1)
.map(p => new String(p._2.getBytes, 0, p._2.getLength, "GBK"))
.map(x=>x.split("\t"))
.map(x=>new sogou(x(0),x(1),x(2),x(3),x(4),x(5)))
val df = spark.createDataFrame(rdd)
df.printSchema()
df.show(10)
}
因为textfile底层调用hadoopfile,还用调用textinputformat,编码格式默认为utf-8,所以这样加载数据才不会出现乱码的问题。