SparkSQL加载外部数据源和存储至外部数据源

加载外部数据源

 

 

  可以加载好多种外部数据源的格式,例如:csv,text,json,parquet等。我们在这里讲解下json和parquet格式。

  json:

    代码:

      def main(args: Array[String]): Unit = {

        val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
        import spark.implicits._
        val df_json = spark.read.json("file:///d:/测试数据/sample1.json")  //读取json文件存储为DataFrame
        df_json.printSchema()     //打印文件的字段和字段类型
      }

    加载Json格式后,df会自动识别Schema,如下:

 

df_json.printSchema()  -->
            root
         |-- city: string (nullable = true)
         |-- count: string (nullable = true)
         |-- id: string (nullable = true)
         |-- info: string (nullable = true)
         |-- infocode: string (nullable = true)
         |-- pois: array (nullable = true)
         |    |-- element: struct (containsNull = true)
         |    |    |-- adcode: string (nullable = true)
         |    |    |-- address: string (nullable = true)
         |    |    |-- adname: string (nullable = true)
         |    |    |-- alias: string (nullable = true)
         |    |    |-- biz_ext: struct (nullable = true)
         |    |    |    |-- cost: string (nullable = true)
         |    |    |    |-- hotel_ordering: string (nullable = true)
         |    |    |    |-- lowest_price: string (nullable = true)
         |    |    |    |-- meal_ordering: string (nullable = true)
         |    |    |    |-- rating: string (nullable = true)
         |    |    |    |-- seat_ordering: string (nullable = true)
         |    |    |    |-- star: array (nullable = true)
         |    |    |    |    |-- element: string (containsNull = true)
         |    |    |    |-- ticket_ordering: string (nullable = true)
         |    |    |-- biz_type: string (nullable = true)
         |    |    |-- business_area: string (nullable = true)
         |    |    |-- children: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- citycode: string (nullable = true)
         |    |    |-- cityname: string (nullable = true)
         |    |    |-- discount_num: string (nullable = true)
         |    |    |-- distance: string (nullable = true)
         |    |    |-- email: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- entr_location: string (nullable = true)
         |    |    |-- event: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- exit_location: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- gridcode: string (nullable = true)
         |    |    |-- groupbuy_num: string (nullable = true)
         |    |    |-- id: string (nullable = true)
         |    |    |-- importance: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- indoor_data: struct (nullable = true)
         |    |    |    |-- cmsid: string (nullable = true)
         |    |    |    |-- cpid: string (nullable = true)
         |    |    |    |-- floor: string (nullable = true)
         |    |    |    |-- truefloor: string (nullable = true)
         |    |    |-- indoor_map: string (nullable = true)
         |    |    |-- location: string (nullable = true)
         |    |    |-- match: string (nullable = true)
         |    |    |-- name: string (nullable = true)
         |    |    |-- navi_poiid: string (nullable = true)
         |    |    |-- pcode: string (nullable = true)
         |    |    |-- photos: array (nullable = true)
         |    |    |    |-- element: struct (containsNull = true)
         |    |    |    |    |-- title: string (nullable = true)
         |    |    |    |    |-- url: string (nullable = true)
         |    |    |-- pname: string (nullable = true)
         |    |    |-- poiweight: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- postcode: string (nullable = true)
         |    |    |-- recommend: string (nullable = true)
         |    |    |-- shopid: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- shopinfo: string (nullable = true)
         |    |    |-- tag: string (nullable = true)
         |    |    |-- tel: string (nullable = true)
         |    |    |-- timestamp: array (nullable = true)
         |    |    |    |-- element: string (containsNull = true)
         |    |    |-- type: string (nullable = true)
         |    |    |-- typecode: string (nullable = true)
         |    |    |-- website: string (nullable = true)
         |-- school_name: string (nullable = true)
         |-- status: string (nullable = true)
         |-- suggestion: struct (nullable = true)
         |    |-- cities: array (nullable = true)
         |    |    |-- element: string (containsNull = true)
         |    |-- keywords: array (nullable = true)
         |    |    |-- element: string (containsNull = true)

 

  执行查询:

          df_json.createOrReplaceTempView("sample")
          val df1 = spark.sql("select school_name from sample")
          df1.show(10)
            +------------+
        | school_name|
        +------------+
        |三明市尤溪县城南中心小学|
        |     三亚市第七小学|
        |        天涯小学|
        |     三亚市榆红小学|
        |  三明市泰宁县下渠学校|
        |      三亚丰和学校|
        |        铜川学校|
        |  三明市三元区岩前小学|
        |  三亚市天涯区桶井小学|
        |  三亚市吉阳区红沙小学|
        +------------+

  注意:如果Json格式不严谨,sparkSQL能将问题数据解析出来

 +--------------------+----+-----+------+----+--------+--------------------+------------+------+--------------------+
        |     _corrupt_record|city|count|    id|info|infocode|                pois| school_name|status|          suggestion|
        +--------------------+----+-----+------+----+--------+--------------------+------------+------+--------------------+
        |                null| 三明市|  186|216938|  OK|   10000|[[350426,城关镇解放路3号...|三明市尤溪县城南中心小学|     1|[WrappedArray(),W...|
        |{"id":"419293","s...|null| null|  null|null|    null|                null|        null|  null|                null|
        |                null| 三亚市|   25|420278|  OK|   10000|[[460204,225国道与31...|        天涯小学|     1|[WrappedArray(),W...|
        |                null| 三亚市|   73|420132|  OK|   10000|[[460203,15路;17路;...|     三亚市榆红小学|     1|[WrappedArray(),W...|
        |                null| 三明市|  578|216867|  OK|   10000|[[350429,松山寺悟旺法师,...|  三明市泰宁县下渠学校|     1|[WrappedArray(),W...|
        |                null| 三亚市|  745|419818|  OK|   10000|[[460203,月北路附近,吉阳...|      三亚丰和学校|     1|[WrappedArray(),W...|
        |                null| 上海市|  891|  4121|  OK|   10000|[[310107,62路;323路...|        铜川学校|     1|[WrappedArray(),W...|
        |                null| 三明市|  900|216812|  OK|   10000|[[350403,星桥大路线,三元...|  三明市三元区岩前小学|     1|[WrappedArray(),W...|
        |                null| 三亚市|   81|420277|  OK|   10000|[[460204,凤凰镇桶井小学对...|  三亚市天涯区桶井小学|     1|[WrappedArray(),W...|
        |                null| 三亚市|   70|419991|  OK|   10000|[[460203,2路,吉阳区,[...|  三亚市吉阳区红沙小学|     1|[WrappedArray(),W...|
        +--------------------+----+-----+------+----+--------+--------------------+------------+------+--------------------+
        only showing top 10 rows

  parquet:

    代码:

      def main(args: Array[String]): Unit = {

        val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
        import spark.implicits._
        val df_parquet = spark.read.parquet("file:///D:/测试数据/sogou/parquet/part-00000-30a2dca2-91f2-4ea6-8993-    c22870048d6f.snappy.parquet")
        df_parquet.show(10);
      }

  运行结果:

      +--------------------+
      |         value|
      +--------------------+
      |20111230000005 57...|
      |20111230000005 66...|
      |20111230000007 b9...|
      |20111230000008 69...|
      |20111230000008 f2...|
      |20111230000009 96...|
      |20111230000009 69...|
      |20111230000009 59...|
      |20111230000010 f5...|
      |20111230000010 28...|
      +--------------------+
        only showing top 10 rows

存储至外部数据源

 

  

  1.parquet格式存储
    def main(args: Array[String]): Unit = {
      val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
      import spark.implicits._
      val df = spark.read.textFile("file:///D:/测试数据/sogou/SogouQ3.txt/*")
      df.repartition(1).write.parquet("file:///D:/测试数据/sogou/parquet/")
    }
  2.orcFile
    需添加hive依赖
      <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-hive_2.11</artifactId>
       <version>2.1.2</version>
       <scope>provided</scope>
      </dependency>

解决加载数据乱码问题

 

说明:如果数据源默认不是UTF-8的编码集,那么加载数据将会产生乱码,解决方法如下:
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("sparkSQL").master("local").getOrCreate()
    import spark.implicits._=
    val rdd = spark.sparkContext.hadoopFile("file:///D:/测试数据/sogou/SogouQ3.txt/*",classOf[TextInputFormat],classOf[LongWritable],classOf[Text],1)
                  .map(p => new String(p._2.getBytes, 0, p._2.getLength, "GBK"))
                  .map(x=>x.split("\t"))
                  .map(x=>new sogou(x(0),x(1),x(2),x(3),x(4),x(5)))
    val df = spark.createDataFrame(rdd)
    df.printSchema()
    df.show(10)
  }

  因为textfile底层调用hadoopfile,还用调用textinputformat,编码格式默认为utf-8,所以这样加载数据才不会出现乱码的问题。

 

转载于:https://www.cnblogs.com/lyr999736/p/10224774.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值