Spark DataFrame读取外部文件并解析数据格式

最新推荐文章于 2024-04-29 11:12:25 发布

巴拉巴拉朵

最新推荐文章于 2024-04-29 11:12:25 发布

阅读量5.4k

点赞数

分类专栏： Spark 数据类型入门教程文章标签： Spark DataFrame CSV文件数据类型转换

本文链接：https://blog.csdn.net/whgyxy/article/details/88879145

版权

Spark 同时被 3 个专栏收录

17 篇文章 0 订阅

订阅专栏

入门教程

7 篇文章 0 订阅

订阅专栏

数据类型

3 篇文章 0 订阅

订阅专栏

Spark DataFrame读取外部文件并解析数据格式

Spark DataFame实际是DataSet的一个特殊类型，DataFrame对sql过程做很了很多优化。现在DataFrame用起来和Python的Pandas一样方便了，这里记录一下DataFrame读取外部文件并解析的过程。

type DataFrame = Dataset[Row]

spark读取csv文件有许多参数可以设置，例如inferSchema”表示是否开启类型自动推测“，“header”时候有表头，下面列举所有的可选参数及其解释

参数	默认值	说明
sep	`,`	sets the single character as a separator for each field and value
encoding	`UTF-8`	decodes the CSV files by the given encoding type
quote	`"`	sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not `null` but an empty string. This behaviour is different from `com.databricks.spark.csv`
escape	`\`	sets the single character used for escaping quotes inside an already quoted value
comment		sets the single character used for skipping lines. beginning with this character. By default, it is disabled
header	`false`	uses the first line as names of columns.
inferSchema	`false`	infers the input schema automatically from data. It requires one extra pass over the data
ignoreLeadingWhiteSpace	`false`	defines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace	`false`	defines whether or not trailing whitespaces from values being read should be skipped.
nullValue		sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type
nanValue	`NaN`	sets the string representation of a non-number" value.
positiveInf	`Inf`	sets the string representation of a positive infinity value
negativeInf	`-Inf`	sets the string representation of a negative infinity value
dateFormat	`yyyy-MM-dd`	sets the string that indicates a date format. Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to date type
timestampFormat	`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`	sets the string that indicates a timestamp format. Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to timestamp type.
maxColumns	`20480`	defines a hard limit of how many columns a record can have.
maxCharsPerColumn	`-1`	defines the maximum number of characters allowed

假设我们的smaple.csv数据如下所示

item_id,month,total,distinct
2638,201801,1684,142
4120,201801,93,24
949976,201801,46,5
457,201801,4051,98
871603,201801,167,28
317120,201801,61,2

读取代码

    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample.csv")
      csvDF.printSchema()

schema会显示如下

root
 |-- item_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- total: integer (nullable = true)
 |-- distinct: integer (nullable = true)

如果我们希望调整total和distince的数据类型为double，我们需要使用 withColumn(colName:String, col:Column),并且导入Spark Sql支持的类型 import org.apache.spark.sql.types._ ，支持IntegerType、DoubleType、StringType…类型；使用列函数，需要sql 列函数支持，需要导入 import org.apache.spark.sql.functions._

	import org.apache.spark.sql.types._
	import org.apache.spark.sql.functions._
    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample.csv")
      .withColumn("total", col("total").cast(IntegerType))
      .withColumn("distinct", col("distinct").cast(IntegerType))
    csvDF.printSchema()

现在schema变为了

root
 |-- item_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- total: double (nullable = true)
 |-- distinct: double (nullable = true)

自动推测的类型有时候并不符合我们的要求，这个时候需要做一次转换。有两种方式，第一种是将DataFame转回RDD然后在RDD里面处理；另外一种是直接在DataFrame里面转换，下面对比两种方式的耗时差异。我们加入一个case class Person来解析DataFrame里面的数据，Person的字段一定要和数据表中的字段名一样，否则会报错

Exception in thread "main" org.apache.spark.sql.AnalysisException:
 cannot resolve '`id`' given input columns: [item_id, month, total, distinct];

具体需要几列看实际情况，但是字段名要保持一致

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object DataFrameTest {

  case class Person(item_id: String, month: String, total: Double, distinct: Double) {
    def info(): String = {
      "item_id:" + this.item_id + " total:" + total
    }
  }

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    val startTime = System.currentTimeMillis()

    import spark.implicits._
    import org.apache.spark.sql.functions._
    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample_1.csv")
      .withColumn("total", col("total").cast(DoubleType))
      .withColumn("distinct", col("distinct").cast(DoubleType))
    csvDF.printSchema()
    csvDF.as[Person].map(p => p.info())  // 方法一：直接DataFrame里面转换
    //    csvDF.rdd.map(x => {x.asInstanceOf[Person].info()})  // 方法二：在RDD里面转换

    val endTime = System.currentTimeMillis()
    println("Run" + (endTime - startTime) + "ms")
  }
}

这里修改了两列的数据类型，从IntegerType到DoubleType

两种方式对比，直接在DataFrame里面读取200w行数据，方式一需要2290ms，方式二需要3411ms，说明转成RDD是有比较大的耗时的，这里推荐使用方式一。需要注意的是，因为这里使用case class进行了数据类型的隐式转换，需要import spark.implicits._

另一种读取CSV文件的方式

上面使用case class 解析格式是一种读取方式，还有另外一种方式，提前设定好schema格式，包含列名和类型，如果给的schema列数量多于数据文件中的列数量，多的列值都为null，如果小于实际数据文件的列数量，只取前面几列。这个时候不用设置自动推断（inferSchema参数）

import org.apache.spark.sql.types._
    val schema = StructType(
      StructField("item_id", IntegerType)::
        StructField("month", IntegerType)::
        StructField("total", DoubleType)::
        StructField("distinct", DoubleType)::
        Nil
    )
    val df = spark.read.schema(schema)
      .option("header", true)
      .csv("C:\\Users\\xxx\\Desktop\\sample_1.csv")

    df.printSchema()
    df.show()

巴拉巴拉朵

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
Spark DataFrame读取外部文件并解析数据格式

Spark DataFrame读取外部文件并解析数据格式Spark DataFame实际是DataSet的一个特殊类型，DataFrame对sql过程做很了很多优化。现在DataFrame用起来和Python的Pandas一样方便了，这里记录一下DataFrame读取外部文件并解析的过程。type DataFrame = Dataset[Row]spark读取csv文件有许多参数可以设置，...
复制链接

扫一扫

专栏目录