Spark DataFrame读取外部文件并解析数据格式

Spark DataFrame读取外部文件并解析数据格式

Spark DataFame实际是DataSet的一个特殊类型,DataFrame对sql过程做很了很多优化。现在DataFrame用起来和Python的Pandas一样方便了,这里记录一下DataFrame读取外部文件并解析的过程。

type DataFrame = Dataset[Row]

spark读取csv文件有许多参数可以设置,例如inferSchema”表示是否开启类型自动推测“,“header”时候有表头,下面列举所有的可选参数及其解释

参数默认值说明
sep,sets the single character as a separator for each field and value
encodingUTF-8decodes the CSV files by the given encoding type
quote"sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv
escape\sets the single character used for escaping quotes inside an already quoted value
commentsets the single character used for skipping lines. beginning with this character. By default, it is disabled
headerfalseuses the first line as names of columns.
inferSchemafalseinfers the input schema automatically from data. It requires one extra pass over the data
ignoreLeadingWhiteSpacefalsedefines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpacefalsedefines whether or not trailing whitespaces from values being read should be skipped.
nullValuesets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type
nanValueNaNsets the string representation of a non-number" value.
positiveInfInfsets the string representation of a positive infinity value
negativeInf-Infsets the string representation of a negative infinity value
dateFormatyyyy-MM-ddsets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type
timestampFormatyyyy-MM-dd'T'HH:mm:ss.SSSZZsets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
maxColumns20480defines a hard limit of how many columns a record can have.
maxCharsPerColumn-1defines the maximum number of characters allowed

假设我们的smaple.csv数据如下所示

item_id,month,total,distinct
2638,201801,1684,142
4120,201801,93,24
949976,201801,46,5
457,201801,4051,98
871603,201801,167,28
317120,201801,61,2

读取代码

    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample.csv")
      csvDF.printSchema()

schema会显示如下

root
 |-- item_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- total: integer (nullable = true)
 |-- distinct: integer (nullable = true)

如果我们希望调整total和distince的数据类型为double,我们需要使用 withColumn(colName:String, col:Column),并且导入Spark Sql支持的类型 import org.apache.spark.sql.types._ ,支持IntegerType、DoubleType、StringType…类型;使用列函数,需要sql 列函数支持 ,需要导入 import org.apache.spark.sql.functions._

	import org.apache.spark.sql.types._
	import org.apache.spark.sql.functions._
    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample.csv")
      .withColumn("total", col("total").cast(IntegerType))
      .withColumn("distinct", col("distinct").cast(IntegerType))
    csvDF.printSchema()

现在schema变为了

root
 |-- item_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- total: double (nullable = true)
 |-- distinct: double (nullable = true)

自动推测的类型有时候并不符合我们的要求,这个时候需要做一次转换。有两种方式,第一种是将DataFame转回RDD然后在RDD里面处理;另外一种是直接在DataFrame里面转换,下面对比两种方式的耗时差异。我们加入一个case class Person来解析DataFrame里面的数据,Person的字段一定要和数据表中的字段名一样,否则会报错

Exception in thread "main" org.apache.spark.sql.AnalysisException:
 cannot resolve '`id`' given input columns: [item_id, month, total, distinct];

具体需要几列看实际情况,但是字段名要保持一致

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object DataFrameTest {

  case class Person(item_id: String, month: String, total: Double, distinct: Double) {
    def info(): String = {
      "item_id:" + this.item_id + " total:" + total
    }
  }

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    val startTime = System.currentTimeMillis()

    import spark.implicits._
    import org.apache.spark.sql.functions._
    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample_1.csv")
      .withColumn("total", col("total").cast(DoubleType))
      .withColumn("distinct", col("distinct").cast(DoubleType))
    csvDF.printSchema()
    csvDF.as[Person].map(p => p.info())  // 方法一:直接DataFrame里面转换
    //    csvDF.rdd.map(x => {x.asInstanceOf[Person].info()})  // 方法二:在RDD里面转换

    val endTime = System.currentTimeMillis()
    println("Run" + (endTime - startTime) + "ms")
  }
}

这里修改了两列的数据类型,从IntegerType到DoubleType

两种方式对比,直接在DataFrame里面读取200w行数据,方式一需要2290ms,方式二需要3411ms,说明转成RDD是有比较大的耗时的,这里推荐使用方式一。需要注意的是,因为这里使用case class进行了数据类型的隐式转换,需要import spark.implicits._

另一种读取CSV文件的方式

上面使用case class 解析格式是一种读取方式,还有另外一种方式,提前设定好schema格式,包含列名和类型,如果给的schema列数量多于数据文件中的列数量,多的列值都为null,如果小于实际数据文件的列数量,只取前面几列。这个时候不用设置自动推断(inferSchema参数)

import org.apache.spark.sql.types._
    val schema = StructType(
      StructField("item_id", IntegerType)::
        StructField("month", IntegerType)::
        StructField("total", DoubleType)::
        StructField("distinct", DoubleType)::
        Nil
    )
    val df = spark.read.schema(schema)
      .option("header", true)
      .csv("C:\\Users\\xxx\\Desktop\\sample_1.csv")

    df.printSchema()
    df.show()

  • 0
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值