【Spark】Spark SQL 数据类型转换

最新推荐文章于 2024-05-07 07:45:24 发布

云祁

最新推荐文章于 2024-05-07 07:45:24 发布

阅读量5.3k

点赞数 1

分类专栏： # ---- Spark 文章标签： spark

本文链接：https://blog.csdn.net/BeiisBei/article/details/104441059

版权

---- Spark 专栏收录该内容

25 篇文章 15 订阅

订阅专栏

前言

数据类型转换这个在任何语言框架中都会涉及到，看起来非常简单，不过要把所有的数据类型都掌握还是需要一定的时间历练。

SparkSQL数据类型

数字类型

ByteType：代表一个字节的整数。范围是-128到127
ShortType：代表两个字节的整数。范围是-32768到32767
IntegerType：代表4个字节的整数。范围是-2147483648到2147483647
LongType：代表8个字节的整数。范围是-9223372036854775808到9223372036854775807
FloatType：代表4字节的单精度浮点数
DoubleType：代表8字节的双精度浮点数
DecimalType：代表任意精度的10进制数据。通过内部的java.math.BigDecimal支持。BigDecimal由一个任意精度的整型非标度值和一个32位整数组成
StringType：代表一个字符串值
BinaryType：代表一个byte序列值
BooleanType：代表boolean值
Datetime类型
TimestampType：代表包含字段年，月，日，时，分，秒的值
DateType：代表包含字段年，月，日的值

复杂类型

ArrayType(elementType, containsNull)：代表由elementType类型元素组成的序列值。containsNull用来指明ArrayType中的值是否有null值
MapType(keyType, valueType, valueContainsNull)：表示包括一组键 - 值对的值。通过keyType表示key数据的类型，通过valueType表示value数据的类型。valueContainsNull用来指明MapType中的值是否有null值
StructType(fields):表示一个拥有StructFields (fields)序列结构的值
StructField(name, dataType, nullable):代表StructType中的一个字段，字段的名字通过name指定，dataType指定field的数据类型，nullable表示字段的值是否有null值。

Spark SQL数据类型和Scala数据类型对比

sparksql 数据类型	scala数据类型
ByteType	Byte
ShortType	Short
IntegerType	Int
LongType	Long
FloatType	Float
DoubleType	Double
DecimalType	scala.math.BigDecimal
StringType	String
BinaryType	Array[Byte]
BooleanType	Boolean
TimestampType	java.sql.Timestamp
DateType	java.sql.Date
ArrayType	scala.collection.Seq
MapType	scala.collection.Map
StructType	org.apache.spark.sql.Row
StructField	The value type in Scala of the data type of this field (For example, Int for a StructField with the data type IntegerType)

Spark SQL数据类型转换案例

一句话描述：调用Column类的cast方法

如何获取Column类

这个之前写过

df("columnName")            // On a specific `df` DataFrame.
col("columnName")           // A generic column not yet associated with a DataFrame.
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"               // Scala short hand for a named column.

测试数据准备

1,tom,23
2,jack,24
3,lily,18
4,lucy,19

spark入口代码

val spark = SparkSession
      .builder()
      .appName("test")
      .master("local[*]")
      .getOrCreate()

测试默认数据类型

spark.read.
      textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .dtypes
      .foreach(println)

结果：

(id,StringType)
(name,StringType)
(age,StringType)

说明默认都是StringType类型

把数值型的列转为IntegerType

 import spark.implicits._
    spark.read.
      textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .select($"id".cast("int"), $"name", $"age".cast("int"))
      .dtypes
      .foreach(println)

结果：

(id,IntegerType)
(name,StringType)
(age,IntegerType)

Column类cast方法的两种重载

第一种
def cast(to: String): Column
Casts the column to a different data type, using the canonical string representation of the type. The supported types are:
string, boolean, byte, short, int, long, float, double, decimal, date, timestamp.

// Casts colA to integer.
df.select(df("colA").cast("int"))
Since
1.3.0

第二种
def cast(to: DataType): Column
Casts the column to a different data type.

// Casts colA to IntegerType.
import org.apache.spark.sql.types.IntegerType
df.select(df("colA").cast(IntegerType))
// equivalent to
df.select(df("colA").cast("int"))