Spark – How to Concatenate DataFrame columns

Using concat() or concat_ws() Spark SQL functions we can concatenate one or more DataFrame columns into a single column, In this article, you will learn using these functions and also using raw SQL to concatenate columns with Scala example.

Related: Concatenate PySpark (Python) DataFrame column

1. Preparing Data & DataFrame


  val data = Seq(("James","A","Smith","2018","M",3000),
    ("Michael","Rose","Jones","2010","M",4000),
    ("Robert","K","Williams","2010","M",4000),
    ("Maria","Anne","Jones","2005","F",4000),
    ("Jen","Mary","Brown","2010","",-1)
  )

  val columns = Seq("fname","mname","lname","dob_year","gender","salary")
  import spark.sqlContext.implicits._
  val df = data.toDF(columns:_*)
  df.show(false)

Note that we need to import implicits on “spark” object which is an instance of SparkSession in order to use toDF() on Seq collection and df.show() yields below output.


+-------+-----+--------+--------+------+------+
|fname  |mname|lname   |dob_year|gender|salary|
+-------+-----+--------+--------+------+------+
|James  |A    |Smith   |2018    |M     |3000  |
|Michael|Rose |Jones   |2010    |M     |4000  |
|Robert |K    |Williams|2010    |M     |4000  |
|Maria  |Anne |Jones   |2005    |F     |4000  |
|Jen    |Mary |Brown   |2010    |      |-1    |
+-------+-----+--------+--------+------+------+

2. Using concat() Function to Concatenate DataFrame Columns

Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column.

Syntax


concat(exprs: Column*): Column

It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.


  df.select(concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")).as("FullName"))
      .show(false)

This statement creates “FullName” column by concatenating columns fname, mname, lname separating by delimiter comma. To add a delimiter, we have used lit() function. This yields output with just a concatenated column.


+------------------+
|FullName          |
+------------------+
|James,A,Smith     |
|Michael,Rose,Jones|
|Robert,K,Williams |
|Maria,Anne,Jones  |
|Jen,Mary,Brown    |
+------------------+

3. Using concat() Function on withColumn()

Let’s see another example using concat() function on withColumn(), here we will add a new column FullName by concatenating columns names.


  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .show(false)

The above snippet also keeps the individual names, if you do not need it you can drop them using the below statement.


  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
    .show(false)

This yields the below output.


+--------+------+------+------------------+
|dob_year|gender|salary|FullName          |
+--------+------+------+------------------+
|2018    |M     |3000  |James,A,Smith     |
|2010    |M     |4000  |Michael,Rose,Jones|
|2010    |M     |4000  |Robert,K,Williams |
|2005    |F     |4000  |Maria,Anne,Jones  |
|2010    |      |-1    |Jen,Mary,Brown    |
+--------+------+------+------------------+

4. Using concat_ws() Function to Concatenate with Delimiter

Adding a delimiter while concatenating DataFrame columns can be easily done using another function concat_ws().

syntax


concat_ws(sep: String, exprs: Column*): Column

concat_ws() function takes the first argument as delimiter following with columns that need to concatenate.


  df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
      .show(false)

This yields the below output.


+--------+------+------+------------------+
|dob_year|gender|salary|FullName          |
+--------+------+------+------------------+
|2018    |M     |3000  |James,A,Smith     |
|2010    |M     |4000  |Michael,Rose,Jones|
|2010    |M     |4000  |Robert,K,Williams |
|2005    |F     |4000  |Maria,Anne,Jones  |
|2010    |      |-1    |Jen,Mary,Brown    |
+--------+------+------+------------------+

5. Using Raw SQL

If you are coming from SQL background, don’t get disappointed, Spark SQL also provides a way to concatenate using Raw SQL syntax. But In order to use this first you need to create a temporary view using df.createOrReplaceTempView("EMP"). This creates a temporary table "EMP".


  df.createOrReplaceTempView("EMP")

  spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
    .show(false)

We can also use concat() function on the raw SQL statements.


+------------------+
|FullName          |
+------------------+
|James Smith A     |
|Michael Jones Rose|
|Robert Williams K |
|Maria Jones Anne  |
|Jen Brown Mary    |
+------------------+

So far we have been using concat() function, now let’s see another way to concatenate string using operator “||”.


  spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
    .show(false)

This yields the same output as the above statement.

6. Complete Example


package com.sparkbyexamples.spark.dataframe.functions.string

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, _}
object ConcatExample extends App {

  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  val data = Seq(("James","A","Smith","2018","M",3000),
    ("Michael","Rose","Jones","2010","M",4000),
    ("Robert","K","Williams","2010","M",4000),
    ("Maria","Anne","Jones","2005","F",4000),
    ("Jen","Mary","Brown","2010","",-1)
  )

  val columns = Seq("fname","mname","lname","dob_year","gender","salary")
  import spark.sqlContext.implicits._
  val df = data.toDF(columns:_*)
  df.printSchema()
  df.show(false)

  df.select(concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")).as("FullName"))
      .show(false)

  df.withColumn("FullName",concat(col("fname"),lit(','),
    col("mname"),lit(','),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
    .show(false)

  df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
    .drop("fname")
    .drop("mname")
    .drop("lname")
      .show(false)

  df.createOrReplaceTempView("EMP")

  spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
    .show(false)

  spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
    .show(false)
}

This example is also available at GitHub for reference

Conclusion

In this article, you have learned different ways to concatenate two or more string Dataframe columns into a single column using Spark SQL concat() and concat_ws() functions and finally learned to concatenate by leveraging RAW SQL syntax along with several Scala examples. Hope you like it.

For more Spark SQL functions, please refer SQL Functions.

Happy Learning !!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值