Using concat() or concat_ws() Spark SQL functions we can concatenate one or more DataFrame columns into a single column, In this article, you will learn using these functions and also using raw SQL to concatenate columns with Scala example.
Related: Concatenate PySpark (Python) DataFrame column
1. Preparing Data & DataFrame
val data = Seq(("James","A","Smith","2018","M",3000),
("Michael","Rose","Jones","2010","M",4000),
("Robert","K","Williams","2010","M",4000),
("Maria","Anne","Jones","2005","F",4000),
("Jen","Mary","Brown","2010","",-1)
)
val columns = Seq("fname","mname","lname","dob_year","gender","salary")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
df.show(false)
Note that we need to import implicits on “spark” object which is an instance of SparkSession
in order to use toDF()
on Seq collection and df.show() yields below output.
+-------+-----+--------+--------+------+------+
|fname |mname|lname |dob_year|gender|salary|
+-------+-----+--------+--------+------+------+
|James |A |Smith |2018 |M |3000 |
|Michael|Rose |Jones |2010 |M |4000 |
|Robert |K |Williams|2010 |M |4000 |
|Maria |Anne |Jones |2005 |F |4000 |
|Jen |Mary |Brown |2010 | |-1 |
+-------+-----+--------+--------+------+------+
2. Using concat() Function to Concatenate DataFrame Columns
Spark SQL functions provide concat()
to concatenate two or more DataFrame columns into a single Column.
Syntax
concat(exprs: Column*): Column
It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.
df.select(concat(col("fname"),lit(','),
col("mname"),lit(','),col("lname")).as("FullName"))
.show(false)
This statement creates “FullName” column by concatenating columns fname
, mname
, lname
separating by delimiter comma. To add a delimiter, we have used lit() function. This yields output with just a concatenated column.
+------------------+
|FullName |
+------------------+
|James,A,Smith |
|Michael,Rose,Jones|
|Robert,K,Williams |
|Maria,Anne,Jones |
|Jen,Mary,Brown |
+------------------+
3. Using concat() Function on withColumn()
Let’s see another example using concat()
function on withColumn(), here we will add a new column FullName
by concatenating columns names.
df.withColumn("FullName",concat(col("fname"),lit(','),
col("mname"),lit(','),col("lname")))
.show(false)
The above snippet also keeps the individual names, if you do not need it you can drop them using the below statement.
df.withColumn("FullName",concat(col("fname"),lit(','),
col("mname"),lit(','),col("lname")))
.drop("fname")
.drop("mname")
.drop("lname")
.show(false)
This yields the below output.
+--------+------+------+------------------+
|dob_year|gender|salary|FullName |
+--------+------+------+------------------+
|2018 |M |3000 |James,A,Smith |
|2010 |M |4000 |Michael,Rose,Jones|
|2010 |M |4000 |Robert,K,Williams |
|2005 |F |4000 |Maria,Anne,Jones |
|2010 | |-1 |Jen,Mary,Brown |
+--------+------+------+------------------+
4. Using concat_ws() Function to Concatenate with Delimiter
Adding a delimiter while concatenating DataFrame columns can be easily done using another function concat_ws()
.
syntax
concat_ws(sep: String, exprs: Column*): Column
concat_ws() function takes the first argument as delimiter following with columns that need to concatenate.
df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
.drop("fname")
.drop("mname")
.drop("lname")
.show(false)
This yields the below output.
+--------+------+------+------------------+
|dob_year|gender|salary|FullName |
+--------+------+------+------------------+
|2018 |M |3000 |James,A,Smith |
|2010 |M |4000 |Michael,Rose,Jones|
|2010 |M |4000 |Robert,K,Williams |
|2005 |F |4000 |Maria,Anne,Jones |
|2010 | |-1 |Jen,Mary,Brown |
+--------+------+------+------------------+
5. Using Raw SQL
If you are coming from SQL background, don’t get disappointed, Spark SQL also provides a way to concatenate using Raw SQL syntax. But In order to use this first you need to create a temporary view using df.createOrReplaceTempView("EMP")
. This creates a temporary table "EMP"
.
df.createOrReplaceTempView("EMP")
spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
.show(false)
We can also use concat() function on the raw SQL statements.
+------------------+
|FullName |
+------------------+
|James Smith A |
|Michael Jones Rose|
|Robert Williams K |
|Maria Jones Anne |
|Jen Brown Mary |
+------------------+
So far we have been using concat()
function, now let’s see another way to concatenate string using operator “||”.
spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
.show(false)
This yields the same output as the above statement.
6. Complete Example
package com.sparkbyexamples.spark.dataframe.functions.string
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, _}
object ConcatExample extends App {
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val data = Seq(("James","A","Smith","2018","M",3000),
("Michael","Rose","Jones","2010","M",4000),
("Robert","K","Williams","2010","M",4000),
("Maria","Anne","Jones","2005","F",4000),
("Jen","Mary","Brown","2010","",-1)
)
val columns = Seq("fname","mname","lname","dob_year","gender","salary")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
df.printSchema()
df.show(false)
df.select(concat(col("fname"),lit(','),
col("mname"),lit(','),col("lname")).as("FullName"))
.show(false)
df.withColumn("FullName",concat(col("fname"),lit(','),
col("mname"),lit(','),col("lname")))
.drop("fname")
.drop("mname")
.drop("lname")
.show(false)
df.withColumn("FullName",concat_ws(",",col("fname"),col("mname"),col("lname")))
.drop("fname")
.drop("mname")
.drop("lname")
.show(false)
df.createOrReplaceTempView("EMP")
spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from EMP")
.show(false)
spark.sql("select fname ||' '|| lname ||' '|| mname as FullName from EMP")
.show(false)
}
This example is also available at GitHub for reference
Conclusion
In this article, you have learned different ways to concatenate two or more string Dataframe columns into a single column using Spark SQL concat() and concat_ws() functions and finally learned to concatenate by leveraging RAW SQL syntax along with several Scala examples. Hope you like it.
For more Spark SQL functions, please refer SQL Functions.
Happy Learning !!