How can I select multiple columns of dataset ds in Spark 2.3 Java by passing a list argument?
For example, this works fine:
ds.select("col1","col2","col3").show();
However, this fails:
List columns = Arrays.toList("col1","col2","col3");
ds.select(columns.toString()).show()
解决方案
Using spark 2.4.0 you have to convert the List to Seq, and use selectExpr following spark documentation.
If you want to use select, you have to remove the first column from your list and add it as a parameter to select.
Please find the two versions :
Suppose that you have the following .csv file :
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
You can use this code to solve your issue:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import scala.collection.JavaConverters;
import scala.collection.Seq;
public class SparkJavaTest {
public static SparkSession spark = SparkSession
.builder()
.appName("JavaSparkTest")
.master("local")
.getOrCreate();
public static Seq convertListToSeq(List inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
}
public static void main(String[] args) {
Dataset ds = spark.read().option("header",true).csv("spark-file.csv");
List columns = Arrays.asList("InvoiceNo","StockCode","Description");
//using selectExpr
ds.selectExpr(convertListToSeq(columns)).show(false);
//using select => this first column will be added to select
List columns2 = Arrays.asList("StockCode","Description");
ds.select("InvoiceNo", convertListToSeq(columns2)).show(false);
}
}
Hope it helps :)