方法一:利用createDataFrame方法,新增列的过程包含在构建rdd和schema中
val trdd = input. select( targetColumns) . rdd. map( x=> {
if ( x. get( 0 ) . toString( ) . toDouble > critValueR || x. get( 0 ) . toString( ) . toDouble < critValueL)
Row( x. get( 0 ) . toString( ) . toDouble, "F" )
else Row( x. get( 0 ) . toString( ) . toDouble, "T" )
} )
val schema = input. select( targetColumns) . schema. add( "flag" , StringType, true )
val sample3 = ss. createDataFrame( trdd, schema) . distinct( ) . withColumnRenamed( targetColumns, "idx" )
方法二:利用withColumn方法,新增列的过程包含在udf函数中
val code : ( Int => String ) = ( arg: Int ) => { if ( arg > critValueR || arg < critValueL) "F" else "T" }
val addCol = udf( code)
val sample3 = input. select( targetColumns) . withColumn( "flag" , addCol( input( targetColumns) ) )
. withColumnRenamed( targetColumns, "idx" )
方法三:利用SQL代码,新增列的过程直接写入SQL代码中
input. select( targetColumns) . createOrReplaceTempView( "tmp" )
val sample3 = ss. sqlContext. sql( "select distinct " + targetColname+
" as idx,case when " + targetColname+ ">" + critValueR+ " then 'F'" +
" when " + targetColname+ "<" + critValueL+ " then 'F' else 'T' end as flag from tmp" )
方法四:以上三种是增加一个有判断的列,如果想要增加一列唯一序号,可以使用monotonically_increasing_id
import org. apache. spark. sql. functions. monotonically_increasing_id
val inputnew = input. withColumn( "idx" , monotonically_increasing_id)