在sparkSql中部分功能需要定制化,需要用到udf。
在sql中,我需要根据三个字段得到一个数组,后续过程中利用这个数组进行计算,下面是scala的实现过程:
sparkSession.udf.register("fvptopic", (op:String,why:String,ext4:String)=> (op match { case "1"|"2"|"3"|"4" => Array(deliveredTopic,"3") case "5"|"6" => Array(onlineTopic,"6") case "7"|"8"|"9"|"0" => Array(deliveringTopic,"4") case "33"|"70"|"77" => Array(exceptionTopic,"5") case "11"=>{ if ("2".equals(ext4)){Array(backTopic,"1")} else{Array("other","7")} } case "12"|"13" => Array(backTopic,"1") case "14"|"15"|"16" => Array(switchTopic,"2") case "17"=> { if ("4".equals(why)){Array(backTopic,"1")} else if ("1".equals(why)||"7".equals(why)){Array(deliveredTopic,"3")} else{Array("other","7")}} case _ => Array("other","7") }))上面是注册了一个udf,名称为fvptopic,传入三个参数,返回一个字符串数组(返回的数组要求元素类型一致),也可以根据需要只返回一个值,下面在sparkSql中调用这个udf
val fvptopic = sparkSession.sql("select *,fvptopic(opcode,staywhycode,ext4)[0] " + "as topic,fvptopic(opcode,staywhycode,ext4)[1] as topicnum from fvpbase").where("topic<>'other'")