Dataset的groupBy agg示例
Dataset<Row> resultDs = dsParsed .groupBy("enodeb_id", "ecell_id") .agg( functions.first("scan_start_time").alias("scan_start_time1"), functions.first("insert_time").alias("insert_time1"), functions.first("mr_type").alias("mr_type1"), functions.first("mr_ltescphr").alias("mr_ltescphr1"), functions.first("mr_ltescpuschprbnum").alias("mr_ltescpuschprbnum1"), functions.count("enodeb_id").alias("rows1")) .selectExpr( "ecell_id", "enodeb_id", "scan_start_time1 as scan_start_time", "insert_time1 as insert_time", "mr_type1 as mr_type", "mr_ltescphr1 as mr_ltescphr", "mr_ltescpuschprbnum1 as mr_ltescpuschprbnum", "rows1 as rows");
Dataset Join示例:
Dataset<Row> ncRes = sparkSession.read().option("delimiter", "|").option("header", true).csv("/user/csv"); Dataset<Row> mro=sparkSession.sql("。。。"); Dataset<Row> ncJoinMro = ncRes .join(mro, mro.col("id").equalTo(ncRes.col("id")).and(mro.col("calid").equalTo(ncRes.col("calid"))), "left_outer") .select(ncRes.col("id").as("int_id"), mro.col("vendor_id"), 。。。
);
join condition另外一种方式:
leftDfWithWatermark.join(rightDfWithWatermark,
expr(""" leftDfId = rightDfId AND leftDfTime >= rightDfTime AND leftDfTime <= rightDfTime + interval 1 hour"""),
joinType = "leftOuter" )