- 数据表结构
SparkSession spark = SparkSession.builder().appName("app-train").master("local[*]").getOrCreate();
Dataset<Row> trainData =spark.read().json("src/main/resource/train_stopover.json").orderBy("duration_date","station_sequence");
trainData.printSchema();
|-- arrive_time: string (nullable = true)
|-- duration_date: string (nullable = true)
|-- leave_time: string (nullable = true)
|-- station_code: string (nullable = true)
|-- station_name: string (nullable = true)
|-- station_sequence: long (nullable = true)
|-- stopover_time: string (nullable = true)
|-- train_no: string (nullable = true)
|-- transport_code: string (nullable = true)
- 数据样本
+-----------+-------------+----------+------------+------------+----------------+-------------+------------+--------------+
|arrive_time|duration_date|leave_time|station_code|station_name|station_sequence|stopover_time| train_no|transport_code|
+-----------+-------------+----------+------------+------------+----------------+-------------+------------+--------------+
| | 20180515| 20:00| BXP| 北京西| 1| |2400000Z210A| Z21|
| 22:33| 20180515| 22:37| VVP| 石家庄北| 2| 4|2400000Z210A| Z21|
| 00:19| 20180515| 00:25| TYV| 太原| 3| 6|2400000Z210A| Z21|
| 07:05| 20180515| 07:16| ZWJ| 中卫| 4| 11|2400000Z210A| Z21|
| 12:17| 20180515| 12:33| LZJ| 兰州| 5| 16|2400000Z210A| Z21|
| 15:01| 20180515| 15:21| XNO| 西宁| 6| 20|2400000Z210A| Z21|
| 19:23| 20180515| 19:25| DHO| 德令哈| 7| 2|2400000Z210A| Z21|
| 22:10| 20180515| 22:35| GRO| 格尔木| 8| 25|2400000Z210A| Z21|
| 08:18| 20180515| 08:24| NQO| 那曲| 9| 6|2400000Z210A| Z21|
| 12:20| 20180515| 12:20| LSO| 拉萨| 10| |2400000Z210A| Z21|
| | 20180516| 20:00| BXP| 北京西| 1| |2400000Z210A| Z21|
| 22:33| 20180516| 22:37| VVP| 石家庄北| 2| 4|2400000Z210A| Z21|
| 00:19| 20180516| 00:25| TYV| 太原| 3| 6|2400000Z210A| Z21|
| 07:05| 20180516| 07:16| ZWJ| 中卫| 4| 11|2400000Z210A| Z21|
| 12:17| 20180516| 12:33| LZJ| 兰州| 5| 16|2400000Z210A| Z21|
| 15:01| 20180516| 15:21| XNO| 西宁| 6| 20|2400000Z210A| Z21|
| 19:23| 20180516| 19:25| DHO| 德令哈| 7| 2|2400000Z210A| Z21|
| 22:10| 20180516| 22:35| GRO| 格尔木| 8| 25|2400000Z210A| Z21|
| 08:18| 20180516| 08:24| NQO| 那曲| 9| 6|2400000Z210A| Z21|
| 12:20| 20180516| 12:20| LSO| 拉萨| 10| |2400000Z210A| Z21|
+-----------+-------------+----------+------------+------------+----------------+-------------+------------+--------------+
数据说明:这是火车车次经停表.transport_code表示车次,
duration_date卖票时间,
station_sequence (1,2,3,…,10)表示经过站点的序列号,
station_name表示站点名,
train_no表示火车内部码.
需要拆成班段信息:1->2,1->3,…1->10,2->3,2->4信息.
想到利用flatmap. 但我们需要按天,车次号,火车内部码进行groupby, 然后再利用flatmap拆分. 而dataset是使用flatMapGroups.
Encoder<Tuple3<String, String, String>> tuple3Encoder = Encoders.tuple(Encoders.STRING(), Encoders.STRING(), Encoders.STRING());
Dataset<Row> result = trainData.groupByKey((MapFuncti