1、数据准备
本例使用数据为著名的电影评分数据,可在示例天气数据
里面有 users.dat
、 movies.dat
、 ratings.dat
三个文件,分别对应用户数据、电影数据、评分数据。
Bean
以下Bean示例均省略getter,setter,toString,构造函数。
缺少以上方法可以导致程序非正常运行(比如缺少getter时会导致Dataset的Cluoms为空进而无法select数据),可用IDE生成相关代码。
- User
/**
* 用户
* userId::gender::age::occupation::zipCode
* 用户id::性别::年龄::职业::邮编
*/
public class User implements Serializable {
private String userId;
private String gender;
private String age;
private String occupation;
private String zipCode;
}
- Rate
/**
* 评分
* userId::movieId::rate::timestamp
* 用户ID::电影ID::评分::时间戳
*/
public class Rate implements Serializable {
private String userId;
private String movieId;
private String rate;
private String timestamp;
}
- Movie
/**
* 电影
* movieId::title:genres
* 电影ID::电影名称:电影类型
*/
public class Movie implements Serializable {
private String movieId;
private String title;
private String genres;
}
2、Dataset操作示例
public static void main(String[] args) {
String dataPath ="E:\\code\\rex\\simple-spark-example\\src\\main\\resources\\";
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount")
.master("local")
.getOrCreate();
SparkContext sc = spark.sparkContext();
sc.setLogLevel("warn");
Dataset<String> userDataset = spark.read().textFile(dataPath + "users.dat");
Dataset<String> movieDataset =spark.read().textFile(dataPath + "movies.dat");
Dataset<String> ratingDataset = spark.read().textFile(dataPath + "ratings.dat");
System.out.println("所有电影中平均得分最高(口碑最好)的电影:");
Encoder<Movie> movieEncoder = Encoders.bean(Movie.class);
MapFunction<String, Movie> stringMovieMapFunction = line -> {
String[] arr = line.split("::", -1);
Movie movie = new Movie(arr[0], arr[1], arr[2]);
return movie;
};
Dataset<Movie> movieBeanDataset = movieDataset.map(stringMovieMapFunction, movieEncoder).cache();
Encoder<Rate> rateEncoder = Encoders.bean(Rate.class);
MapFunction<String, Rate> stringRateMapFunction = line -> {
String[] arr = line.split("::", -1);
Rate rate = new Rate(arr[0], arr[1], arr[2], arr[3]);
return rate;
};
Dataset<Rate> rateBeanDataset = ratingDataset.map(stringRateMapFunction, rateEncoder).cache();
/* Encoder<User> userEncoder = Encoders.bean(User.class);
MapFunction<String, User> stringUserMapFunction = line -> {
String[] arr = line.split("::", -1);
User rate = new User(arr[0], arr[1], arr[2], arr[3], arr[4]);
return rate;
};
Dataset<User> userBeanDataset = userDataset.map(stringUserMapFunction, userEncoder).cache();*/
// 计算评分最高的10部电影
//计算总评分和评分人数
Dataset<Row> rateSumDataset= rateBeanDataset.select("movieId","userId","rate")
.groupBy("movieId")
.agg(functions.count("userId").as("users")
,functions.sum(functions.col("rate").cast("double")).as("rateSum")
)/*.filter(functions.col("users").$greater(0))*/.cache();
rateSumDataset.show();
//计算平均评分,并按降序排列
Dataset<Row> rateAvgDataset = rateSumDataset.select(functions.col("movieId")
, functions.col("rateSum")
, functions.col("users")
, functions.col("rateSum").divide(functions.col("users")).as("rateAvg")
).orderBy(functions.col("rateAvg").desc());
rateAvgDataset.show();
Dataset<Row> rateAvgJoinDataset = rateAvgDataset.join(movieBeanDataset
,"movieId"
).select(functions.col("movieId")
, functions.col("rateSum")
, functions.col("users")
, functions.col("rateAvg")
, functions.col("title")
, functions.col("genres")
).orderBy(functions.col("rateAvg").desc()).cache();//小思考对应的代码行
rateAvgJoinDataset.show(100);
//计算电影总评分和评论人数
rateAvgJoinDataset.takeAsList(10).forEach(System.out::println);
spark.stop();
}
为了方便展示中间过程,代码有一定冗余和不优雅的地方,望理解
小思考:
rateAvgJoinDataset.show(100);
这句代码的前一行的cache方法注释掉后,println输出和show输出的差异你发现了么?为什么会这样呢?
因为去掉cache后,rateAvgJoinDataset定义的系列算子都是lazy级别的Transformation算子
PS
Spark程序中有两种级别的算子:Transformation和Action。
● Transformation算子会由DAGScheduler划分到pipeline中,是Lazy级别的不会触发任务的执行;
● Action算子会触发Job来执行pipeline中的运算