Spark SQL网站搜索综合案例实战

最新推荐文章于 2023-07-17 14:03:20 发布

a11a2233445566

最新推荐文章于 2023-07-17 14:03:20 发布

阅读量1.2k

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/a11a2233445566/article/details/54694673

版权

spark 专栏收录该内容

67 篇文章 0 订阅

订阅专栏

以京东为例找出搜索平台上用户每天搜索排名5名的产品，The hottest！

用户登录京东网站，在搜索栏搜索的时候，将用户每天搜索排名前5名的商品列出来。

一：生成模拟京东用户搜索的测试数据。

l SparkSQLUserlogsHottest.log测试数据文件包括时间、用户id、商品、地点、设备信息 10000条数据

二：根据测试数据实现搜索平台上用户每天搜索排名5名的产品。

l 创建JavaSparkContext及HiveContext

l sc.textFile读入测试数据文件，生成JavaRDD<String>类型的数据集合line0。

l sc.broadcast定义广播变量，用于测试数据的查询及过滤。

l lines0.filter使用匿名接口函数Function的call方法过滤出包含广播变量的数据，生成JavaRDD<String> 类型的数据集合lines。

l 测试验证点：

测试数据广播变量的过滤是否成功？打印出lines的数据验证。

l lines.mapToPair 使用匿名接口函数PairFunction的call方法对lines中的数据按"\t"进行分割，分割以后将(date#Item#userID)三个字段合并为Key, 将 Value值设置为1，形成 K，V键值对。返回的结果pairs类型为JavaPairRDD<String, Integer>。相当于hadoop mapreduce中的map，将某天某用户点击某商品的次数计数为1。

l 测试验证点：

测试(date#Item#userID)三个字段合并为Key成功了吗？k，v是否符合预期？

l pairs.reduceByKey使用匿名接口函数Function2的call方法对pairs中数据进行reduce汇总，将pairs具有相同key值的数据，累加统计数值，返回的结果reduceedPairs 的类型为JavaPairRDD<String, Integer>。相当于hadoop mapreduce中的reduce，将某天某用户点击某商品的所有点击次数累加汇总。

l 测试验证点：

测试(date#Item#userID) k，v的reduce统计是否成功？

将reduceedRow中的每行数据拆分，reduceedRow的key值分割为三个字段：时间、用户id、商品，将reduceedRow的商品累计点击数值value记为count，然后将这四个字段时间、用户id、商品、count再拼接成json格式放到peopleInformations列表里面，打印输出。peopleInformations类型为List<String>，为json格式。

l sc.parallelize(peopleInformations)将scala空间的变量peopleInformations转换为Spark RDD空间的变量peopleInformationsRDD，类型为JavaRDD<String>

l sqlContext.read().json(peopleInformationsRDD)通过内容为JSON的RDD来构造peopleInformationsDF，类型为DataFrame

l DataFrame注册成为临时表

peopleInformationsDF.registerTempTable("peopleInformations")

l 窗口函数：使用子查询的方式完成目标数据的提取，在目标数据内幕使用窗口函数row_number来进行分组排序， PARTITION BY :指定窗口函数分组的Key， ORDER BY：分组后进行排序；

String sqlText = "SELECT UserID,Item, count "

+ "FROM ("

+ "SELECT "

+ "UserID,Item, count,"

+ "row_number() OVER (PARTITION BY UserID ORDER BY count DESC) rank"

+" FROM peopleInformations "

+ ") sub_peopleInformations "

+ "WHERE rank <= 3 " ;

Sql查询语句：

Ø 使用窗口函数，查询peopleInformations表，按照用户分组，将同一用户的每天的点击的商品数排名，形成一张子表sub_peopleInformations。

Ø 然后从排名后的子表sub_peopleInformations中，查询出前三名的用户点击的商品和计数。

l 测试验证点：

验证窗口函数sql语句执行是否成功？

sqlContext.sql(sqlText)执行sql查询，execellentNameAgeDF.show()显示sql查询结果。

l 将execellentNameAgeDF的结果保存为json文件格式

execellentNameAgeDF.write().format("json").save(）;

三：源代码SparkSQLUserlogsHottestDataManually.java和SparkSQLUserlogsHottest.java

1、SparkSQLUserlogsHottestDataManually.java

package com.dt.imf.zuoye001;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.io.PrintWriter;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.Calendar;

import java.util.Random;

import java.util.UUID;

public class SparkSQLUserlogsHottestDataManually {

public static void main(String[] args) {

long numberItems = 10000;

ganerateUserLogs(numberItems,"G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\");

}

/**

* 生成userLog

* @param numberItems

* @param path

private static void ganerateUserLogs(long numberItems, String path) {

StringBuffer userLogBuffer = new StringBuffer();

// String filename = getCountDate(null, "yyyyMMddHHmmss", -1) + ".log";

String filename = "SparkSQLUserlogsHottest.log";

// 元数据：Date、UserID、Item、City、Device；

for (int i = 0; i < numberItems; i++) {

String date = getCountDate(null, "yyyy-MM-dd", -1);

// String timestamp = getCountDate(null, "yyyy-MM-dd HH:mm:ss", -1);

String userID = ganerateUserID();

String ItemID = ganerateItemID();

String CityID = ganerateCityIDs();

String Device = ganerateDevice();

/* userLogBuffer.append(date + "\t" + timestamp + "\t" + userID

+ "\t" + pageID + "\t" + channelID + "\t" + action + "\n");*/

userLogBuffer.append(date + "\t" + userID

+ "\t" + ItemID + "\t" + CityID + "\t" + Device + "\n");

System.out.println(userLogBuffer);

WriteLog(path, filename, userLogBuffer + "");

}

public static void WriteLog(String path, String filename, String strUserLog)

{

FileWriter fw = null;

PrintWriter out = null;

try {

File writeFile = new File(path + filename);

if (!writeFile.exists())

writeFile.createNewFile();

else {

writeFile.delete();

}

fw = new FileWriter(writeFile, true);

out = new PrintWriter(fw);

out.print(strUserLog);

} catch (Exception e) {

e.printStackTrace();

try {

if (out != null)

out.close();

if (fw != null)

fw.close();

} catch (IOException ex) {

ex.printStackTrace();

}

} finally {

try {

if (out != null)

out.close();

if (fw != null)

fw.close();

} catch (IOException e) {

e.printStackTrace();

}

/**

* 获得日期

* @param date

* @param patton

* @param step

* @return

public static String getCountDate(String date, String patton, int step) {

SimpleDateFormat sdf = new SimpleDateFormat(patton);

Calendar cal = Calendar.getInstance();

if (date != null) {

try {

cal.setTime(sdf.parse(date));

} catch (ParseException e) {

e.printStackTrace();

}

cal.add(Calendar.DAY_OF_MONTH, step);

return sdf.format(cal.getTime());

}

/**

* 随机生成用户ID

* @return

private static String ganerateUserID() {

Random random = new Random();

String[] userID = { "98415b9c-f3d4-45c3-bc7f-dce3126c6c0b", "7371b4bd-8535-461f-a5e2-c4814b2151e1",

"49852bfa-a662-4060-bf68-0dddde5feea1", "8768f089-f736-4346-a83d-e23fe05b0ecd",

"a76ff021-049c-4a1a-8372-02f9c51261d5", "8d5dc011-cbe2-4332-99cd-a1848ddfd65d",

"a2bccbdf-f0e9-489c-8513-011644cb5cf7", "89c79413-a7d1-462c-ab07-01f0835696f7",

"8d525daa-3697-455e-8f02-ab086cda7851", "c6f57c89-9871-4a92-9cbe-a2d76cd79cd0",

"19951134-97e1-4f62-8d5c-134077d1f955", "3202a063-4ebf-4f3f-a4b7-5e542307d726",

"40a0d872-45cc-46bc-b257-64ad898df281", "b891a528-4b5e-4ba7-949c-2a32cb5a75ec",

"0d46d52b-75a2-4df2-b363-43874c9503a2", "c1e4b8cf-0116-46bf-8dc9-55eb074ad315",

"6fd24ac6-1bb0-4ea6-a084-52cc22e9be42", "5f8780af-93e8-4907-9794-f8c960e87d34",

"692b1947-8b2e-45e4-8051-0319b7f0e438", "dde46f46-ff48-4763-9c50-377834ce7137" };

return userID[random.nextInt(20)];

}

/**

* 随机生成pageID

* @return

private static String ganerateItemID() {

Random random = new Random();

//String[] ItemIDs = { "xiyiji", "binxiang", "kaiguan", "reshuiqi", "ranqizao", "dianshiji", "kongtiao" };

String[] ItemIDs = { "小米", "休闲鞋", "洗衣机", "显示器", "显卡", "洗衣液", "行车记录仪" };

return ItemIDs[random.nextInt(7)];

}

/**

* 随机生成channelID

* @return

private static String ganerateCityIDs() {

Random random = new Random();

/*String[] CityNames = { "shanghai", "beijing", "ShenZhen", "HangZhou", "Tianjin", "Guangzhou", "Nanjing", "Changsha", "WuHan",

"jinan" }*/;

String[] CityNames = { "上海", "北京", "深圳", "广州", "纽约", "伦敦", "东京", "首尔", "莫斯科",

"巴黎" };

return CityNames[random.nextInt(10)];

}

/**

* 随机生成action

* @return

private static String ganerateDevice() {

Random random = new Random();

String[] Devices = { "android", "iphone", "ipad" };

return Devices[random.nextInt(3)];

}

/**

* 生成用户Guid

* @param num

* @return

private static String ganerateUserID(int num) {

StringBuffer userid = new StringBuffer();

for (int i = 0; i < num; i++) {

UUID uuid = UUID.randomUUID();

userid.append("\"" + uuid + "\",");

}

System.out.println(userid);

return userid + "";

}

2、SparkSQLUserlogsHottest.java

package com.dt.imf.zuoye001;

import java.util.ArrayList;

import java.util.Arrays;

import java.util.Iterator;

import java.util.List;

import org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.function_return;

import org.apache.hadoop.io.IntWritable;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.api.java.function.VoidFunction;

import org.apache.spark.broadcast.Broadcast;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.SQLContext;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;

import org.apache.spark.sql.hive.HiveContext;

import scala.Tuple2;

public class SparkSQLUserlogsHottest {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkSQLUserlogsHottest");

JavaSparkContext sc = new JavaSparkContext(conf);

SQLContext sqlContext = new HiveContext(sc);

JavaRDD<String> lines0 = sc.textFile("G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\SparkSQLUserlogsHottest.test.log");

/*元数据：Date、UserID、Item、City、Device；

(date#Item#userID)

//定义广播变量

String devicebd ="iphone";

final Broadcast<String> broadcastdevice =sc.broadcast(devicebd);

// 过滤

// lines.filter();

JavaRDD<String> lines =lines0.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) throws Exception {

return s.contains(broadcastdevice.value());

}

});

// 验证

List<String> listRow000 = lines.collect();

for( String row : listRow000){

System.out.println(row);

}

//组拼字符串(date#Item#userID) 构建KV(date#Item#userID，1)

JavaPairRDD<String, Integer> pairs = lines.mapToPair(new PairFunction<String, String, Integer>() {

private static final long serialVersionUID =1L ;

@Override

public Tuple2<String, Integer> call(String line) throws Exception {

String[] splitedLine =line.split("\t");

int one = 1;

String dataanditemanduserid = splitedLine[0] +"#"+ splitedLine[2]+"#"+String.valueOf(splitedLine[1]);

return new Tuple2<String,Integer>(String.valueOf(dataanditemanduserid),Integer.valueOf(one));

}

});

// 验证

List<Tuple2<String,Integer>> listRow = pairs.collect();

for(Tuple2<String,Integer> row : listRow){

System.out.println(row._1);

System.out.println(row._2);

}

//reducebykey，统计计数

JavaPairRDD<String, Integer> reduceedPairs =pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {

@Override

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2 ;

}

});

List<Tuple2<String,Integer>> reduceedRow = reduceedPairs.collect();

//动态组拼出JSON

List<String> peopleInformations = new ArrayList<String>();

for(Tuple2<String,Integer> row : reduceedRow){

//拆分三个字段

String[] rowSplitedLine =row._1.split("#");

String rowuserID = rowSplitedLine[2];

String rowitemID = rowSplitedLine[1];

String rowdateID = rowSplitedLine[0];

//拼接json 元数据：Date、UserID、Item、City、Device

String jsonzip= "{\"Date\":\""+ rowdateID

+"\", \"UserID\":\""+ rowuserID

+"\", \"Username\":\""+ rowuserID

+"\", \"Item\":\""+ rowitemID

+"\", \"count\":"+ row._2 +" }";

peopleInformations.add(jsonzip);

}

//打印验证peopleInformations

for(String row : peopleInformations){

System.out.println(row.toString());

//System.out.println(row._2);

}

//通过内容为JSON的RDD来构造DataFrame

JavaRDD<String> peopleInformationsRDD = sc.parallelize(peopleInformations);

DataFrame peopleInformationsDF = sqlContext.read().json(peopleInformationsRDD);

//注册成为临时表

peopleInformationsDF.registerTempTable("peopleInformations");

/* 使用子查询的方式完成目标数据的提取，在目标数据内幕使用窗口函数row_number来进行分组排序：

* PARTITION BY :指定窗口函数分组的Key；

* ORDER BY：分组后进行排序；

String sqlText = "SELECT UserID,Item, count "

+ "FROM ("

+ "SELECT "

+ "UserID,Item, count,"

+ "row_number() OVER (PARTITION BY UserID ORDER BY count DESC) rank"

+" FROM peopleInformations "

+ ") sub_peopleInformations "

+ "WHERE rank <= 3 " ;

DataFrame execellentNameAgeDF = sqlContext.sql(sqlText);

execellentNameAgeDF.show();

execellentNameAgeDF.write().format("json").save("G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\Result20140419_2");

}

a11a2233445566

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录