1、求出每天访问的去重用户数、会员数、session数
数据样例 如下 实际大概500M左右 数据只是一天的数据 所以最后结果只有一条
链接:https://pan.baidu.com/s/15_8m-kn-_cYmmpNSsMcutA
提取码:wjhu
190.164.178.204 1531362217 www.qianfeng.com /?en=e_l&ver=0.0.1&pl=website&sdk=???&u_ud=8594CA46-632A-4C51-A505-62815E9A3FAF&u_mid=智刚&u_sd=0BB14C4F-29BB-4A84-A27F-E893C67E383C&c_time=1531362385&l=zh-CN&b_iev=Opera/9.80 (Windows NT 10.0; U; zh-cn) Presto/2.9.168 Version/10.50&b_rst=b_rst=1366*768
190.164.178.204 1531362217 www.qianfeng.com /?en=e_pv&p_url=http%3A%2F%2Flocalhost%3A8080%2Fdemo.jsp&tt=%E6%B5%8B%E8%AF%95%E9%A1%B5%E9%9D%A21&ver=0.0.1&pl=website&sdk=???&u_ud=8594CA46-632A-4C51-A505-62815E9A3FAF&u_mid=智刚&u_sd=0BB14C4F-29BB-4A84-A27F-E893C67E383C&c_time=1531362385&l=zh-CN&b_iev=Opera/9.80 (Windows NT 10.0; U; zh-cn) Presto/2.9.168 Version/10.50&b_rst=b_rst=1366*768
250.75.31.244 1531394907 www.qianfeng.com /?en=e_l&ver=0.0.7&pl=website&sdk=???&u_ud=74F48D0B-8D1E-4CA9-B1BA-17C6915F2AB9&u_mid=鸿朗&u_sd=A2363DE0-9C4D-4FAE-93C2-DD148862F5D0&c_time=1531394974&l=zh-CN&b_iev=Opera/9.80 (Windows NT 10.0; U; zh-cn) Presto/2.9.168 Version/11.50&b_rst=b_rst=1366*768
250.75.31.244 1531394907 www.qianfeng.com /?en=e_pv&p_url=http%3A%2F%2Flocalhost%3A8080%2Fdemo2.jsp&p_ref=http%3A%2F%2Flocalhost%3A8080%2Fdemo.jsp&tt=%E6%B5%8B%E8%AF%95%E9%A1%B5%E9%9D%A22&ver=0.0.7&pl=website&sdk=???&u_ud=74F48D0B-8D1E-4CA9-B1BA-17C6915F2AB9&u_mid=鸿朗&u_sd=A2363DE0-9C4D-4FAE-93C2-DD148862F5D0&c_time=1531394974&l=zh-CN&b_iev=Opera/9.80 (Windows NT 10.0; U; zh-cn) Presto/2.9.168 Version/11.50&b_rst=b_rst=1366*768
250.75.31.244 1531394907 www.qianfeng.com /?en=e_cr&oid=31739B-F90C-E24FEB&on=order31739B-F90C-E24FEB&cua=1454&cut=?&pt=银行卡&ver=0.0.7&pl=website&sdk=???&u_ud=74F48D0B-8D1E-4CA9-B1BA-17C6915F2AB9&u_mid=鸿朗&u_sd=A2363DE0-9C4D-4FAE-93C2-DD148862F5D0&c_time=1531394974&l=zh-CN&b_iev=Opera/9.80 (Windows NT 10.0; U; zh-cn) Presto/2.9.168 Version/11.50&b_rst=b_rst=1366*768
250.75.31.244 1531394907 www.qianfeng.com /?en=e_cs&oid=31739B-F90C-E24FEB&ver=0.0.7&pl=website&sdk=???&u_ud=74F48D0B-8D1E-4CA9-B1BA-17C6915F2AB9&u_mid=鸿朗&u_sd=A2363DE0-9C4D-4FAE-93C2-DD148862F5D0&c_time=1531394974&l=zh-CN&b_iev=Opera/9.80 (Windows NT 10.0; U; zh-cn) Presto/2.9.168 Version/11.50&b_rst=b_rst=1366*768
139.73.50.157 1531366383 www.qianfeng.com /?en=e_l&ver=0.0.2&pl=website&sdk=???&u_ud=DBA6A6A6-6DBE-448C-B3AD-7581522B3DE1&u_mid=金鹏&u_sd=0CF915C1-22DB-4F5D-8A8C-31A5F161C0D4&c_time=1531366497&l=zh-CN&b_iev=Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 10.0; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)&b_rst=b_rst=1366*768
139.73.50.157 1531366383 www.qianfeng.com /?en=e_pv&p_url=http%3A%2F%2Flocalhost%3A8080%2Fdemo2.jsp&p_ref=http%3A%2F%2Flocalhost%3A8080%2Fdemo.jsp&tt=%E6%B5%8B%E8%AF%95%E9%A1%B5%E9%9D%A22&ver=0.0.2&pl=website&sdk=???&u_ud=DBA6A6A6-6DBE-448C-B3AD-7581522B3DE1&u_mid=金鹏&u_sd=0CF915C1-22DB-4F5D-8A8C-31A5F161C0D4&c_time=1531366497&l=zh-CN&b_iev=Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 10.0; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)&b_rst=b_rst=1366*768
139.73.50.157 1531366383 www.qianfeng.com /?en=e_cr&oid=6521E8-A790-D14B90&on=order6521E8-A790-D14B90&cua=986&cut=£&pt=银行卡&ver=0.0.2&pl=website&sdk=???&u_ud=DBA6A6A6-6DBE-448C-B3AD-7581522B3DE1&u_mid=金鹏&u_sd=0CF915C1-22DB-4F5D-8A8C-31A5F161C0D4&c_time=1531366497&l=zh-CN&b_iev=Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 10.0; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)&b_rst=b_rst=1366*768
139.73.50.157 1531366383 www.qianfeng.com /?en=e_cs&oid=6521E8-A790-D14B90&ver=0.0.2&pl=website&sdk=???&u_ud=DBA6A6A6-6DBE-448C-B3AD-7581522B3DE1&u_mid=金鹏&u_sd=0CF915C1-22DB-4F5D-8A8C-31A5F161C0D4&c_time=1531366497&l=zh-CN&b_iev=Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 10.0; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)&b_rst=b_rst=1366*768
1.按天来划分,那key就是时间
第二列是时间戳,需要转化为时间
(时间戳转换为时间的方法)
time=“1531362217”
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
String time1 = simpleDateFormat.format(new Date(1000L * Long.parseLong(time.toString())));
输出后结果为 2018-07-12
第四列是一串字符串,其中可以切割出u_ud, u_mid, u_sd 就是要求的三个东西(我也不太懂分别代表啥)
就是分别去重。 简单:这个数据量,去重想用到Set就行。
刚开始搞错了,map输出时将三个串放在一串输出了,然后在reduce端只是切割后计数,reduce没有判断符不符合条件,反正就是不对,可能会有脏数据,要考虑这样的情况,emm, 我就得是要得。
还有 没有仔细观察数据 应该先使用???切割 这样才能得到想要的 直接使用&切割不对
2.map切割 拼起来直接输出了
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] s = value.toString().split("\t");
String time = s[1];
String s1[] = s[3].split("&");
String uid = s1[4];
String umid = s1[5];
String usid = s1[6];
String time1 = simpleDateFormat.format(new Date(1000L * Long.parseLong(time.toString())));
context.write(new Text(time1), new Text(uid+":"+umid+":"+usid));
}
然后reduce
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Set<String> uidSet = new TreeSet<>();
Set<String> umidSet = new TreeSet<>();
Set<String> usidSet = new TreeSet<>();
for(Text t : values){
String ss[] = t.toString().split(":");
String uid1 = ss[0];
String umid1 = ss[1]; 就是这样切割完了直接放入set 可能有脏数据的情况 就是它的数据可能并不是id啥的 只是使用了&分隔 : 分隔的那种 emmm 也不知道想的对不对。。。。
String usid1 = ss[2];
uidSet.add(uid1);
umidSet.add(umid1);
usidSet.add(usid1);
}
context.write(new Text(key), new Text("u_ud:"+uidSet.size() +"\t" +"u_mid:"+ umidSet.size() +"u_sd:"+ "\t" + usidSet.size()));
}
然后出来的结果是它
2018-07-12 u_ud:400025 u_mid:213582 u_sd: 480253
就是吧
应该分开输出,在map端,满足某个条件分别输出。
具体代码如下。
2018-07-12 u_ud:400000 u_sid:238 u_sd:400001
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] s = value.toString().split("\t");
String time = s[1];
String time1 = simpleDateFormat.format(new Date(1000L * Long.parseLong(time.toString())));
// String[] s1 = s[3].split("/?/?/?");
String[] s1 = s[3].split("\\?\\?\\?");
String[] s2 = s1[1].split("&");
for (String str : s2){// 循环使用&分隔出来的每一个字符串 找寻符合条件的
String[] tmp_value = str.split("=");
//u_mid=金鹏 将这一串再用=切出来 不切也行 判断是否包含这个u_mid字符串也行
if(tmp_value[0].equals("u_ud")){
context.write(new Text(time1), new Text("u_ud:"+tmp_value[1]));
}else if(tmp_value[0].equals("u_mid")){
context.write(new Text(time1), new Text("u_mid:"+tmp_value[1]));
}else if(tmp_value[0].equals("u_sd")){
context.write(new Text(time1), new Text("u_sd:"+tmp_value[1]));
}
}
//三个值 分别放三个set 因为set自动去重
Set<String> uudSet = new TreeSet<>();
Set<String> umidSet = new TreeSet<>();
Set<String> usdSet = new TreeSet<>();
//reduce是根据key值分组 所以时间相同的在一个reduce中 满足按天计算数量
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text t : values){//循环存放 放在不同的map
String ss[] = t.toString().split(":");
if(ss[0].equals("u_ud")){
uudSet.add(ss[1]);
}else if(ss[0].equals("u_mid")){
umidSet.add(ss[1]);
}else if(ss[0].equals("u_sd")){
usdSet.add(ss[1]);
}
}
context.write(key,new Text("u_ud:"+uudSet.size()+"\tu_sid:"+umidSet.size()+"\tu_sd:"+usdSet.size()));
}
整体代码
/**
* 作者:Shishuai
* 文件名:MyTest01
* 时间:2019/9/7 16:46
*/
package test01;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Set;
import java.util.TreeSet;
public class UpdateLiuLiangMapReduce {
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text>{
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//一共四列数据 先使用\t切割
String[] s = value.toString().split("\t");
//获取第二列的时间戳 转换为时间
String time = s[1];
String time1 = simpleDateFormat.format(new Date(1000L * Long.parseLong(time.toString())));
//然后使用???分隔 只有在???后面一串的再用 &分隔才能获得u_ud u_mid u_sid
//注意问号?需要使用转义字符
// String[] s1 = s[3].split("/?/?/?");
String[] s1 = s[3].split("\\?\\?\\?");
//这样获取的第二个值 再使用&分隔
// 这样具体看一下
// ??? & u_ud=DBA6A6A6-6DBE-448C-B3AD-7581522B3DE1 & u_mid=金鹏 & u_sd=0CF915C1-22DB-4F5D-8A8C-31A5F161C0。。。。。
String[] s2 = s1[1].split("&");
// 这样分别获取前三个值则为想要的 emm 也不一定 所以要判断下
// 而且 注意 是分别去重 所以不能跟我刚开始一样 放在一个value中输出 所以要分开去重(好像我那样也行 就是感觉麻烦吧 这样分开 简洁明了)
for (String str : s2){// 循环使用&分隔出来的每一个字符串 找寻符合条件的
String[] tmp_value = str.split("=");
//u_mid=金鹏 将这一串再用=切出来 不切也行 判断是否包含这个u_mid字符串也行
if(tmp_value[0].equals("u_ud")){
context.write(new Text(time1), new Text("u_ud:"+tmp_value[1]));
}else if(tmp_value[0].equals("u_mid")){
context.write(new Text(time1), new Text("u_mid:"+tmp_value[1]));
}else if(tmp_value[0].equals("u_sd")){
context.write(new Text(time1), new Text("u_sd:"+tmp_value[1]));
}
// 输出格式 1970-5-5 u_mid:志刚
// 1970-5-5 u_sd:dasdasdasd
// 需要对 u_ud u_mid u_sd 统计个数
}
}
}
public static class MyReducer extends Reducer<Text, Text, Text, Text>{
//三个值 分别放三个set 因为set自动去重
Set<String> uudSet = new TreeSet<>();
Set<String> umidSet = new TreeSet<>();
Set<String> usdSet = new TreeSet<>();
//reduce是根据key值分组 所以时间相同的在一个reduce中 满足按天计算数量
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text t : values){//循环存放 放在不同的map
String ss[] = t.toString().split(":");
if(ss[0].equals("u_ud")){
uudSet.add(ss[1]);
}else if(ss[0].equals("u_mid")){
umidSet.add(ss[1]);
}else if(ss[0].equals("u_sd")){
usdSet.add(ss[1]);
}
// context.write(new Text(key + ":" + t),new Text("u_ud:"+uudSet.size()+"\tu_sid:"+umidSet.size()+"\tu_sd:"+usdSet.size()));
}
context.write(key,new Text("u_ud:"+uudSet.size()+"\tu_sid:"+umidSet.size()+"\tu_sd:"+usdSet.size()));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
// conf.set("fs.defaultFS", "hdfs://qf");
// conf.set("dfs.nameservices", "qf");
// conf.set("dfs.ha.namenodes.qf", "nn1, nn2");
// conf.set("dfs.namenode.rpc-address.qf.nn1", "hadoop01:9000");
// conf.set("dfs.namenode.rpc-address.qf.nn2", "hadoop02:9000");
// conf.set("dfs.client.failover.proxy.provider.qf", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
Job job = Job.getInstance(conf, "LiuLiangMapReduce");
job.setJarByClass(UpdateLiuLiangMapReduce.class);
job.setMapperClass(UpdateLiuLiangMapReduce.MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(UpdateLiuLiangMapReduce.MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("C:\\Users\\HP\\Desktop\\logdata.log"));
FileOutputFormat.setOutputPath(job, new Path("D:/BigDataTestData/test00005"));
int success = job.waitForCompletion(true) ? 0 : 1;
System.exit(success);
}
}