故事背景
在我们的日常生活中,人们已经习惯了看电影。但是,每个人的偏好是不同的,有的人可能喜欢战争片,有人可能更喜欢艺术片,而有的人则可能喜欢爱情片,等等。现在,我们收集了一些的客户和电影的相关信息,目的是找出客户对特定影片的评分,从而预测出客户有可能喜爱的电影并推荐给客户。本次的大数据处理,使用了单词统计、基于用户的协同过滤算法等。
分析预测技术
分析工具:基于Hadoop的MapReduce
数据预处理:利用单词统计将一部分重复的、无用的数据过滤掉
算法:基于用户的协同过滤算法
数据可视化:使用了echart的柱状图和平行坐标图
基于用户的协同过滤算法
根据其他用户的观点产生对目标用户的推荐列表。即如果用户对一些项的评分比较相似,则他们对其他项的评分也相似。协同过滤推荐系统使用统计技术搜素目标用户的若干最近邻居,然后根基最近邻居对项的评分预测目标用户对未评分项的评分,选择预测评分最高的前若干项作为推荐结果反馈给用户
实现:
- 收集可以代表用户兴趣的信息
-
最近邻搜索,计算两个用户的相似度
余弦相似度:用户i和用户j之间的相识度
-
生成预测结果
可以通过用户U与最近邻集合NBS中项目的评分得到
案例
根据电影的基本信息和用户对电影的评价来向用户推荐电影
收集数据:十万级的用户电影评分数据,来源于最新的MovieLens
根据movies.dat中的数据,通过单词统计,分析大众对不同种类电影的喜好
package org.bigdata.util; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class classify { private static class classifyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{ @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { String[] strs = value.toString().split("::"); String[] classes = strs[2].split("\\|"); for(String str : classes){ context.write(new Text(str),new IntWritable(1)); } } } private static class classifyReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ @Override protected void reduce(Text value, Iterable<IntWritable> datas, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int count = 0; for(IntWritable data : datas){ count = count + data.get(); } context.write(value,new IntWritable(count)); } } public static void main(String[] args) throws Exception{ Configuration cfg = HadoopCfg.getCfg(); Job job = Job.getInstance(cfg); job.setJobName("classify Count"); job.setJarByClass(classify.class); job.setMapperClass(classifyMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(classifyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,new Path("/input/movies.dat")); FileOutputFormat.setOutputPath(job,new Path("/output/")); System.exit( job.waitForCompletion(true)?0:1); } }
数据预处理:对ratings.dat中的数据j进行预处理,过滤出评分在4以上的数据
——>
package org.bigdata.util; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class favor { private static class favorMapper extends Mapper<LongWritable,Text,Text,IntWritable>{ @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { String[] strs = value.toString().split(","); //int mvote = (Float.valueOf(strs[2])).intValue(); int mvote = (Float.valueOf(strs[2])).intValue(); if(mvote >= 3) { context.write(new Text(strs[1]+"\t"+strs[0]+"\t"+strs[2]),new IntWritable(1)); } } } private static class favorReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ @Override protected void reduce(Text value, Iterable<IntWritable> datas, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int count = 0; for(IntWritable data : datas){ count = count + data.get(); } context.write(value,new IntWritable(count)); } } public static void main(String[] args) throws Exception{ Configuration cfg = HadoopCfg.getCfg(); Job job = Job.getInstance(cfg); job.setJobName("favor Count"); job.setJarByClass(favor.class); job.setMapperClass(favorMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(favorReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job,new Path("/select/")); FileOutputFormat.setOutputPath(job,new Path("/output/")); System.exit( job.waitForCompletion(true)?0:1); } }
数据处理:对过滤后的数据使用基于用户的协同过滤算法进行预测分析
主函数:
package com; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.util.ToolRunner; public class UserCF { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new UserCF1(), args); ToolRunner.run(new Configuration(), new UserCF2(), args); ToolRunner.run(new Configuration(), new UserCF3(), args); ToolRunner.run(new Configuration(), new UserCF4(), args); ToolRunner.run(new Configuration(), new UserCF5(), args); ToolRunner.run(new Configuration(), new UserCF6(), args); } }
将评过相同电影的用户关联起来
package com; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.bigdata.util.HadoopCfg; public class UserCF1 extends Configured implements Tool { public static class Mapper1 extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 只获取文本中的用户编号、电影编号、评分 String[] values = value.toString().split("\t"); //电影编号作为Key context.write(new Text(values[1]), new Text(values[0]+"\t"+values[2])); } } public static class Reducer1 extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { List<String> tmp_list = new ArrayList<String>(); for(Text tmp:values) { tmp_list.add(tmp.toString()); } for(int i=0;i<tmp_list.size();i++) { String []tmp1 =tmp_list.get(i).split("\t"); int tmp11 = (Float.valueOf(tmp1[1])).intValue(); int down1 = tmp11 * tmp11; for(int j=0;j<tmp_list.size();j++) { String []tmp2 =tmp_list.get(j).split("\t"); int tmp21 = (Float.valueOf(tmp2[1])).intValue(); int up = tmp11 * tmp21; int down2 = tmp21 * tmp21; //评过同一电影的用户关联起来 context.write(new Text(tmp1[0]+" "+tmp2[0]), new Text(up+" "+down1+" "+down2)); } } } } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub Configuration conf = HadoopCfg.getCfg(); Job job = Job.getInstance(conf, "UserCF1"); job.setJarByClass(UserCF1.class); job.setMapperClass(Mapper1.class); job.setReducerClass(Reducer1.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/userCF/train")); Path table_path = new Path("/userCF/tmp"); FileSystem.get(conf).delete(table_path, true); FileOutputFormat.setOutputPath(job, table_path); job.waitForCompletion(true); return 0; } }
余弦相似性算法
package com; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.bigdata.util.HadoopCfg; public class UserCF2 extends Configured implements Tool { public static class Mapper2 extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] values = value.toString().split("\t"); String[] tmp = values[0].split(" "); //用户 context.write(new Text(tmp[0]+"\t"+tmp[1]), new Text(values[1])); } } public static class Reducer2 extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int up=0; int down1=0; int down2=0; float simi=0; for(Text tmp:values) { String[] tmp_list = tmp.toString().split(" "); up=up+Integer.parseInt(tmp_list[0]); down1=down1+Integer.parseInt(tmp_list[1]); down2=down2+Integer.parseInt(tmp_list[2]); } //余弦相似性 float down = (int)Math.sqrt(down1)*(int)Math.sqrt(down2); simi=up/down; context.write(key, new Text(simi+" si")); } } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub Configuration conf = HadoopCfg.getCfg(); Job job = Job.getInstance(conf, "UserCF2"); job.setJarByClass(UserCF2.class); job.setMapperClass(Mapper2.class); job.setReducerClass(Reducer2.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/userBase/tmp")); Path table_path = new Path("/userBase/simi"); FileSystem.get(conf).delete(table_path, true); FileOutputFormat.setOutputPath(job, table_path); job.waitForCompletion(true); return 0; } }
将已评过电影的评分和用户相似度关联起来
package com; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.bigdata.util.HadoopCfg; public class UserCF3 extends Configured implements Tool { public static class Mapper3 extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] values = value.toString().split("\t"); //用户1 用户2,相似度 context.write(new Text(values[0]), new Text(values[1]+"\t"+values[2])); } } public static class Reducer3 extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { List<String> user_list = new ArrayList<String>(); List<String> item_list = new ArrayList<String>(); for(Text tmp:values) { String []tmp1=tmp.toString().split("\t"); String []tmp2= tmp1[1].split(" "); //判断到底是哪个文件中的数据 if(tmp2.length==2) { user_list.add(tmp1[0]+"\t"+tmp2[0]); } else { item_list.add(tmp1[0]+"\t"+tmp2[0]); } } //将评分和相似度关联起来 for(int i=0;i<user_list.size();i++) { String []tmp1 = user_list.get(i).split("\t"); for(int j=0;j<item_list.size();j++) { String []tmp2 = item_list.get(j).split("\t"); context.write(new Text(tmp1[0]+" "+tmp2[0]), new Text(tmp1[1]+" "+tmp2[1])); } } } } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub Configuration conf = HadoopCfg.getCfg(); Job job = Job.getInstance(conf, "UserCF1"); job.setJarByClass(UserCF3.class); job.setMapperClass(Mapper3.class); job.setReducerClass(Reducer3.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/userCF/train")); FileInputFormat.addInputPath(job, new Path( "/userCF/simi")); Path table_path = new Path("/userCF/tmp2"); FileSystem.get(conf).delete(table_path, true); FileOutputFormat.setOutputPath(job, table_path); job.waitForCompletion(true); return 0; } }
预测用户对所有电影的所有评分
package com; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.bigdata.util.HadoopCfg; //预测评分 public class UserCF4 extends Configured implements Tool { public static class Mapper4 extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] values = value.toString().split("\t"); String [] tmp = values[0].split(" "); context.write(new Text(tmp[0]+"\t"+tmp[1]), new Text(values[1])); } } public static class Reducer4 extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { double up=0;double down=0; for(Text tmp:values) { String []tmp1 = tmp.toString().split(" "); up = up + Double.parseDouble(tmp1[0])*Double.parseDouble(tmp1[1]); down = down +Math.abs(Double.parseDouble(tmp1[0])); } double score = up/down; context.write(key, new Text(score+"")); } } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub Configuration conf = HadoopCfg.getCfg(); Job job = Job.getInstance(conf, "UserCF2"); job.setJarByClass(UserCF4.class); job.setMapperClass(Mapper4.class); job.setReducerClass(Reducer4.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/userCF/tmp2")); Path table_path = new Path("/userCF/score"); FileSystem.get(conf).delete(table_path, true); FileOutputFormat.setOutputPath(job, table_path); job.waitForCompletion(true); return 0; } }
检测实际用户的评分与预测的偏差
package com; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.bigdata.util.HadoopCfg; public class UserCF5 extends Configured implements Tool { public static class Mapper5 extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] values = value.toString().split("\t"); context.write(new Text(values[0]+"\t"+values[1]), new Text(values[2])); } } public static class Reducer5 extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int i=0;double tmp1=0,tmp2=0; for(Text tmp:values) { if(i==0) { tmp1=Double.parseDouble(tmp.toString()); } else{ tmp2=Double.parseDouble(tmp.toString()); } i++; } if(i==2) { context.write(new Text("mae"), new Text(Math.abs(tmp1-tmp2)+"")); } } } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub Configuration conf = HadoopCfg.getCfg(); Job job = Job.getInstance(conf, "UserCF5"); job.setJarByClass(UserCF5.class); job.setMapperClass(Mapper5.class); job.setReducerClass(Reducer5.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/userCF/score/part-r-00000")); FileInputFormat.addInputPath(job, new Path( "/userCF/test")); Path table_path = new Path("/userCF/tmp3"); FileSystem.get(conf).delete(table_path, true); FileOutputFormat.setOutputPath(job, table_path); job.waitForCompletion(true); return 0; } }
package com; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.bigdata.util.HadoopCfg; public class UserCF6 extends Configured implements Tool { public static class Mapper6 extends Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] values = value.toString().split("\t"); context.write(new Text(values[0]), new Text(values[1])); } } public static class Reducer6 extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int num=0;double sum=0; for(Text tmp:values) { sum=sum + Double.parseDouble(tmp.toString()); num = num +1; } context.write(new Text("mae"), new Text(sum/num+"")); } } @Override public int run(String[] arg0) throws Exception { // TODO Auto-generated method stub Configuration conf = HadoopCfg.getCfg(); Job job = Job.getInstance(conf, "UserCF2"); job.setJarByClass(UserCF6.class); job.setMapperClass(Mapper6.class); job.setReducerClass(Reducer6.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path( "/userCF/tmp3")); Path table_path = new Path("/userCF/MAE"); FileSystem.get(conf).delete(table_path, true); FileOutputFormat.setOutputPath(job, table_path); job.waitForCompletion(true); return 0; } }
数据可视化:将结果转为json数据,通过echart中的柱状图和平行坐标图数据可视化
http://echarts.baidu.com/demo.html#mix-zoom-on-value
http://echarts.baidu.com/demo.html#parallel-aqi
结论与启示
-
通过Hadoop的MapReduce的基于用户的协同过滤算法对数据进行了分析预测
-
通过柱状图可以知道大众对喜剧片、动作片、爱情片更为喜爱
-
通过平行坐标图可以得知人们通过电影的评分可以关联起来,通过数据分析推荐可以更为快捷地找到自己喜欢的影片
-
我们正处于大数据的时代,通过分析预测,我们将更加了解自己及需求
问题
在本次的数据处理过程中,由于数据有些庞大,在处理的过程中,产生了大量的中间文件,将Hadoop中的存储空间占了很大一部分导致不能正常运行完成。(为什么这么说呢?如果一个人评价了10000部影片,那么只要看过其中一部影片的用户就会和这位用户产生关联,就要计算他没看过的其他的所有影片的可能的评价,这样用户越多电影越多就会发生数据大爆炸)
通过网上查找资料,占据的内存太多,几乎没有空余的空间存储。需要采用Hadoop的datanode多磁盘空间处理,增加磁盘,通过hdfs-site.xml中的dfs.datanode.data.dir配置项通过分号分割将新添加的磁盘添加到datanode中。
http://www.makaidong.com/%E5%8D%9A%E5%AE%A2%E5%9B%AD%E6%8E%92%E8%A1%8C/20013.shtml