基于Hadoop的商品推荐系统
推荐结果=用户的购买向量*物品的相似度矩阵
物品的相似度:物品的共现次数(也可以使用欧氏距离等)
预备工作
1.项目名:GRMS
2.添加Maven依赖:pom.xml
3.创建包:
com.briup.bigdata.project.grms
|--step1
|--step2
|--...
|--utils
4.将集群上的四个xml配置文件放到resources目录中。
5.在HDFS集群的根目录下创建目录:
/grms
|--rawdata/matrix.txt
|--step1
|--...
6.初始数据:matrix.txt
10001 20001 1
10001 20002 1
10001 20005 1
10001 20006 1
10001 20007 1
10002 20003 1
10002 20004 1
10002 20006 1
10003 20002 1
10003 20007 1
10004 20001 1
10004 20002 1
10004 20005 1
10004 20006 1
10005 20001 1
10006 20004 1
10006 20007 1
这里1000开头的是用户编号,2000开头的是商品编号,最后一列是购买次数
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.briup.bigdata.project.grms</groupId>
<artifactId>GRMS</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.9</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.2.6</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.6</version>
</dependency>
</dependencies>
<build>
<finalName>grms</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
</project>
//以上版本信息根据各人使用版本进行调整
8. 计算用户购买商品的列表
类名:UserBuyGoodsList.java
方法:
UserBuyGoodsList
UserBuyGoodsListMapper
UserBuyGoodsListReducer
代码实现
package com.briup.bigdata.project.grms;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
import java.io.IOException;
import java.util.Iterator;
public class UserBuyGoodsList extends Configured implements Tool {
static class UserBuyGoodsListMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tuple=value.toString().split("\t");
context.write(new Text(tuple[0]),new Text(tuple[1]));
}
}
static class UserBuyGoodsListReducer extends Reducer<Text,Text,Text,Text>{
@Override
protected void reduce(Text key,Iterable<Text> values,Context context) throws IOException, InterruptedException {
Iterator<Text> iterator=values.iterator();
StringBuilder builder=new StringBuilder();
while(iterator.hasNext()){
builder.append(iterator.next().toString()+",");
}
String result=builder.substring(0,builder.length()-1);
context.write(key,new Text(result));
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Path in = new Path(conf.get("in"));
Path out = new Path(conf.get("out"));
Job job = Job.getInstance(conf, this.getClass().getSimpleName());
job.setJarByClass(UserBuyGoodsList.class);
job.setMapperClass(UserBuyGoodsListMapper.class);
job.setReducerClass(UserBuyGoodsListReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job,in);
TextOutputFormat.setOutputPath(job,out);
return job.waitForCompletion(true)?0:-1;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new GoodsConcurrenceList(),args));
}
}
以上思路:在Map端中对初始数据按行读取,每行数据按”/t”分隔放入数组,也就是说用户数据进入tuple[0],商品数据进入tuple[1],提交给reduce端处理,利用迭代器将key值相同的value值分别append到stringbuilder中(注意,不建议使用string,因为使用string + 会新生成一个字符串,在大数据中更消耗内存),以逗号分隔,最后在context中写入时注意substring最后一个字符(那个逗号没有意义)
run方法中配置作业,这种方法会相对比较浪费精力因为每次写一个新的类就要重新配置一次,本文最后会推荐一个将所有配置写成一个类,这样重新配置的时候可以轻松且可视一些。
结果数据:
10001 20001,20005,20006,20007,20002
10002 20006,20003,20004
10003 20002,20007
10004 20001,20002,20005,20006
10005 20001
10006 20004,20007
9.计算商品的共现关系
文件:GoodsCooccurrenceList.java
类名:GoodsCooccurrenceList
GoodsCooccurrenceListMapper
GoodsCooccurrenceListReducer
数据来源:第1步的计算结果
代码实现:
package com.briup.bigdata.project.grms;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
public class GoodsConcurrenceList extends Configured implements Tool{
private final static Text K = new Text();
private final static IntWritable V = new IntWritable(1);
static class GoodsConcurrenceListMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens=value.toString().split("\t");
String[] items =tokens[1].split(",");
for(int i=0;i<items.length;i++){
String itemA=items[i];
for(int j=0;j<items.length;j++){
String itemB=items[j];
K.set(itemA+"/t"+itemB);
context.write(K,V);
}
}
}
}
static class GoodsConcurrenceListReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum =0;
for(IntWritable value :values ){
sum += value.get();
}
V.set(sum);
context.write(key, V);
}
}
public int run(String[] strings) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf,this.getClass().getSimpleName());
Path in = new Path(conf.get("in"));
Path out = new Path(conf.get("out"));
job.setJarByClass(GoodsConcurrenceList.class);
job.setMapperClass(GoodsConcurrenceListMapper.class);
job.setReducerClass(GoodsConcurrenceListReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job,in);
TextOutputFormat.setOutputPath(job,out);
return job.waitForCompletion(true)?0:-1;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new GoodsConcurrenceList(),args));
}
}
以上思路:为了方便参数的设定,我定义了两个final的静态数据,分别是K,V 其实可以直接把计算结果放入context中。首先我们把UserBuyGoodsList得到的数据按行读取按”\t”分隔,但是同现矩阵是不需要用户数据的