Hadoop项目案例：电影网站用户性别预测

最新推荐文章于 2023-06-04 01:12:30 发布

sukeeper

最新推荐文章于 2023-06-04 01:12:30 发布

阅读量2.1k

点赞数 3

分类专栏： hadoop项目实战文章标签：算法 hdfs mapreduce big data hadoop

本文链接：https://blog.csdn.net/keeper567/article/details/127969894

版权

本文介绍了如何使用Hadoop MapReduce实现KNN算法预测电影网站用户性别。通过用户评分、用户信息和电影数据，进行数据转换、清洗、划分，建立并评估分类器，最终达到性别预测的效果，准确率为0.7。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习目标：

（1）理解掌握KNN算法的原理

（2）掌握以MapReducer编程实现KNN算法

（3）掌握以MapReducer编程实现KNN分类器评估

实现的Hadoop框架如下：

任务背景

XX网站是一个深受用户欢迎的电影网站，它提供了大量的电音介绍及评论，包括上影的影视查询及其购票服务。用户可以记录想看、看过的电影，顺便打分、写电影评。为了提高用户的使用体验和满意度，网站计划为广大的用户提供更精准‘更个性化的电影推荐服务。

什么是个性化的电影推荐服务？举一个简单的列子，不同性别的人偏爱的电影有所不同，如大部分的男生可能比较喜欢看警匪类型或者动作型的电影，而大部分的女生.喜欢看浪漫的爱情篇。那么网站就可以根据性别特点为用户推荐用户更加喜欢的电影。如某会员是女性，那么当该会员登录时，网站可以为她推荐最新上映的浪漫爱情片。相对于常规的针对整体对象的推荐方式，比如好评排行榜、热门电影等，这类个性化的推荐的方式更加适合用户的真实需求，，从而提高用户的体验及其与用户的粘性。当然，在实际业务服务中进行正真的个性化推荐时，不仅是依靠用户的性别信息，而是需要使用大量与用户相关的真实数据来建立推荐模型。

因为用户在访问网站的电影时产生了大量的历史浏览数据，从用户浏览过的电影类型记录来预测该用户的性别，这里可以作为一个解决思路来进行尝试，大致步骤如下：

（1）对用户看过的说有电影类型进行统计，再通过已知性别用户观看电影的类型数据建立一个分类器。

（2）向分类器输入未知用户性别用户观看电影的类型统计数据，获得该用户的性别分类。

如下图所示，数据是根据每个用户的性别信息及该用户看过的电影类型的统计情况。其中，UserID代表的是用户ID；Gender代表的是用户性别，其中1代表的是女性，0代表的是男性；Age代表的是用户的年龄；Occupation代代表的是用户的职业；Zip-code代表的是用户的地区编码；从Action到Western代表的是电影的不同类型。列如，某条记录中Action的字段值是4，则说明该用户看过4部动作电影。

这里使用MapReducer编程，利用KNN算法对已知性别的用户观看的电影类型统计和建立分类器，并且对这个分类器的分类结果进行评估，选出分类性能最好的一个分类器，用于对未知性别的用户进行分类。

认识KNN算法

KNN算法简介

KNN算法又称为K邻近分类算法，它是一个理论上比较成熟的算法，也是分类算法中最简单的算法之一。所谓K邻近，就是K个最近的邻居的意思，即每个样本都可以用它最接近的K个邻居来代表。该方法的思路：如果特征空间中的K个最相似的样本中的大多数属于一个类别，某样本也属于这个类别。在KNN算法中，所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或多个样本的类别来决定待分类的所属类别。

更详细的讲解这里给大家推荐一篇写得比较好文章：KNN算法原理_一骑走烟尘的博客-CSDN博客_knn算法原理

本过程使用到三份数据，分别为用户对电影的评分数据ratings.dat、已知用户性别的用户数据users.dat、电影信息数据movies.dat。

用户对电影的部分评分数据ratings.dat如下图所示。该数据包含四个字段，即UserID（用户ID）、MovieID（电影ID）、Rating（评分）及Timestamp（时间戳）。其中，UserID的范围是1~6040，MovieID的范围是1~3925，Rating采用的是五分好评制，即最高分为5分，最低分为1分。

已知性别的用户信息部分数据users.dat如下图所示。该数据包括5个字段，分别为UserID（用户ID）、Gender（用户性别）、Age（年龄）、Occupation（职业）以及Zippy-code（地区编码）。其中，Occupation字段代表21种不同的职业类型，Age记录的并不是用户的真实年龄，而是一个年龄段，例如，1代表的是18岁一下。

部分电影数据movies.dat数据如下图所示，该字段包括MovieID（电影ID），Title（电影名称）。Genres（电影类型）三个字段。其中，Title字段不仅记录电影名称，还记录了电影上映的时间。数据中总共记录了18种电影类型，包括喜剧片、动作片、警匪片、爱情片等。

step2：数据变换

我们的目的是根据电影类型来预测用户的性别，换句话说，预测用户的性别需要知道用户看过的那些类型的电影最多，所以对用户看过的电影数据类型进行统计，但是我们没有直接的数据，需要从三份数据里面提取的需要的信息。如下图：

数据转换是将数据从一种表现形式变为另一种表现形式的过程。数据转换主要是找到数据的特征表示。将网站用户的用户信息数据及其观影记录数据进行转换，得到用户观看电影的类型统计数据，思路如下：

（1）

根据UserID字段连接ratings.dat数据和users.dat数据，连接得到一份包含UserID（用户ID）、Gender（用户性别）、Age（用户年龄）、Occupation（用户职业）、Zip-code（用户地区编码）、MovieID（电影ID）的数据。

代码实现

GlobalUtility （自定义类型）

import org.apache.hadoop.conf.Configuration;

public class GlobalUtility
{
    private static Configuration conf = null;
    private static String DFS = "fs.defaultFS";
    private static String DFS_INPL = "fs.hdfs.impl";
    public static Configuration getConf()
    {
        if (conf == null)
        {
            conf = new Configuration();
            conf.set(DFS,"hdfs://master:8020");
            conf.set(DFS_INPL,"org.apache.hadoop.hdfs.DistributedFileSystem");
        }
        return conf;
    }
}

UserAndRatingMapper


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

public class UserAndRatingMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    private FileSystem fs = null;
    Map<String,String> userInfoMap = null;
    private String splitter =null;
    private FSDataInputStream is = null;
    private BufferedReader reader =null;
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        Configuration conf = context.getConfiguration();
        fs = FileSystem.get(context.getConfiguration());

        URI[] uris = context.getCacheFiles();
        splitter = conf.get("SPLITTER");
        userInfoMap = new HashMap<>();

        for (URI path:uris)
        {
            if (path.getPath().endsWith("users.dat"))
            {
                is = fs.open(new Path(path));
                reader = new BufferedReader(new InputStreamReader(is,"utf-8"));
                String line = null;
                while ((line = reader.readLine())!= null)
                {
                    String[] strs = line.split(splitter);
                    userInfoMap.put(strs[0],line);
                }
            }
        }
}
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] strs = line.split(splitter);
        Text key_out = new Text(userInfoMap.get(strs[0])+"::"+strs[1]);
        context.write(key_out,NullWritable.get());
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        reader.close();
        is.close();
    }
}

UserAndRatingReducer


import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.mockito.internal.matchers.Null;

import java.io.IOException;

public class UserAndRatingReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

UserAndRatingDriver

import MovieUserPredict.preTreat.mapper.UserAndRatingMapper;
import MovieUserPredict.preTreat.reducer.UserAndRatingReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;

import java.io.File;
import java.net.URI;

public class UserAndRatingDriver extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        conf.set("SPLITTER",args[4]);
        Job job = Job.getInstance(conf,"user and rating link mission");
        job.addCacheFile(new URI(args[2]));
        job.setJarByClass(UserAndRatingDriver.class);
        job.setMapperClass(UserAndRatingMapper.class);
        job.setReducerClass(UserAndRatingReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        Path inputPath = new Path(args[1]);
        FileInputFormat.addInputPath(job,inputPath);
        Path outputPath = new Path(args[3]);
        FileSystem fs = FileSystem.get(conf);
        if (fs.exists(outputPath))
        {
            fs.delete(outputPath,true);
        }
        FileOutputFormat.setOutputPath(job,outputPath);

        return job.waitForCompletion(true)?0:2;
    }
}

MainEntrence（统一方法入口）


import MovieUserPredict.preTreat.drivers.UserAndRatingDriver;
import MovieUserPredict.preTreat.drivers.User_Rating_movies_Driver;
import org.apache.hadoop.util.ToolRunner;
import sccc.utilities.GlobUtility;

public class MainEntrence {
    public static void main(String[] args) throws Exception {
        if (args.length < 5)
        {
            System.err.println("Patameters are not correct.");
            System.exit(1);
        }
        if (args[0].equals("PreTreat_one"))
        {
            ToolRunner.run(GlobUtility.getConf(),new UserAndRatingDriver(),args);
        }else if (args[0].equals("PreTreat_two"))
        {
            ToolRunner.run(GlobUtility.getConf(),new User_Rating_movies_Driver(),args);
        }
    }
}

实现效果：

(2)

根据moviesID连接movies.dat数据和上一步跑出来的数据，连接结果是一份包含UserID(用户ID),Gender(性别),Age(年龄),Occupation(职业),Zip-code(地区邮编),MovieID(电影ID).Genres(电影类型)的数据。

代码实现

User_Rating_movies_Mapper

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

public class User_Rating_movies_Mapper extends Mapper<Text, NullWritable,NullWritable,Text>
{
    private String splitter = null;
    private FileSystem fs = null;
    private FSDataInputStream is = null;
    private Map<String,String> movieInfoMap = new HashMap<>();
    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    {
        Configuration conf = context.getConfiguration();
        fs = FileSystem.get(conf);
        splitter = conf.get("SPLITTER");
        URI[] uris = context.getCacheFiles();
        for (URI uri :uris)
        {
            if (uri.getPath().endsWith("movies.dat"))
            {
                is = fs.open(new Path(uri));
                BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8"));
                String line = null;
                while ((line = reader.readLine())!=null)
                {
                    String[] strs = line.split(splitter);
                    movieInfoMap.put(strs[0],strs[2]);
                }
            }
        }
    }

    @Override
    protected void map(Text key, NullWritable value, Context context) throws IOException, InterruptedException
    {
        String line = key.toString();
        String[] strs = line.split(splitter);
        String movieId = strs[strs.length-1];
        Text value_out = new Text(line+splitter+movieI