k-means算法得到最“相似”的演员Java实现

最新推荐文章于 2022-07-23 23:04:44 发布

aGreySky

最新推荐文章于 2022-07-23 23:04:44 发布

阅读量667

点赞数 2

分类专栏：数据挖掘文章标签： k-means 算法数据挖掘聚类 Java实现

本文链接：https://blog.csdn.net/agreysky/article/details/99565175

版权

数据挖掘专栏收录该内容

5 篇文章 2 订阅

订阅专栏

今天在编写项目时，遇到这样一个需求：

选择多个特征，用特征表达演员。采用合理的相似性计算方法，找到同性别演员中主演电影类型、表演风格最相似的两位演员。当某位导演选角，A演员因故不能参加，可以推荐风格相似的演员B。

最先想到的就是聚类算法，再想到k-means算法。

1.手头的数据

k-means算法最重要的是选好特征。在以上数据中，我选取了年龄（birthday）、作品数量（works_count）、获奖次数（awards_count）、粉丝数（fans_count）、题材（genres）作为特征。

其中题材（genres）特征作为文字类型，不好进行直接分析，因此需要进行进一步处理。

2.数据的规范化

public ClustersAndGenreSetList getClusters() throws IOException {
          //1.从数据库中获取所有题材类型
          ArrayList<String> genresArrayList = (ArrayList<String>) movieRepository.findAllMovieGenres();
          //使用hashset存储所有题材类型 因为hashset中值唯一
          HashSet<String> genreSet = new HashSet();
          for (String genres: genresArrayList) {
               if (genres.equals("")|| genres.equals("真人秀"))
                    continue;
               //得到单独的题材
               String[] genreArray = genres.split(",");
               //加入set
               for (String genre : genreArray) {
                    genreSet.add(genre);
               }
          }
          FEATURE_NUMBER = 4 + genreSet.size();
          ArrayList<String> genreSetList = new ArrayList<>();
          //hashSet转ArrayList
          for (String genre : genreSet) {
               genreSetList.add(genre);
          }

          //2.查找所有演员信息(包括各题材个数)
          DataList dataList = new DataList();
          ArrayList<Actor> actorArrayList = (ArrayList<Actor>) actorRepository.findAll();
          for (int i = 0;i<actorArrayList.size();i++) {
               //得到某演员的数据
               Data data = getActorData(i, actorArrayList.get(i),genreSetList);
               dataList.getDatas().add(data);
          }
          ClusterList clusterList = main.run(dataList);
          ClustersAndGenreSetList clustersAndGenreSetList = new ClustersAndGenreSetList(clusterList.getClusters(), genreSetList);
          return clustersAndGenreSetList;
     }

首先需要对每个演员所参演的“每个题材的数量”进行计算。

（1）得到数据库中所有题材类型

共有25种电影题材
喜剧情色科幻运动恐怖灾难同性犯罪动画传记纪录片惊悚冒险奇幻历史悬疑古装音乐剧情短片武侠爱情家庭战争动作

与期望的结果一致

（2）统计每个演员的数据

//得到某演员的数据
     private Data getActorData(int j, Actor actor, ArrayList<String> genreSetList) {
          //维度为题材数+4
          Vector<Double> vector = new Vector(FEATURE_NUMBER);
          ArrayList<String> genresArrayList = (ArrayList<String>) movieRepository.findMovieGenresByActorId(actor.getId());
          vector.add(0,((double)actor.getWorksCount())/10);
          vector.add(1,((double)actor.getAwardsCount()));
          vector.add(2,((double)actor.getFansCount())/1000);
          vector.add(3,Calendar.getInstance().get(Calendar.YEAR) - Double.valueOf(actor.getBirthday().split("-")[0]));
          int i = 4;
          //填满向量
          int k = vector.capacity();
          while(i < k){
               vector.add(i++, 0.0);
          }
          for (String genres: genresArrayList) {
               if (genres.equals("")|| genres.equals("真人秀"))
                    continue;
               //得到单独的题材
               String[] genreArray = genres.split(",");
               for (String genre : genreArray) {
                    int index = genreSetList.indexOf(genre);
                    vector.set(index + 4, vector.get(index + 4) + 1);
               }
          }
          Data data = new Data(j, vector, actor.getId());
          return  data;
     }

每个演员都有29个特征（25个题材+4个基本信息），使用向量（Vector）来存储特征信息

向量的前4个维度为作品数量（works_count）、获奖次数（awards_count）、粉丝数（fans_count）、年龄（birthday）

为了使每个特征对分析的影响均衡，对前4个特征进行规范处理：作品数量/10、获奖数量（不进行处理）、粉丝数量/1000、出生日期->年龄

处理题材数量思路：获取某演员的所有电影题材，并扫描；得到某题材在题材集合中的下标位置，在演员向量中对应位置设置值（向量的后25个维度与之前的题材集合对应），没扫描一个题材，就为对应的该维度+1。

得到的演员信息进行输出：

Data: 0, actorId: 1000525,vector:[8.7, 3.0, 4.875, 45.0, 3.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 2.0, 0.0, 3.0, 4.0]

8.7为87部作品，3.0为3次获奖，4.875为4875个粉丝，45.0为45岁，3.0为3部喜剧........

与预期结果一致（终于可以进行分析了）

3.执行k-means算法

public static ClusterList run(DataList dataList) throws IOException {
    //定义距离公式 欧氏距离
    DistanceMetric distanceMetric = new EuclideanDistance();
    //定义簇操作类 需要传入距离公式和迭代次数
    Clusterer clusterer = new KMeansClusterer(distanceMetric, ITER);
    //开始聚类 核心 传入待处理数据和簇的个数 返回处理完的簇集合
    ClusterList clusterList = clusterer.runKMeansClustering(dataList, K);
    System.out.println(clusterList);
    //输出结果
    OutPutFile.outputClusterAndContent("result/cluster"+K,clusterList);
    return clusterList;
}

本项目采用的是欧式距离来表示向量之间的距离：

欧式距离公式：

2个向量之间计算方法：

//计算欧式距离
    public static double getEuclideanDistance(Vector<Double> vector1, Vector<Double> vector2){
        double euclideanDistance = 0;
        for (int i = 0; i < vector1.size(); i++) {
            euclideanDistance += Math.pow(vector1.get(i) - vector2.get(i), 2);
        }
        return Math.sqrt(euclideanDistance);
    }

有了数据和计算方法，就可以运行k-means算法：

public ClusterList runKMeansClustering(DataList dataList, int k) {
        ClusterList clusterList = new ClusterList();
        //1.清除数据分配 将所有data的是否分配设为false
        dataList.clearIsAllocated();

        //2.获取随机的一个质点作为某个簇的簇心,创建一个簇
        //取随机数
        int randomDataIndex = new Random().nextInt(dataList.getDatas().size());
        Cluster initCluster = new Cluster(dataList.getDatas().get(randomDataIndex));
        clusterList.getClusters().add(initCluster);
        //3.创建其它几个簇 k-1个
        //根据 离 已有簇的簇心 最近的值 的最大值 来得到剩余簇的簇心
        //开始循环创建剩余簇
        while(clusterList.getClusters().size()<k){
            //创建并添加簇
            clusterList.getClusters().add(createClusterBasedFurthestData(dataList, clusterList));
        }
        //簇心集合 用于判断迭代之后簇心是否发生变化 若未发生变化 则迭代完成
        ArrayList<Vector> oldCenters = new ArrayList<>();
        //开始迭代
        for (int i = 0; i < iterNum; i++) {
            //基于质心和数据点的距离,分配没有分配的数据
            assignUnallocatedDataPoints(dataList, clusterList);
            //每次迭代开始需要先清空上一次用于判断的簇心集合
            oldCenters.clear();
            //将簇心赋予用于判断的簇心集合
            for (int j =0;j<k;j++)
                oldCenters.add((Vector) clusterList.getClusters().get(j).getCenterVector().clone());
            //更新质心,取每个簇所有数据点的各维度的均值
            clusterList.updateCentroids();
            //更新完的簇心无变化 跳出循环 完成迭代
            if (clusterList.sameCenter(oldCenters)){
                return clusterList;
            }
            if (i < iterNum - 1) {
                //簇中的数据清空,进行重新迭代分配
                clusterList.clearDatas();
            }
            System.out.println("执行了"+ (i+1) +"次");
        }
        return clusterList;
    }

1.随机从数据集合（dataList）中选取一个数据的向量作为第一个簇的簇心，由此初始化一个簇，并将该簇加入簇集合。

2.初始化剩余k-1个簇（k为定义好的簇个数，本例中为4）

剩余簇的簇心是根据以下规则得到：

max（min（dis1，dis2，.....），min（dis1，dis2，.....），.....）

即离已存在簇的簇心最远距离的数据（data）中的向量作为新簇的簇心。

“最远距离”中的“距离”是：该向量离已存在簇心的距离中的最小值

寻找最远距离：

（1）得到簇集合与某向量的“距离”的方法（与它最接近簇心的距离）

double calcDistance(Data data, ClusterList clusterList){
    double distance = Double.MAX_VALUE;
    for (Cluster cluster:clusterList.getClusters()){
        distance=Math.min(distance,calcDistance(data,cluster));
    }
    return distance;
}

（2）遍历所有向量，找到最远的向量的方法（this代表的是簇集合）

public Data findFurthestData(DistanceMetric distanceMetric, DataList dataList) {
        double furthestDistance = Double.MIN_VALUE;
        Data furthestData = null;
        for (Data data : dataList.getDatas()){
            if(!data.isAllocated()){
                //找最远距离
                double dataDistance = distanceMetric.calcDistance(data, this);
                if (furthestDistance<dataDistance){
                    furthestDistance = dataDistance;
                    furthestData = data;
                }
            }
        }
        return furthestData;
    }

3.为每个数据找寻簇，分配规则

min（dis1，dis2，....）

距离最近的簇（距离 = 簇心与向量的欧式距离）

（1）计算与每一个簇的距离得到最近的簇（遍历簇）

public Cluster findNearestCluster(DistanceMetric distanceMetric, Data data) {
        Cluster nearestCluster = null;
        double nearestDistance = Double.MAX_VALUE;
        for (Cluster cluster:this.getClusters()){
            //计算距离
            double clusterDistance = distanceMetric.calcDistance(data,cluster);
            if (clusterDistance<nearestDistance){
                nearestDistance = clusterDistance;
                nearestCluster = cluster;
            }
        }
        return nearestCluster;
    }

（2）数据全部加入簇（遍历数据）

public Cluster findNearestCluster(DistanceMetric distanceMetric, Data data) {
        Cluster nearestCluster = null;
        double nearestDistance = Double.MAX_VALUE;
        for (Cluster cluster:this.getClusters()){
            //计算距离
            double clusterDistance = distanceMetric.calcDistance(data,cluster);
            if (clusterDistance<nearestDistance){
                nearestDistance = clusterDistance;
                nearestCluster = cluster;
            }
        }
        return nearestCluster;
    }

4.更新簇心

更新规则：将簇中所有数据的向量进行平均化，赋予簇心

（1）遍历簇集合，更新簇心

public void updateCentroids() {
        for (int i = 0;i<this.getClusters().size();i++) {
            this.getClusters().get(i).updateCentroid();
        }
    }

（2）遍历簇中数据，相加后除以簇的大小

/** 更新该 簇的质心 */
    public void updateCentroid() {
        centerVector = new Vector<Double>(featureNum);
        int k = centerVector.capacity();
        for (int i = 0;i< k;i++){
            centerVector.add(i,0.0);
        }
        //该簇中的数据进行循环
        for (Data data : this.getDataList().getDatas()) {
            centerVector = VectorUtil.sum(centerVector,data.getVector());
        }
        centerVector = VectorUtil.divide(centerVector,dataList.getDatas().size());
    }

5.判断簇心是否发生变化，若发生变化，则从第3步开始迭代，直到簇心不再变化，完成整个算法，得到4个簇。

Cluster 0
Data: 0, actorId: 1000525,vector:[8.7, 3.0, 4.875, 45.0, 3.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 2.0, 0.0, 3.0, 4.0]
Data: 1, actorId: 1000905,vector:[8.1, 14.0, 18.124, 62.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 7.0, 0.0, 0.0, 2.0, 1.0, 0.0, 5.0]
Data: 3, actorId: 1025194,vector:[10.4, 10.0, 4.202, 56.0, 3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 1.0, 2.0, 3.0, 1.0, 1.0, 0.0, 0.0, 5.0, 0.0, 2.0, 3.0, 0.0, 0.0, 12.0]
Data: 5, actorId: 1040990,vector:[7.3, 4.0, 4.466, 46.0, 1.0, 0.0, 4.0, 0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 3.0, 1.0, 1.0, 1.0, 2.0, 0.0, 0.0, 4.0, 0.0, 0.0, 2.0, 0.0, 1.0, 7.0]
Data: 11, actorId: 1138320,vector:[11.7, 8.0, 18.223, 43.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 4.0, 0.0, 0.0, 1.0, 0.0, 6.0, 0.0, 1.0, 3.0, 0.0, 0.0, 3.0]
Data: 21, actorId: 1275317,vector:[10.4, 1.0, 8.838, 51.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 6.0, 1.0, 0.0, 3.0, 0.0, 3.0, 2.0]
Cluster 1
Data: 2, actorId: 1025141,vector:[3.6, 4.0, 27.128, 40.0, 5.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 7.0, 1.0, 0.0, 8.0, 1.0, 0.0, 1.0]
Data: 4, actorId: 1027798,vector:[11.0, 12.0, 29.32, 45.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 2.0, 2.0, 1.0, 0.0, 10.0, 0.0, 0.0, 5.0, 0.0, 1.0, 0.0]
Data: 6, actorId: 1048026,vector:[12.4, 5.0, 37.569, 57.0, 10.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 6.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 3.0, 1.0, 0.0, 3.0]
Cluster 2
Data: 7, actorId: 1050059,vector:[13.5, 7.0, 9.996, 38.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 1.0, 0.0, 0.0, 6.0, 1.0, 0.0, 1.0, 0.0, 2.0, 6.0]
Data: 8, actorId: 1052359,vector:[13.2, 14.0, 8.364, 33.0, 4.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 9.0, 0.0, 1.0, 8.0, 0.0, 0.0, 2.0]
Data: 12, actorId: 1259866,vector:[4.4, 1.0, 4.798, 37.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 1.0, 4.0, 1.0, 0.0, 7.0, 1.0, 0.0, 8.0, 0.0, 0.0, 4.0]
Data: 13, actorId: 1274223,vector:[9.0, 0.0, 2.828, 27.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 7.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0]
Data: 14, actorId: 1274224,vector:[7.9, 4.0, 9.179, 27.0, 3.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 1.0, 7.0, 1.0, 1.0, 3.0, 0.0, 0.0, 1.0]
Data: 15, actorId: 1274225,vector:[3.7, 0.0, 2.553, 31.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 2.0, 0.0, 3.0, 0.0, 11.0, 0.0, 0.0, 6.0, 1.0, 2.0, 0.0]
Data: 16, actorId: 1274235,vector:[8.5, 6.0, 6.704, 40.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 2.0, 0.0, 1.0, 6.0, 1.0, 1.0, 3.0, 0.0, 1.0, 3.0]
Data: 17, actorId: 1274388,vector:[8.0, 3.0, 2.545, 35.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 4.0, 0.0, 1.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0]
Data: 18, actorId: 1274514,vector:[3.2, 0.0, 4.533, 35.0, 8.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 2.0, 0.0, 1.0, 2.0, 0.0, 6.0, 0.0, 1.0, 9.0, 0.0, 0.0, 0.0]
Data: 19, actorId: 1274628,vector:[5.3, 2.0, 3.351, 30.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 5.0, 0.0, 2.0, 1.0, 0.0, 5.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0]
Data: 20, actorId: 1275243,vector:[5.5, 1.0, 5.841, 31.0, 2.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 3.0, 1.0, 0.0, 5.0, 0.0, 0.0, 3.0, 0.0, 0.0, 1.0]
Data: 22, actorId: 1275721,vector:[11.0, 2.0, 3.043, 41.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 2.0, 4.0, 0.0, 2.0, 3.0, 0.0, 6.0, 0.0, 0.0, 6.0, 0.0, 0.0, 2.0]
Data: 23, actorId: 1314535,vector:[5.7, 0.0, 4.791, 31.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 3.0, 0.0, 2.0, 2.0, 1.0, 4.0, 2.0, 1.0, 2.0, 0.0, 0.0, 5.0]
Data: 24, actorId: 1315861,vector:[2.8, 2.0, 8.475, 31.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 5.0, 1.0, 1.0, 1.0, 0.0, 4.0, 1.0, 0.0, 4.0, 0.0, 1.0, 3.0]
Data: 25, actorId: 1325700,vector:[5.5, 0.0, 2.939, 40.0, 11.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0]
Cluster 3
Data: 9, actorId: 1054424,vector:[28.2, 13.0, 12.947, 58.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0, 4.0, 1.0, 0.0, 2.0, 0.0, 0.0, 5.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9.0]
Data: 10, actorId: 1054531,vector:[30.7, 8.0, 8.486, 65.0, 11.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 3.0, 5.0, 0.0, 0.0, 0.0, 3.0, 2.0, 0.0, 0.0, 0.0, 0.0, 3.0, 1.0, 0.0, 1.0, 0.0, 0.0, 11.0]

返回簇集合，算法完成。