51-60天,kNN 与 NB

本文介绍了机器学习的基础知识,包括kNN分类器、kMeans聚类算法及其在推荐系统中的应用。kNN通过计算实例之间的距离找到最近邻居,kMeans则是通过迭代寻找数据的聚类中心。文中还展示了基于M-distance的推荐系统,以及符号型和数值型数据的Naive Bayes分类器。这些算法在实际数据集上进行了测试并计算了准确性。
摘要由CSDN通过智能技术生成

第 51 天: kNN 分类器
这个代码 300 行, 分三天完成. 今天先把代码抄完并运行, 明后天有修改程序的工作. 要求熟练掌握.

两种距离度量.
数据随机分割方式.
间址的灵活使用: trainingSet 和 testingSet 都是整数数组, 表示下标.
arff 文件的读取. 需要 weka.jar 包.
求邻居.
投票.
 

package machinelearning.knn;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;

import weka.core.*;

/**
 * kNN classification.
 * 
 * @author Fan Min minfanphd@163.com.
 */
public class KnnClassification {

	/**
	 * Manhattan distance.
	 */
	public static final int MANHATTAN = 0;

	/**
	 * Euclidean distance.
	 */
	public static final int EUCLIDEAN = 1;

	/**
	 * The distance measure.
	 */
	public int distanceMeasure = EUCLIDEAN;

	/**
	 * A random instance;
	 */
	public static final Random random = new Random();

	/**
	 * The number of neighbors.
	 */
	int numNeighbors = 7;

	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * The training set. Represented by the indices of the data.
	 */
	int[] trainingSet;

	/**
	 * The testing set. Represented by the indices of the data.
	 */
	int[] testingSet;

	/**
	 * The predictions.
	 */
	int[] predictions;

	/**
	 *********************
	 * The first constructor.
	 * 
	 * @param paraFilename
	 *            The arff filename.
	 *********************
	 */
	public KnnClassification(String paraFilename) {
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			// The last attribute is the decision class.
			dataset.setClassIndex(dataset.numAttributes() - 1);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Error occurred while trying to read \'" + paraFilename
					+ "\' in KnnClassification constructor.\r\n" + ee);
			System.exit(0);
		} // Of try
	}// Of the first constructor

	/**
	 *********************
	 * Get a random indices for data randomization.
	 * 
	 * @param paraLength
	 *            The length of the sequence.
	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
	 *********************
	 */
	public static int[] getRandomIndices(int paraLength) {
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;
		} // Of for i

		return resultIndices;
	}// Of getRandomIndices

	/**
	 *********************
	 * Split the data into training and testing parts.
	 * 
	 * @param paraTrainingFraction
	 *            The fraction of the training set.
	 *********************
	 */
	public void splitTrainingTesting(double paraTrainingFraction) {
		int tempSize = dataset.numInstances();
		int[] tempIndices = getRandomIndices(tempSize);
		int tempTrainingSize = (int) (tempSize * paraTrainingFraction);

		trainingSet = new int[tempTrainingSize];
		testingSet = new int[tempSize - tempTrainingSize];

		for (int i = 0; i < tempTrainingSize; i++) {
			trainingSet[i] = tempIndices[i];
		} // Of for i

		for (int i = 0; i < tempSize - tempTrainingSize; i++) {
			testingSet[i] = tempIndices[tempTrainingSize + i];
		} // Of for i
	}// Of splitTrainingTesting

	/**
	 *********************
	 * Predict for the whole testing set. The results are stored in predictions.
	 * #see predictions.
	 *********************
	 */
	public void predict() {
		predictions = new int[testingSet.length];
		for (int i = 0; i < predictions.length; i++) {
			predictions[i] = predict(testingSet[i]);
		} // Of for i
	}// Of predict

	/**
	 *********************
	 * Predict for given instance.
	 * 
	 * @return The prediction.
	 *********************
	 */
	public int predict(int paraIndex) {
		int[] tempNeighbors = computeNearests(paraIndex);
		int resultPrediction = simpleVoting(tempNeighbors);

		return resultPrediction;
	}// Of predict

	/**
	 *********************
	 * The distance between two instances.
	 * 
	 * @param paraI
	 *            The index of the first instance.
	 * @param paraJ
	 *            The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, int paraJ) {
		int resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

	/**
	 *********************
	 * Get the accuracy of the classifier.
	 * 
	 * @return The accuracy.
	 *********************
	 */
	public double getAccuracy() {
		// A double divides an int gets another double.
		double tempCorrect = 0;
		for (int i = 0; i < predictions.length; i++) {
			if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		return tempCorrect / testingSet.length;
	}// Of getAccuracy

	/**
	 ************************************
	 * Compute the nearest k neighbors. Select one neighbor in each scan. In
	 * fact we can scan only once. You may implement it by yourself.
	 * 
	 * @param paraK
	 *            the k value for kNN.
	 * @param paraCurrent
	 *            current instance. We are comparing it with all others.
	 * @return the indices of the nearest instances.
	 ************************************
	 */
	public int[] computeNearests(int paraCurrent) {
		int[] resultNearests = new int[numNeighbors];
		boolean[] tempSelected = new boolean[trainingSet.length];
		double tempDistance;
		double tempMinimalDistance;
		int tempMinimalIndex = 0;

		// Select the nearest paraK indices.
		for (int i = 0; i < numNeighbors; i++) {
			tempMinimalDistance = Double.MAX_VALUE;

			for (int j = 0; j < trainingSet.length; j++) {
				if (tempSelected[j]) {
					continue;
				} // Of if

				tempDistance = distance(paraCurrent, trainingSet[j]);
				if (tempDistance < tempMinimalDistance) {
					tempMinimalDistance = tempDistance;
					tempMinimalIndex = j;
				} // Of if
			} // Of for j

			resultNearests[i] = trainingSet[tempMinimalIndex];
			tempSelected[tempMinimalIndex] = true;
		} // Of for i

		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
		return resultNearests;
	}// Of computeNearests

	/**
	 ************************************
	 * Voting using the instances.
	 * 
	 * @param paraNeighbors
	 *            The indices of the neighbors.
	 * @return The predicted label.
	 ************************************
	 */
	public int simpleVoting(int[] paraNeighbors) {
		int[] tempVotes = new int[dataset.numClasses()];
		for (int i = 0; i < paraNeighbors.length; i++) {
			tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
		} // Of for i

		int tempMaximalVotingIndex = 0;
		int tempMaximalVoting = 0;
		for (int i = 0; i < dataset.numClasses(); i++) {
			if (tempVotes[i] > tempMaximalVoting) {
				tempMaximalVoting = tempVotes[i];
				tempMaximalVotingIndex = i;
			} // Of if
		} // Of for i

		return tempMaximalVotingIndex;
	}// Of simpleVoting

	/**
	 *********************
	 * The entrance of the program.
	 * 
	 * @param args
	 *            Not used now.
	 *********************
	 */
	public static void main(String args[]) {
		KnnClassification tempClassifier = new KnnClassification("D:/data/iris.arff");
		tempClassifier.splitTrainingTesting(0.8);
		tempClassifier.predict();
		System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
	}// Of main

}// Of class KnnClassification

 

第 52 天: kNN 分类器 (续)

今日在昨日基础上进行一定修改

重新实现 computeNearests, 仅需要扫描一遍训练集, 即可获得 k kk 个邻居. 提示: 现代码与插入排序思想相结合.
增加 setDistanceMeasure() 方法,选择距离计算的方法.
增加 setNumNeighors() 方法,设置邻居的数量.

public int[] computeNearests(int paraCurrent) {
   int[] resultNearests = new int[numNeighbors];
   boolean[] tempSelected = new boolean[trainingSet.length];
   double tempDistance;
   double tempMinimalDistance;
   int tempMinimalIndex = 0;

   //直接插入排序
   double[][] tempDistanceArray = new double[trainingSet.length][2];
   tempDistanceArray[0][0] = 0;
   tempDistanceArray[0][1] = distance(paraCurrent, trainingSet[0]);
   int j;
   for (int i = 1; i < trainingSet.length; i++) {
       tempDistance = distance(paraCurrent, trainingSet[i]);
       for (j = i - 1; j >= 0; j--) {
           if (tempDistance < tempDistanceArray[j][1]) {
               tempDistanceArray[j + 1] = tempDistanceArray[j];
           } else {
               break;
           }
       }
       tempDistanceArray[j + 1][0] = i;
       tempDistanceArray[j + 1][1] = tempDistance;
   }

   for (int i = 0; i < numNeighbors; i++) {
       resultNearests[i] = trainingSet[(int)tempDistanceArray[i][0]];
   }

   System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
   return resultNearests;
}
public void setDistanceMeasure(int paraType) {
   if (paraType == 0) {
       distanceMeasure = MANHATTAN;
   } else if (paraType == 1) {
       distanceMeasure = EUCLIDEAN;
   } else {
       System.out.println("Wrong Distance Measure.");
   }
}

public void setNumNeighbors(int paraNumNeighbors) {
   if (paraNumNeighbors > dataset.numInstances()) {
       System.out.println("The number of neighbors is too big.");
       return;
   }

   numNeighbors = paraNumNeighbors;
}


/**
*********************
* The entrance of the program.
* 
* @param args
*            Not used now.
*********************
*/
public static void main(String args[]) {
	KnnClassification tempClassifier = new KnnClassification("D:/data/iris.arff");
	tempClassifier.setDistanceMeasure(0);
	tempClassifier.setNumNeighbors(5);
	tempClassifier.splitTrainingTesting(0.8);
	tempClassifier.predict();
	System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
}

第 53 天: kNN 分类器 (续)

  1. 增加 weightedVoting() 方法, 距离越短话语权越大. 支持两种以上的加权方式.
  • 实现 leave-one-out 测试.
    public int weightedVoting(int paraCurrent, int[] paraNeighbors) {
         double[] tempVotes = new double[dataset.numClasses()];
    
         double tempDistance;
         //a越大,b越小,效果越好
         int a = 2, b = 1;
        
         for (int i = 0; i < paraNeighbors.length; i++) {
             tempDistance = distance(paraCurrent, paraNeighbors[i]);
             tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]
                     += getWeightedNum(a, b, tempDistance);
         }
    
         int tempMaximalVotingIndex = 0;
         double tempMaximalVoting = 0;
         for (int i = 0; i < dataset.numClasses(); i++) {
             if (tempVotes[i] > tempMaximalVoting) {
                 tempMaximalVoting = tempVotes[i];
                 tempMaximalVotingIndex = i;
             }
         }
    
         return tempMaximalVotingIndex;
     }
    
     public double getWeightedNum(int a, int b, double paraDistance) {
         return b / (paraDistance + a);
     }
    
     public void leave_one_out() {
     	//留一法交叉验证
         int tempSize = dataset.numInstances();
         int[] tempIndices = getRandomIndices(tempSize);
         double tempCorrect = 0;
         for (int i = 0; i < tempSize; i++) {
             trainingSet = new int[tempSize - 1];
             testingSet = new int[1];
    
             int tempIndex = 0;
             for (int j = 0; j < tempSize; j++) {
                 if (j == i) {
                     continue;
                 }
                 trainingSet[tempIndex++] = tempIndices[j];
             }
    
             testingSet[0] = tempIndices[i];
    
             this.predict();
    
             if (predictions[0] == dataset.instance(testingSet[0]).classValue()) {
                 tempCorrect++;
             }
         }
    
         System.out.println("The accuracy is:" + tempCorrect / tempSize);
     }
    
     public static void main(String[] args) {
         KnnClassification tempClassifier = new KnnClassification("D:\\data\\iris.arff");
         tempClassifier.setDistanceMeasure(0);
         tempClassifier.setNumNeighbors(5);
         tempClassifier.splitTrainingTesting(0.8);
         tempClassifier.predict();
         System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
    
         //测试
         System.out.println("\r\n-------leave_one_out-------");
         tempClassifier.leave_one_out();
     }
    

     第 54 天: 基于 M-distance 的推荐

    package machinelearning.knn;
    
    /**
     * Recommendation with M-distance.
     * @author Fan Min minfanphd@163.com.
     */
    
    import java.io.*;
    
    public class MBR {
    
    	/**
    	 * Default rating for 1-5 points.
    	 */
    	public static final double DEFAULT_RATING = 3.0;
    
    	/**
    	 * The total number of users.
    	 */
    	private int numUsers;
    
    	/**
    	 * The total number of items.
    	 */
    	private int numItems;
    
    	/**
    	 * The total number of ratings (non-zero values)
    	 */
    	private int numRatings;
    
    	/**
    	 * The predictions.
    	 */
    	private double[] predictions;
    
    	/**
    	 * Compressed rating matrix. User-item-rating triples.
    	 */
    	private int[][] compressedRatingMatrix;
    
    	/**
    	 * The degree of users (how many item he has rated).
    	 */
    	private int[] userDegrees;
    
    	/**
    	 * The average rating of the current user.
    	 */
    	private double[] userAverageRatings;
    
    	/**
    	 * The degree of users (how many item he has rated).
    	 */
    	private int[] itemDegrees;
    
    	/**
    	 * The average rating of the current item.
    	 */
    	private double[] itemAverageRatings;
    
    	/**
    	 * The first user start from 0. Let the first user has x ratings, the second
    	 * user will start from x.
    	 */
    	private int[] userStartingIndices;
    
    	/**
    	 * Number of non-neighbor objects.
    	 */
    	private int numNonNeighbors;
    
    	/**
    	 * The radius (delta) for determining the neighborhood.
    	 */
    	private double radius;
    
    	/**
    	 ************************* 
    	 * Construct the rating matrix.
    	 * 
    	 * @param paraRatingFilename
    	 *            the rating filename.
    	 * @param paraNumUsers
    	 *            number of users
    	 * @param paraNumItems
    	 *            number of items
    	 * @param paraNumRatings
    	 *            number of ratings
    	 ************************* 
    	 */
    	public MBR(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception {
    		// Step 1. Initialize these arrays
    		numItems = paraNumItems;
    		numUsers = paraNumUsers;
    		numRatings = paraNumRatings;
    
    		userDegrees = new int[numUsers];
    		userStartingIndices = new int[numUsers + 1];
    		userAverageRatings = new double[numUsers];
    		itemDegrees = new int[numItems];
    		compressedRatingMatrix = new int[numRatings][3];
    		itemAverageRatings = new double[numItems];
    
    		predictions = new double[numRatings];
    
    		System.out.println("Reading " + paraFilename);
    
    		// Step 2. Read the data file.
    		File tempFile = new File(paraFilename);
    		if (!tempFile.exists()) {
    			System.out.println("File " + paraFilename + " does not exists.");
    			System.exit(0);
    		} // Of if
    		BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
    		String tempString;
    		String[] tempStrArray;
    		int tempIndex = 0;
    		userStartingIndices[0] = 0;
    		userStartingIndices[numUsers] = numRatings;
    		while ((tempString = tempBufReader.readLine()) != null) {
    			// Each line has three values
    			tempStrArray = tempString.split(",");
    			compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);
    			compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);
    			compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);
    
    			userDegrees[compressedRatingMatrix[tempIndex][0]]++;
    			itemDegrees[compressedRatingMatrix[tempIndex][1]]++;
    
    			if (tempIndex > 0) {
    				// Starting to read the data of a new user.
    				if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
    					userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
    				} // Of if
    			} // Of if
    			tempIndex++;
    		} // Of while
    		tempBufReader.close();
    
    		double[] tempUserTotalScore = new double[numUsers];
    		double[] tempItemTotalScore = new double[numItems];
    		for (int i = 0; i < numRatings; i++) {
    			tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
    			tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
    		} // Of for i
    
    		for (int i = 0; i < numUsers; i++) {
    			userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
    		} // Of for i
    		for (int i = 0; i < numItems; i++) {
    			itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
    		} // Of for i
    	}// Of the first constructor
    
    	/**
    	 ************************* 
    	 * Set the radius (delta).
    	 * 
    	 * @param paraRadius
    	 *            The given radius.
    	 ************************* 
    	 */
    	public void setRadius(double paraRadius) {
    		if (paraRadius > 0) {
    			radius = paraRadius;
    		} else {
    			radius = 0.1;
    		} // Of if
    	}// Of setRadius
    
    	/**
    	 ************************* 
    	 * Leave-one-out prediction. The predicted values are stored in predictions.
    	 * 
    	 * @see predictions
    	 ************************* 
    	 */
    	public void leaveOneOutPrediction() {
    		double tempItemAverageRating;
    		// Make each line of the code shorter.
    		int tempUser, tempItem, tempRating;
    		System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);
    
    		numNonNeighbors = 0;
    		for (int i = 0; i < numRatings; i++) {
    			tempUser = compressedRatingMatrix[i][0];
    			tempItem = compressedRatingMatrix[i][1];
    			tempRating = compressedRatingMatrix[i][2];
    
    			// Step 1. Recompute average rating of the current item.
    			tempItemAverageRating = (itemAverageRatings[tempItem] * itemDegrees[tempItem] - tempRating)
    					/ (itemDegrees[tempItem] - 1);
    
    			// Step 2. Recompute neighbors, at the same time obtain the ratings
    			// Of neighbors.
    			int tempNeighbors = 0;
    			double tempTotal = 0;
    			int tempComparedItem;
    			for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
    				tempComparedItem = compressedRatingMatrix[j][1];
    				if (tempItem == tempComparedItem) {
    					continue;// Ignore itself.
    				} // Of if
    
    				if (Math.abs(tempItemAverageRating - itemAverageRatings[tempComparedItem]) < radius) {
    					tempTotal += compressedRatingMatrix[j][2];
    					tempNeighbors++;
    				} // Of if
    			} // Of for j
    
    			// Step 3. Predict as the average value of neighbors.
    			if (tempNeighbors > 0) {
    				predictions[i] = tempTotal / tempNeighbors;
    			} else {
    				predictions[i] = DEFAULT_RATING;
    				numNonNeighbors++;
    			} // Of if
    		} // Of for i
    	}// Of leaveOneOutPrediction
    
    	/**
    	 ************************* 
    	 * Compute the MAE based on the deviation of each leave-one-out.
    	 * 
    	 * @author Fan Min
    	 ************************* 
    	 */
    	public double computeMAE() throws Exception {
    		double tempTotalError = 0;
    		for (int i = 0; i < predictions.length; i++) {
    			tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
    		} // Of for i
    
    		return tempTotalError / predictions.length;
    	}// Of computeMAE
    
    	/**
    	 ************************* 
    	 * Compute the MAE based on the deviation of each leave-one-out.
    	 * 
    	 * @author Fan Min
    	 ************************* 
    	 */
    	public double computeRSME() throws Exception {
    		double tempTotalError = 0;
    		for (int i = 0; i < predictions.length; i++) {
    			tempTotalError += (predictions[i] - compressedRatingMatrix[i][2])
    					* (predictions[i] - compressedRatingMatrix[i][2]);
    		} // Of for i
    
    		double tempAverage = tempTotalError / predictions.length;
    
    		return Math.sqrt(tempAverage);
    	}// Of computeRSME
    
    	/**
    	 ************************* 
    	 * The entrance of the program.
    	 * 
    	 * @param args
    	 *            Not used now.
    	 ************************* 
    	 */
    	public static void main(String[] args) {
    		try {
    			MBR tempRecommender = new MBR("D:/data/movielens-943u1682m.txt", 943, 1682, 100000);
    
    			for (double tempRadius = 0.2; tempRadius < 0.6; tempRadius += 0.1) {
    				tempRecommender.setRadius(tempRadius);
    
    				tempRecommender.leaveOneOutPrediction();
    				double tempMAE = tempRecommender.computeMAE();
    				double tempRSME = tempRecommender.computeRSME();
    
    				System.out.println("Radius = " + tempRadius + ", MAE = " + tempMAE + ", RSME = " + tempRSME
    						+ ", numNonNeighbors = " + tempRecommender.numNonNeighbors);
    			} // Of for tempRadius
    		} catch (Exception ee) {
    			System.out.println(ee);
    		} // Of try
    	}// Of main
    }// Of class MBR
    

     

    第 55 天: 基于 M-distance 的推荐 (续)

    昨天实现的是 item-based recommendation. 今天自己来实现一下 user-based recommendation. 只需要在原有基础上增加即可.

  • public class MBR {
    
    	/**
    	 *评分为1-5分
    	 */
    	public static final double DEFAULT_RATING = 3.0;
    
    	/**
    	 * 用户数量
    	 */
    	private int numUsers;
    
    	/**
    	 * 项目数量
    	 */
    	private int numItems;
    
    	/**
    	 * 评分数量(非零值)
    	 */
    	private int numRatings;
    
    	/**
    	 * 预测数组
    	 */
    	private double[] predictions;
    
    	/**
    	 * 压缩评级矩阵。
    	 */
    	private int[][] compressedRatingMatrix;
    
    	/**
    	 *有多少用户对项目进行了评分
    	 */
    	private int[] userDegrees;
    
    	/**
    	 * 当前用户的平均分级
    	 */
    	private double[] userAverageRatings;
    
    	/**
    	 * 多少项目被评分
    	 */
    	private int[] itemDegrees;
    
    	/**
    	 * 当前项目平均评级
    	 */
    	private double[] itemAverageRatings;
    
    	/**
    	 * 第一个用户从0开始。让第一个用户有x评级,第二个用户将从x开始。
    	 */
    	private int[] userStartingIndices;
    
    	/**
    	 * 没有邻居的对象
    	 */
    	private int numNonNeighbors;
    
    	/**
    	 * 确定邻居的半径(增量)
    	 */
    	private double radius;
    
    	/**
    	 ************************* 
    	 * 创建评分矩阵
    	 * 
    	 * @param paraRatingFilename
    	 *            the rating filename.
    	 * @param paraNumUsers
    	 *            number of users
    	 * @param paraNumItems
    	 *            number of items
    	 * @param paraNumRatings
    	 *            number of ratings
    	 ************************* 
    	 */
    	public MBR(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception {
    		//初始化三个数组
    		numItems = paraNumItems;
    		numUsers = paraNumUsers;
    		numRatings = paraNumRatings;
    
    		userDegrees = new int[numUsers];
    		userStartingIndices = new int[numUsers + 1];
    		userAverageRatings = new double[numUsers];
    		itemDegrees = new int[numItems];
    		compressedRatingMatrix = new int[numRatings][3];
    		itemAverageRatings = new double[numItems];
    
    		predictions = new double[numRatings];
    
    		System.out.println("Reading " + paraFilename);
    
    		//读取数据
    		File tempFile = new File(paraFilename);
    		if (!tempFile.exists()) {
    			System.out.println("File " + paraFilename + " does not exists.");
    			System.exit(0);
    		}
    		BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
    		String tempString;
    		String[] tempStrArray;
    		int tempIndex = 0;
    		userStartingIndices[0] = 0;
    		userStartingIndices[numUsers] = numRatings;
    		while ((tempString = tempBufReader.readLine()) != null) {
    			//每一行有三个值
    			tempStrArray = tempString.split(",");
    			compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);//用户
    			compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);//项目
    			compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);//评级
    
    			userDegrees[compressedRatingMatrix[tempIndex][0]]++;
    			itemDegrees[compressedRatingMatrix[tempIndex][1]]++;
    
    			if (tempIndex > 0) {
    				// 开始读取新用户数据
    				if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
    					userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
    				}
    			}
    			tempIndex++;
    		}
    		tempBufReader.close();
    
    		double[] tempUserTotalScore = new double[numUsers];
    		double[] tempItemTotalScore = new double[numItems];
    		for (int i = 0; i < numRatings; i++) {
    			tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
    			tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
    		}
    
    		for (int i = 0; i < numUsers; i++) {
    			userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
    		}
    		for (int i = 0; i < numItems; i++) {
    			itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
    		}
    	}
    
    	/**
    	 ************************* 
    	 * 设置半径(增量)。
    	 * 
    	 * @param paraRadius
    	 *            The given radius.
    	 ************************* 
    	 */
    	public void setRadius(double paraRadius) {
    		if (paraRadius > 0) {
    			radius = paraRadius;
    		} else {
    			radius = 0.1;
    		}
    	}
    
    	/**
    	 ************************* 
    	 * 留一法。预测值存储在预测中。
    	 * 
    	 * @see predictions
    	 ************************* 
    	 */
    	public void leaveOneOutPrediction() {
    		double tempUserAverageRating;
    		int tempUser, tempItem, tempRating;
    		System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);
    
    		numNonNeighbors = 0;
    		for (int i = 0; i < numRatings; i++) {
    			tempUser = compressedRatingMatrix[i][0];
    			tempItem = compressedRatingMatrix[i][1];
    			tempRating = compressedRatingMatrix[i][2];
    
    			//重新计算当前的平均评分。
    			tempUserAverageRating = (userAverageRatings[tempUser] * userDegrees[tempUser] - tempRating)
    					/ (userDegrees[tempUser] - 1);
    
    			//重新计算邻居,同时获得评分
    			//邻居
    			int tempNeighbors = 0;
    			double tempTotal = 0;
    			int tempComparedUser;
    			for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
    				tempComparedUser = compressedRatingMatrix[j][0];
    				if (tempUser == tempComparedUser) {
    					continue;
    				}
    
    				if (Math.abs(tempUserAverageRating - userAverageRatings[tempComparedUser]) < radius) {
    					tempTotal += compressedRatingMatrix[j][2];
    					tempNeighbors++;
    				}
    			}
    
    			//预测为邻居的平均值。
    			if (tempNeighbors > 0) {
    				predictions[i] = tempTotal / tempNeighbors;
    			} else {
    				predictions[i] = DEFAULT_RATING;
    				numNonNeighbors++;
    			}
    		}
    	}
    
    	/**
    	 ************************* 
    	 * 根据每个遗漏的偏差计算MAE。
    	 ************************* 
    	 */
    	public double computeMAE() throws Exception {
    		double tempTotalError = 0;
    		for (int i = 0; i < predictions.length; i++) {
    			tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
    		}
    
    		return tempTotalError / predictions.length;
    	}
    
    	/**
    	 ************************* 
    	 * 根据每个遗漏的偏差计算MAE。
    	 ************************* 
    	 */
    	public double computeRSME() throws Exception {
    		double tempTotalError = 0;
    		for (int i = 0; i < predictions.length; i++) {
    			tempTotalError += (predictions[i] - compressedRatingMatrix[i][2])
    					* (predictions[i] - compressedRatingMatrix[i][2]);
    		}
    
    		double tempAverage = tempTotalError / predictions.length;
    
    		return Math.sqrt(tempAverage);
    	}
    
    	/**
    	 ************************* 
    	 * The entrance of the program.
    	 * 
    	 * @param args
    	 *            Not used now.
    	 ************************* 
    	 */
    	public static void main(String[] args) {
    		try {
    			MBR tempRecommender = new MBR("D:/data/movielens943u1682m.txt", 10000, 1682, 1000000);
    
    			for (double tempRadius = 0.2; tempRadius < 0.6; tempRadius += 0.1) {
    				tempRecommender.setRadius(tempRadius);
    
    				tempRecommender.leaveOneOutPrediction();
    				double tempMAE = tempRecommender.computeMAE();
    				double tempRSME = tempRecommender.computeRSME();
    
    				System.out.println("Radius = " + tempRadius + ", MAE = " + tempMAE + ", RSME = " + tempRSME
    						+ ", numNonNeighbors = " + tempRecommender.numNonNeighbors);
    			}
    		} catch (Exception ee) {
    			System.out.println(ee);
    		}
    	}
    }
    

    第 56 天: kMeans 聚类
    kMeans 是最常用的聚类算法.

    kMeans 聚类需要中心点收敛时结束. 偷懒使用了 Arrays.equals()
    数据集为 iris, 所以最后一个属性没使用. 如果对于没有决策属性的数据集, 需要进行相应修改.
    数据没有归一化.
    getRandomIndices() 和 kMeans 的完全相同, 拷贝过来. 本来应该写在 SimpleTools.java 里面的, 代码不多, 为保证独立性就放这里了.
    distance() 和 kMeans 的相似, 注意不要用决策属性, 而且参数不同. 第 2 个参数为实数向量, 这是类为中心可能为虚拟的, 而中心点那里并没有对象.
     

    package machinelearning.kmeans;
    
    import java.io.FileReader;
    import java.util.Arrays;
    import java.util.Random;
    import weka.core.Instances;
    
    /**
     * kMeans clustering.
     * @author Fan Min minfanphd@163.com.
     */
     public class KMeans {
    
    	/**
    	 * Manhattan distance.
    	 */
    	public static final int MANHATTAN = 0;
    
    	/**
    	 * Euclidean distance.
    	 */
    	public static final int EUCLIDEAN = 1;
    
    	/**
    	 * The distance measure.
    	 */
    	public int distanceMeasure = EUCLIDEAN;
    
    	/**
    	 * A random instance;
    	 */
    	public static final Random random = new Random();
    
    	/**
    	 * The data.
    	 */
    	Instances dataset;
    
    	/**
    	 * The number of clusters.
    	 */
    	int numClusters = 2;
    
    	/**
    	 * The clusters.
    	 */
    	int[][] clusters;
    
    	/**
    	 ******************************* 
    	 * The first constructor.
    	 * 
    	 * @param paraFilename
    	 *            The data filename.
    	 ******************************* 
    	 */
    	public KMeans(String paraFilename) {
    		dataset = null;
    		try {
    			FileReader fileReader = new FileReader(paraFilename);
    			dataset = new Instances(fileReader);
    			fileReader.close();
    		} catch (Exception ee) {
    			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
    			System.exit(0);
    		} // Of try
    	}// Of the first constructor
    
    	/**
    	 ******************************* 
    	 * A setter.
    	 ******************************* 
    	 */
    	public void setNumClusters(int paraNumClusters) {
    		numClusters = paraNumClusters;
    	}// Of the setter
    
    	/**
    	 *********************
    	 * Get a random indices for data randomization.
    	 * 
    	 * @param paraLength
    	 *            The length of the sequence.
    	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
    	 *********************
    	 */
    	public static int[] getRandomIndices(int paraLength) {
    		int[] resultIndices = new int[paraLength];
    
    		// Step 1. Initialize.
    		for (int i = 0; i < paraLength; i++) {
    			resultIndices[i] = i;
    		} // Of for i
    
    		// Step 2. Randomly swap.
    		int tempFirst, tempSecond, tempValue;
    		for (int i = 0; i < paraLength; i++) {
    			// Generate two random indices.
    			tempFirst = random.nextInt(paraLength);
    			tempSecond = random.nextInt(paraLength);
    
    			// Swap.
    			tempValue = resultIndices[tempFirst];
    			resultIndices[tempFirst] = resultIndices[tempSecond];
    			resultIndices[tempSecond] = tempValue;
    		} // Of for i
    
    		return resultIndices;
    	}// Of getRandomIndices
    
    	/**
    	 *********************
    	 * The distance between two instances.
    	 * 
    	 * @param paraI
    	 *            The index of the first instance.
    	 * @param paraArray
    	 *            The array representing a point in the space.
    	 * @return The distance.
    	 *********************
    	 */
    	public double distance(int paraI, double[] paraArray) {
    		int resultDistance = 0;
    		double tempDifference;
    		switch (distanceMeasure) {
    		case MANHATTAN:
    			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
    				tempDifference = dataset.instance(paraI).value(i) - paraArray[i];
    				if (tempDifference < 0) {
    					resultDistance -= tempDifference;
    				} else {
    					resultDistance += tempDifference;
    				} // Of if
    			} // Of for i
    			break;
    
    		case EUCLIDEAN:
    			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
    				tempDifference = dataset.instance(paraI).value(i) - paraArray[i];
    				resultDistance += tempDifference * tempDifference;
    			} // Of for i
    			break;
    		default:
    			System.out.println("Unsupported distance measure: " + distanceMeasure);
    		}// Of switch
    
    		return resultDistance;
    	}// Of distance
    
    	/**
    	 ******************************* 
    	 * Clustering.
    	 ******************************* 
    	 */
    	public void clustering() {
    		int[] tempOldClusterArray = new int[dataset.numInstances()];
    		tempOldClusterArray[0] = -1;
    		int[] tempClusterArray = new int[dataset.numInstances()];
    		Arrays.fill(tempClusterArray, 0);
    		double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];
    
    		// Step 1. Initialize centers.
    		int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
    		for (int i = 0; i < numClusters; i++) {
    			for (int j = 0; j < tempCenters[0].length; j++) {
    				tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
    			} // Of for j
    		} // Of for i
    
    		int[] tempClusterLengths = null;
    		while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
    			System.out.println("New loop ...");
    			tempOldClusterArray = tempClusterArray;
    			tempClusterArray = new int[dataset.numInstances()];
    
    			// Step 2.1 Minimization. Assign cluster to each instance.
    			int tempNearestCenter;
    			double tempNearestDistance;
    			double tempDistance;
    
    			for (int i = 0; i < dataset.numInstances(); i++) {
    				tempNearestCenter = -1;
    				tempNearestDistance = Double.MAX_VALUE;
    
    				for (int j = 0; j < numClusters; j++) {
    					tempDistance = distance(i, tempCenters[j]);
    					if (tempNearestDistance > tempDistance) {
    						tempNearestDistance = tempDistance;
    						tempNearestCenter = j;
    					} // Of if
    				} // Of for j
    				tempClusterArray[i] = tempNearestCenter;
    			} // Of for i
    
    			// Step 2.2 Mean. Find new centers.
    			tempClusterLengths = new int[numClusters];
    			Arrays.fill(tempClusterLengths, 0);
    			double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
    			// Arrays.fill(tempNewCenters, 0);
    			for (int i = 0; i < dataset.numInstances(); i++) {
    				for (int j = 0; j < tempNewCenters[0].length; j++) {
    					tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
    				} // Of for j
    				tempClusterLengths[tempClusterArray[i]]++;
    			} // Of for i
    
    			// Step 2.3 Now average
    			for (int i = 0; i < tempNewCenters.length; i++) {
    				for (int j = 0; j < tempNewCenters[0].length; j++) {
    					tempNewCenters[i][j] /= tempClusterLengths[i];
    				} // Of for j
    			} // Of for i
    
    			System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
    			tempCenters = tempNewCenters;
    		} // Of while
    
    		// Step 3. Form clusters.
    		clusters = new int[numClusters][];
    		int[] tempCounters = new int[numClusters];
    		for (int i = 0; i < numClusters; i++) {
    			clusters[i] = new int[tempClusterLengths[i]];
    		} // Of for i
    
    		for (int i = 0; i < tempClusterArray.length; i++) {
    			clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
    			tempCounters[tempClusterArray[i]]++;
    		} // Of for i
    
    		System.out.println("The clusters are: " + Arrays.deepToString(clusters));
    	}// Of clustering
    
    	/**
    	 ******************************* 
    	 * Clustering.
    	 ******************************* 
    	 */
    	public static void testClustering() {
    		KMeans tempKMeans = new KMeans("D:/data/iris.arff");
    		tempKMeans.setNumClusters(3);
    		tempKMeans.clustering();
    	}// Of testClustering
    
    	/**
    	 ************************* 
    	 * A testing method.
    	 ************************* 
    	 */
    	public static void main(String arags[]) {
    		testClustering();
    	}// Of main
    
    }// Of class KMeans
    

  • New loop ...
    Now the new centers are: [[6.017142857142856, 2.7971428571428567, 4.545714285714286, 1.5214285714285716], [6.964285714285715, 3.089285714285714, 5.932142857142857, 2.107142857142857], [5.005769230769231, 3.3807692307692316, 1.5288461538461537, 0.2749999999999999]]
    New loop ...
    Now the new centers are: [[6.022666666666666, 2.804, 4.544, 1.5333333333333332], [6.980000000000001, 3.0759999999999996, 5.991999999999998, 2.1039999999999996], [5.005999999999999, 3.4180000000000006, 1.464, 0.2439999999999999]]
    New loop ...
    Now the new centers are: [[6.022666666666666, 2.804, 4.544, 1.5333333333333332], [6.980000000000001, 3.0759999999999996, 5.991999999999998, 2.1039999999999996], [5.005999999999999, 3.4180000000000006, 1.464, 0.2439999999999999]]
    The clusters are: [[50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 110, 111, 112, 113, 114, 115, 116, 119, 121, 123, 126, 127, 133, 137, 138, 139, 141, 142, 145, 146, 147, 148, 149], [100, 102, 103, 104, 105, 107, 108, 109, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132, 134, 135, 136, 140, 143, 144], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]
    
    

    第 57 天: kMeans 聚类 (续)

    获得虚拟中心后, 换成与其最近的点作为实际中心, 再聚类.
    今天主要是想控制下节奏. 毕竟 kMeans 也值得两天的工作量.

     

    第 58 天: 符号型数据的 NB 算法
    Naive Bayes 是一种用后验概率公式推导出的算法. 它有一个独立性假设, 从数学上看起来不靠谱. 但从机器学习效果来说是不错的.

    所有的程序都在今天列出, 但今天只研究符号型数据的分类. 为此, 可以只抄符号型数据相关的方法 (从 main() 开始有选择性地抄), 明天再抄数值型数据处理算法.
    必须自己举一个小的例子 (如 10 个对象, 3 个条件属性, 2 个类别) 来辅助理解.
    需要查阅相关基础知识.
    需要理解三维数组每个维度的涵义: The conditional probabilities for all classes over all attributes on all values. 注意到三维数组不是规则的, 例如, 不同属性的属性值个数可能不同.
    这里使用同样的数据进行训练和测试. 如果要划分训练集和测试集, 可参考 kNN 代码.
     

    package machinelearning.bayes;
    
    import java.io.FileReader;
    import java.util.Arrays;
    
    import weka.core.*;
    
    /**
     * The Naive Bayes algorithm.
     * 
     * @author Fan Min minfanphd@163.com.
     */
    
    public class NaiveBayes {
    	/**
    	 ************************* 
    	 * An inner class to store parameters.
    	 ************************* 
    	 */
    	private class GaussianParamters {
    		double mu;
    		double sigma;
    
    		public GaussianParamters(double paraMu, double paraSigma) {
    			mu = paraMu;
    			sigma = paraSigma;
    		}// Of the constructor
    
    		public String toString() {
    			return "(" + mu + ", " + sigma + ")";
    		}// Of toString
    	}// Of GaussianParamters
    
    	/**
    	 * The data.
    	 */
    	Instances dataset;
    
    	/**
    	 * The number of classes. For binary classification it is 2.
    	 */
    	int numClasses;
    
    	/**
    	 * The number of instances.
    	 */
    	int numInstances;
    
    	/**
    	 * The number of conditional attributes.
    	 */
    	int numConditions;
    
    	/**
    	 * The prediction, including queried and predicted labels.
    	 */
    	int[] predicts;
    
    	/**
    	 * Class distribution.
    	 */
    	double[] classDistribution;
    
    	/**
    	 * Class distribution with Laplacian smooth.
    	 */
    	double[] classDistributionLaplacian;
    
    	/**
    	 * The conditional probabilities for all classes over all attributes on all
    	 * values.
    	 */
    	double[][][] conditionalProbabilities;
    
    	/**
    	 * The conditional probabilities with Laplacian smooth.
    	 */
    	double[][][] conditionalProbabilitiesLaplacian;
    
    	/**
    	 * The Guassian parameters.
    	 */
    	GaussianParamters[][] gaussianParameters;
    
    	/**
    	 * Data type.
    	 */
    	int dataType;
    
    	/**
    	 * Nominal.
    	 */
    	public static final int NOMINAL = 0;
    
    	/**
    	 * Numerical.
    	 */
    	public static final int NUMERICAL = 1;
    
    	/**
    	 ********************
    	 * The constructor.
    	 * 
    	 * @param paraFilename
    	 *            The given file.
    	 ********************
    	 */
    	public NaiveBayes(String paraFilename) {
    		dataset = null;
    		try {
    			FileReader fileReader = new FileReader(paraFilename);
    			dataset = new Instances(fileReader);
    			fileReader.close();
    		} catch (Exception ee) {
    			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
    			System.exit(0);
    		} // Of try
    
    		dataset.setClassIndex(dataset.numAttributes() - 1);
    		numConditions = dataset.numAttributes() - 1;
    		numInstances = dataset.numInstances();
    		numClasses = dataset.attribute(numConditions).numValues();
    	}// Of the constructor
    
    	/**
    	 ********************
    	 * Set the data type.
    	 ********************
    	 */
    	public void setDataType(int paraDataType) {
    		dataType = paraDataType;
    	}// Of setDataType
    
    	/**
    	 ********************
    	 * Calculate the class distribution with Laplacian smooth.
    	 ********************
    	 */
    	public void calculateClassDistribution() {
    		classDistribution = new double[numClasses];
    		classDistributionLaplacian = new double[numClasses];
    
    		double[] tempCounts = new double[numClasses];
    		for (int i = 0; i < numInstances; i++) {
    			int tempClassValue = (int) dataset.instance(i).classValue();
    			tempCounts[tempClassValue]++;
    		} // Of for i
    
    		for (int i = 0; i < numClasses; i++) {
    			classDistribution[i] = tempCounts[i] / numInstances;
    			classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
    		} // Of for i
    
    		System.out.println("Class distribution: " + Arrays.toString(classDistribution));
    		System.out.println(
    				"Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
    	}// Of calculateClassDistribution
    
    	/**
    	 ********************
    	 * Calculate the conditional probabilities with Laplacian smooth. ONLY scan
    	 * the dataset once. There was a simpler one, I have removed it because the
    	 * time complexity is higher.
    	 ********************
    	 */
    	public void calculateConditionalProbabilities() {
    		conditionalProbabilities = new double[numClasses][numConditions][];
    		conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];
    
    		// Allocate space
    		for (int i = 0; i < numClasses; i++) {
    			for (int j = 0; j < numConditions; j++) {
    				int tempNumValues = (int) dataset.attribute(j).numValues();
    				conditionalProbabilities[i][j] = new double[tempNumValues];
    				conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
    			} // Of for j
    		} // Of for i
    
    		// Count the numbers
    		int[] tempClassCounts = new int[numClasses];
    		for (int i = 0; i < numInstances; i++) {
    			int tempClass = (int) dataset.instance(i).classValue();
    			tempClassCounts[tempClass]++;
    			for (int j = 0; j < numConditions; j++) {
    				int tempValue = (int) dataset.instance(i).value(j);
    				conditionalProbabilities[tempClass][j][tempValue]++;
    			} // Of for j
    		} // Of for i
    
    		// Now for the real probability with Laplacian
    		for (int i = 0; i < numClasses; i++) {
    			for (int j = 0; j < numConditions; j++) {
    				int tempNumValues = (int) dataset.attribute(j).numValues();
    				for (int k = 0; k < tempNumValues; k++) {
    					conditionalProbabilitiesLaplacian[i][j][k] = (conditionalProbabilities[i][j][k]
    							+ 1) / (tempClassCounts[i] + numClasses);
    				} // Of for k
    			} // Of for j
    		} // Of for i
    
    		System.out.println(Arrays.deepToString(conditionalProbabilities));
    	}// Of calculateConditionalProbabilities
    
    	/**
    	 ********************
    	 * Calculate the conditional probabilities with Laplacian smooth.
    	 ********************
    	 */
    	public void calculateGausssianParameters() {
    		gaussianParameters = new GaussianParamters[numClasses][numConditions];
    
    		double[] tempValuesArray = new double[numInstances];
    		int tempNumValues = 0;
    		double tempSum = 0;
    
    		for (int i = 0; i < numClasses; i++) {
    			for (int j = 0; j < numConditions; j++) {
    				tempSum = 0;
    
    				// Obtain values for this class.
    				tempNumValues = 0;
    				for (int k = 0; k < numInstances; k++) {
    					if ((int) dataset.instance(k).classValue() != i) {
    						continue;
    					} // Of if
    
    					tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
    					tempSum += tempValuesArray[tempNumValues];
    					tempNumValues++;
    				} // Of for k
    
    				// Obtain parameters.
    				double tempMu = tempSum / tempNumValues;
    
    				double tempSigma = 0;
    				for (int k = 0; k < tempNumValues; k++) {
    					tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
    				} // Of for k
    				tempSigma /= tempNumValues;
    				tempSigma = Math.sqrt(tempSigma);
    
    				gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma);
    			} // Of for j
    		} // Of for i
    
    		System.out.println(Arrays.deepToString(gaussianParameters));
    	}// Of calculateGausssianParameters
    
    	/**
    	 ********************
    	 * Classify all instances, the results are stored in predicts[].
    	 ********************
    	 */
    	public void classify() {
    		predicts = new int[numInstances];
    		for (int i = 0; i < numInstances; i++) {
    			predicts[i] = classify(dataset.instance(i));
    		} // Of for i
    	}// Of classify
    
    	/**
    	 ********************
    	 * Classify an instances.
    	 ********************
    	 */
    	public int classify(Instance paraInstance) {
    		if (dataType == NOMINAL) {
    			return classifyNominal(paraInstance);
    		} else if (dataType == NUMERICAL) {
    			return classifyNumerical(paraInstance);
    		} // Of if
    
    		return -1;
    	}// Of classify
    
    	/**
    	 ********************
    	 * Classify an instances with nominal data.
    	 ********************
    	 */
    	public int classifyNominal(Instance paraInstance) {
    		// Find the biggest one
    		double tempBiggest = -10000;
    		int resultBestIndex = 0;
    		for (int i = 0; i < numClasses; i++) {
    			double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
    			for (int j = 0; j < numConditions; j++) {
    				int tempAttributeValue = (int) paraInstance.value(j);
    
    				// Laplacian smooth.
    				tempPseudoProbability += Math
    						.log(conditionalProbabilities[i][j][tempAttributeValue]);
    			} // Of for j
    
    			if (tempBiggest < tempPseudoProbability) {
    				tempBiggest = tempPseudoProbability;
    				resultBestIndex = i;
    			} // Of if
    		} // Of for i
    
    		return resultBestIndex;
    	}// Of classifyNominal
    
    	/**
    	 ********************
    	 * Classify an instances with numerical data.
    	 ********************
    	 */
    	public int classifyNumerical(Instance paraInstance) {
    		// Find the biggest one
    		double tempBiggest = -10000;
    		int resultBestIndex = 0;
    
    		for (int i = 0; i < numClasses; i++) {
    			double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
    			for (int j = 0; j < numConditions; j++) {
    				double tempAttributeValue = paraInstance.value(j);
    				double tempSigma = gaussianParameters[i][j].sigma;
    				double tempMu = gaussianParameters[i][j].mu;
    
    				tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
    						* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
    			} // Of for j
    
    			if (tempBiggest < tempPseudoProbability) {
    				tempBiggest = tempPseudoProbability;
    				resultBestIndex = i;
    			} // Of if
    		} // Of for i
    
    		return resultBestIndex;
    	}// Of classifyNumerical
    
    	/**
    	 ********************
    	 * Compute accuracy.
    	 ********************
    	 */
    	public double computeAccuracy() {
    		double tempCorrect = 0;
    		for (int i = 0; i < numInstances; i++) {
    			if (predicts[i] == (int) dataset.instance(i).classValue()) {
    				tempCorrect++;
    			} // Of if
    		} // Of for i
    
    		double resultAccuracy = tempCorrect / numInstances;
    		return resultAccuracy;
    	}// Of computeAccuracy
    
    	/**
    	 ************************* 
    	 * Test nominal data.
    	 ************************* 
    	 */
    	public static void testNominal() {
    		System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
    		String tempFilename = "D:/data/mushroom.arff";
    
    		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
    		tempLearner.setDataType(NOMINAL);
    		tempLearner.calculateClassDistribution();
    		tempLearner.calculateConditionalProbabilities();
    		tempLearner.classify();
    
    		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
    	}// Of testNominal
    
    	/**
    	 ************************* 
    	 * Test numerical data.
    	 ************************* 
    	 */
    	public static void testNumerical() {
    		System.out.println(
    				"Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
    		String tempFilename = "D:/data/iris.arff";
    
    		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
    		tempLearner.setDataType(NUMERICAL);
    		tempLearner.calculateClassDistribution();
    		tempLearner.calculateGausssianParameters();
    		tempLearner.classify();
    
    		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
    	}// Of testNominal
    
    	/**
    	 ************************* 
    	 * Test this class.
    	 * 
    	 * @param args
    	 *            Not used now.
    	 ************************* 
    	 */
    	public static void main(String[] args) {
    		testNominal();
    		testNumerical();
    	}// Of main
    
    }// Of class NaiveBayes
    

    第 59 天: 数值型数据的 NB 算法

    今天把数值型数据处理的代码加上去.
    假设所有属性的属性值都服从高斯分布. 也可以做其它假设.
    将概率密度当成概率值直接使用 Bayes 公式.
    可以看到, 数值型数据的处理并不会比符号型的复杂.

/**
 ********************
 * Classify an instances with numerical data.
 ********************
 */
public int classifyNumerical(Instance paraInstance) {
	// Find the biggest one
	double tempBiggest = -10000;
	int resultBestIndex = 0;

	for (int i = 0; i < numClasses; i++) {
		double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
		for (int j = 0; j < numConditions; j++) {
			double tempAttributeValue = paraInstance.value(j);
			double tempSigma = gaussianParameters[i][j].sigma;
			double tempMu = gaussianParameters[i][j].mu;

			tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
					* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
		} // Of for j

		if (tempBiggest < tempPseudoProbability) {
			tempBiggest = tempPseudoProbability;
			resultBestIndex = i;
		} // Of if
	} // Of for i

	return resultBestIndex;
}// Of classifyNumerical

/**
 ************************* 
 * Test numerical data.
 ************************* 
 */
public static void testNumerical() {
	System.out.println(
			"Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
	String tempFilename = "D:/data/iris.arff";

	NaiveBayes tempLearner = new NaiveBayes(tempFilename);
	tempLearner.setDataType(NUMERICAL);
	tempLearner.calculateClassDistribution();
	tempLearner.calculateGausssianParameters();
	tempLearner.classify();

	System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNominal

public static void main(String[] args) {
	testNominal();
	testNumerical();
}// Of main

 第 60 天: 小结

1.KNN即K-Nearest Neighbor,是数据挖掘中一种最简单的分类方法,即要判断某一个样本属于已知样本种类中的哪一类时,通过计算找出所有样本中与测试样本最近或者最相似的K个样本,统计这K个样本中哪一种类最多则把测试样本归位该类。如何衡量两个样本的相似度?可以用向量的p-范数来定义。

2.机器学习主要分为监督学习、无监督学习、半监督学习、强化学习

3..监督学习:
In:有标签
Out:有反馈
目的:预测结果
案例:学认字
算法:分类(类别),回归(数字)

4.无监督学习:
In:无标签
Out:无反馈
目的:发现潜在结构
案例:自动聚类
算法:聚类,降维

5.半监督学习:
已知:训练样本Data和待分类的类别
未知:训练样本有无标签均可
应用:训练数据量过时,
监督学习效果不能满足需求,因此用来增强效果。

6.强化学习:
In:决策流程及激励系统
Out:一系列行动
目的:长期利益最大化,回报函数(只会提示你是否在朝着目标方向前进的延迟反映)
案例:学下棋
算法:马尔科夫决策,动态规划

7.kmeans算法又名k均值算法,K-means算法中的k表示的是聚类为k个簇,means代表取每一个聚类中数据值的均值作为该簇的中心,或者称为质心,即用每一个的类的质心对该簇进行描述。

8.实现kmeans算法的主要四点:
          (1)簇个数 k 的选择
          (2)各个样本点到“簇中心”的距离
          (3)根据新划分的簇,更新“簇中心”
          (4)重复上述2、3过程,直至"簇中心"没有移动
        优缺点:优点:容易实现
缺点:可能收敛到局部最小值,在大规模数据上收敛较慢

9.所谓聚类算法是指将一堆没有标签的数据自动划分成几类的方法,属于无监督学习方法,这个方法要保证同一类的数据有相似的特征。

10.了解了机器学习的基本内容,需要对机器学习进行更加深的了解

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值