51-60天，kNN 与 NB

最新推荐文章于 2022-05-08 21:17:32 发布

qq_45380507

最新推荐文章于 2022-05-08 21:17:32 发布

阅读量213

点赞数

本文链接：https://blog.csdn.net/qq_45380507/article/details/118631182

版权

本文介绍了机器学习的基础知识，包括kNN分类器、kMeans聚类算法及其在推荐系统中的应用。kNN通过计算实例之间的距离找到最近邻居，kMeans则是通过迭代寻找数据的聚类中心。文中还展示了基于M-distance的推荐系统，以及符号型和数值型数据的Naive Bayes分类器。这些算法在实际数据集上进行了测试并计算了准确性。

摘要由CSDN通过智能技术生成

第 51 天: kNN 分类器
这个代码 300 行, 分三天完成. 今天先把代码抄完并运行, 明后天有修改程序的工作. 要求熟练掌握.

两种距离度量.
数据随机分割方式.
间址的灵活使用: trainingSet 和 testingSet 都是整数数组, 表示下标.
arff 文件的读取. 需要 weka.jar 包.
求邻居.
投票.

package machinelearning.knn;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;

import weka.core.*;

/**
 * kNN classification.
 * 
 * @author Fan Min minfanphd@163.com.
 */
public class KnnClassification {

	/**
	 * Manhattan distance.
	 */
	public static final int MANHATTAN = 0;

	/**
	 * Euclidean distance.
	 */
	public static final int EUCLIDEAN = 1;

	/**
	 * The distance measure.
	 */
	public int distanceMeasure = EUCLIDEAN;

	/**
	 * A random instance;
	 */
	public static final Random random = new Random();

	/**
	 * The number of neighbors.
	 */
	int numNeighbors = 7;

	/**
	 * The whole dataset.
	 */
	Instances dataset;

	/**
	 * The training set. Represented by the indices of the data.
	 */
	int[] trainingSet;

	/**
	 * The testing set. Represented by the indices of the data.
	 */
	int[] testingSet;

	/**
	 * The predictions.
	 */
	int[] predictions;

	/**
	 *********************
	 * The first constructor.
	 * 
	 * @param paraFilename
	 *            The arff filename.
	 *********************
	 */
	public KnnClassification(String paraFilename) {
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			// The last attribute is the decision class.
			dataset.setClassIndex(dataset.numAttributes() - 1);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Error occurred while trying to read \'" + paraFilename
					+ "\' in KnnClassification constructor.\r\n" + ee);
			System.exit(0);
		} // Of try
	}// Of the first constructor

	/**
	 *********************
	 * Get a random indices for data randomization.
	 * 
	 * @param paraLength
	 *            The length of the sequence.
	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
	 *********************
	 */
	public static int[] getRandomIndices(int paraLength) {
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;
		} // Of for i

		return resultIndices;
	}// Of getRandomIndices

	/**
	 *********************
	 * Split the data into training and testing parts.
	 * 
	 * @param paraTrainingFraction
	 *            The fraction of the training set.
	 *********************
	 */
	public void splitTrainingTesting(double paraTrainingFraction) {
		int tempSize = dataset.numInstances();
		int[] tempIndices = getRandomIndices(tempSize);
		int tempTrainingSize = (int) (tempSize * paraTrainingFraction);

		trainingSet = new int[tempTrainingSize];
		testingSet = new int[tempSize - tempTrainingSize];

		for (int i = 0; i < tempTrainingSize; i++) {
			trainingSet[i] = tempIndices[i];
		} // Of for i

		for (int i = 0; i < tempSize - tempTrainingSize; i++) {
			testingSet[i] = tempIndices[tempTrainingSize + i];
		} // Of for i
	}// Of splitTrainingTesting

	/**
	 *********************
	 * Predict for the whole testing set. The results are stored in predictions.
	 * #see predictions.
	 *********************
	 */
	public void predict() {
		predictions = new int[testingSet.length];
		for (int i = 0; i < predictions.length; i++) {
			predictions[i] = predict(testingSet[i]);
		} // Of for i
	}// Of predict

	/**
	 *********************
	 * Predict for given instance.
	 * 
	 * @return The prediction.
	 *********************
	 */
	public int predict(int paraIndex) {
		int[] tempNeighbors = computeNearests(paraIndex);
		int resultPrediction = simpleVoting(tempNeighbors);

		return resultPrediction;
	}// Of predict

	/**
	 *********************
	 * The distance between two instances.
	 * 
	 * @param paraI
	 *            The index of the first instance.
	 * @param paraJ
	 *            The index of the second instance.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, int paraJ) {
		int resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

	/**
	 *********************
	 * Get the accuracy of the classifier.
	 * 
	 * @return The accuracy.
	 *********************
	 */
	public double getAccuracy() {
		// A double divides an int gets another double.
		double tempCorrect = 0;
		for (int i = 0; i < predictions.length; i++) {
			if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		return tempCorrect / testingSet.length;
	}// Of getAccuracy

	/**
	 ************************************
	 * Compute the nearest k neighbors. Select one neighbor in each scan. In
	 * fact we can scan only once. You may implement it by yourself.
	 * 
	 * @param paraK
	 *            the k value for kNN.
	 * @param paraCurrent
	 *            current instance. We are comparing it with all others.
	 * @return the indices of the nearest instances.
	 ************************************
	 */
	public int[] computeNearests(int paraCurrent) {
		int[] resultNearests = new int[numNeighbors];
		boolean[] tempSelected = new boolean[trainingSet.length];
		double tempDistance;
		double tempMinimalDistance;
		int tempMinimalIndex = 0;

		// Select the nearest paraK indices.
		for (int i = 0; i < numNeighbors; i++) {
			tempMinimalDistance = Double.MAX_VALUE;

			for (int j = 0; j < trainingSet.length; j++) {
				if (tempSelected[j]) {
					continue;
				} // Of if

				tempDistance = distance(paraCurrent, trainingSet[j]);
				if (tempDistance < tempMinimalDistance) {
					tempMinimalDistance = tempDistance;
					tempMinimalIndex = j;
				} // Of if
			} // Of for j

			resultNearests[i] = trainingSet[tempMinimalIndex];
			tempSelected[tempMinimalIndex] = true;
		} // Of for i

		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
		return resultNearests;
	}// Of computeNearests

	/**
	 ************************************
	 * Voting using the instances.
	 * 
	 * @param paraNeighbors
	 *            The indices of the neighbors.
	 * @return The predicted label.
	 ************************************
	 */
	public int simpleVoting(int[] paraNeighbors) {
		int[] tempVotes = new int[dataset.numClasses()];
		for (int i = 0; i < paraNeighbors.length; i++) {
			tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
		} // Of for i

		int tempMaximalVotingIndex = 0;
		int tempMaximalVoting = 0;
		for (int i = 0; i < dataset.numClasses(); i++) {
			if (tempVotes[i] > tempMaximalVoting) {
				tempMaximalVoting = tempVotes[i];
				tempMaximalVotingIndex = i;
			} // Of if
		} // Of for i

		return tempMaximalVotingIndex;
	}// Of simpleVoting

	/**
	 *********************
	 * The entrance of the program.
	 * 
	 * @param args
	 *            Not used now.
	 *********************
	 */
	public static void main(String args[]) {
		KnnClassification tempClassifier = new KnnClassification("D:/data/iris.arff");
		tempClassifier.splitTrainingTesting(0.8);
		tempClassifier.predict();
		System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
	}// Of main

}// Of class KnnClassification

第 52 天: kNN 分类器 (续)

今日在昨日基础上进行一定修改

重新实现 computeNearests, 仅需要扫描一遍训练集, 即可获得 k kk 个邻居. 提示: 现代码与插入排序思想相结合.
增加 setDistanceMeasure() 方法，选择距离计算的方法.
增加 setNumNeighors() 方法，设置邻居的数量.

public int[] computeNearests(int paraCurrent) {
   int[] resultNearests = new int[numNeighbors];
   boolean[] tempSelected = new boolean[trainingSet.length];
   double tempDistance;
   double tempMinimalDistance;
   int tempMinimalIndex = 0;

   //直接插入排序
   double[][] tempDistanceArray = new double[trainingSet.length][2];
   tempDistanceArray[0][0] = 0;
   tempDistanceArray[0][1] = distance(paraCurrent, trainingSet[0]);
   int j;
   for (int i = 1; i < trainingSet.length; i++) {
       tempDistance = distance(paraCurrent, trainingSet[i]);
       for (j = i - 1; j >= 0; j--) {
           if (tempDistance < tempDistanceArray[j][1]) {
               tempDistanceArray[j + 1] = tempDistanceArray[j];
           } else {
               break;
           }
       }
       tempDistanceArray[j + 1][0] = i;
       tempDistanceArray[j + 1][1] = tempDistance;
   }

   for (int i = 0; i < numNeighbors; i++) {
       resultNearests[i] = trainingSet[(int)tempDistanceArray[i][0]];
   }

   System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
   return resultNearests;
}
public void setDistanceMeasure(int paraType) {
   if (paraType == 0) {
       distanceMeasure = MANHATTAN;
   } else if (paraType == 1) {
       distanceMeasure = EUCLIDEAN;
   } else {
       System.out.println("Wrong Distance Measure.");
   }
}

public void setNumNeighbors(int paraNumNeighbors) {
   if (paraNumNeighbors > dataset.numInstances()) {
       System.out.println("The number of neighbors is too big.");
       return;
   }

   numNeighbors = paraNumNeighbors;
}


/**
*********************
* The entrance of the program.
* 
* @param args
*            Not used now.
*********************
*/
public static void main(String args[]) {
	KnnClassification tempClassifier = new KnnClassification("D:/data/iris.arff");
	tempClassifier.setDistanceMeasure(0);
	tempClassifier.setNumNeighbors(5);
	tempClassifier.splitTrainingTesting(0.8);
	tempClassifier.predict();
	System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
}

第 53 天: kNN 分类器 (续)

增加 weightedVoting() 方法, 距离越短话语权越大. 支持两种以上的加权方式.

实现 leave-one-out 测试.

public int weightedVoting(int paraCurrent, int[] paraNeighbors) {
     double[] tempVotes = new double[dataset.numClasses()];

     double tempDistance;
     //a越大，b越小，效果越好
     int a = 2, b = 1;
    
     for (int i = 0; i < paraNeighbors.length; i++) {
         tempDistance = distance(paraCurrent, paraNeighbors[i]);
         tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]
                 += getWeightedNum(a, b, tempDistance);
     }

     int tempMaximalVotingIndex = 0;
     double tempMaximalVoting = 0;
     for (int i = 0; i < dataset.numClasses(); i++) {
         if (tempVotes[i] > tempMaximalVoting) {
             tempMaximalVoting = tempVotes[i];
             tempMaximalVotingIndex = i;
         }
     }

     return tempMaximalVotingIndex;
 }

 public double getWeightedNum(int a, int b, double paraDistance) {
     return b / (paraDistance + a);
 }

 public void leave_one_out() {
 	//留一法交叉验证
     int tempSize = dataset.numInstances();
     int[] tempIndices = getRandomIndices(tempSize);
     double tempCorrect = 0;
     for (int i = 0; i < tempSize; i++) {
         trainingSet = new int[tempSize - 1];
         testingSet = new int[1];

         int tempIndex = 0;
         for (int j = 0; j < tempSize; j++) {
             if (j == i) {
                 continue;
             }
             trainingSet[tempIndex++] = tempIndices[j];
         }

         testingSet[0] = tempIndices[i];

         this.predict();

         if (predictions[0] == dataset.instance(testingSet[0]).classValue()) {
             tempCorrect++;
         }
     }

     System.out.println("The accuracy is：" + tempCorrect / tempSize);
 }

 public static void main(String[] args) {
     KnnClassification tempClassifier = new KnnClassification("D:\\data\\iris.arff");
     tempClassifier.setDistanceMeasure(0);
     tempClassifier.setNumNeighbors(5);
     tempClassifier.splitTrainingTesting(0.8);
     tempClassifier.predict();
     System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());

     //测试
     System.out.println("\r\n-------leave_one_out-------");
     tempClassifier.leave_one_out();
 }

第 54 天: 基于 M-distance 的推荐

package machinelearning.knn;

/**
 * Recommendation with M-distance.
 * @author Fan Min minfanphd@163.com.
 */

import java.io.*;

public class MBR {

	/**
	 * Default rating for 1-5 points.
	 */
	public static final double DEFAULT_RATING = 3.0;

	/**
	 * The total number of users.
	 */
	private int numUsers;

	/**
	 * The total number of items.
	 */
	private int numItems;

	/**
	 * The total number of ratings (non-zero values)
	 */
	private int numRatings;

	/**
	 * The predictions.
	 */
	private double[] predictions;

	/**
	 * Compressed rating matrix. User-item-rating triples.
	 */
	private int[][] compressedRatingMatrix;

	/**
	 * The degree of users (how many item he has rated).
	 */
	private int[] userDegrees;

	/**
	 * The average rating of the current user.
	 */
	private double[] userAverageRatings;

	/**
	 * The degree of users (how many item he has rated).
	 */
	private int[] itemDegrees;

	/**
	 * The average rating of the current item.
	 */
	private double[] itemAverageRatings;

	/**
	 * The first user start from 0. Let the first user has x ratings, the second
	 * user will start from x.
	 */
	private int[] userStartingIndices;

	/**
	 * Number of non-neighbor objects.
	 */
	private int numNonNeighbors;

	/**
	 * The radius (delta) for determining the neighborhood.
	 */
	private double radius;

	/**
	 ************************* 
	 * Construct the rating matrix.
	 * 
	 * @param paraRatingFilename
	 *            the rating filename.
	 * @param paraNumUsers
	 *            number of users
	 * @param paraNumItems
	 *            number of items
	 * @param paraNumRatings
	 *            number of ratings
	 ************************* 
	 */
	public MBR(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception {
		// Step 1. Initialize these arrays
		numItems = paraNumItems;
		numUsers = paraNumUsers;
		numRatings = paraNumRatings;

		userDegrees = new int[numUsers];
		userStartingIndices = new int[numUsers + 1];
		userAverageRatings = new double[numUsers];
		itemDegrees = new int[numItems];
		compressedRatingMatrix = new int[numRatings][3];
		itemAverageRatings = new double[numItems];

		predictions = new double[numRatings];

		System.out.println("Reading " + paraFilename);

		// Step 2. Read the data file.
		File tempFile = new File(paraFilename);
		if (!tempFile.exists()) {
			System.out.println("File " + paraFilename + " does not exists.");
			System.exit(0);
		} // Of if
		BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
		String tempString;
		String[] tempStrArray;
		int tempIndex = 0;
		userStartingIndices[0] = 0;
		userStartingIndices[numUsers] = numRatings;
		while ((tempString = tempBufReader.readLine()) != null) {
			// Each line has three values
			tempStrArray = tempString.split(",");
			compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);
			compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);
			compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);

			userDegrees[compressedRatingMatrix[tempIndex][0]]++;
			itemDegrees[compressedRatingMatrix[tempIndex][1]]++;

			if (tempIndex > 0) {
				// Starting to read the data of a new user.
				if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
					userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
				} // Of if
			} // Of if
			tempIndex++;
		} // Of while
		tempBufReader.close();

		double[] tempUserTotalScore = new double[numUsers];
		double[] tempItemTotalScore = new double[numItems];
		for (int i = 0; i < numRatings; i++) {
			tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
			tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
		} // Of for i

		for (int i = 0; i < numUsers; i++) {
			userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
		} // Of for i
		for (int i = 0; i < numItems; i++) {
			itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
		} // Of for i
	}// Of the first constructor

	/**
	 ************************* 
	 * Set the radius (delta).
	 * 
	 * @param paraRadius
	 *            The given radius.
	 ************************* 
	 */
	public void setRadius(double paraRadius) {
		if (paraRadius > 0) {
			radius = paraRadius;
		} else {
			radius = 0.1;
		} // Of if
	}// Of setRadius

	/**
	 ************************* 
	 * Leave-one-out prediction. The predicted values are stored in predictions.
	 * 
	 * @see predictions
	 ************************* 
	 */
	public void leaveOneOutPrediction() {
		double tempItemAverageRating;
		// Make each line of the code shorter.
		int tempUser, tempItem, tempRating;
		System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);

		numNonNeighbors = 0;
		for (int i = 0; i < numRatings; i++) {
			tempUser = compressedRatingMatrix[i][0];
			tempItem = compressedRatingMatrix[i][1];
			tempRating = compressedRatingMatrix[i][2];

			// Step 1. Recompute average rating of the current item.
			tempItemAverageRating = (itemAverageRatings[tempItem] * itemDegrees[tempItem] - tempRating)
					/ (itemDegrees[tempItem] - 1);

			// Step 2. Recompute neighbors, at the same time obtain the ratings
			// Of neighbors.
			int tempNeighbors = 0;
			double tempTotal = 0;
			int tempComparedItem;
			for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
				tempComparedItem = compressedRatingMatrix[j][1];
				if (tempItem == tempComparedItem) {
					continue;// Ignore itself.
				} // Of if

				if (Math.abs(tempItemAverageRating - itemAverageRatings[tempComparedItem]) < radius) {
					tempTotal += compressedRatingMatrix[j][2];
					tempNeighbors++;
				} // Of if
			} // Of for j

			// Step 3. Predict as the average value of neighbors.
			if (tempNeighbors > 0) {
				predictions[i] = tempTotal / tempNeighbors;
			} else {
				predictions[i] = DEFAULT_RATING;
				numNonNeighbors++;
			} // Of if
		} // Of for i
	}// Of leaveOneOutPrediction

	/**
	 ************************* 
	 * Compute the MAE based on the deviation of each leave-one-out.
	 * 
	 * @author Fan Min
	 ************************* 
	 */
	public double computeMAE() throws Exception {
		double tempTotalError = 0;
		for (int i = 0; i < predictions.length; i++) {
			tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
		} // Of for i

		return tempTotalError / predictions.length;
	}// Of computeMAE

	/**
	 ************************* 
	 * Compute the MAE based on the deviation of each leave-one-out.
	 * 
	 * @author Fan Min
	 ************************* 
	 */
	public double computeRSME() throws Exception {
		double tempTotalError = 0;
		for (int i = 0; i < predictions.length; i++) {
			tempTotalError += (predictions[i] - compressedRatingMatrix[i][2])
					* (predictions[i] - compressedRatingMatrix[i][2]);
		} // Of for i

		double tempAverage = tempTotalError / predictions.length;

		return Math.sqrt(tempAverage);
	}// Of computeRSME

	/**
	 ************************* 
	 * The entrance of the program.
	 * 
	 * @param args
	 *            Not used now.
	 ************************* 
	 */
	public static void main(String[] args) {
		try {
			MBR tempRecommender = new MBR("D:/data/movielens-943u1682m.txt", 943, 1682, 100000);

			for (double tempRadius = 0.2; tempRadius < 0.6; tempRadius += 0.1) {
				tempRecommender.setRadius(tempRadius);

				tempRecommender.leaveOneOutPrediction();
				double tempMAE = tempRecommender.computeMAE();
				double tempRSME = tempRecommender.computeRSME();

				System.out.println("Radius = " + tempRadius + ", MAE = " + tempMAE + ", RSME = " + tempRSME
						+ ", numNonNeighbors = " + tempRecommender.numNonNeighbors);
			} // Of for tempRadius
		} catch (Exception ee) {
			System.out.println(ee);
		} // Of try
	}// Of main
}// Of class MBR

第 55 天: 基于 M-distance 的推荐 (续)

昨天实现的是 item-based recommendation. 今天自己来实现一下 user-based recommendation. 只需要在原有基础上增加即可.

public class MBR {

	/**
	 *评分为1-5分
	 */
	public static final double DEFAULT_RATING = 3.0;

	/**
	 * 用户数量
	 */
	private int numUsers;

	/**
	 * 项目数量
	 */
	private int numItems;

	/**
	 * 评分数量（非零值）
	 */
	private int numRatings;

	/**
	 * 预测数组
	 */
	private double[] predictions;

	/**
	 * 压缩评级矩阵。
	 */
	private int[][] compressedRatingMatrix;

	/**
	 *有多少用户对项目进行了评分
	 */
	private int[] userDegrees;

	/**
	 * 当前用户的平均分级
	 */
	private double[] userAverageRatings;

	/**
	 * 多少项目被评分
	 */
	private int[] itemDegrees;

	/**
	 * 当前项目平均评级
	 */
	private double[] itemAverageRatings;

	/**
	 * 第一个用户从0开始。让第一个用户有x评级，第二个用户将从x开始。
	 */
	private int[] userStartingIndices;

	/**
	 * 没有邻居的对象
	 */
	private int numNonNeighbors;

	/**
	 * 确定邻居的半径（增量）
	 */
	private double radius;

	/**
	 ************************* 
	 * 创建评分矩阵
	 * 
	 * @param paraRatingFilename
	 *            the rating filename.
	 * @param paraNumUsers
	 *            number of users
	 * @param paraNumItems
	 *            number of items
	 * @param paraNumRatings
	 *            number of ratings
	 ************************* 
	 */
	public MBR(String paraFilename, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception {
		//初始化三个数组
		numItems = paraNumItems;
		numUsers = paraNumUsers;
		numRatings = paraNumRatings;

		userDegrees = new int[numUsers];
		userStartingIndices = new int[numUsers + 1];
		userAverageRatings = new double[numUsers];
		itemDegrees = new int[numItems];
		compressedRatingMatrix = new int[numRatings][3];
		itemAverageRatings = new double[numItems];

		predictions = new double[numRatings];

		System.out.println("Reading " + paraFilename);

		//读取数据
		File tempFile = new File(paraFilename);
		if (!tempFile.exists()) {
			System.out.println("File " + paraFilename + " does not exists.");
			System.exit(0);
		}
		BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
		String tempString;
		String[] tempStrArray;
		int tempIndex = 0;
		userStartingIndices[0] = 0;
		userStartingIndices[numUsers] = numRatings;
		while ((tempString = tempBufReader.readLine()) != null) {
			//每一行有三个值
			tempStrArray = tempString.split(",");
			compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);//用户
			compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);//项目
			compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);//评级

			userDegrees[compressedRatingMatrix[tempIndex][0]]++;
			itemDegrees[compressedRatingMatrix[tempIndex][1]]++;

			if (tempIndex > 0) {
				// 开始读取新用户数据
				if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
					userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
				}
			}
			tempIndex++;
		}
		tempBufReader.close();

		double[] tempUserTotalScore = new double[numUsers];
		double[] tempItemTotalScore = new double[numItems];
		for (int i = 0; i < numRatings; i++) {
			tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
			tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
		}

		for (int i = 0; i < numUsers; i++) {
			userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
		}
		for (int i = 0; i < numItems; i++) {
			itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
		}
	}

	/**
	 ************************* 
	 * 设置半径（增量）。
	 * 
	 * @param paraRadius
	 *            The given radius.
	 ************************* 
	 */
	public void setRadius(double paraRadius) {
		if (paraRadius > 0) {
			radius = paraRadius;
		} else {
			radius = 0.1;
		}
	}

	/**
	 ************************* 
	 * 留一法。预测值存储在预测中。
	 * 
	 * @see predictions
	 ************************* 
	 */
	public void leaveOneOutPrediction() {
		double tempUserAverageRating;
		int tempUser, tempItem, tempRating;
		System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);

		numNonNeighbors = 0;
		for (int i = 0; i < numRatings; i++) {
			tempUser = compressedRatingMatrix[i][0];
			tempItem = compressedRatingMatrix[i][1];
			tempRating = compressedRatingMatrix[i][2];

			//重新计算当前的平均评分。
			tempUserAverageRating = (userAverageRatings[tempUser] * userDegrees[tempUser] - tempRating)
					/ (userDegrees[tempUser] - 1);

			//重新计算邻居，同时获得评分
			//邻居
			int tempNeighbors = 0;
			double tempTotal = 0;
			int tempComparedUser;
			for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
				tempComparedUser = compressedRatingMatrix[j][0];
				if (tempUser == tempComparedUser) {
					continue;
				}

				if (Math.abs(tempUserAverageRating - userAverageRatings[tempComparedUser]) < radius) {
					tempTotal += compressedRatingMatrix[j][2];
					tempNeighbors++;
				}
			}

			//预测为邻居的平均值。
			if (tempNeighbors > 0) {
				predictions[i] = tempTotal / tempNeighbors;
			} else {
				predictions[i] = DEFAULT_RATING;
				numNonNeighbors++;
			}
		}
	}

	/**
	 ************************* 
	 * 根据每个遗漏的偏差计算MAE。
	 ************************* 
	 */
	public double computeMAE() throws Exception {
		double tempTotalError = 0;
		for (int i = 0; i < predictions.length; i++) {
			tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
		}

		return tempTotalError / predictions.length;
	}

	/**
	 ************************* 
	 * 根据每个遗漏的偏差计算MAE。
	 ************************* 
	 */
	public double computeRSME() throws Exception {
		double tempTotalError = 0;
		for (int i = 0; i < predictions.length; i++) {
			tempTotalError += (predictions[i] - compressedRatingMatrix[i][2])
					* (predictions[i] - compressedRatingMatrix[i][2]);
		}

		double tempAverage = tempTotalError / predictions.length;

		return Math.sqrt(tempAverage);
	}

	/**
	 ************************* 
	 * The entrance of the program.
	 * 
	 * @param args
	 *            Not used now.
	 ************************* 
	 */
	public static void main(String[] args) {
		try {
			MBR tempRecommender = new MBR("D:/data/movielens943u1682m.txt", 10000, 1682, 1000000);

			for (double tempRadius = 0.2; tempRadius < 0.6; tempRadius += 0.1) {
				tempRecommender.setRadius(tempRadius);

				tempRecommender.leaveOneOutPrediction();
				double tempMAE = tempRecommender.computeMAE();
				double tempRSME = tempRecommender.computeRSME();

				System.out.println("Radius = " + tempRadius + ", MAE = " + tempMAE + ", RSME = " + tempRSME
						+ ", numNonNeighbors = " + tempRecommender.numNonNeighbors);
			}
		} catch (Exception ee) {
			System.out.println(ee);
		}
	}
}

第 56 天: kMeans 聚类
kMeans 是最常用的聚类算法.

kMeans 聚类需要中心点收敛时结束. 偷懒使用了 Arrays.equals()
数据集为 iris, 所以最后一个属性没使用. 如果对于没有决策属性的数据集, 需要进行相应修改.
数据没有归一化.
getRandomIndices() 和 kMeans 的完全相同, 拷贝过来. 本来应该写在 SimpleTools.java 里面的, 代码不多, 为保证独立性就放这里了.
distance() 和 kMeans 的相似, 注意不要用决策属性, 而且参数不同. 第 2 个参数为实数向量, 这是类为中心可能为虚拟的, 而中心点那里并没有对象.

package machinelearning.kmeans;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;
import weka.core.Instances;

/**
 * kMeans clustering.
 * @author Fan Min minfanphd@163.com.
 */
 public class KMeans {

	/**
	 * Manhattan distance.
	 */
	public static final int MANHATTAN = 0;

	/**
	 * Euclidean distance.
	 */
	public static final int EUCLIDEAN = 1;

	/**
	 * The distance measure.
	 */
	public int distanceMeasure = EUCLIDEAN;

	/**
	 * A random instance;
	 */
	public static final Random random = new Random();

	/**
	 * The data.
	 */
	Instances dataset;

	/**
	 * The number of clusters.
	 */
	int numClusters = 2;

	/**
	 * The clusters.
	 */
	int[][] clusters;

	/**
	 ******************************* 
	 * The first constructor.
	 * 
	 * @param paraFilename
	 *            The data filename.
	 ******************************* 
	 */
	public KMeans(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		} // Of try
	}// Of the first constructor

	/**
	 ******************************* 
	 * A setter.
	 ******************************* 
	 */
	public void setNumClusters(int paraNumClusters) {
		numClusters = paraNumClusters;
	}// Of the setter

	/**
	 *********************
	 * Get a random indices for data randomization.
	 * 
	 * @param paraLength
	 *            The length of the sequence.
	 * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
	 *********************
	 */
	public static int[] getRandomIndices(int paraLength) {
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;
		} // Of for i

		return resultIndices;
	}// Of getRandomIndices

	/**
	 *********************
	 * The distance between two instances.
	 * 
	 * @param paraI
	 *            The index of the first instance.
	 * @param paraArray
	 *            The array representing a point in the space.
	 * @return The distance.
	 *********************
	 */
	public double distance(int paraI, double[] paraArray) {
		int resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - paraArray[i];
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - paraArray[i];
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

	/**
	 ******************************* 
	 * Clustering.
	 ******************************* 
	 */
	public void clustering() {
		int[] tempOldClusterArray = new int[dataset.numInstances()];
		tempOldClusterArray[0] = -1;
		int[] tempClusterArray = new int[dataset.numInstances()];
		Arrays.fill(tempClusterArray, 0);
		double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];

		// Step 1. Initialize centers.
		int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
		for (int i = 0; i < numClusters; i++) {
			for (int j = 0; j < tempCenters[0].length; j++) {
				tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
			} // Of for j
		} // Of for i

		int[] tempClusterLengths = null;
		while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
			System.out.println("New loop ...");
			tempOldClusterArray = tempClusterArray;
			tempClusterArray = new int[dataset.numInstances()];

			// Step 2.1 Minimization. Assign cluster to each instance.
			int tempNearestCenter;
			double tempNearestDistance;
			double tempDistance;

			for (int i = 0; i < dataset.numInstances(); i++) {
				tempNearestCenter = -1;
				tempNearestDistance = Double.MAX_VALUE;

				for (int j = 0; j < numClusters; j++) {
					tempDistance = distance(i, tempCenters[j]);
					if (tempNearestDistance > tempDistance) {
						tempNearestDistance = tempDistance;
						tempNearestCenter = j;
					} // Of if
				} // Of for j
				tempClusterArray[i] = tempNearestCenter;
			} // Of for i

			// Step 2.2 Mean. Find new centers.
			tempClusterLengths = new int[numClusters];
			Arrays.fill(tempClusterLengths, 0);
			double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
			// Arrays.fill(tempNewCenters, 0);
			for (int i = 0; i < dataset.numInstances(); i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
				} // Of for j
				tempClusterLengths[tempClusterArray[i]]++;
			} // Of for i

			// Step 2.3 Now average
			for (int i = 0; i < tempNewCenters.length; i++) {
				for (int j = 0; j < tempNewCenters[0].length; j++) {
					tempNewCenters[i][j] /= tempClusterLengths[i];
				} // Of for j
			} // Of for i

			System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
			tempCenters = tempNewCenters;
		} // Of while

		// Step 3. Form clusters.
		clusters = new int[numClusters][];
		int[] tempCounters = new int[numClusters];
		for (int i = 0; i < numClusters; i++) {
			clusters[i] = new int[tempClusterLengths[i]];
		} // Of for i

		for (int i = 0; i < tempClusterArray.length; i++) {
			clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
			tempCounters[tempClusterArray[i]]++;
		} // Of for i

		System.out.println("The clusters are: " + Arrays.deepToString(clusters));
	}// Of clustering

	/**
	 ******************************* 
	 * Clustering.
	 ******************************* 
	 */
	public static void testClustering() {
		KMeans tempKMeans = new KMeans("D:/data/iris.arff");
		tempKMeans.setNumClusters(3);
		tempKMeans.clustering();
	}// Of testClustering

	/**
	 ************************* 
	 * A testing method.
	 ************************* 
	 */
	public static void main(String arags[]) {
		testClustering();
	}// Of main

}// Of class KMeans

New loop ...
Now the new centers are: [[6.017142857142856, 2.7971428571428567, 4.545714285714286, 1.5214285714285716], [6.964285714285715, 3.089285714285714, 5.932142857142857, 2.107142857142857], [5.005769230769231, 3.3807692307692316, 1.5288461538461537, 0.2749999999999999]]
New loop ...
Now the new centers are: [[6.022666666666666, 2.804, 4.544, 1.5333333333333332], [6.980000000000001, 3.0759999999999996, 5.991999999999998, 2.1039999999999996], [5.005999999999999, 3.4180000000000006, 1.464, 0.2439999999999999]]
New loop ...
Now the new centers are: [[6.022666666666666, 2.804, 4.544, 1.5333333333333332], [6.980000000000001, 3.0759999999999996, 5.991999999999998, 2.1039999999999996], [5.005999999999999, 3.4180000000000006, 1.464, 0.2439999999999999]]
The clusters are: [[50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 110, 111, 112, 113, 114, 115, 116, 119, 121, 123, 126, 127, 133, 137, 138, 139, 141, 142, 145, 146, 147, 148, 149], [100, 102, 103, 104, 105, 107, 108, 109, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132, 134, 135, 136, 140, 143, 144], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]

第 57 天: kMeans 聚类 (续)

获得虚拟中心后, 换成与其最近的点作为实际中心, 再聚类.
今天主要是想控制下节奏. 毕竟 kMeans 也值得两天的工作量.

第 58 天: 符号型数据的 NB 算法
Naive Bayes 是一种用后验概率公式推导出的算法. 它有一个独立性假设, 从数学上看起来不靠谱. 但从机器学习效果来说是不错的.

所有的程序都在今天列出, 但今天只研究符号型数据的分类. 为此, 可以只抄符号型数据相关的方法 (从 main() 开始有选择性地抄), 明天再抄数值型数据处理算法.
必须自己举一个小的例子 (如 10 个对象, 3 个条件属性, 2 个类别) 来辅助理解.
需要查阅相关基础知识.
需要理解三维数组每个维度的涵义: The conditional probabilities for all classes over all attributes on all values. 注意到三维数组不是规则的, 例如, 不同属性的属性值个数可能不同.
这里使用同样的数据进行训练和测试. 如果要划分训练集和测试集, 可参考 kNN 代码.

package machinelearning.bayes;

import java.io.FileReader;
import java.util.Arrays;

import weka.core.*;

/**
 * The Naive Bayes algorithm.
 * 
 * @author Fan Min minfanphd@163.com.
 */

public class NaiveBayes {
	/**
	 ************************* 
	 * An inner class to store parameters.
	 ************************* 
	 */
	private class GaussianParamters {
		double mu;
		double sigma;

		public GaussianParamters(double paraMu, double paraSigma) {
			mu = paraMu;
			sigma = paraSigma;
		}// Of the constructor

		public String toString() {
			return "(" + mu + ", " + sigma + ")";
		}// Of toString
	}// Of GaussianParamters

	/**
	 * The data.
	 */
	Instances dataset;

	/**
	 * The number of classes. For binary classification it is 2.
	 */
	int numClasses;

	/**
	 * The number of instances.
	 */
	int numInstances;

	/**
	 * The number of conditional attributes.
	 */
	int numConditions;

	/**
	 * The prediction, including queried and predicted labels.
	 */
	int[] predicts;

	/**
	 * Class distribution.
	 */
	double[] classDistribution;

	/**
	 * Class distribution with Laplacian smooth.
	 */
	double[] classDistributionLaplacian;

	/**
	 * The conditional probabilities for all classes over all attributes on all
	 * values.
	 */
	double[][][] conditionalProbabilities;

	/**
	 * The conditional probabilities with Laplacian smooth.
	 */
	double[][][] conditionalProbabilitiesLaplacian;

	/**
	 * The Guassian parameters.
	 */
	GaussianParamters[][] gaussianParameters;

	/**
	 * Data type.
	 */
	int dataType;

	/**
	 * Nominal.
	 */
	public static final int NOMINAL = 0;

	/**
	 * Numerical.
	 */
	public static final int NUMERICAL = 1;

	/**
	 ********************
	 * The constructor.
	 * 
	 * @param paraFilename
	 *            The given file.
	 ********************
	 */
	public NaiveBayes(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		} // Of try

		dataset.setClassIndex(dataset.numAttributes() - 1);
		numConditions = dataset.numAttributes() - 1;
		numInstances = dataset.numInstances();
		numClasses = dataset.attribute(numConditions).numValues();
	}// Of the constructor

	/**
	 ********************
	 * Set the data type.
	 ********************
	 */
	public void setDataType(int paraDataType) {
		dataType = paraDataType;
	}// Of setDataType

	/**
	 ********************
	 * Calculate the class distribution with Laplacian smooth.
	 ********************
	 */
	public void calculateClassDistribution() {
		classDistribution = new double[numClasses];
		classDistributionLaplacian = new double[numClasses];

		double[] tempCounts = new double[numClasses];
		for (int i = 0; i < numInstances; i++) {
			int tempClassValue = (int) dataset.instance(i).classValue();
			tempCounts[tempClassValue]++;
		} // Of for i

		for (int i = 0; i < numClasses; i++) {
			classDistribution[i] = tempCounts[i] / numInstances;
			classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
		} // Of for i

		System.out.println("Class distribution: " + Arrays.toString(classDistribution));
		System.out.println(
				"Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
	}// Of calculateClassDistribution

	/**
	 ********************
	 * Calculate the conditional probabilities with Laplacian smooth. ONLY scan
	 * the dataset once. There was a simpler one, I have removed it because the
	 * time complexity is higher.
	 ********************
	 */
	public void calculateConditionalProbabilities() {
		conditionalProbabilities = new double[numClasses][numConditions][];
		conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];

		// Allocate space
		for (int i = 0; i < numClasses; i++) {
			for (int j = 0; j < numConditions; j++) {
				int tempNumValues = (int) dataset.attribute(j).numValues();
				conditionalProbabilities[i][j] = new double[tempNumValues];
				conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
			} // Of for j
		} // Of for i

		// Count the numbers
		int[] tempClassCounts = new int[numClasses];
		for (int i = 0; i < numInstances; i++) {
			int tempClass = (int) dataset.instance(i).classValue();
			tempClassCounts[tempClass]++;
			for (int j = 0; j < numConditions; j++) {
				int tempValue = (int) dataset.instance(i).value(j);
				conditionalProbabilities[tempClass][j][tempValue]++;
			} // Of for j
		} // Of for i

		// Now for the real probability with Laplacian
		for (int i = 0; i < numClasses; i++) {
			for (int j = 0; j < numConditions; j++) {
				int tempNumValues = (int) dataset.attribute(j).numValues();
				for (int k = 0; k < tempNumValues; k++) {
					conditionalProbabilitiesLaplacian[i][j][k] = (conditionalProbabilities[i][j][k]
							+ 1) / (tempClassCounts[i] + numClasses);
				} // Of for k
			} // Of for j
		} // Of for i

		System.out.println(Arrays.deepToString(conditionalProbabilities));
	}// Of calculateConditionalProbabilities

	/**
	 ********************
	 * Calculate the conditional probabilities with Laplacian smooth.
	 ********************
	 */
	public void calculateGausssianParameters() {
		gaussianParameters = new GaussianParamters[numClasses][numConditions];

		double[] tempValuesArray = new double[numInstances];
		int tempNumValues = 0;
		double tempSum = 0;

		for (int i = 0; i < numClasses; i++) {
			for (int j = 0; j < numConditions; j++) {
				tempSum = 0;

				// Obtain values for this class.
				tempNumValues = 0;
				for (int k = 0; k < numInstances; k++) {
					if ((int) dataset.instance(k).classValue() != i) {
						continue;
					} // Of if

					tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
					tempSum += tempValuesArray[tempNumValues];
					tempNumValues++;
				} // Of for k

				// Obtain parameters.
				double tempMu = tempSum / tempNumValues;

				double tempSigma = 0;
				for (int k = 0; k < tempNumValues; k++) {
					tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
				} // Of for k
				tempSigma /= tempNumValues;
				tempSigma = Math.sqrt(tempSigma);

				gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma);
			} // Of for j
		} // Of for i

		System.out.println(Arrays.deepToString(gaussianParameters));
	}// Of calculateGausssianParameters

	/**
	 ********************
	 * Classify all instances, the results are stored in predicts[].
	 ********************
	 */
	public void classify() {
		predicts = new int[numInstances];
		for (int i = 0; i < numInstances; i++) {
			predicts[i] = classify(dataset.instance(i));
		} // Of for i
	}// Of classify

	/**
	 ********************
	 * Classify an instances.
	 ********************
	 */
	public int classify(Instance paraInstance) {
		if (dataType == NOMINAL) {
			return classifyNominal(paraInstance);
		} else if (dataType == NUMERICAL) {
			return classifyNumerical(paraInstance);
		} // Of if

		return -1;
	}// Of classify

	/**
	 ********************
	 * Classify an instances with nominal data.
	 ********************
	 */
	public int classifyNominal(Instance paraInstance) {
		// Find the biggest one
		double tempBiggest = -10000;
		int resultBestIndex = 0;
		for (int i = 0; i < numClasses; i++) {
			double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
			for (int j = 0; j < numConditions; j++) {
				int tempAttributeValue = (int) paraInstance.value(j);

				// Laplacian smooth.
				tempPseudoProbability += Math
						.log(conditionalProbabilities[i][j][tempAttributeValue]);
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNominal

	/**
	 ********************
	 * Classify an instances with numerical data.
	 ********************
	 */
	public int classifyNumerical(Instance paraInstance) {
		// Find the biggest one
		double tempBiggest = -10000;
		int resultBestIndex = 0;

		for (int i = 0; i < numClasses; i++) {
			double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
			for (int j = 0; j < numConditions; j++) {
				double tempAttributeValue = paraInstance.value(j);
				double tempSigma = gaussianParameters[i][j].sigma;
				double tempMu = gaussianParameters[i][j].mu;

				tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
						* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNumerical

	/**
	 ********************
	 * Compute accuracy.
	 ********************
	 */
	public double computeAccuracy() {
		double tempCorrect = 0;
		for (int i = 0; i < numInstances; i++) {
			if (predicts[i] == (int) dataset.instance(i).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		double resultAccuracy = tempCorrect / numInstances;
		return resultAccuracy;
	}// Of computeAccuracy

	/**
	 ************************* 
	 * Test nominal data.
	 ************************* 
	 */
	public static void testNominal() {
		System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
		String tempFilename = "D:/data/mushroom.arff";

		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
		tempLearner.setDataType(NOMINAL);
		tempLearner.calculateClassDistribution();
		tempLearner.calculateConditionalProbabilities();
		tempLearner.classify();

		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
	}// Of testNominal

	/**
	 ************************* 
	 * Test numerical data.
	 ************************* 
	 */
	public static void testNumerical() {
		System.out.println(
				"Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
		String tempFilename = "D:/data/iris.arff";

		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
		tempLearner.setDataType(NUMERICAL);
		tempLearner.calculateClassDistribution();
		tempLearner.calculateGausssianParameters();
		tempLearner.classify();

		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
	}// Of testNominal

	/**
	 ************************* 
	 * Test this class.
	 * 
	 * @param args
	 *            Not used now.
	 ************************* 
	 */
	public static void main(String[] args) {
		testNominal();
		testNumerical();
	}// Of main

}// Of class NaiveBayes

第 59 天: 数值型数据的 NB 算法

今天把数值型数据处理的代码加上去.
假设所有属性的属性值都服从高斯分布. 也可以做其它假设.
将概率密度当成概率值直接使用 Bayes 公式.
可以看到, 数值型数据的处理并不会比符号型的复杂.

/**
 ********************
 * Classify an instances with numerical data.
 ********************
 */
public int classifyNumerical(Instance paraInstance) {
	// Find the biggest one
	double tempBiggest = -10000;
	int resultBestIndex = 0;

	for (int i = 0; i < numClasses; i++) {
		double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
		for (int j = 0; j < numConditions; j++) {
			double tempAttributeValue = paraInstance.value(j);
			double tempSigma = gaussianParameters[i][j].sigma;
			double tempMu = gaussianParameters[i][j].mu;

			tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
					* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
		} // Of for j

		if (tempBiggest < tempPseudoProbability) {
			tempBiggest = tempPseudoProbability;
			resultBestIndex = i;
		} // Of if
	} // Of for i

	return resultBestIndex;
}// Of classifyNumerical

/**
 ************************* 
 * Test numerical data.
 ************************* 
 */
public static void testNumerical() {
	System.out.println(
			"Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
	String tempFilename = "D:/data/iris.arff";

	NaiveBayes tempLearner = new NaiveBayes(tempFilename);
	tempLearner.setDataType(NUMERICAL);
	tempLearner.calculateClassDistribution();
	tempLearner.calculateGausssianParameters();
	tempLearner.classify();

	System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNominal

public static void main(String[] args) {
	testNominal();
	testNumerical();
}// Of main

第 60 天: 小结

1.KNN即K-Nearest Neighbor，是数据挖掘中一种最简单的分类方法，即要判断某一个样本属于已知样本种类中的哪一类时，通过计算找出所有样本中与测试样本最近或者最相似的K个样本，统计这K个样本中哪一种类最多则把测试样本归位该类。如何衡量两个样本的相似度？可以用向量的p-范数来定义。

2.机器学习主要分为监督学习、无监督学习、半监督学习、强化学习

3..监督学习：
In：有标签
Out：有反馈
目的：预测结果
案例：学认字
算法：分类（类别），回归（数字）

4.无监督学习：
In：无标签
Out：无反馈
目的：发现潜在结构
案例：自动聚类
算法：聚类，降维

5.半监督学习:
已知：训练样本Data和待分类的类别
未知：训练样本有无标签均可
应用：训练数据量过时，
监督学习效果不能满足需求，因此用来增强效果。

6.强化学习：
In：决策流程及激励系统
Out：一系列行动
目的：长期利益最大化，回报函数（只会提示你是否在朝着目标方向前进的延迟反映）
案例：学下棋
算法：马尔科夫决策，动态规划

7.kmeans算法又名k均值算法,K-means算法中的k表示的是聚类为k个簇，means代表取每一个聚类中数据值的均值作为该簇的中心，或者称为质心，即用每一个的类的质心对该簇进行描述。

8.实现kmeans算法的主要四点：
（1）簇个数 k 的选择
（2）各个样本点到“簇中心”的距离
（3）根据新划分的簇，更新“簇中心”
（4）重复上述2、3过程，直至"簇中心"没有移动
优缺点：优点：容易实现
缺点：可能收敛到局部最小值，在大规模数据上收敛较慢

9.所谓聚类算法是指将一堆没有标签的数据自动划分成几类的方法，属于无监督学习方法，这个方法要保证同一类的数据有相似的特征。

10.了解了机器学习的基本内容，需要对机器学习进行更加深的了解

qq_45380507

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
51-60天，kNN 与 NB

第 51 天: kNN 分类器这个代码 300 行, 分三天完成. 今天先把代码抄完并运行, 明后天有修改程序的工作. 要求熟练掌握.两种距离度量.数据随机分割方式.间址的灵活使用: trainingSet 和 testingSet 都是整数数组, 表示下标.arff 文件的读取. 需要 weka.jar 包.求邻居.投票.package machinelearning.knn;import java.io.FileReader;import java.util.Arrays.
复制链接

扫一扫