监督学习:从给定的训练数据集中学习出一个函数(模型参数), 当新的数据到来时,可以根据这个函数预测结果。监督学习的训练集要求包括输入输出,也可以说是特征和目标。训练集中的目标是由人标注的。
PCA和很多deep learning算法都属于无监督学习
knn(k近邻算法):
它是一个监督学习(Supervised learning)类的算法,该算法的思想是: 一个样本与数据集中的k个样本最相似, 如果这k个样本中的大多数属于某一个类别, 则该样本也属于这个类别。
(图不知道为啥插不上来。。。)
因此,要判断实例属于哪一个类,就要依靠训练集中得出的结果来进行预测,比如训练集结果 中A类实例出现的最多,那么就猜测该实例属于A类(个人理解)。
以下选自其他博主的博客对于k值选择的解释:
如果选择较小的K值,就相当于用较小的邻域中的训练实例进行预测,学习的近似误差会减小,只有与输入实例较近的训练实例才会对预测结果起作用,单缺点是学习的估计误差会增大,预测结果会对近邻的实例点分成敏感。如果邻近的实例点恰巧是噪声,预测就会出错。换句话说,K值减小就意味着整体模型变复杂,分的不清楚,就容易发生过拟合。
如果选择较大K值,就相当于用较大邻域中的训练实例进行预测,其优点是可以减少学习的估计误差,但近似误差会增大,也就是对输入实例预测不准确,K值得增大就意味着整体模型变的简单
**近似误差:**可以理解为对现有训练集的训练误差。
**估计误差:**可以理解为对测试集的测试误差。
近似误差关注训练集,如果k值小了会出现过拟合的现象,对现有的训练集能有很好的预测,但是对未知的测试样本将会出现较大偏差的预测。模型本身不是最接近最佳模型。
估计误差关注测试集,估计误差小了说明对未知数据的预测能力好。模型本身最接近最佳模型。
在应用中,K值一般取一个比较小的数值,通常采用交叉验证法来选取最优的K值。
流程:
1) 计算已知类别数据集中的点与当前点之间的距离
2) 按距离递增次序排序
3) 选取与当前点距离最小的k个点
4) 统计前k个点所在的类别出现的频率
5) 返回前k个点出现频率最高的类别作为当前点的预测分类
一句话,k过大过小都不好,随着k的增大,会有一个最佳点,会使得训练集和测试集误差都比较小。感觉就像二次函数求最值那样。
以下代码原来自老师,目前我已经清楚了各个方法以及参数的含义(那个机器学习extended jar当中的一些参数还不清楚,不过我能大致猜出来是什么意思了)。
package knn;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;
import weka.core.*;
/**
* kNN classification.
*
* @author Fan Min minfanphd@163.com.
*/
public class KnnClassification {
/**
* Manhattan distance.
*/
public static final int MANHATTAN = 0;//manhattan 曼哈顿距离
/**
* Euclidean distance.
*/
public static final int EUCLIDEAN = 1;//euclidean 欧几里得距离
/**
* The distance measure.
*/
public int distanceMeasure = EUCLIDEAN;//代表不同的距离检测方法,就是上面的两种。
/**
* A random instance;
*/
public static final Random random = new Random();
/**
* The number of neighbors.
*/
int numNeighbors = 7;
/**
* The whole dataset.
*/
Instances dataset;
/**
* The training set. Represented by the indices of the data.
*/
int[] trainingSet;
/**
* The testing set. Represented by the indices of the data.
*/
int[] testingSet;
/**
* The predictions.
*/
int[] predictions;
/**
*********************
* The first constructor.
*
* @param paraFilename
* The arff filename.
*********************
*/
public void setDistanceMeasure(int n) {//增加 setDistanceMeasure() 方法.
switch(n) {
case EUCLIDEAN: distanceMeasure = EUCLIDEAN;break;
case MANHATTAN: distanceMeasure = MANHATTAN;break;
default:distanceMeasure = -1;
}
}
public void setNumNeighbors(int n) {//增加 setNumNeighors() 方法.
numNeighbors = n;
}
public KnnClassification(String paraFilename) {
try {
FileReader fileReader = new FileReader(paraFilename);
dataset = new Instances(fileReader);
// The last attribute is the decision class.
dataset.setClassIndex(dataset.numAttributes() - 1);
fileReader.close();
} catch (Exception ee) {
System.out.println("Error occurred while trying to read \'" + paraFilename
+ "\' in KnnClassification constructor.\r\n" + ee);
System.exit(0);
} // Of try
}// Of the first constructor
/**
*********************
* Get a random indices for data randomization. 得一个数据随机化的随机指数。
*
* @param paraLength
* The length of the sequence.
* @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6. //返回一组索引
*********************
*/
public static int[] getRandomIndices(int paraLength) {//为什么会有这个设随机下标的步骤??
int[] resultIndices = new int[paraLength];
// Step 1. Initialize.
for (int i = 0; i < paraLength; i++) {
resultIndices[i] = i;
} // Of for i
// Step 2. Randomly swap.
int tempFirst, tempSecond, tempValue;
for (int i = 0; i < paraLength; i++) {
// Generate two random indices.
tempFirst = random.nextInt(paraLength);
tempSecond = random.nextInt(paraLength);
// Swap.
tempValue = resultIndices[tempFirst];
resultIndices[tempFirst] = resultIndices[tempSecond];
resultIndices[tempSecond] = tempValue;
} // Of for i
return resultIndices;
}// Of getRandomIndices
/**
*********************
* Split the data into training and testing parts.
*
* @param paraTrainingFraction
* The fraction of the training set.//小部分
*********************
*/
public void splitTrainingTesting(double paraTrainingFraction) {
int tempSize = dataset.numInstances();
int[] tempIndices = getRandomIndices(tempSize);
int tempTrainingSize = (int) (tempSize * paraTrainingFraction);//这里的paraTrainingFraction是小数,比如0.1就代表有1成的数据要作为训练数据,9成作为测试数据。
trainingSet = new int[tempTrainingSize];
testingSet = new int[tempSize - tempTrainingSize];
for (int i = 0; i < tempTrainingSize; i++) {
trainingSet[i] = tempIndices[i];
} // Of for i
for (int i = 0; i < tempSize - tempTrainingSize; i++) {
testingSet[i] = tempIndices[tempTrainingSize + i];
} // Of for i
}// Of splitTrainingTesting
/**
*********************
* Predict for the whole testing set. The results are stored in predictions.
* #see predictions.
*********************
*/
public void predict() {
predictions = new int[testingSet.length];
for (int i = 0; i < predictions.length; i++) {
predictions[i] = predict(testingSet[i]);
} // Of for i
}// Of predict
/**
*********************
* Predict for given instance.
*
* @return The prediction.
*********************
*/
public int predict(int paraIndex) {
int[] tempNeighbors = MycomputeNearests(paraIndex);//找寻该实例(paraIndex)周围最近的几个neibor作为数组返回。
int resultPrediction = simpleVoting(tempNeighbors);
return resultPrediction;
}// Of predict
/**
*********************
* The distance between two instances.
*
* @param paraI
* The index of the first instance.//第一个实例的下标
* @param paraJ
* The index of the second instance.
* @return The distance.
*********************
*/
//这里的实例都是二维的吗?不是,一个实例里面有四个数据和一个数据的类型。
public double distance(int paraI, int paraJ) {
double resultDistance = 0;
double tempDifference;
switch (distanceMeasure) {
case MANHATTAN:
for (int i = 0; i < dataset.numAttributes() - 1; i++) {//numAttributes返回实例属性的个数。
tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
if (tempDifference < 0) {
resultDistance -= tempDifference;
} else {
resultDistance += tempDifference;
} // Of if
} // Of for i
break;
case EUCLIDEAN:
for (int i = 0; i < dataset.numAttributes() - 1; i++) {
tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
resultDistance += tempDifference * tempDifference;//这里欧几里得老师并没有去开方,其实平方也能看出问题。
} // Of for i
break;
default:
System.out.println("Unsupported distance measure: " + distanceMeasure);
}// Of switch
return resultDistance;
}// Of distance
/**
*********************
* Get the accuracy of the classifier.
*
* @return The accuracy.
*********************
*/
public double getAccuracy() {
// A double divides an int gets another double.
double tempCorrect = 0;
for (int i = 0; i < predictions.length; i++) {
if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {//会不会说的是arff的类别?只不过这里的类别用数字来代替?
tempCorrect++;
} // Of if
} // Of for i
return tempCorrect / testingSet.length;
}// Of getAccuracy
/**
************************************
* Compute the nearest k neighbors. Select one neighbor in each scan. In
* fact we can scan only once. You may implement it by yourself.
*
* @param paraK
* the k value for kNN.
* @param paraCurrent
* current instance. We are comparing it with all others.
* @return the indices of the nearest instances.
************************************
*/
public int[] computeNearests(int paraCurrent) {
int[] resultNearests = new int[numNeighbors];//选定附近7个点来进行分类(因为这里的numNeighbors是7个)
boolean[] tempSelected = new boolean[trainingSet.length];
double tempDistance;
double tempMinimalDistance;
int tempMinimalIndex = 0;
// Select the nearest paraK indices.
for (int i = 0; i < numNeighbors; i++) {
tempMinimalDistance = Double.MAX_VALUE;
for (int j = 0; j < trainingSet.length; j++) {
if (tempSelected[j]) {
continue;
} // Of if
tempDistance = distance(paraCurrent, trainingSet[j]);
if (tempDistance < tempMinimalDistance) {
tempMinimalDistance = tempDistance;
tempMinimalIndex = j;
} // Of if
} // Of for j
resultNearests[i] = trainingSet[tempMinimalIndex];//trainingSet里面存储的是乱序的编号,而这些编号实在random那里被打乱的。
tempSelected[tempMinimalIndex] = true;
} // Of for i
System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
return resultNearests;
}// Of computeNearests
/*
* 以下是我自己仿照重写的一段computeNearests的代码:
*
*/
//第一遍模仿出来的准确性和老师相差很多。。。。我再看看。。
public int[] MycomputeNearests(int paraCurrent) {//这就是在利用训练的数据(trainingSet)了,得出的结果将会被用在测试数据(testingSet)上。
int[] result = new int[numNeighbors];
boolean[] ifSelected = new boolean[trainingSet.length];
//double[] computeDistance = new double[numNeighbors];//To contain the minimal data in each circle
double MiniDistance;//the minimal data
double tempDistance;
int MiniDistanceIndex = 0;//the index of tempDistance
for(int i = 0;i < result.length;i++) {
MiniDistance = Double.MAX_VALUE;
MiniDistanceIndex = 0;
for(int j = 0;j < trainingSet.length;j++) {
if(ifSelected[j]==true) {
continue;
}
tempDistance = distance(paraCurrent, trainingSet[j]);
if(MiniDistance>tempDistance) {
MiniDistance = tempDistance;
MiniDistanceIndex = j;//直接存取最小的那一个下标。
}
}
result[i] = trainingSet[MiniDistanceIndex];
ifSelected[MiniDistanceIndex] = true;
}
System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(result));
return result;
}
/**
************************************
* Voting using the instances.
*
* @param paraNeighbors
* The indices of the neighbors.
* @return The predicted label.
************************************
*/
public int simpleVoting(int[] paraNeighbors) {
int[] tempVotes = new int[dataset.numClasses()];//数据集类标签的数量,你看arff文件那里,最后的那一串字符就是数据所属的类了。
for (int i = 0; i < paraNeighbors.length; i++) {
tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
} // Of for i
int tempMaximalVotingIndex = 0;
int tempMaximalVoting = 0;
for (int i = 0; i < dataset.numClasses(); i++) {
if (tempVotes[i] > tempMaximalVoting) {
tempMaximalVoting = tempVotes[i];
tempMaximalVotingIndex = i;
} // Of if
} // Of for i
return tempMaximalVotingIndex;
}// Of simpleVoting
/**
*********************
* The entrance of the program.
*
* @param args
* Not used now.
*********************
*/
public static void main(String args[]) {
KnnClassification tempClassifier = new KnnClassification("D:\\MathineLearning\\iris.arff");
tempClassifier.splitTrainingTesting(0.8);
tempClassifier.predict();
System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy());
}// Of main
}// Of class KnnClassification
代码段中的一些方法:
重新实现 computeNearests, 仅需要扫描一遍训练集, 即可获得 k kk 个邻居. 提示: 现代码与插入排序思想相结合. 其时间复杂度为 O ( k n ) O(kn)O(kn), 其中 O ( n ) O(n)O(n) 用于扫描训练集, O ( k ) O(k)O(k) 用于插入.
/*
* 以下是我自己仿照重写的一段computeNearests的代码:
*
*/
//第一遍模仿出来的准确性和老师相差很多。。。。我再看看。。
//第一遍是忽略了训练集233,第二遍着眼于训练集就没问题了。每次排序都选出最靠近的那一个实例,结果自然就是有序的了。
public int[] MycomputeNearests(int paraCurrent) {//这就是在利用训练的数据(trainingSet)了,得出的结果将会被用在测试数据(testingSet)上。
int[] result = new int[numNeighbors];
boolean[] ifSelected = new boolean[trainingSet.length];
//double[] computeDistance = new double[numNeighbors];//To contain the minimal data in each circle
double MiniDistance;//the minimal data
double tempDistance;
int MiniDistanceIndex = 0;//the index of tempDistance
for(int i = 0;i < result.length;i++) {
MiniDistance = Double.MAX_VALUE;
MiniDistanceIndex = 0;
for(int j = 0;j < trainingSet.length;j++) {
if(ifSelected[j]==true) {
continue;
}
tempDistance = distance(paraCurrent, trainingSet[j]);
if(MiniDistance>tempDistance) {
MiniDistance = tempDistance;
MiniDistanceIndex = j;//直接存取最小的那一个下标。
}
}
result[i] = trainingSet[MiniDistanceIndex];
ifSelected[MiniDistanceIndex] = true;
}
System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(result));
return result;
}
增加 setDistanceMeasure() 方法.
增加 setNumNeighors() 方法.
直接代码如下:
改参数就行了over
public void setDistanceMeasure(int n) {//增加 setDistanceMeasure() 方法.
switch(n) {
case EUCLIDEAN: distanceMeasure = EUCLIDEAN;break;
case MANHATTAN: distanceMeasure = MANHATTAN;break;
default:distanceMeasure = -1;
}
}
public void setNumNeighbors(int n) {//增加 setNumNeighors() 方法.
numNeighbors = n;
}
增加 weightedVoting() 方法, 距离越短话语权越大. 支持两种以上的加权方式.
public int weightVoting(int[] paraNeighbors) {
int[] result = new int[dataset.numClasses()];
int[] weightSet = new int[paraNeighbors.length];
for(int i = paraNeighbors.length-1;i>=0 ;i--) {
weightSet[paraNeighbors.length-1-i] = i;
}//赋权值
for(int i = 0;i < paraNeighbors.length;i++) {
result[(int)dataset.instance(paraNeighbors[i]).classValue()] += weightSet[i];
}//result数组中的下标都是类别的代码,每个实例有不同的权值,而实例又分为很多个类,最终的目的就是比较每个类下面的权值的大小,权值最大的类就是最后猜测出来的类。
int max = result[0];
int maxindex = 0;
for(int i = 0;i<result.length;i++) {
if(max<result[i]) {
max = result[i];
maxindex = i;
}
}
return maxindex;
}