WEKA Notes

最新推荐文章于 2022-09-17 22:59:35 发布

CaseyZ

最新推荐文章于 2022-09-17 22:59:35 发布

阅读量835

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/Casey1234/article/details/39041293

版权

机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

WEKA’s dataset

Format: .arff

comments by %

definition of parameter by @

E.g

@relation Glass

@attribute ‘Si’ numeric

@attribute ‘Type’ {***,***,***}

lesson 1.4

Classifier

J48 Tree

pruned & unpruned tree

number of leaves (constraint)

visualisation of tree

J48 Tree (C4.5)

choose the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other.The splitting criterion is the normalisedinformation gain (difference in entropy). The attribute with thehighest normalised information gain(highest entropy) is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.

lesson 1.5

filter

1.supervised & unsupervised

2.attribute & instances

lesson 1.6

visualize panel

lesson 2.1

User Classifier

Classify panel -> Supplied test set -> open file (segment-test) -> Choose UserClassifier in trees -> Start

Choose X and Y and select instance by rectangle or polygon in Data View. In the Tree View, the Decision Tree is made. The Correctly Classified Instances is higher, the perfecter tree is.

lesson 2.2

Training data is different from Test data

Basic assumption: train and test sets produced by independent sampling from an infinite population.

If I have only one dataset, split it under a specific percentage, like 66%. It means that 66% of data is train data and the rest is for the test data. In this option, if the percentage setting is not change, the result of Correctly Classified Instances will not be changed, because WEKA wants the results are repeatable. WEKA initialises random number generator before it does each run to make sure u know when u do the same experiment tomorrow.

The classifier algorithm is J48

lesson 2.3

How to get the different result of Correctly Classified Instances experiment under the same percentage setting of training data and test data splitting one dataset?

Setting the random-number seed (Random seed for XVal / & Split ***)

Evaluate J48 on segment-challenge by calculating sample mean and variance and Standard deviation.

lesson 2.4

baseline accuracy

e.g. rules -> ZeroR

baseline is not always worse than other classifiers like J48, NaiveBayes, IBK…

lesson 2.5

cross-validation

improve upon repeated holdout

divide 10 pieces, take 9 piece for training and 1 for test, do 10times.

average results.

each data point used once for testing, 9 times for training.

reduce the variance

CONCLUSION:

cross-validation is better than repeated holdout, and stratified is even better.

If the dataset is large, the cross-validation is unnecessary, like 10K data with 2 class.

100 different class, then u need larger datasets.

<1k 10 fold cross-validation, won’t take much longer

lesson 2.6

the number of fold using cross-validation

why 10 folds is better than 20 folds?

training 95% and testing 5%. Doing it 21st time

In the final analysis, use entire dataset for training

The J48 classifier performance does not always increase as training set size increases.

But in segment-challenge example, the performance always increases.

lesson 3.1

oneR

A class is classified by only one attribute.

It is a simple classifier.

Sometimes a very simply classifier performs well in many common datasets.

lesson 3.2

overfitting

oneR

change the minBucketSize parameter can make this classifier overfit. e.g changing 6, which is a default value, to 1. This may decrease the instance correct, which is lower than the baseline accuracy.

Overfitting is a big problem in every ML algorithm.

lesson 3.3

Naive Bayes

Hypothesis: The attributes are 1. equally important and 2.statistically independent.

Hypothesis is a class of instance. Evidence is an attribute of instance.

P(H|E)=P(E|H) * P(H)/P(E)

P(H) is a priori probability of H (baseline probability)

which is the P before the evidence is known

P(H|E) is a posteriori probability of H

which is the P after the evidence is known.

The evidence is the attribute value of unknown instance.

P(H|E)=P(E1|H)*P(E2|H)…*P(En|H)*P(H)/P(E)

zero-frequency problem.

when an attribute value is 0, it will makes the P value equalled to 0 because every number times 0 equals to 0. So WEKA uses the simple way that add 1 on every number to avoid the 0 attribute value.

Naive Bayes: all attributes contribute equally and independently, even if independence assumption is clearly violated. Because classification does not need accurate probability estimates, only needs the greatest probability is assigned to the correct class.

Adding redundant attributes may cause problem. Cases of dependence is that two attributes have same value, the identical attributes. WEKA provides attribute selection function to select subset of fairly independent attribute.

lesson 3.4

decision tree.

J48: top-down: recursive divide-and conquer（从上到下，递归的分治策略）

select an attribute for root node and create branch for each possible attribute value.
split instances into subsets. One for each branch extending from the node.
repeat recursively for each branch using only instances that reach the branch.
stop if all instances have the same class

How to choose the root node?

One attribute has some branches. The purer branch, which the attribute values are same in this branch, is more, the better attribute to be the root node it is.

If the branch is not pure, it needs to be a second splitter and it disobeys our goal, which is to reach the minimum decision tree.

We need to calculate the degree of purity.

Aim: To get the smallest tree, we need to use some heuristic methods.

— choose the attribute that produces the purest nodes. The simple heuristic method is information theory based method. Finding the greatest information gain (entropy)

Information gain = entropy of distribution before splitting - entropy of distribution after splitting.

It can be defined as the amount of information gained by knowing the value of attribute.

The information gain of outlook is greatest.

Then keep splitting within the first branch of outlook, which is sunny, by calculating the information gain of temperature, windy and humidity. And we find the humidity has the greatest information gain. Thus the second node is humidity in the sunny outlook branch.

Then keep splitting until all the nodes are pure then stop.

J48 is a common algorithm is data mining.

There are many different criteria for attribute selection but they rarely make a large difference.

lesson 3.5

pruning decision tree

Do not continue splitting if the nodes get very small

J48 minNumObj parameter, default value 2 (The minimum number of instances per leaf)

confidenceFactor (The confidence factor used for pruning(small value incurs more pruning) )

subtreeRaising pruning an interior node to raise its sub-tree up one level. It increases the complexity of algorithm, switching it off can make algorithm quicker.

The reason for pruning tree is to prevent from overfitting dataset of original un-pruning tree.

The default setting in WEKA is pruning. U can switch it off by setting theunpruned intotrue.

lesson 3.6

nearest neighbour or instance-based

linear decision boundary

Euclidean distance, the sum of square of the difference between attribute values.

Manhattan or city block distance: sum of the absolute difference between attribute values.

K-nearest-neighbors

IBK(instance-based learning)-> laze class

k value KNN parameter

Nearest neighbour often very accurate but slow because it needs to scan entire training data to make each prediction. More sophisticated data structures can make this faster.

IBK assumes all the attributes are equally important.

dataset bigger, k bigger, the better result using IBK

lesson 4.1

boundary visualisation

Visualisation restricted to numeric attributes, and 2D plots

lesson 4.2

linear regression

regression problem(predict numeric value) vs classification problem(predict nominal value)

x=w0+w1a1+w2a2+…+wkak k attributes (a0=1)

calculate weights from training data. a1, a2…ak are each attributes value of one instance.

predicted value for first training instance a(i)

w0a0(1)+w1a1(1)+w2a2(1)+…+wkak(1)=Σ(j=0–k)wjaj(1)

Σ(i=1–n) (x(i)-Σ(j=0–k) wjaj(i))^2

The method for choosing weights w is to minimise squared error on training data.

x(i)is the ith actual value of training instance.

Σ(j=0–k) wjaj(i) is the ith predicted value of training instance.

Standard matrix problem.

It will have a better performance if the number of instance is more than attribute, which n>k

if the number of attribute, which is k, is more than n, then the known parameter is not sufficient to calculate weights w.

If attributes are two or binary, we can covert them to 0 and 1.

multi valued?

functions->linearregrassion.

Nonlinear regression

Model tree

Each leaf has a linear regression model.

Linear patches approximate a continuous function.

trees->M5P

practical problem usually needs nonlinear solution.

lesson 4.3

regression techniques for classification

two-class problem:

Training stage: call the classes 0 and 1

Prediction stage: set a threshold for predicting class 0 or 1 (if a number is lower than the threshold, then consider it as 0, otherwise 1)

Multi-class problem: multi-response linear regression

separate regression for each class

Training stage: set output to 1 for training instances that belong to the class, 0 for instances that don’t. Come up with separate regression line for each class.

Prediction stage: choose the class with the largest output.

n regression problem for n different classes.

OR: pairwise linear regression (n^2)/2 pairs

linear regression line for each pairs discriminating from other pairs.

If we got a dataset that is nominal, linearregression is presented grey. Although it can be selected, the start button can not be executed. Thus we need to transfer the nominal dataset into numerical dataset.

Filter: change attribute, unsupervised, NominalToBinary, apply to the specific attribute. (This attribute filter don’t work on the class, so the class needs to be changed into No class) Then the linearegresson can run and get the equation.

add a new attribute (classification) that gives the regression output.

Supervised attribute filter-> classification -> classifiers -> functions -> LinearRegression -> SetoutputClassification as True -> Apply

The classification contains the numbers predicted by linear regression.

(Take ml algorithm classification result as an attribute adding into the dataset)

Convert the class back to nominal because zeroR works only on nominal data.

Filter: change attribute, unsupervised, NumericToNominal, apply to the specific attribute.

Set class into class attribute.

Remove all the other attributes.

predict nominal class using oneR.

By changing the minBucketSize parameter to something larger to prevent from overfitting (get only single threshold/ split point).

lesson 4.4

probability based rather than classification.

probabilities are often useful anyway

e.g

probability :tomorrow will be rainy 95% and sunny 5%.

classification: tomorrow will be surely rainy or sunny.

Naive Bayes

zeroR or J48 also can produce probability prediction.

zeroR: probability distribution always is (0.648 & 0.352)

Reason: we use 90% training fold and 10% test fold.

90% train fold includes 448 negative, 243positive instance.

Under the circumstance of Laplace correction, (448+1)/(448+1+243+1)

The idea of logistic regression is to make linear regression producing probabilities.

(Linear regression: we use a linear function to calculate a threshold and do classification)

However, in the logistic regression, we take the expression of linear regression, which is

w0a0+w1a1+w2a2+…+wkak=Σ(j=0–k)wjaj in as a component into a new formula, which is

P(1|a1,a2,…,ak)=1/(1+exp(w0a0+w1a1+w2a2+…+wkak)).

The reason why we do this transformation is that we sometimes may mistakenly consider that the result calculated by (w0a0+w1a1+w2a2+…+wkak) is a probability, but actually they are not. Sometimes the result of the calculation may be larger than 1 or a minus number, which is not permitted in the probability. Thus, after the transformation of the “exp” transformation (which is call logic transform, multi-dimension) above, we can guarantee the number sitting in the range (0,1).

In the linear regression, we try to minimum the Σ(i=1–n) (x(i)-Σ(j=0–k) wjaj(i))^2

The method for choosing weights w is to minimise squared error on training data.

However, in the logistic regression, we maximize the log-likelihood(not to minimise the squared error)

Σ(i=1—n) (1-x(i))*log(1-P(1|a1(1),a2(2),…,ak(k)))+x(i)*log(P(1|a1(1),a2(2),…,ak(k)))

Application:

open diabetes.arff -> apply function -> Logistic- 10 cross-validation

lesson 4.5

support vector machines

one critical member(support vector) in each class. 两个关键向量连接线的垂直平分线作为类的分界线 or 面

all the other member can be removed

x=b+支持向量的向量积总和，其它点无关

linear separately

non-linear separately

Linear decision boundary

Kernel trick can get more complex boundary

SVM can avoid overfitting because it only depends on the support vector, not every point,even with large number of attributes

functions -> SMO (solve only two classes problem)

functions -> LibSVM external library with interface

download those library from WEKA and put them into the right Java class path

faster than SMO, more sophisticated options.

lesson 4.6

Ensemble learning 集成学习法

We can improve predicted performance by having different ml methods or producing classifiers for the same problem and let them vote when they come to classifier or unknown test instance.

The drawback is that it will produce complex output.

Four methods:

Bagging

Randomization

Boosting

Stacking

Bagging:

We need to produce several different decision structure.

We use J48 to produce several different decision trees. Thus we need several independent training datasets which has same number of instance by getting sampling from original training set.

In fact, we sample the set with replacement. Sometimes there are two same instance in the sample.

J48 is suitable for “unstable” learning schemes.

Unstable: a little change in training dataset can make a big change in model. E.g, we can get a totally different decision tree by change a little about dataset. However, the naive bayes or instance-based learning are stable learning methods.

meta -> Bagging bagSizePercent: Size of each bag, as a percentage of the training set size.

Randomization: random forests

we randomize the algorithm, not the training data.

-Random forests

attribute selection for J48 decision tree: do not pick the very best, but a fewer best. Choose some fewer best nodes and randomly select one of them(In the original version of J48, we always choose the largest information gain node as the split node.)

We randomly choose the decision tree, and bag the results, it can get better performance.

tree->randomforests.

numFeatures : The number of attributes to be used in random selection (see RandomTree)

Boosting

It is iterative: new models are influenced by performance of perviously built one.

Put extra weight for instances that are misclassified. (wrong classification instances) and set up next model for the training set in iteration.

New model become an “expert” for instances classified by earlier models.

meta -> AdaboostM1

numIterations: The number of iterations to be performed.

Stacking

Using meta learner (not voting in Boosting) to combine predictions of base learners.

— base learners: level-0 models