Artificial Intelligence -- Chapter 12 Intro to Machine Learning

One thing to point out:  the same feature can appear multiple times in a decision tree as long as the're on separate paths. (Look at the feature "Alternate" in this tree.) 

 

Purer means after the attribute filtering, we should be more clear that what decision we should make.

Recall that Green node for "Yes We will wait" and the red for "No, we won't wait"

If we‘re spliting based on "type of resturant", we can see that we still get 50% True node and 50% False node, we have no preference on waiting or not.

If we look at "patrons", the result are clearer on decisions. So this attribute is what we want for more information.

We could see that for binary variable, if we have P(+x) = 0 or P(+x) = 1, the value of H(X) is 0, this means we have zero entropy, this random variable only has one value.

When we have this binary variable P(+x)=0.5, H(X) will reach maximum, in this case, it's a uniform distribution for X, we are completely uncertain about the expected value.

So the higer the entropy of a RV, the closer this RV is to a uniform distribution or more uncertaion there is in this RV.

We can then also take about entropy for RVs with more than just two possible values. And since we are summing over the values in the formula, it also means more classes we have, higher the entropy.

We always want lower entropy, because it makes lower uncertainty.

 H(Full) = -\frac 13 log \frac 13 +( -\frac 23 log \frac 23 )=0.918

and when we take expectation value, we assign weight on each term based on the amount of node. (This is also called the weighted average)

Gain(Type) = 1 - 1 = 0 . Remember that we want to maximize expected information gain, so 'Patron' will be better for us. So the Gain(A) formula tells us if we lower the entropy after this attribute, we will get higer gain.

We could see that it's pretty close to a uniform distribution. And now if we're building a decision tree, what we need to do is to going over each attribute and see which one gives us the highest information gain:

We will do compute the gain for other attributes, and we're not going to show the rest of them here, but the conclusion is that "Fav Language" gives the highest gain value among these attributes.

And the next thing to do is deciding the next attribute after we note that applicant's favorite language is objective. We will use the same procedure to calculate the highest information gain for each case: Objective C and Java.

Suppose we use the attribute "Experience". In this attribute, we don't even need to test any other features because we have all pure data after 'Experience', so 'Experience' is all I need to look at after knowing that the applicant uses Objective C.

So the branch of tree starts from Objective C is almost done. Let's see another branch that starts from Java:

And after we try different attributes, we found that "Degree" gives us the highest gain (We also skip the computing for every attribute here.)

You would probably think that it's not so great that we're making decisions based solely on one data point in the PHD term, but that's the training dataset we have. We might say something else about the result if we have more data, but here that's all we get for training.

So here we can build the tree. It's a simple tree but gives us 100% accuracy on the training data. This is because we always stop building outwards when we have a pure split. 

In this case, we would say it's a good one for training data, because the tree is small and accurate on the training dataset. And also notice that not all attributes have to be used. For example, "Visa" is not used, and this is telling us that "Visa" doesn't actually give us any more information than the first three attributes already give us. This is something that could always happen.

Let's try another model:

And given those dataset, we can also build up a naive bayes model as well.

We found that if we now classify the same data instance (PHD, mobile, Java) here, we are actually calcualting the product of probabilities.

For Hire(PHD, mobile, Java) = yes: P(hire) * P(PHD| +h)* P(mobile | +h) * P(Java | +h)

For Hire(PHD, mobile, Java) = no: (1-P(hire))* P(PHD| -h)* P(mobile | -h) * P(Java | -h)

We'll look at these two numbers and pick the higher likelihood, so in the bayes net, we are actually gonna hire this person, which is a completely different decision from what we get in the decision tree model.

So one thing to keep in mind is that different models have different properties (pros and cons), and might give different result.

If we're stopping early, we just need to go with the majority boat and then use those as our decisions. This will generally give us errors in our training data, but usually a little bit of error in the training dataset is ok because that always allows us to do better or generalized on test data or unseen data, which is ultimately what we want. We are always happy to sort of sacrifice a little on performance of training data if it translates to an improvement on test data.


In the perspective of linear classifers, we treat inputs that consists of features as high dimensional vectors, each feature corresponds to a dimension in the vector space.

For example in the graph, we see two dimensional space, that means our data consists of two features (X1, X2), and we're plotting each data point according to the values that each feature takes on. And the classes that we want to key out will essentially correspond to distinct areas in the space.

So when we try to learn the classifier, we're actually going to learn how to distinguish those areas of the space, or how to divide the areas. The lines that we see in the graph (H1, H2, H3) are all potential dividers.

Mathematically we're just computing a weighted average given a set of features. And the weight values will come from the training process, we're gonna use training data to find which features are more important than others.

In the graph, when we finally learn the vector w, geometrically speaking, this vector forms the boundary, data point on this boundary will have a score of 0 because the dot product of two orthogonal vectors is always 0. Then any data point have the same direction of w will get a positive value and classified as +1, while data point have opposite direction of w will get a negative value and classified as -1. 

We need to transform any qualitative features into numbers in the linear classifier. And that could be just some arbitrary function. (like Bechelor = 0, Master = 1, PHD = -1, for feature "degree")

If we come up with w=(3,2,2,0). It means we pay more attention to degree and a little less attention on exprience and language, we don't care about visa.


Let's now talk about how we could learn those weight to form a weighted vector.

y*  here is either +1 or -1 for binary classifier. What we are doing here is once the class we get is -1 and the true result is +1, then we want to use the red term to sort of rotate the vector w, we know that when w changes, the boundary of class also change. As a result, this will lead the feature vector f goes into another region of the boundary. Note that the vector f doesn't move in each update, it's the boundary that keeps moving.

-->

we could see f was previously classified as -1 based on the blue boundary, and after updating it is +1 based on the black boundary.

From x1 to x4, f·w = 0 because w is zero vector, and 0 is a tie situation so we will predict with hire(1) based on our premise. In this case, these predictions are the same to the truth, so there is no upate yet.

When we come to x5, we predict as hire(1) but the truth is -1, so we make an update on w:

w = w + \alpha(y^* \times f) = (0,0,0,0) + (-1)(-1,1,-1,1) = (1,-1,1,-1), since truth y^* = -1 

And we keep computing using the updated w... when all the 14 cases converge to the truth when they're having dot product with w after several updates, then we would stop.


Suppose we have f1 and f2 in the graph, so geometrically we know that w1 x f1 will have the largest value than w2 and w3, so f1 will be classified as the class corresponds to w1. 

For f2, it is in the boudary of w1 and w3, it's shown that w1 x f2 = x3 x f2, it is a tie situation for class 1 and 3.

What we're doing when updating both y and y*:

we're moving w_y away from f because it is too close to f, leading to a large value (f x w_y) ,

simultaneously we're moving w_{y^*} closer to form a larger value of (f x w_{y^*})

Also there might be other weighted vector wz, but we don't touch them in this update because we are only caring about our wrong predition y and the true answer y* in each iteration.

Suppose now we have 3 classes for hire: +1 hire, 0 intern, -1 not hire.

When we say x1 to x4 are correct, we are refering to that for f(x1) · w1 gives the highest value among w_-1 w_0 w_1, f(x2) · w0 gives the highest value among w_-1 w_0 w_1...

for x5, we have: w_-1 ·f(x5) = 0,  w_0 ·f(x5) = -2, w_1 ·f(x5) = 2, so we pick w_1 and get class 1 for f(x5), but the true class is -1. So we will make update:

the wrong predition is 1, so updaing w_1: w_1 = w_1 - f(x5) = (1,-1,0,0)

the true answer is -1, so updaing w_-1: w_-1 = w_-1 + f(x5) = (-1,1,-1,1)

and w_0 is not involved in this update. 

The same sense for the following...

We would prefer a boundary that has a wider margin on both section of classes just like L2.

If we have L1 as our boundary, when we are doing classfication on the testing (unseen) data, it's more likely to cause overfitting because we're having a too strict boundary for +1 class (since the boundary L1 is just on some +1 data point)


The following slide is not the main topic we focus on, but you can take a look for interests.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值