机器学习笔记(Washington University)- Classification Specialization-week three & week four

1. Quality metric

Quality metric for the desicion tree is the classification error

error=number of incorrect  predictions / number of examples

 

2. Greedy algorithm

Procedure

Step 1: Start with an empty tree

Step 2: Select a feature to split data

explanation:

  Split data for each feature 

  Calculate classification error of this decision stump

  choose the one with the lowest error

For each split of the tree:

  Step 3: If all data in these nodes have same y value

      Or if we already use up all the features, stop.        

  Step 4: Otherwise go to step 2 and continue on this split

Algorithm

predict(tree_node, input)

if current tree_node is a leaf:

  return majority class of data points in leaf

else:

  next_node = child node of tree_node whose feature value agrees with input

  return (tree_node, input)

3  Threshold split

Threshold split is for the continous input

we just pick a threshold value for the continous input and classify the data.

Procedure:

Step 1: Sort the values of a feature hj(x) {v1, v2,...,vn}

Step 2: For i = 1 .... N-1(all the data points)

      consider split ti=(vi+vi+1)/2

      compute the classification error of the aplit

    choose ti with the lowest classification error

 

4. Overfitting

As the depth increases, the overfitting could occur.

Curing Methods

1. Early Stopping

Stop learning algorithm before tree become too complex

Like:

  • Limit the depth of the tree  (it is difficult to choose the depth value)
  • Use classification error to limit depth of the tree (can be dangerous  XOR)
  • Stop if number of data points is too few in the intermedate modes

2.Pruning

Simplify tree after learning algorithm terminates

Consider a specific total cost:

Total cost = classification error + lamda*number of leaf nodes

Start at bottom of tree T and traverse up apply prune_split(T,M) to each desicion node M

prune_split(T,M):

1. Compute the total cost of tree T using the formula above,

C(T) = Error(T)+λL(T)

2. Let Tsmaller be tree after pruning subtree below M

3. Compute total cost complexity of Tsmaller ,C(Tsmaller) = Error(Tsmaller)+λL(Tsmaller)

4. If C(Tsmaller ) < C(T), prune to Tsmaller 

 

5. Missing data

1. Purification by skipping data or skipping featuires

Cons:

1. Removing data points or features may remove important info from data

2. Unclear when it is better to remove data points versus features

3. Does not help if data is missing at prediction time

 

2. Imputation

Filling in the missing value

1. Categorical feature

Fill in the most popular value of xi

2. Numerical feature 

Fill in the average or median value of xi

Cons:

May result in systematic error

 

3. Adding missing value choice to every decision node

we use the classification error to decide where to put the unknowns.

Cons:

Requires modification of learning algorithm

 

转载于:https://www.cnblogs.com/climberclimb/p/6848037.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值