Labs
决策树的构建过程
1.决策树构建的关键,一在于确定什么特征作为决策节点,能够得到最好的信息增益,二是确定什么时候停止分裂(如规定最大深度/分类纯度达100%等即刻返回return)。
2.计算信息增益,(1.需要计算记录出当每个特征作为决策节点的时候的信息增益值,并保存对应的特征和对应的子树),(2.信息增益等于父节点的熵-子节点的加权平均熵),(3.熵对应计算公式,即以纯度为x计算)。
3.(此处代码完整版已经上传至csdn)
需要注意的是,一定要计算if len(y)==0:
,这非常关键
def compute_entropy(y):
"""
compute entropy on a node
Args:
y(ndarray):shape(m,1) the type of the target value
Returns:
entropy(float):entropy on the node
"""
entropy=0.
if len(y)==0:
return 0
fraction=np.sum(y)/len(y)
if fraction==0 or fraction==1:
return 0
entropy+=-fraction*np.log2(fraction)-(1-fraction)*np.log2(1-fraction)
return entropy
4.分裂子树是决策树递归的重要一环
def split_dataset(X,node_indices,feature):
"""
split data into two trees
Args:
X(ndarray):shape(m,n) m examples with n features
node_indices(list):shape(n_,) n_ examples can be split
feature(int):which feature to split on
Returns:
left_indices(ndarray):shape(m,) where feature == 1
right_indices(ndarray):shape(m,)
"""
left_indices=[]
right_indices=[]
for index in node_indices:
if X[index,feature]==1:
left_indices.append(index)
else:
right_indices.append(index)
return left_indices,right_indices
# UNQ_C3
# GRADED FUNCTION: compute_information_gain
def compute_information_gain(X, y, node_indices, feature):
"""
Compute the information of splitting the node on a given feature
Args:
X(ndarray):shape(m,n) m examples with n features
y(ndarray):shape(m,1) m examples of target value
node_indices(list):shape(n_,) n_ examples can be split
feature(int):which feature to split on
Returns:
information_gain(float):compute information gain
"""
# Split dataset
left_indices, right_indices = split_dataset(X, node_indices, feature)
# Some useful variables
y_left,y_right = y[left_indices], y[right_indices]
# You need to return the following variables correctly
information_gain = 0
### START CODE HERE ###
w_left=len(y_left)/len(node_indices)
w_right=len(y_right)/len(node_indices)
# Weights
root_entropy=compute_entropy(y[node_indices])
p_left=compute_entropy(y_left)
p_right=compute_entropy(y_right)
#Weighted entropy
information_gain=root_entropy-(p_right*w_right+p_left*w_left)
#Information gain
### END CODE HERE ###
return information_gain
def get_best_split(X, y, node_indices):
"""
compute best feature which get biggest information gain
Args:
X(ndarray):shape(m,n) m examples with n features
y(ndarray):shape(m,1) m examples of target value
node_indices(list):shape(n_,) n_ examples can be split
Returns:
best_feature(int):which feature get best informatino gain
"""
num_features=X.shape[1]
best_information_gain=0.
best_feature=-1
for i in range(num_features):
pre_information_gain=compute_information_gain(X, y, node_indices, i)
if pre_information_gain>best_information_gain:
best_feature=i
best_information_gain=pre_information_gain
return best_feature
def build_tree_recursive(X,y,node_indices,branch_name,max_depth,current_depth):
"""
build tree using the recursive algorithm
Args:
X(ndarray):shape(m,n) m examples with n features
y(ndarray):shape(m,1) m examples of target value
node_indices(list):shape(n_,) n_ examples can be split
branch_name(string):['left','right','root']
max_depth(int):max depth
current_depth(int):current depth
Returns:
"""
if current_depth==max_depth:
formatting=" "*current_depth+"-"*current_depth+branch_name
print(formatting+f" node with indices {node_indices}")
return
best_feature=get_best_split(X, y, node_indices)
formatting="-"*current_depth
print(formatting+f"Depth {current_depth},{branch_name}:split on feature:{best_feature}")
left_indices,right_indices=split_dataset(X,node_indices,best_feature)
build_tree_recursive(X,y,left_indices,'left',max_depth,current_depth+1)
build_tree_recursive(X,y,right_indices,'right',max_depth,current_depth+1)
心境记录
感觉18天完成的计划似乎是完不成了,有点小无奈。但这次总体来说学习表现还是不错,因此我打算给自己也合理宽限几天,8.31号之前完成就行,也就是21天完成,和原课程安排的时间一样,三周完成。
完成这部分的机器学习之后我也对AI有了一个基本的了解,我打算接下来继续学习吴恩达的深度学习系列,然后掌握pytorch框架,试图复现一篇论文。