Decision Tree by Python

Part1.Introduce

This article records the process of building a decision tree(ID3) which is my first curriculum design.

The work I need to do is that load and process the data first, then use the train data to build the decision tree and finally use the test data to predict if people can go out to play.

The data set:

id

outlook

temperature

humidity

wind

play

1

sunny

hot

high

weak

no

2

sunny

hot

high

strong

no

3

overcast

hot

high

weak

yes

4

rainy

mild

high

weak

yes

5

rainy

cool

normal

weak

yes

6

rainy

cool

normal

strong

no

7

overcast

cool

normal

strong

yes

8

sunny

mild

high

weak

no

9

sunny

cool

normal

weak

yes

10

rainy

mild

normal

weak

yes

11

sunny

mild

normal

strong

yes

12

overcast

mild

high

strong

yes

13

overcast

hot

normal

weak

yes

14

rainy

mild

high

strong

no

In this data set, there are 4 kinds of features where every one contains element that varying quantities. I will use entropy and gini to build the tree.

Part2.Implementation

1. load data:

        I use the txt type to save file so that the first step is to load data into narray type and reshape the data.

'''
function: load txt file
input:the txt file path
output:the data(narray) in txt file
'''
def Myread_data(path):
    data=[]  
    feature=[]
    filepath = path
    f = open(filepath,'r')  # f -> string
    f_data = f.readlines()  # f_data -> list
    for row in f_data:
        row = row.strip('\n')
        data.append(row.split(' '))
    for i in range(len(data)):
        for j in data[i]:
            feature.append(j)
    array = np.array(feature)
    array = array.reshape(len(f_data),int(array.size/len(f_data)))
    return array

# # read and process data
file_path = 'play.txt'
words = Myread_data(file_path)  # get data

2. precess data:

        Because the data name are string type, now I use a function that I write to turn the feature name into number type.

'''
function: 1.turn features(words) into numbers
          2.get off the feature labels
input: data(words)
output: features(numbers)
'''
def Myword2num(word):
    dic = {}
    data = word
    data = data[1:,1:]
    for j in range(data.shape[1]):
        count = 0
        for i in range(data.shape[0]):
            if data[i,j] not in dic:
                dic[data[i,j]] = count
                count += 1
    for j in range(data.shape[1]):
        for i in range(data.shape[0]):
            if data[i,j] in dic:
                data[i,j] = dic[data[i,j]]
    return data


features = Myword2num(words)[:,:-1] # divide the data into features and labels
labels = Myword2num(words)[:,-1]

        By the way, I add the features name and labels name and  devide the data into train data and test data(both contain features and labels).

feature_names = words[0,1:-1]  # ['outlook' 'temperature' 'humidity' 'wind']
label_names = ['No', 'Yes']
X_train,X_test,Y_train,Y_test=train_test_split(features,labels,test_size=0.3)  # divide the data into train and test set

 3. build decision tree

        the sklearn library provide some functions to help user build tree:

                1)DecisionTreeClassifier():

                        criterion: chose method entrop or gini to make decisions about how to divide data.

                        random_state: something like  randomseed which is used to keep model stable.

                        splitter: 'best' means the tree branch in favor of more important features when                                 making decisions, while 'random' means more random when making                                 decisions. 

                        max_depth: the train depth in every train loop.

                2)fit(): fitting the features and labels

                3)score():use the test data to grade the model, the score closer to 1 the better the                         model is.

        In this train, I use a 7 times loop, that the max_depth +1 in every new loop, I think it could help make the model more fitting.

# # build the decision tree
test_score = []  # save the score for every train
# train 7 times
# the depth(from 1) +1 every time
for i in range(7):  
    clf=tree.DecisionTreeClassifier(criterion="entropy"
                                ,random_state=30
                               ,splitter='best'
                                ,max_depth=i+1
    )
    clf=clf.fit(X_train,Y_train) # fit features and labels
    score=clf.score(X_test,Y_test)  # get the test score which is better when close to 1
    test_score.append(score)

4. show the tree and scores

        Learned from the last section, there are 7 loop which means there are 7 scores. I keep the scores and use the plt function to draw them. Meanwhile I use the plot_tree function to draw the decision tree in the final train.

# # show the train scores
plt.plot(range(1,8),test_score,color="red",label="max_depth")  # show the scores with every train
plt.show()

# # show the decision tree
tree.plot_tree(clf,
            feature_names=feature_names,
            class_names = label_names,
            filled=True,
            rounded=True)
plt.show()

5. results analysis

        The scores:

 

         As we can see, in the 7 loops, the score in loop 1 and 2 are smaller, which means they have a poor effect. Meanwhile the other scores are all 0.8. Now let's see the final tree.

         Obviousily, the features wind contain more specific gravity, at the second layer, the entropy of outlook and humidity are same, and it's strange that the third lay contains another outlook, because the element in it has different entropy.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值