Homework 2 - Security Analytics - Decision Tree

通过对恶意URL数据集使用默认决策树分类器进行训练和测试,发现30%训练和70%测试时的精度为0.948。随着max_depth增加,精度先增后稳,最佳精度在max_depth=60。随机森林分类器在相同操作下,整体精度略有提升且趋势稳定。在信用卡欺诈数据集中,决策树的精度随max_depth上升而下降,而随机森林则保持相对稳定的表现。
摘要由CSDN通过智能技术生成

1. Use a decision tree classifier (default) to train and test the malicious URL dataset. (2pt)

2. Explore how the tree depth number can affect the accuracy score (3 pt)

  • Make a loop to set max_depth to 10, 20, 30…until 100, and observe the accuracy score change
  • What’s the tree depth number that can get the best accuracy score?

3. Use random forest to repeat step 2. What’s your observation? (3 pt)

4. Try both the decision tree and random forests with the credit card fraud dataset. What’s your observation on accuracy score change? (2 pt)

creditcard.csv
dataset.csv
hw2_Kyle Wang.ipynb

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score

urls = pd.read_csv("dataset.csv")

X = urls.iloc[:, 1:30]
y = urls['Result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test 
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9484429512856958

DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = DecisionTreeClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9381056984106474
max_depth: 20
Accuracy: 0.9488305982685101
max_depth: 30
Accuracy: 0.9477968729810053
max_depth: 40
Accuracy: 0.9505104018607056
max_depth: 50
Accuracy: 0.9477968729810053
max_depth: 60
Accuracy: 0.9481845199638196
max_depth: 70
Accuracy: 0.9489598139294483
max_depth: 80
Accuracy: 0.9461170693888099
max_depth: 90
Accuracy: 0.9497351078950769
max_depth: 100
Accuracy: 0.9490890295903863

When max_depth = 10, the accuracy always is the lowest. The trend of the accuracy is to rise first and then level off.
In most cases, when max_depth = 60 can get the best accuracy score.


from sklearn.ensemble import RandomForestClassifier
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print(DEPTH)
    clf = RandomForestClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

10
Accuracy: 0.9483137356247577
20
Accuracy: 0.9594262824654348
30
Accuracy: 0.9591678511435586
40
Accuracy: 0.9598139294482492
50
Accuracy: 0.9589094198216824
60
Accuracy: 0.960847654735754
70
Accuracy: 0.9605892234138778
80
Accuracy: 0.960847654735754
90
Accuracy: 0.9603307920920016
100
Accuracy: 0.9592970668044967

The overall accuracy has been slightly improved. The oeverall trend unchanged.


cards = pd.read_csv("creditcard.csv")

X = cards.iloc[:, 0:30]
y = cards['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test 
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = DecisionTreeClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9993027863466506
max_depth: 20
Accuracy: 0.999117197100795
max_depth: 30
Accuracy: 0.999132244877486
max_depth: 40
Accuracy: 0.9991121811752314
max_depth: 50
Accuracy: 0.9990770696962857
max_depth: 60
Accuracy: 0.999102149324104
max_depth: 70
Accuracy: 0.999102149324104
max_depth: 80
Accuracy: 0.9991372608030497
max_depth: 90
Accuracy: 0.9990770696962857
max_depth: 100
Accuracy: 0.9990319263662127

The overall accuracy has a downward trend ———— DecisionTreeClassifier


DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = RandomForestClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9994181526346149
max_depth: 20
Accuracy: 0.9994382163368696
max_depth: 30
Accuracy: 0.9994332004113059
max_depth: 40
Accuracy: 0.9994382163368696
max_depth: 50
Accuracy: 0.9994231685601785
max_depth: 60
Accuracy: 0.9994332004113059
max_depth: 70
Accuracy: 0.9994332004113059
max_depth: 80
Accuracy: 0.9994181526346149
max_depth: 90
Accuracy: 0.9994131367090512
max_depth: 100
Accuracy: 0.9994281844857422

The overall trend of accuracy is relative stable ———— RandomForestClassifier


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值