Homework 2 - Security Analytics - Decision Tree

王锴KKKKKKyle

已于 2023-03-24 06:00:47 修改

阅读量182

点赞数

文章标签：机器学习 python

于 2023-03-24 05:30:05 首次发布

本文链接：https://blog.csdn.net/kyle313606922/article/details/129742988

版权

通过对恶意URL数据集使用默认决策树分类器进行训练和测试，发现30%训练和70%测试时的精度为0.948。随着max_depth增加，精度先增后稳，最佳精度在max_depth=60。随机森林分类器在相同操作下，整体精度略有提升且趋势稳定。在信用卡欺诈数据集中，决策树的精度随max_depth上升而下降，而随机森林则保持相对稳定的表现。

摘要由CSDN通过智能技术生成

1. Use a decision tree classifier (default) to train and test the malicious URL dataset. (2pt)

link: https://www.kaggle.com/akashkr/phishing-url-eda-and-modelling
Use 30% for training and 70% for testing
What is the accuracy score?

2. Explore how the tree depth number can affect the accuracy score (3 pt)

Make a loop to set max_depth to 10, 20, 30…until 100, and observe the accuracy score change
What’s the tree depth number that can get the best accuracy score?

3. Use random forest to repeat step 2. What’s your observation? (3 pt)

4. Try both the decision tree and random forests with the credit card fraud dataset. What’s your observation on accuracy score change? (2 pt)

creditcard.csv
dataset.csv
hw2_Kyle Wang.ipynb

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score

urls = pd.read_csv("dataset.csv")

X = urls.iloc[:, 1:30]
y = urls['Result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test 
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9484429512856958

DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = DecisionTreeClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9381056984106474
max_depth: 20
Accuracy: 0.9488305982685101
max_depth: 30
Accuracy: 0.9477968729810053
max_depth: 40
Accuracy: 0.9505104018607056
max_depth: 50
Accuracy: 0.9477968729810053
max_depth: 60
Accuracy: 0.9481845199638196
max_depth: 70
Accuracy: 0.9489598139294483
max_depth: 80
Accuracy: 0.9461170693888099
max_depth: 90
Accuracy: 0.9497351078950769
max_depth: 100
Accuracy: 0.9490890295903863

When max_depth = 10, the accuracy always is the lowest. The trend of the accuracy is to rise first and then level off.
In most cases, when max_depth = 60 can get the best accuracy score.

from sklearn.ensemble import RandomForestClassifier
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print(DEPTH)
    clf = RandomForestClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

10
Accuracy: 0.9483137356247577
20
Accuracy: 0.9594262824654348
30
Accuracy: 0.9591678511435586
40
Accuracy: 0.9598139294482492
50
Accuracy: 0.9589094198216824
60
Accuracy: 0.960847654735754
70
Accuracy: 0.9605892234138778
80
Accuracy: 0.960847654735754
90
Accuracy: 0.9603307920920016
100
Accuracy: 0.9592970668044967

The overall accuracy has been slightly improved. The oeverall trend unchanged.

cards = pd.read_csv("creditcard.csv")

X = cards.iloc[:, 0:30]
y = cards['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test 
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = DecisionTreeClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9993027863466506
max_depth: 20
Accuracy: 0.999117197100795
max_depth: 30
Accuracy: 0.999132244877486
max_depth: 40
Accuracy: 0.9991121811752314
max_depth: 50
Accuracy: 0.9990770696962857
max_depth: 60
Accuracy: 0.999102149324104
max_depth: 70
Accuracy: 0.999102149324104
max_depth: 80
Accuracy: 0.9991372608030497
max_depth: 90
Accuracy: 0.9990770696962857
max_depth: 100
Accuracy: 0.9990319263662127

The overall accuracy has a downward trend ———— DecisionTreeClassifier

DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
    print("max_depth:", DEPTH)
    clf = RandomForestClassifier(max_depth=DEPTH)
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    DEPTH += 10

max_depth: 10
Accuracy: 0.9994181526346149
max_depth: 20
Accuracy: 0.9994382163368696
max_depth: 30
Accuracy: 0.9994332004113059
max_depth: 40
Accuracy: 0.9994382163368696
max_depth: 50
Accuracy: 0.9994231685601785
max_depth: 60
Accuracy: 0.9994332004113059
max_depth: 70
Accuracy: 0.9994332004113059
max_depth: 80
Accuracy: 0.9994181526346149
max_depth: 90
Accuracy: 0.9994131367090512
max_depth: 100
Accuracy: 0.9994281844857422