(11-3-04 )检测以太坊区块链中的非法账户:Train-Test Split(拆分数据集)

11.3.4  Train-Test Split(拆分数据集)

"Train-Test Split" 是机器学习和数据分析中常用的一种数据集拆分方法,用于评估模型的性能和泛化能力。Train-Test Split的主要目的是,将原始数据集划分为两个互斥的子集:训练集(Training Set)和测试集(Test Set)。

(1)导入了 sklearn(Scikit-Learn)库中的 train_test_split 函数,并展示了数据集的前几行。 train_test_split 函数是用于将数据集划分为训练集和测试集的常用工具。它可以将数据集按照一定的比例分割成训练集和测试集,以便进行机器学习模型的训练和评估。具体实现代码如下所示。

from sklearn.model_selection import train_test_split
dataset.head()

执行后会输出:

	Address	FLAG	Avg min between sent tnx	Avg min between received tnx	Time Diff between first and last (Mins)	Sent tnx	Received Tnx	Number of Created Contracts	Unique Received From Addresses	Unique Sent To Addresses	...	max val sent to contract	total Ether sent	total ether balance	Total ERC20 tnxs	ERC20 total Ether received	ERC20 total ether sent	ERC20 total Ether sent contract	ERC20 uniq sent addr.1	ERC20 uniq rec contract addr	ERC20 min val rec
0	0x00009277775ac7d0d59eaad8fee3d10ac6c805e8	0	844.26	1093.71	704785.63	721	89	0	40	118	...	0.0	865.691093	-279.224419	265.0	3.558854e+07	3.560317e+07	0.0	0.0	58.0	0.0
1	0x0002b44ddb1476db43c868bd494422ee4c136fed	0	12709.07	2958.44	1218216.73	94	8	0	5	14	...	0.0	3.087297	-0.001819	8.0	4.034283e+02	2.260809e+00	0.0	0.0	7.0	0.0
2	0x0002bda54cb772d040f779e88eb453cac0daa244	0	246194.54	2434.02	516729.30	2	10	0	10	2	...	0.0	3.588616	0.000441	8.0	5.215121e+02	0.000000e+00	0.0	0.0	8.0	0.0
3	0x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e	0	10219.60	15785.09	397555.90	25	9	0	7	13	...	0.0	1750.045862	-854.646303	14.0	1.711105e+04	1.141223e+04	0.0	0.0	11.0	0.0
4	0x00062d1dd1afb6fb02540ddad9cdebfe568e0d89	0	36.61	10707.77	382472.42	4598	20	1	7	19	...	0.0	104.318883	-50.896986	42.0	1.628297e+05	1.235399e+05	0.0	0.0	27.0	0.0

2)首先将目标变量(响应变量)存储在 y 变量中,特征变量存储在 X 变量中。同时,将 "FLAG" 列和 "Address" 列从特征中移除。然后,定义了一个名为 train_val_test_split 的函数,用于将数据集划分为训练集、验证集和测试集。这个函数使用 train_test_split 函数来进行划分。最后,使用 train_val_test_split 函数将数据集划分为训练集(80%)、验证集(10%)和测试集(10%),并分别存储在 X_train、X_val、X_test、y_train、y_val 和 y_test 变量中。具体实现代码如下所示。

# 将响应变量放入 y,将特征变量放入 X
y = dataset['FLAG']
X = dataset.drop(['FLAG', 'Address'], axis=1)

# 定义一个用于划分数据集的函数
def train_val_test_split(X, y, train_size, val_size, test_size):
    X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=test_size)
    relative_train_size = train_size / (val_size + train_size)
    X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,
                                                      train_size=relative_train_size, test_size=1-relative_train_size)
    return X_train, X_val, X_test, y_train, y_val, y_test

# 将数据集划分为训练集、验证集和测试集
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(X, y, 0.8, 0.1, 0.1)
X_train.shape, y_train.shape, X_test.shape, y_test.shape,X_val.shape,y_val.shape

这些形状信息可以用于确保数据集的维度正确,并且可以作为训练、测试和验证过程中的参考。

3获取训练集 X_train 的列名,具体实现代码如下所示。

X_train.columns

执行后将返回训练集中的特征列(不包括目标列)的列名列表:

Index([' Total ERC20 tnxs', ' ERC20 uniq rec contract addr',
       'total ether balance', 'Time Diff between first and last (Mins)',
       'max value received ', 'avg val received',
       ' ERC20 total Ether received', ' ERC20 min val rec',
       'Unique Received From Addresses', 'Received Tnx',
       'Avg min between received tnx', 'min value received',
       'Avg min between sent tnx', 'total Ether sent', 'avg val sent',
       'max val sent', 'Sent tnx', 'Unique Sent To Addresses'],
      dtype='object')

4通过互信息评估每个特征对目标的重要性,并可视化显示了前 18 个具有最大信息增益的特征的重要性。具体实现代码如下所示。

!pip install skfeature-chappers
from sklearn.feature_selection import mutual_info_classif
importance=mutual_info_classif(X_train,y_train)
feat_importances=pd.Series(importance,X_train.columns[0:len(X_train.columns)])
plt.figure(figsize=[30,15])
feat_importances.nlargest(18).plot(kind='barh',color='teal',)
plt.show()

5获取具有最大信息增益的前 18 个重要特征的列名,这些列名被存储在名为 col_x 的变量中。具体实现代码如下所示。

col_x=feat_importances.nlargest(18).index
col_x

执行后将获得这些重要特征的列名列表,这些列名代表了对目标变量具有较高影响的特征。

Index([' Total ERC20 tnxs', ' ERC20 uniq rec contract addr',
       'total ether balance', 'Time Diff between first and last (Mins)',
       'max value received ', 'avg val received',
       ' ERC20 total Ether received', ' ERC20 min val rec',
       'Unique Received From Addresses', 'Received Tnx',
       'Avg min between received tnx', 'min value received',
       'Avg min between sent tnx', 'total Ether sent', 'avg val sent',
       'max val sent', 'Sent tnx', 'Unique Sent To Addresses'],
      dtype='object')

6从训练集 X_train、验证集 X_val 和测试集 X_test 中选择了具有最大信息增益的前 18 个重要特征,并将这些特征存储在了相应的数据集中。具体实现代码如下所示。

X_train=X_train[col_x]
X_val=X_val[col_x]
X_test=X_test[col_x]
feat_importances

执行后会输出:

Avg min between sent tnx                   0.096649
Avg min between received tnx               0.102166
Time Diff between first and last (Mins)    0.237711
Sent tnx                                   0.068052
Received Tnx                               0.109679
#######省略部分输出结果
 ERC20 total Ether sent contract           0.005287
 ERC20 uniq sent addr.1                    0.001419
 ERC20 uniq rec contract addr              0.254201
 ERC20 min val rec                         0.141128
dtype: float64

未完待续

  • 11
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值