遗传算法:
遗传算法通常用于参数优化和特征选择,而不是直接用于特征提取。特征提取通常是指从原始数据中抽取或生成新的特征,而不是选择现有特征的子集。
遗传算法进行特征提取步骤:
特征工程: 首先,进行特征工程,即从原始数据中提取相关的特征。对于心跳数据集,可能的特征包括心率、R波波峰间隔、心电图信号的频域特征等。
特征选择: 使用遗传算法来选择最佳的特征子集。这意味着在特征集合中选择一组最具代表性或最相关的特征子集,以改善模型的性能。这可以在遗传算法的适应度函数中包括模型性能指标(例如分类准确率、回归误差等)。
编码: 将特征子集表示为遗传算法的染色体,染色体中的每个基因代表一个特征是否被选择。
适应度函数: 定义一个适应度函数,评估每个特征子集的性能。这个函数可以使用某种机器学习模型(如分类器或回归器)来对所选特征子集进行评估,并返回模型在训练集或交叉验证集上的性能度量。
遗传操作: 设计选择、交叉和变异操作来更新特征子集的染色体。这些操作应该考虑到保留有利特征、探索新特征的平衡,以及避免过拟合。
迭代优化: 重复进行遗传操作,生成新的特征子集,并根据适应度函数评估其性能。随着迭代次数的增加,逐步寻找最优的特征子集。
目的1:使用MLP和SVM模型进行心跳数据集分类
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from sko.GA import GA
from sko.DE import DE
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
# 数据加载与预处理
data = pd.read_csv('D:\My Python\Computational Intelligence\实验二\data.csv')
new_data = pd.DataFrame(columns=['heartbeat_signals', 'label'])
for i in range(4):
d = data[data['label'] == i]
train_data = d.sample(n=1000, axis=0)
new_data = pd.concat([new_data, train_data], ignore_index=True, join='inner')
X = new_data['heartbeat_signals'].values
Y = new_data['label'].values
X_n = []
for i in X:
X_n.append(i.split(','))
X_n = np.array(X_n).astype(np.float64)
Y_n = np.array(Y).astype(np.float64)
X_train, X_test, y_train, y_test = train_test_split(X_n, Y_n, test_size=0.3)
# 用MLP实现分类与预测
# clf = MLPClassifier(verbose=True, hidden_layer_sizes=(100, ), max_iter=1000)
# clf.fit(X_train, y_train)
# print('训练集结果:', clf.score(X_train, y_train))
# print('测试集结果:', clf.score(X_test, y_test))
# print('准确率:', clf.score(X_test, y_test))
# 用GA进行参数寻优
def schaffer(p):
hidden_layer_sizes = tuple(int(i) for i in p)
clf = MLPClassifier(verbose=True, hidden_layer_sizes=hidden_layer_sizes, max_iter=1000)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
return -accuracy
ga = GA(func=schaffer, n_dim=1, size_pop=4, max_iter=50, lb=[1], ub=[100], precision=1e-7)
start_time = time.time()
best_x, best_y = ga.run()
print(time.time() - start_time)
print('best_x:', best_x, '\n', 'best_y:', -best_y)
# 用DE进行参数寻优
# def obj_func(p):
# hidden_layer_sizes = tuple(int(i) for i in p)
# clf = MLPClassifier(verbose=True, hidden_layer_sizes=hidden_layer_sizes, max_iter=1000)
# clf.fit(X_train, y_train)
# accuracy = clf.score(X_test, y_test)
# return -accuracy
#
# de = DE(func=obj_func, n_dim=1, size_pop=50, max_iter=2, lb=[1], ub=[100])
# start_time = time.time()
# best_x, best_y = de.run()
# print(time.time() - start_time)
# print('best_x:', best_x, '\n', 'best_y:', -best_y)
from sklearn import neural_network as nn
import time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer,f1_score,accuracy_score, precision_score,recall_score
from sklearn.svm import SVC
start=time.time()
data = pd.read_csv('data.csv')
data = data.drop(columns = ['id']) #删除‘id’那一列
new_data=pd.DataFrame(columns=['heartbeat_signals','label'])
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min