机器学习项目过程
- 导入数据
- 数据特征
- 数据可视化
- 评估算法
- 实施预测
- 分析结果
一、导入类库及数据
1、导入类库
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
2、导入数据
names = ['separ-length','separ-width','petal-length','petal-width','class']
data = pd.read_csv(r'D:/iris.csv',names = names)
二、数据特征
1、数据的维度:查看数据的行数及列数、列名。
data.shape
data.columns
2、查看数据前5行
data.head()
3、统计数据信息
data.describe()
4、数据分组
data.groupby('class').size()
三、数据可视化
1、箱线图
data.plot(kind = 'box',subplots = True, layout = (2,2),sharex = False, sharey = False)
plt.show()
2、直方图
data.hist()
plt.show()
3、散点矩阵图
pd.plotting.scatter_matrix(data,color = 'r')
plt.show()
三、评估算法
1、分离出X_train , Y_train 用来训练算法创建模型, X_validation和Y_validation用来评估验证评估模型。
from sklearn.model_selection import train_test_split
array = data.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X,Y, test_size = validation_size, random_state = seed)
2、创建模型
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
models = {}
kfold = KFold(n_splits = 10, random_state = seed)
models['LDA'] = LinearDiscriminantAnalysis()
results = []
cv_results = cross_val_score(models['LDA'],X_train, Y_train, cv = kfold, scoring = 'accuracy')
print('线性判别分析 (LDA): 准确率 %f' % (cv_results.mean()))
四、实施预测及结果分析
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
lda = LinearDiscriminantAnalysis()
lda.fit(X = X_train, y = Y_train)
predictions = lda.predict(X_validation)
print(accuracy_score(Y_validation,predictions))
print(confusion_matrix(Y_validation,predictions))
print(classification_report(Y_validation,predictions))
通过结果,可以看到准确率是0.97。通过冲突矩阵看到只有一个数据预测错误。最后提供一个包含精确率(precision),召回率(recall),F1值(f1-score)数据的报告。