一、任务背景
本数据集来自UCI,描述的是对一种车型的评价,车子具有 buying,maint,doors,persons,lug_boot and safety六种属性,而车子的好坏分为uncc,ucc,good and vgood四种。我们这次机器学习的目的就是训练一种模型能够自动评价一款车型的好坏。
二、获取数据
#获取数据
data = pd.read_excel('car_data1.xlsx')
data.head()#预览一下数据
Unnamed: 0 | buying | maint | doors | persons | lug_boot | safety | class | |
---|---|---|---|---|---|---|---|---|
0 | 0 | vhigh | vhigh | 2 | 2 | small | low | unacc |
1 | 1 | vhigh | vhigh | 2 | 2 | small | med | unacc |
2 | 2 | vhigh | vhigh | 2 | 2 | small | high | unacc |
3 | 3 | vhigh | vhigh | 2 | 2 | med | low | unacc |
4 | 4 | vhigh | vhigh | 2 | 2 | med | med | unacc |
#检查每种属性的种类
for i in data.columns:
print(data[i].unique(),"\t",data[i].nunique())
[ 0 1 2 ... 1725 1726 1727] 1728
['vhigh' 'high' 'med' 'low'] 4
['vhigh' 'high' 'med' 'low'] 4
['2' '3' '4' '5more'] 4
['2' '4' 'more'] 3
['small' 'med' 'big'] 3
['low' 'med' 'high'] 3
['unacc' 'acc' 'vgood' 'good'] 4
sns.countplot(data['class'])
<AxesSubplot:xlabel='class', ylabel='count'>
从表格中可以看到标签分类是极度不平均的
接下来进行相关性分析
fig=plt.figure(figsize=(10,6))
sns.heatmap(data.corr(),annot=True)
<AxesSubplot:>
从表格中可以看到基本上每一种属性和分配标准都属于弱相关性
三、数据处理
#将字符串转换为整数型,数据处理
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for i in data.columns:
data[i]=le.fit_transform(data[i])
data
Unnamed: 0 | buying | maint | doors | persons | lug_boot | safety | class | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 0 | 0 | 2 | 1 | 2 |
1 | 1 | 3 | 3 | 0 | 0 | 2 | 2 | 2 |
2 | 2 | 3 | 3 | 0 | 0 | 2 | 0 | 2 |
3 | 3 | 3 | 3 | 0 | 0 | 1 | 1 | 2 |
4 | 4 | 3 | 3 | 0 | 0 | 1 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1723 | 1723 | 1 | 1 | 3 | 2 | 1 | 2 | 1 |
1724 | 1724 | 1 | 1 | 3 | 2 | 1 | 0 | 3 |
1725 | 1725 | 1 | 1 | 3 | 2 | 0 | 1 | 2 |
1726 | 1726 | 1 | 1 | 3 | 2 | 0 | 2 | 1 |
1727 | 1727 | 1 | 1 | 3 | 2 | 0 | 0 | 3 |
1728 rows × 8 columns
#通过皮尔逊相关系数可以知道,所有属性和class的相关性是比较弱的
X=data[data.columns[:-1]]
y=data['class']#标签
X
Unnamed: 0 | buying | maint | doors | persons | lug_boot | safety | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | 3 | 0 | 0 | 2 | 1 |
1 | 1 | 3 | 3 | 0 | 0 | 2 | 2 |
2 | 2 | 3 | 3 | 0 | 0 | 2 | 0 |
3 | 3 | 3 | 3 | 0 | 0 | 1 | 1 |
4 | 4 | 3 | 3 | 0 | 0 | 1 | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
1723 | 1723 | 1 | 1 | 3 | 2 | 1 | 2 |
1724 | 1724 | 1 | 1 | 3 | 2 | 1 | 0 |
1725 | 1725 | 1 | 1 | 3 | 2 | 0 | 1 |
1726 | 1726 | 1 | 1 | 3 | 2 | 0 | 2 |
1727 | 1727 | 1 | 1 | 3 | 2 | 0 | 0 |
1728 rows × 7 columns
四、模型训练
#用sklearn自带的区分数据集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)#三七开
#选择逻辑回归进行训练模型
from sklearn.linear_model import LogisticRegression
#from LogisticRegression import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
logreg=LogisticRegression(solver='newton-cg',multi_class='multinomial')
logreg.fit(X_train,y_train)
LogisticRegression(multi_class='multinomial', solver='newton-cg')
pred=logreg.predict(X_test)
logreg.score(X_test,y_test)
0.7225433526011561
五、结论
#或许逻辑回归并不是一个非常高的方法
from sklearn.model_selection import learning_curve
lc=learning_curve(logreg,X_train,y_train,cv=10,n_jobs=-1)
size=lc[0]
train_score=[lc[1][i].mean() for i in range (0,5)]
test_score=[lc[2][i].mean() for i in range (0,5)]
fig=plt.figure(figsize=(12,8))
plt.plot(size,train_score)
plt.plot(size,test_score)
[<matplotlib.lines.Line2D at 0x1fd0bb8dd00>]
从下面的学习曲线可以看到,使用逻辑回归的情况下,当数据量越来越大的时候,预测的精确度是在不断减小的。因此说明后续的课题研究应当着眼于尝试更好的机器学习方法来训练一个更好的分类模型。
六、代码连接:
链接:https://pan.baidu.com/s/1DqkpRh_ZvK0W3cZVxJbEQA
提取码:1qly