数据分析处理问题小例子(wine数据集)

刚学数据分析时做的小例子,从notebook上复制过来,留个纪念~

数据集是从UCI上download下来的Wine数据集,下载地址这是一个多分类问题,类别标签为1,2,3。

先瞅瞅数据,

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression   #逻辑斯特回归,线性分类
from sklearn.linear_model import SGDClassifier        #随机梯度参数估计
from sklearn.svm import LinearSVC                     #支持向量机
from sklearn.naive_bayes import MultinomialNB         #朴素贝叶斯
from sklearn.neighbors import KNeighborsClassifier    #K近邻
from sklearn.tree import DecisionTreeClassifier       #决策树
from sklearn.ensemble import RandomForestClassifier   #随机森林
from sklearn.ensemble import GradientBoostingClassifier   #梯度提升决策树
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.preprocessing import MinMaxScaler   #最大最小归一化
from sklearn.preprocessing import StandardScaler   #标准化
from scipy.stats import pearsonr                    #皮尔森相关系数
from sklearn.model_selection import train_test_split     #划分数据集
from sklearn.model_selection import cross_val_score   
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
#计算排列和组合数所需要的包
from itertools import combinations
from scipy.special import comb

columns=['0Alcohol','1Malic acid ','2Ash','3Alcalinity of ash',
         '4Magnesium','5Total phenols','6Flavanoid',
         '7Nonflavanoid phenols','8Proanthocyanins ','9Color intensity ','10Hue ','11OD280/OD315 of diluted wines' ,'12Proline ','13category']
data= pd.read_csv("G:/feature_code/wine_data.csv",header=None,names=columns)
data.shape
(178, 14)

显示前五行,

data.head()
 0Alcohol1Malic acid2Ash3Alcalinity of ash4Magnesium5Total phenols6Flavanoid7Nonflavanoid phenols8Proanthocyanins9Color intensity10Hue11OD280/OD315 of diluted wines12Proline13category
014.231.712.4315.61272.803.060.282.295.641.043.9210651
113.201.782.1411.21002.652.760.261.284.381.053.4010501
213.162.362.6718.61012.803.240.302.815.681.033.1711851
314.371.952.5016.81133.853.490.242.187.800.863.4514801
413.242.592.8721.01182.802.690.391.824.321.042.937351
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
0Alcohol                          178 non-null float64
1Malic acid                       178 non-null float64
2Ash                              178 non-null float64
3Alcalinity of ash                178 non-null float64
4Magnesium                        178 non-null int64
5Total phenols                    178 non-null float64
6Flavanoid                        178 non-null float64
7Nonflavanoid phenols             178 non-null float64
8Proanthocyanins                  178 non-null float64
9Color intensity                  178 non-null float64
10Hue                             178 non-null float64
11OD280/OD315 of diluted wines    178 non-null float64
12Proline                         178 non-null int64
13category                        178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB

数据说明: 共178条记录,数据没有空缺值,还可以通过describe()方法看数据,主要关注mean这个,对数据分布有个大体的了解。 然后再看数据样本是否均衡

data['13category'].value_counts()
2    71
1    59
3    48
Name: 13category, dtype: int64

样本较为均衡,差别不大

for i in data.iloc[:,0:13].columns:    
    sns.boxplot(x = data['13category'],y = data[i])  
    ax = sns.boxplot(x='13category', y=i, data=data)
    ax = sns.stripplot(x='13category', y=i, data=data, jitter=True, edgecolor="gray")
    plt.show()

 

 

通过对以上每个特征与标签的盒装图和散点图分析,区分度不是很大,不容易进行特征筛选,接下来计算特征和分类的Pearson相关系数

def pearsonar(X,y):   
    pearson=[]
    for col in X.columns.values:
        pearson.append(abs(pearsonr(X[col].values,y)[0]))
    pearsonr_X = pd.DataFrame({'col':X.columns,'corr_value':pearson})
    pearsonr_X = pearsonr_X.sort_values(by='corr_value',ascending=False)
    print pearsonr_X
pearsonar(X,y)

 结果如下,

       col  corr_value
6                       6Flavanoid    0.847498
11  11OD280/OD315 of diluted wines    0.788230
5                   5Total phenols    0.719163
12                      12Proline     0.633717
10                          10Hue     0.617369
3               3Alcalinity of ash    0.517859
8                8Proanthocyanins     0.499130
7            7Nonflavanoid phenols    0.489109
1                     1Malic acid     0.437776
0                         0Alcohol    0.328222
9                9Color intensity     0.265668
4                       4Magnesium    0.209179
2                             2Ash    0.049643

分析发现只有特征2与标签的线性关系较低 ,再计算特征间的线性相关

c=list(combinations([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12],2))
p=[]
for i in range(len(c)):
    p.append(abs(pearsonr(X.iloc[:,c[i][0]],X.iloc[:,c[i][1]])[0]))
pearsonr_ = pd.DataFrame({'col':c,'corr_value':p})
pearsonr_ = pearsonr_.sort_values(by='corr_value',ascending=False)
print pearsonr_
 col  corr_value
50    (5, 6)    0.864564
61   (6, 11)    0.787194
55   (5, 11)    0.699949
58    (6, 8)    0.652692
11   (0, 12)    0.643720
52    (5, 8)    0.612413
75  (10, 11)    0.565468
20   (1, 10)    0.561296
8     (0, 9)    0.546364
60   (6, 10)    0.543479
57    (6, 7)    0.537900
72   (9, 10)    0.521813
70   (8, 11)    0.519067
66   (7, 11)    0.503270
56   (5, 12)    0.498115
62   (6, 12)    0.494193
51    (5, 7)    0.449935
23    (2, 3)    0.443367
41   (3, 12)    0.440597
54   (5, 10)    0.433681
73   (9, 11)    0.428815
16    (1, 6)    0.411007
49   (4, 12)    0.393351
21   (1, 11)    0.368710
63    (7, 8)    0.365845
36    (3, 7)    0.361922
35    (3, 6)    0.351370
15    (1, 5)    0.335167
71   (8, 12)    0.330417
34    (3, 5)    0.321113
..       ...         ...
76  (10, 12)    0.236183
32   (2, 12)    0.223626
18    (1, 8)    0.220746
42    (4, 5)    0.214401
1     (0, 2)    0.211545
46    (4, 9)    0.199950
37    (3, 8)    0.197327
43    (4, 6)    0.195784
22   (1, 12)    0.192011
27    (2, 7)    0.186230
59    (6, 9)    0.172379
12    (1, 2)    0.164045
6     (0, 7)    0.155929
64    (7, 9)    0.139057
7     (0, 8)    0.136698
25    (2, 5)    0.128980
26    (2, 6)    0.115077
0     (0, 1)    0.094397
33    (3, 4)    0.083333
30   (2, 10)    0.074667
10   (0, 11)    0.072343
9    (0, 10)    0.071747
48   (4, 11)    0.066004
47   (4, 10)    0.055398
53    (5, 9)    0.055136
14    (1, 4)    0.054575
68    (8, 9)    0.025250
38    (3, 9)    0.018732
28    (2, 8)    0.009652
31   (2, 11)    0.003911

[78 rows x 2 columns]

5、6、11三个特征相关性较大,可能存在冗余特征 

#通过随机森林特征重要性筛选特征
def randomF_importfeat(X,y):
    features_list=X.columns
    forest = RandomForestClassifier(oob_score=True, n_estimators=10000)
    forest.fit(X, y)
    feature_importance = forest.feature_importances_
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    fi_threshold = 0    
    important_idx = np.where(feature_importance > fi_threshold)[0]
    important_features = features_list[important_idx]
    print( "\n", important_features.shape[0], "Important features(>", \
          fi_threshold, "% of max importance)...\n")
    sorted_idx = np.argsort(feature_importance[important_idx])[::-1]
    #get the figure about important features
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1, 2, 2)
    plt.title('Feature Importance')
    plt.barh(pos, feature_importance[important_idx][sorted_idx[::-1]], \
            color='r',align='center')
    plt.yticks(pos, important_features[sorted_idx[::-1]])
    plt.xlabel('Relative Importance')
    plt.draw()
    plt.show()
randomF_importfeat(X,y)
('\n', 13L, 'Important features(>', 0, '% of max importance)...\n')

2这个特征可以考虑删除

先试下PCA降维效果

# 直接PCA降维后枚举各种模型测试
def _PCA(X,y):
    ss=MinMaxScaler()
    X=ss.fit_transform(X)
    pca=PCA(n_components='mle')
    X_new=pca.fit_transform(X)
    clfs = [LogisticRegression(),SGDClassifier(),LinearSVC(),KNeighborsClassifier(),\
            DecisionTreeClassifier(),RandomForestClassifier(),GradientBoostingClassifier(),GaussianNB()]
    for model in clfs:
            print("模型及模型参数:")
            print(str(model))
            print("模型准确率:")
            print(np.mean(cross_val_score(model,X_new,y,cv=10)))
_PCA(X,y)

 

模型及模型参数:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
模型准确率:
0.983333333333
模型及模型参数:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
模型准确率:
0.967251461988
模型及模型参数:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
模型准确率:
0.983333333333
模型及模型参数:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
模型准确率:
0.971200980392
模型及模型参数:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
模型准确率:
0.927048933609
模型及模型参数:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
模型准确率:
0.971895424837
模型及模型参数:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
模型准确率:
0.972222222222
模型及模型参数:
GaussianNB(priors=None)
模型准确率:
0.977743378053

数据很标准,随便PCA一下就能得到如此高分

再试试原始数据在各个默认参数模型上的表现如何,

#划分训练集和测试集  
X_train,X_test,y_train,y_test=train_test_split(data.iloc[:,:13],data.iloc[:,13],test_size=0.2,random_state=0) 
#此处采用最大最小归一化, 可以换成StandardScaler()归一化方法,如果用StandardScaler()方法的话,则不能使用MultinomialNB()模型
ss=MinMaxScaler()
#ss=StandardScaler()                           
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)
#模型及模型参数列表
clfs = [LogisticRegression(),SGDClassifier(),LinearSVC(),MultinomialNB(),KNeighborsClassifier(),\
        DecisionTreeClassifier(),RandomForestClassifier(),GradientBoostingClassifier(),GaussianNB(),ExtraTreesClassifier()]
#输出模型及参数信息,以及模型分类准确性
for model in clfs:
        print("模型及模型参数:")   
        print(str(model))
        model.fit(X_train,y_train)
        print("模型准确率:")
        print(model.score(X_test,y_test))
模型及模型参数:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
模型准确率:
0.972222222222
模型及模型参数:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
模型准确率:
1.0
模型及模型参数:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
模型准确率:
1.0
模型及模型参数:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
模型准确率:
0.944444444444
模型及模型参数:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
模型准确率:
0.972222222222
模型及模型参数:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
模型准确率:
0.972222222222
模型及模型参数:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
模型准确率:
1.0
模型及模型参数:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
模型准确率:
0.944444444444
模型及模型参数:
GaussianNB(priors=None)
模型准确率:
0.916666666667
模型及模型参数:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
模型准确率:
0.972222222222

删除第二个特征后,再试一下

params=[ 0,1,3,4,5,6,7,8,9,10,11,12]
X=data.iloc[:,:13]
X=[params]
#划分训练集和测试集  
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0) 
#此处采用最大最小归一化, 可以换成StandardScaler()归一化方法,如果用StandardScaler()方法的话,则不能使用MultinomialNB()模型
ss=MinMaxScaler()
#ss=StandardScaler()                           
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)
#模型及模型参数列表
clfs = [LogisticRegression(),SGDClassifier(),LinearSVC(),MultinomialNB(),KNeighborsClassifier(),\
        DecisionTreeClassifier(),RandomForestClassifier(),GradientBoostingClassifier(),GaussianNB(),ExtraTreesClassifier()]
#输出模型及参数信息,以及模型分类准确性
for model in clfs:
        print("模型及模型参数:")   
        print(str(model))
        model.fit(X_train,y_train)
        print("模型准确率:")
        print(model.score(X_test,y_test))
模型及模型参数:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
模型准确率:
0.916666666667
模型及模型参数:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
模型准确率:
0.944444444444
模型及模型参数:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
模型准确率:
1.0
模型及模型参数:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
模型准确率:
0.944444444444
模型及模型参数:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
模型准确率:
0.944444444444
模型及模型参数:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
模型准确率:
0.972222222222
模型及模型参数:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
模型准确率:
0.972222222222
模型及模型参数:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
模型准确率:
0.944444444444
模型及模型参数:
GaussianNB(priors=None)
模型准确率:
0.916666666667
模型及模型参数:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
模型准确率:
1.0

对于数据集较小的,试验了一下贪婪搜索了所有特征组合可能

def greed(X,y):
    ss=MinMaxScaler()
    X=ss.fit_transform(X)
    X=pd.DataFrame(X)
    jilu=pd.DataFrame(columns=['m','feature','score'])    
    params=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12]
    model = LinearSVC()
    best_socre=0
    n=13
    i=1
    index=0
    while i<=n:
        test_params=list(combinations(params, i))
        j=int(comb(n,i))
        i=i+1
        for m in range(j):
            z=list(test_params[m])
            score = np.mean(cross_val_score(model,X[z],y,cv=10))    #10折交叉验证取平均
            if score>best_socre:
                best_socre=score
                best_feature=z
            jilu.loc[index,['m']]=m
            jilu.loc[index,['feature']]=str(z)
            jilu.loc[index,['score']]=score
            index=index+1
    print(jilu)
    print ("best_feature=",best_feature,"best_score=",best_socre)
greed(data.iloc[:,:13],data.iloc[:,13])

 

  • 15
    点赞
  • 100
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
wine数据集是一个非常流行的数据集,用于进行数据分析和分类任务。它包含了几种不同类型的葡萄酒的化学成分和相关信息。数据集中包含了13个不同的特征,包括酒精含量、苹果酸含量、灰分含量等等。利用这些特征,我们可以对葡萄酒进行分类和预测。 首先,我们可以对数据集进行可视化分析,对不同特征之间的关系进行探索。可以利用散点图、箱线图等方法,观察不同特征之间的相关性和分布情况。通过这样的分析,我们可以初步了解不同特征对葡萄酒类型的影响程度。 其次,可以利用机器学习算法对wine数据集进行分类模型的构建。可以尝试使用分类算法,如逻辑回归、支持向量机、决策树等,来预测葡萄酒的类型。通过对数据集进行训练和测试,可以评估不同算法的性能,并选择最合适的模型进行预测。 另外,可以进行特征选择和降维分析,以提高模型的预测能力和效率。可以利用特征重要性评估方法,剔除对模型影响较小的特征,以减少模型复杂度,同时保持预测准确性。同时,也可以尝试使用主成分分析等方法,对数据进行降维处理,以减少特征空间的维度,同时保持数据集的相关信息。 综上所述,利用wine数据集进行数据分析,可以进行数据可视化分析、分类模型构建、特征选择和降维等多个方面的探索和实践。通过这些分析,可以更好地理解葡萄酒数据的特征与分类关系,为葡萄酒行业提供决策支持和预测分析。
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值