我在github
上建了个repo
DataSetForMachineLearning,用来存放各种数据集,如果有需要,欢迎star
。
进行机器学习时,有时候需要一些数据做练手,数据从何而来呢,可以充分利用一些库,像sklearn
,seaborn
都是自带一些数据的(如常见的iris花卉,titanic泰坦尼克号数据。。。
),可以通过如下方式获取:
sklearn
In [80]: from sklearn import datasets
In [81]: list(filter(lambda x: 'load' in x, dir(datasets)))
Out[81]:
['__loader__',
'load_boston',
'load_breast_cancer',
'load_diabetes',
'load_digits',
'load_files',
'load_iris',
'load_linnerud',
'load_mlcomp',
'load_sample_image',
'load_sample_images',
'load_svmlight_file',
'load_svmlight_files',
'load_wine']
使用方法如下:
In [90]: wine = datasets.load_wine()
In [91]: wine.data.shape
Out[91]: (178, 13)
In [92]: wine.data[:10]
Out[92]:
array([[ 1.42300000e+01, 1.71000000e+00, 2.43000000e+00, ...,
1.04000000e+00, 3.92000000e+00, 1.06500000e+03],
[ 1.32000000e+01, 1.78000000e+00, 2.14000000e+00, ...,
1.05000000e+00, 3.40000000e+00, 1.05000000e+03],
[ 1.31600000e+01, 2.36000000e+00, 2.67000000e+00, ...,
1.03000000e+00, 3.17000000e+00, 1.18500000e+03],
...,
[ 1.40600000e+01, 2.15000000e+00, 2.61000000e+00, ...,
1.06000000e+00, 3.58000000e+00, 1.29500000e+03],
[ 1.48300000e+01, 1.64000000e+00, 2.17000000e+00, ...,
1.08000000e+00, 2.85000000e+00, 1.04500000e+03],
[ 1.38600000e+01, 1.35000000e+00, 2.27000000e+00, ...,
1.01000000e+00, 3.55000000e+00, 1.04500000e+03]])
In [94]: wine.keys()
Out[94]: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
In [95]: wine.feature_names
Out[95]:
['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']
seaborn
In [96]: import seaborn as sns
In [