01、sklearn.datasets
(1)datasets.load_*()
获取小规模数据集,数据包含在datasets里
(2)datasets.fetch_*()
获取大规模数据集,需要从网络上下载,函数的第一个参数是data_home
,表示数据集下载的目录,默认是 ~/scikit_learn_data/,
要修改默认目录,可以修改环境变量SCIKIT_LEARN_DATA
(3)datasets.make_*()
本地生成数据集
load和 fetch函数返回的数据类型是 datasets.base.Bunch,本质上是一个 dict,它的键值对可用通过对象的属性方式访问。主要包含以下属性:
-
data:特征数据数组,是 n_samples * n_features 的二维 numpy.ndarray 数组
-
target:标签数组,是 n_samples 的一维 numpy.ndarray 数组
-
DESCR:数据描述
-
feature_names:特征名
-
target_names:标签名
02、获取小数据集
用于分类
- sklearn.datasets.load_iris
In [12]: from sklearn.datasets import load_iris
...: data = load_iris()
...:
In [13]: data.target
Out[13]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [14]: data.feature_names
Out[14]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
In [15]: data.target_names
Out[15]:
array(['setosa', 'versicolor', 'virginica'],
dtype='|S10')
In [17]: data.target[[1,10, 100]]
Out[17]: array([0, 0, 2])
名称 | 数量 |
---|---|
类别 | 3 |
特征 | 4 |
样本数量 | 150 |
每个类别数量 | 50 |
- sklearn.datasets.load_digits
In [20]: from sklearn.datasets import load_digits
In [21]: digits = load_digits()
In [22]: print(digits.data.shape)
(1797, 64)
In [23]: digits.target
Out[23]: array([0, 1, 2, ..., 8, 9, 8])
In [24]: digits.target_names
Out[24]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [25]: digits.images
Out[25]:
array([[[ 0., 0., 5., ..., 1., 0., 0.],
[ 0., 0., 13., ..., 15., 5., 0.],
[ 0., 3., 15., ..., 11., 8., 0.],
...,
[ 0., 4., 11., ..., 12., 7., 0.],
[ 0., 2., 14., ..., 12., 0., 0.],
[ 0., 0., 6., ..., 0., 0., 0.]],
[[ 0., 0., 10., ..., 1., 0., 0.],
[ 0., 2., 16., ..., 1., 0., 0.],
[ 0., 0., 15., ..., 15., 0., 0.],
...,
[ 0., 4., 16., ..., 16., 6., 0.],
[ 0., 8., 16., ..., 16., 8., 0.],
[ 0., 1., 8., ..., 12., 1., 0.]]])
名称 | 数量 |
---|---|
类别 | 10 |
特征 | 64 |
样本数量 | 1797 |
用于回归
- sklearn.datasets.load_boston
In [34]: from sklearn.datasets import load_boston
In [35]: boston = load_boston()
In [36]: boston.data.shape
Out[36]: (506, 13)
In [37]: boston.feature_names
Out[37]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'],
dtype='|S7')
In [38]:
名称 | 数量 |
---|---|
目标类别 | 5-50 |
特征 | 13 |
样本数量 | 506 |
- sklearn.datasets.load_diabetes
In [13]: from sklearn.datasets import load_diabetes
In [14]: diabetes = load_diabetes()
In [15]: diabetes.data
Out[15]:
array([[ 0.03807591, 0.05068012, 0.06169621, ..., -0.00259226,
0.01990842, -0.01764613],
[-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
-0.06832974, -0.09220405],
[ 0.08529891, 0.05068012, 0.04445121, ..., -0.00259226,
0.00286377, -0.02593034],
...,
[ 0.04170844, 0.05068012, -0.01590626, ..., -0.01107952,
-0.04687948, 0.01549073],
[-0.04547248, -0.04464164, 0.03906215, ..., 0.02655962,
0.04452837, -0.02593034],
[-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
-0.00421986, 0.00306441]])
名称 | 数量 |
---|---|
目标类别 | 25-346 |
特征 | 10 |
样本数量 | 442 |
03、获取大数据集
- sklearn.datasets.fetch_20newsgroups
In [29]: from sklearn.datasets import fetch_20newsgroups
In [30]: data_test = fetch_20newsgroups(subset='test',shuffle=True, random_sta
...: te=42)
In [31]: data_train = fetch_20newsgroups(subset='train',shuffle=True, random_s
...: tate=42)
- sklearn.datasets.fetch_20newsgroups_vectorized
In [57]: from sklearn.datasets import fetch_20newsgroups_vectorized
In [58]: bunch = fetch_20newsgroups_vectorized(subset='all')
In [59]: from sklearn.utils import shuffle
In [60]: X, y = shuffle(bunch.data, bunch.target)
...: offset = int(X.shape[0] * 0.8)
...: X_train, y_train = X[:offset], y[:offset]
...: X_test, y_test = X[offset:], y[offset:]
生成本地分类数据:
- sklearn.datasets.make_classification
from sklearn.datasets.samples_generator import make_classification
X,y= datasets.make_classification(n_samples=100000, n_features=20,n_informative=2, n_redundant=10,random_state=42)