sklearn——加载数据集

最新推荐文章于 2024-07-19 10:12:42 发布

来路与归途

最新推荐文章于 2024-07-19 10:12:42 发布

阅读量1.2w

点赞数 6

分类专栏：机器学习文章标签： sklearn.datasets sklearn数据集

本文链接：https://blog.csdn.net/qq_42233538/article/details/92829584

版权

机器学习专栏收录该内容

11 篇文章 1 订阅

订阅专栏

1. 通用数据集 API

根据所需数据集的类型，有三种主要类型的数据集API接口可用于获取数据集；

方法一，loaders 可用来加载小的标准数据集,在玩具数据集中有介绍

方法二，fetchers 可用来下载并加载大的真实数据集,在真实世界中的数据集中有介绍

说明：

loaders和fetchers的所有函数都返回一个字典一样的对象，里面至少包含两项:shape为n_samples*n_features的数组，对应的字典key是data(20news groups数据集除外)以及长度为n_samples的numpy数组,包含了目标值,对应的字典key是target。

通过将return_X_y参数设置为True，几乎所有这些函数都可以将输出约束为只包含数据和目标的元组。

数据集还包含一些对DESCR描述，同时一部分也包含feature_names和target_names的特征。有关详细信息，请参阅下面的数据集说明

方法三，generation functions 它们可以用来生成受控的合成数据集(synthetic datasets),在人工合成的数据集中有介绍

说明：这些函数返回一个元组(X,y)，该元组由shape为n_samples*n_features的numpy数组X和长度为n_samples的包含目标y的数组组成。

此外，还有一些用于加载其他格式或其他位置的数据集的混合工具(miscellanous tools),在加载其他类型的数据集中有介绍

2. 玩具数据集

调用	描述
`load_boston`([return_X_y])	Load and return the boston house-prices dataset (regression).
`load_iris`([return_X_y])	Load and return the iris dataset (classification).
`load_diabetes`([return_X_y])	Load and return the diabetes dataset (regression).
`load_digits`([n_class, return_X_y])	Load and return the digits dataset (classification).
`load_linnerud`([return_X_y])	Load and return the linnerud dataset (multivariate regression).
`load_wine`([return_X_y])	Load and return the wine dataset (classification).
`load_breast_cancer`([return_X_y])	Load and return the breast cancer wisconsin dataset (classification).

例：以加载iris(鸢尾花)Dataset数据集为例

功能：加载并返回IRIS数据集（分类）。

IRIS数据集是一个经典且非常简单的多类分类数据集。

类别数	Classes	3
每一个类别样本数	Samples per class	50
总样本数	Samples total	150
特征数	Dimensionality	4
特征	Features	real, positive

参数：

参数：	return_X_y : boolean, default=False. If True, returns `(data, target)` instead of a Bunch object. See below for more information about the `data` and target object. New in version 0.18.
返回:	data : Bunch（像字典一样的对象） ‘data’, 要学习的数据, ‘target’, 分类标签, ‘target_names’, 标签的含义, ‘feature_names’, 特征的含义, ‘DESCR’, 数据集的完整描述, ‘filename’, IRIS CSV数据集的物理位置（添加在Versio中N 0.20）. 如果将参数 `return_X_y` 设定为True，则返回(data, target) 元组 New in version 0.18.

return_X_y : boolean, default=False.

If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object.

New in version 0.18.

data : Bunch（像字典一样的对象）

‘data’, 要学习的数据, ‘target’, 分类标签, ‘target_names’, 标签的含义, ‘feature_names’, 特征的含义, ‘DESCR’, 数据集的完整描述, ‘filename’, IRIS CSV数据集的物理位置（添加在Versio中N 0.20）.

如果将参数 return_X_y 设定为True，则返回(data, target) 元组

New in version 0.18.

#导入数据集
from sklearn.datasets import load_iris
#1.创建对象,不指定return_X_y
data=load_iris()
print(type(data))          #<class 'sklearn.utils.Bunch'>
#获取对象参数，与获取字典的方式一样
x=data.data
y=data.target
print(x.shape)             # (150, 4)
print(y.shape)             #(150,)
#2.创建对象,指定return_X_y
data1=load_iris(return_X_y=True)
print(type(data1))
x1=data1[0]
y1=data1[1]
print(x1.shape)            # (150, 4)
print(y1.shape)            #(150,)

3 真实世界中的数据集

fetch_olivetti_faces([data_home, shuffle, …])	Load the Olivetti faces data-set from AT&T (classification).
fetch_20newsgroups([data_home, subset, …])	Load the filenames and data from the 20 newsgroups dataset (classification).
fetch_20newsgroups_vectorized([subset, …])	Load the 20 newsgroups dataset and vectorize it into token counts (classification).
fetch_lfw_people([data_home, funneled, …])	Load the Labeled Faces in the Wild (LFW) people dataset (classification).
fetch_lfw_pairs([subset, data_home, …])	Load the Labeled Faces in the Wild (LFW) pairs dataset (classification).
fetch_covtype([data_home, …])	Load the covertype dataset (classification).
fetch_rcv1([data_home, subset, …])	Load the RCV1 multilabel dataset (classification).
fetch_kddcup99([subset, data_home, shuffle, …])	Load the kddcup99 dataset (classification).
fetch_california_housing([data_home, …])	Load the California housing dataset (regression).

例：加载自然条件下的已经标记的人脸数据集(LFW)

描述：总共有5749个人物对应id为0~5748，总共有13233张图片，每张图片大小为5828，由于resize默认值0.5，所有在不改变resize的情况下，返回图片的大小2914=62*47

类别数	Classes	5749
总样本数	Samples total	13233
特征数	Dimensionality	5828
特征	Features	real, between 0 and 255

参数:

参数:	data_home : 可选的,默认值:没有为数据集指定另一个下载和缓存文件夹。默认情况下，所有scikit-learn数据都存储在“~/scikit_learn_data”子文件夹中。 funneled : 布尔值，可选，默认值:True 下载并使用数据集的漏斗变体。 resize : 浮点数据，可选，默认0.5 比例用于调整每张脸部图片的大小。 min_faces_per_person :整型数据，可选，默认值:没有提取的数据集只保留至少具有min_faces_per_person不同图片的人的图片。 color :布尔值，可选，默认为False 保留3个RGB通道，而不是将它们平均到单个灰度通道。如果color=True，则数据的形状比color = False的形状多一个维度。 slice_ : 可选提供一个自定义2D切片(高度、宽度)来提取jpeg文件的“有趣”部分，并避免使用来自背景的统计相关性 download_if_missing : 可选，默认为True 如果为False，如果数据在本地不可用，则引发IOError，而不是尝试从源站点下载数据。 return_X_y : 布尔值、默认= False 如果为真, 返回 `(dataset.data, dataset.target)` 否则，返回 Bunch （一种类似字典类型的数据）对象类型. New in version 0.20.
返回:	dataset : 类dict对象，具有以下属性: dataset.data : numpy array of shape (13233, 2914) 每一行对应一个原始大小为62 x 47像素的网格化人脸图像。更改slice_或resize参数将更改输出的形状。 dataset.images : numpy array of shape (13233, 62, 47) 每一行都是对应于数据集中5749个人之一的人脸图像。更改slice_或resize参数将更改输出的形状。 dataset.target : numpy array of shape (13233,) 与每张人脸图像相关的标签。这些标签的范围从0-5748，并对应于人员id。 data.target_names：numpy array of shape(5749,) 对应标签的人名 dataset.DESCR : string 描述(LFW)数据集中标记的人脸。 (data, target) : tuple if `return_X_y` is True New in version 0.20.

data_home : 可选的,默认值:没有

为数据集指定另一个下载和缓存文件夹。默认情况下，所有scikit-learn数据都存储在“~/scikit_learn_data”子文件夹中。

funneled : 布尔值，可选，默认值:True 下载并使用数据集的漏斗变体。

resize : 浮点数据，可选，默认0.5 比例用于调整每张脸部图片的大小。

min_faces_per_person :整型数据，可选，默认值:没有

提取的数据集只保留至少具有min_faces_per_person不同图片的人的图片。

color :布尔值，可选，默认为False

保留3个RGB通道，而不是将它们平均到单个灰度通道。如果color=True，则数据的形状比color = False的形状多一个维度。

slice_ : 可选

提供一个自定义2D切片(高度、宽度)来提取jpeg文件的“有趣”部分，并避免使用来自背景的统计相关性

download_if_missing : 可选，默认为True

如果为False，如果数据在本地不可用，则引发IOError，而不是尝试从源站点下载数据。

return_X_y : 布尔值、默认= False

如果为真, 返回 (dataset.data, dataset.target) 否则，返回 Bunch （一种类似字典类型的数据）对象类型.

New in version 0.20.

dataset : 类dict对象，具有以下属性:

dataset.data : numpy array of shape (13233, 2914)

每一行对应一个原始大小为62 x 47像素的网格化人脸图像。更改slice_或resize参数将更改输出的形状。

dataset.images : numpy array of shape (13233, 62, 47)

每一行都是对应于数据集中5749个人之一的人脸图像。更改slice_或resize参数将更改输出的形状。

dataset.target : numpy array of shape (13233,)

与每张人脸图像相关的标签。这些标签的范围从0-5748，并对应于人员id。

data.target_names：numpy array of shape(5749,)

对应标签的人名
dataset.DESCR : string

描述(LFW)数据集中标记的人脸。

(data, target) : tuple if return_X_y is True

New in version 0.20.

#导入数据集
from sklearn.datasets import load_iris
from sklearn.datasets import fetch_lfw_people
import matplotlib.pyplot as plt
import cv2
import numpy as np
#1.创建对象,不指定return_X_y
#data=load_iris()
data=fetch_lfw_people()
print(type(data))
#获取对象参数，与获取字典的方式一样
x=data.data
image=data.images
y=data.target
name=data.target_names
descr=data.DESCR
print(x.shape)
print(image.shape)
print(y[0])
print(data.target_names)
print(name.shape)
#print(descr)
#2.1显示图片，方法一
image_a=np.array(image[0],dtype=np.uint8)
plt.imshow(image_a)
plt.show()
#2.2显示图片，方法二，cv2仅支持uint8格式图片
#image_a=np.array(image[0],dtype=np.uint8)
#img_res=cv2.resize(image_a,dsize=(100,100),interpolation=cv2.INTER_CUBIC)
#cv2.imshow("a",img_res)
#cv2.waitKey()
#cv2.destroyAllWindows()

4. 样本生成器

4.1. 分类和聚类生成器——这些生成器将产生一个相应特征的离散矩阵。

4.1.1. 单标签

make_blobs ，make_classification，make_gaussian_quantiles，make_hastie_10_2 ，make_circles，make_moon

4.1.2. 多标签

make_multilabel_classification

4.1.3. 二分聚类

调用	描述
`make_biclusters`(shape, n_clusters[, noise, …])	Generate an array with constant block diagonal structure for biclustering.
`make_checkerboard`(shape, n_clusters[, …])	Generate an array with block checkerboard structure for biclustering.

4.2. 回归生成器

make_regression，make_sparse_uncorrelated，make_friedman1，make_friedman2，make_friedman3

4.3. 流形学习生成器

调用	描述
`make_s_curve`([n_samples, noise, random_state])	Generate an S curve dataset.
`make_swiss_roll`([n_samples, noise, random_state])	Generate a swiss roll dataset.

4.4. 生成器分解

调用	描述
`make_low_rank_matrix`([n_samples, …])	Generate a mostly low rank matrix with bell-shaped singular values
`make_sparse_coded_signal`(n_samples, …[, …])	Generate a signal as a sparse combination of dictionary elements.
`make_spd_matrix`(n_dim[, random_state])	Generate a random symmetric, positive-definite matrix.
`make_sparse_spd_matrix`([dim, alpha, …])	Generate a sparse symmetric definite positive matrix.

5. 加载其他数据集

5.1. 样本图片

scikit 在通过图片的作者共同授权下嵌入了几个样本 JPEG 图片。这些图像为了方便用户对 test algorithms （测试算法）和 pipeline on 2D data （二维数据管道）进行测试。

调用	描述
`load_sample_images`()	Load sample images for image manipulation.
`load_sample_image`(image_name)	Load the numpy array of a single sample image

5.2. svmlight或libsvm格式的数据集

5.3. 从openml.org下载数据集

openml.org是一个用于机器学习数据和实验的公共存储库，它允许每个人上传开放的数据集。

在sklearn.datasets包中，可以通过sklearn.datasets.fetch_openml函数来从openml.org下载数据集．

5.4. 从外部数据集加载

scikit-learn使用任何存储为numpy数组或者scipy稀疏数组的数值数据。其他可以转化成数值数组的类型也可以接受，如pandas中的DataFrame。

以下推荐一些将标准纵列形式的数据转换为scikit-learn可以使用的格式的方法:

pandas.io 提供了从常见格式(包括CSV,Excel,JSON,SQL等)中读取数据的工具.DateFrame 也可以从由元组或者字典组成的列表构建而成.Pandas能顺利的处理异构的数据，并且提供了处理和转换成方便scikit-learn使用的数值数据的工具。
scipy.io 专门处理科学计算领域经常使用的二进制格式，例如.mat和.arff格式的内容。
numpy/routines.io 将纵列形式的数据标准的加载为numpy数组
scikit-learn的datasets.load_svmlight_file处理svmlight或者libSVM稀疏矩阵
scikit-learn的 datasets.load_files 处理文本文件组成的目录，每个目录名是每个类别的名称，每个目录内的每个文件对应该类别的一个样本

对于一些杂项数据，例如图像，视屏，音频。您可以参考:

skimage.io 或 Imageio 将图像或者视屏加载为numpy数组
scipy.misc.imread (requires the Pillow package)将各种图像文件格式加载为像素灰度数据
scipy.io.wavfile.read 将WAV文件读入一个numpy数组

存储为字符串的无序(或者名字)特征(在pandas的DataFrame中很常见)需要转换为整数，当整数类别变量被编码成独热变量(sklearn.preprocessing.OneHotEncoder)或类似数据时，它或许可以被最好的利用。参见预处理数据.

注意：如果你要管理你的数值数据，建议使用优化后的文件格式来减少数据加载时间,例如HDF5。像 H5Py, PyTables和pandas等的各种库提供了一个Python接口，来读写该格式的数据。

来路与归途

关注

6
点赞
踩
61

收藏

觉得还不错? 一键收藏
0
评论
sklearn——加载数据集

1. 通用数据集 API根据所需数据集的类型，有三种主要类型的数据集API接口可用于获取数据集；方法一，loaders 可用来加载小的标准数据集,在玩具数据集中有介绍方法二，fetchers 可用来下载并加载大的真实数据集,在真实世界中的数据集中有介绍说明：loaders和fetchers的所有函数都返回一个字典一样的对象，里面至少包含两项:shape为n_samples*n_...
复制链接

扫一扫

专栏目录