scikit-learn数据集

最新推荐文章于 2024-07-24 23:51:07 发布

kuliyang

最新推荐文章于 2024-07-24 23:51:07 发布

阅读量1k

点赞数 1

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/qq_29758759/article/details/106126865

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

Scikit-learn 数据集

Scikit-learn 数据集

在使用Scikit-learn数据集之前要先引用相应的包sklearn.datasets。数据集详细的信息还是要参考官方文档的。

数据集简介

The dataset loaders. 可以加载sklearn包自带的一些小型数据集，例如Toy datasets.
The dataset fetchers. 可以下载并加载一些大型数据集。

上面两种接口可以通过属性n_samples*n_features获得长度为n_samples的numpy array格式的数据；键data访问数据，键target获得标签。

如果只是想获得数据和标签的元组，可以将输入参数return_X_y设为True。

The dataset generation functions. 可以生成一些常见的数据集，例如服从高斯分布的数据。返回n_samples*n_features的元组(X, y)。
还包含其他数据集，例如图片、svmlight/libsvm格式的数据集等。

toy数据集

导入数据	简介	任务	大小(样本*属性)
load_boston([return_X_y])	boston房价	回归	506*13
load_iris([return_X_y])	鸢尾花数据集	分类	150*4
load_diabetes([return_X_y])	糖尿病数据集	回归	442*10
load_digits([n_class, return_X_y])	手写字数据集	分类	1797*64
load_linnerud([return_X_y])	健身数据集	多分类	20*3
load_wine([return_X_y])	红酒数据集	分类	178*13
load_breast_cancer([return_X_y])	乳腺癌数据集	分类	569*30

糖尿病数据集

糖尿病数据集介绍

这是一个糖尿病的数据集，主要包括442行数据，10个属性，分别是：Age(年龄)、性别(Sex)、Body mass index(体质指数)、Average Blood Pressure(平均血压)、S1~S6一年后疾病级数指标。Target为一年后患疾病的定量指标。

数据集使用

可以通过下面的语句查看数据集信息

from sklearn import datasets
diabetes = datasets.load_diabetes()                         #载入数据
print(diabetes.data)                                         #数据
print(diabetes.target)                                       #类标
print(diabetes.feature_names)                                #特征
print(u'总行数: ', len(diabetes.data), len(diabetes.target)) #数据总行数
print(u'特征数: ', len(diabetes.data[0]))                    #每行数据集维数
print(u'数据类型: ', diabetes.data.shape)                    #类型
print(type(diabetes.data), type(diabetes.target))            #数据集类型

也可以访问数据集的DESCR属性获得数据集的完整描述，而某些数据集包含feature_names和target_names。

获得数据可以使用diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True)，返回值： X: [n_samples, n_features]是特征矩阵的大小， y: [n_maples]是返回的标签数据，每条数据的标签。

# 糖尿病实例 import package
from sklearn import datasets, linear_model
# Load the diabetes dataset 
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True)
# Size of dataset
diabetes_X.shape

生成服从特定概率分布的数据集

1. 单标签数据集

make_blobs产生多类数据集，对每个类的中心和标准差有很好的控制

输入参数：sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None)[source]

参数	类别	默认值	说明
n_samples	int类型	可选参数 (default=100)	总的点数，平均的分到每个clusters中
n_features	int类型	可选参数 (default=2)	每个样本的特征维度
centers	int类型 or 聚类中心坐标元组构成的数组类型	可选参数(default=3)	产生的中心点的数量, or 固定中心点位置
cluster_std	float or floats序列	可选参数 (default=1.0)	clusters的标准差
center_box	一对floats (min, max)	可选参数 (default=(-10.0, 10.0))	随机产生数据的时候，每个cluster中心的边界
shuffle	boolean	可选参数 (default=True)	打乱样本
random_state	int/ RandomState对象 / None	可选参数（default=None）	如果是int,random_state作为随机数产生器的seed; 如果是RandomState对象, random_state是随机数产生器; 如果是None, RandomState 对象是随机数产生器通过np.random

实例：产生两类样本点，两个聚类中心，坐标为(-3, 3)和(3, 3)；方差为0.5和0.7；样本点有1K个，每个点的纬度为2.

from sklearn.datasets.samples_generator import make_blobs
centers = [(-3, -3),(3, 3)]
cluster_std = [0.5,0.7]
X,y = make_blobs(n_samples=1000, centers=centers,n_features=2, random_state=0, cluster_std=cluster_std)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

实例：产生3类样本点，3个距离中心，方差分别为0.5，0.7， 0.5，2K个样本点。

from sklearn.datasets.samples_generator import make_blobs
centers = [(-3, -3),(0,0),(3, 3)]
cluster_std = [0.5,0.7,0.5]
X,y = make_blobs(n_samples=2000, centers=centers,n_features=2, random_state=0, cluster_std=cluster_std)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

2. 有噪点的数据集

make_blobs和make_classification都通过为每个类分配一个或多个正态分布的点簇来创建多类数据集。make_blobs提供了有关每个聚类的中心和标准偏差，便于聚类。 make_classification专门通过以下方式引入噪声：相关，冗余和非信息性特征；每个类别有多个高斯聚类；以及通过特征空间的线性变换。

输入参数：sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

返回值：* X: array of shape [n_sample, n_features] 特征矩阵
* y: array of shape [n_sample] 矩阵每一行的整数类型标签

参数	类型	默认值	说明
n_samples	int类型	可选 (default=100)	样本数量
n_features	int	可选 (default=20)	总的特征数量,是从有信息的数据点，冗余数据点，重复数据点，和特征点-有信息的点-冗余的点-重复点中随机选择的
n_informative	int	optional (default=2)	informative features数量
n_redundant	int	optional (default=2)	redundant features数量
n_repeated	int	optional (default=0)	duplicated features数量
n_classes	int	optional (default=2)	类别或者标签数量
n_clusters_per_class	int	optional (default=2)	每个class中cluster数量
weights	floats列表 or None	(default=None)	每个类的权重，用于分配样本点
flip_y	float	optional (default=0.01)	随机交换样本的一段
class_sep	float	optional (default=1.0)	超立方体维数乘积
hypercube	boolean	optional (default=True)	如果为Ture则聚类到超立方体的顶点上. 如果为False则随机放置
shift	float,array of shape [n_features] or None	optional (default=0.0)	按照指定的值改变特征. 如果是None，特征就在[-class_sep,class_sep]中随机选取值来改变
scale	float array of shape [n_features] or None	optional (default=1.0)	将特征值乘以指定值. 如果是None,特征就乘以[1,100]内的随机值. 注意特征值放缩是在改变之后
shuffle	boolean	optional (default=True)	随机排列样本和特征
random_state	int,RandomState instance or None	optional (default=None)	如果是int型,random_state由随机生成器的随机种子生成; 如果是RandomState instance,random_state 是随机生成; 如果是None,随机由np.random随机产生.

from sklearn.datasets.samples_generator import make_classification
X,y = make_classification(n_samples=2000, n_features=10, n_informative=4, n_classes=4, random_state=0)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

3. 高斯分布数据集

输入参数：sklearn.datasets.make_gaussian_quantiles(mean=None, cov=1.0, n_samples=100, n_features=2, n_classes=3, shuffle=True, random_state=None)

参数	类型	默认	说明
mean	array of shape [n_features]	optional (default=None)	指定多维正态分布的均值. 如果是None就使用原点(0, 0, …).
cov	float	optional (default=1.)	cov乘以单位矩阵产生协方差矩阵. 该数据集仅产生对称正态分布
n_samples	int	optional (default=100)	注意样本点在各个类之间均匀分布
n_features	int	optional (default=2)	每个样本的特征数量.
n_classes	int	optional (default=3)	分类数
shuffle	boolean	optional (default=True)	更改样本
random_state	int, RandomState instance or None	optional (default=None)	如果是int型,random_state由随机生成器的随机种子生成; 如果是RandomState instance,random_state 是随机生成; 如果是None,随机由np.random随机产生.

from sklearn.datasets.samples_generator import make_gaussian_quantiles
X,y = make_gaussian_quantiles(mean=(1,1), cov=1.0, n_samples=1000, n_features=2, n_classes=2, shuffle=True, random_state=None)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

4. 产生二分类数据

make_hastie_10_2产生二分类数据

参数	类型	默认	说明
n_samples	int	optional (default=12000)	样本数量
random_state	int, RandomState instance or None	optional (default=None)	如果是int型,random_state由随机生成器的随机种子生成; 如果是RandomState instance,random_state 是随机生成; 如果是None,随机由np.random随机产生.

from sklearn.datasets.samples_generator import make_hastie_10_2
X,y = make_hastie_10_2(n_samples=1000, random_state=None)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1 )
plt.scatter(X[:,0] , X[:,1],  c = y, alpha = 0.7)
plt.subplot(1, 2, 2)
plt.hist(y)
plt.show()

其他数据集

图片数据集

scikit-learn自带一组JPEG格式的图片，可用于测试2D数据的算法。

导入图片	简介
load_sample_images()	导入样本图片，用于加载自带的2个图片
load_sample_images(image_name)	导入单个图片，返回numpy数组，用于加载外部图片

图片的默认uint-8格式。如果首先将输入转换为浮点表示形式，则机器学习算法效果最好。另外，如果您打算使用matplotlib.pyplpt.imshow，请不要忘记将其缩放到0-1。

# Authors: Robert Layton <robertlayton@gmail.com>
#          Olivier Grisel <olivier.grisel@ensta.org>
#          Mathieu Blondel <mathieu@mblondel.org>
#
# License: BSD 3 clause

print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
from time import time

n_colors = 64

# Load the Summer Palace photo
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255

# Load Image and transform to a 2D numpy array.
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))

print("Fitting model on a small sub-sample of the data")
t0 = time()
image_array_sample = shuffle(image_array, random_state=0)[:1000]
kmeans = KMeans(n_clusters=n_colors, random_state=0).fit(image_array_sample)
print("done in %0.3fs." % (time() - t0))

# Get labels for all points
print("Predicting color indices on the full image (k-means)")
t0 = time()
labels = kmeans.predict(image_array)
print("done in %0.3fs." % (time() - t0))


codebook_random = shuffle(image_array, random_state=0)[:n_colors]
print("Predicting color indices on the full image (random)")
t0 = time()
labels_random = pairwise_distances_argmin(codebook_random,
                                          image_array,
                                          axis=0)
print("done in %0.3fs." % (time() - t0))


def recreate_image(codebook, labels, w, h):
    """Recreate the (compressed) image from the code book & labels"""
    d = codebook.shape[1]
    image = np.zeros((w, h, d))
    label_idx = 0
    for i in range(w):
        for j in range(h):
            image[i][j] = codebook[labels[label_idx]]
            label_idx += 1
    return image

# Display all results, alongside original image
plt.figure(1)
plt.clf()
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)

plt.figure(2)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, K-Means)')
plt.imshow(recreate_image(kmeans.cluster_centers_, labels, w, h))

plt.figure(3)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))
plt.show()

svmlight/libsvm格式的数据集

SVMlight是实现半监督SVM的一个工具包,LIBSVM是台湾大学林智仁(Lin Chih-Jen)教授等开发设计的一个简单、易于使用和快速有效的SVM模式识别与回归的软件包.

from sklearn.datasets import load_svmlight_file
# 加载数据集
X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
# 加载多个数据集
X_train, y_train, X_test, y_test = load_svmlight_files(("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
# 保证X_test, y_test有相同的特征值
X_test, y_test = load_svmlight_file("/path/to/test_dataset.txt", n_features=X_train.shape[1])

从openml.org下载数据集

openml.org是一个公开的机器学习网站。

from sklearn.datasets import fetch_openml
mice = fetch_openml(name='miceprotein', version=4)
print(mice.data.shape) #(1080, 77)
print(mice.target.shape) #(1080,)
print(np.unique(mice.target)) #'c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'
print(mice.DESCR) 
print(mice.details) 
print(mice.url)

on
from sklearn.datasets import fetch_openml
mice = fetch_openml(name=‘miceprotein’, version=4)
print(mice.data.shape) #(1080, 77)
print(mice.target.shape) #(1080,)
print(np.unique(mice.target)) #‘c-CS-m’, ‘c-CS-s’, ‘c-SC-m’, ‘c-SC-s’, ‘t-CS-m’, ‘t-CS-s’, ‘t-SC-m’, ‘t-SC-s’
print(mice.DESCR)
print(mice.details)
print(mice.url)

kuliyang

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
scikit-learn数据集

文章目录Scikit-learn 数据集数据集简介toy数据集糖尿病数据集糖尿病数据集介绍数据集使用生成服从特定概率分布的数据集1. 单标签数据集2. 有噪点的数据集3. 高斯分布数据集4. 产生二分类数据其他数据集图片数据集svmlight/libsvm格式的数据集从openml.org下载数据集Scikit-learn 数据集在使用Scikit-learn数据集之前要先引用相应的包sklearn.datasets。数据集详细的信息还是要参考官方文档的。数据集简介The dataset loa
复制链接

扫一扫

专栏目录