sklearn中的数据集1 (toy_datasets)

disanda

已于 2022-06-10 20:07:57 修改

阅读量1.7k

点赞数 2

分类专栏： dataset 文章标签： sklearn dataset 数据集

于 2021-09-07 09:23:22 首次发布

本文链接：https://blog.csdn.net/disanda/article/details/120150415

版权

dataset 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

sklearn中的 toy datasets

sklearn中的玩具数据库(toy datasets)，数据量较小，方便使用

tips:
安装 : pip install scikit-learn

1. Boston house prices (波士顿房价数据集)

有506套房子的样本，每个样本有13个特征，target是房子的价格(x1000刀),特征解释如下：

1. CRIM per capita crime rate by town
CRIM 各镇的人均犯罪率

2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
ZN 划定为25,000平方英尺以上的住宅用地的比例

3. INDUS proportion of non-retail business acres per town
INDUS 每个镇的非零售商业亩数比例

4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
CHAS 查尔斯河虚拟变量（=1，如果区块与河流相连；否则为0）。

5. NOX nitric oxides concentration (parts per 10 million)
NOX 氮氧化物浓度（每1000万份）。

6. RM average number of rooms per dwelling
RM 每个住宅的平均房间数

7. AGE proportion of owner-occupied units built prior to 1940
AGE 1940年以前建造的业主自用单元的比例

8. DIS weighted distances to five Boston employment centres
DIS 到波士顿五个就业中心的加权距离

9. RAD index of accessibility to radial highways
RAD 辐射状高速公路的可达性指数

 10.TAX full-value property-tax rate per $10,000
TAX 每10,000美元的全额财产税税率

11. PTRATIO pupil-teacher ratio by town
PTRATIO 各镇学生与教师的比例

12. B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
B 1000(Bk-0.63)^2 其中Bk是各镇的黑人比例

13. LSTAT % lower status of the population
LSTAT 人口中地位较低的百分比

14.(target) MEDV Median value of owner-occupied homes in $1000’s
（这个是标签）MEDV 以1000美元为单位的业主自用房屋的中值

实例:

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
print(X.shape) #(506, 13)

2.Iris flower (鸢尾花数据集)

该数据集包含来自三种鸢尾属植物（Iris setosa 山鸢尾、Iris virginica 杂色鸢尾和 Iris versicolor 维吉尼亚鸢尾）中每一种的 50 个样本。
从每个样本测量四个特征：萼片和花瓣的长度和宽度
(Sepal.Length（花萼长度）、Sepal.Width（花萼宽度）、Petal.Length（花瓣长度）、Petal.Width（花瓣宽度）)，以厘米为单位。
基于这四个特征的组合，Fisher 开发了一个线性判别模型来区分物种。

from sklearn.datasets import load_iris
iris = load_iris()
print(iris)

3. Digits (8x8手写体数据集)

全名：Pen-Based Recognition of Handwritten Digits Data Set

1797个样本，每个样本包括8*8像素的图像和一个[0, 9]整数的标签

from sklearn import datasets
digits = datasets.load_digits()
print(digits.keys())
#dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

#可视化
import matplotlib.pyplot as plt
plt.imshow(digits.images[0])
plt.show()

4. Diabetes dataset (糖尿病患者数据集)

样本数量 n = 442。即有442名糖尿病患者，收集10个关键特征。分别是:

1.age 
年龄

2.sex
性别

3.bmi 
body mass index : 身体质量指数

4.bp 
average blood pressure : 平均血压

5.s1 tc 
total serum cholesterol : 血清总胆固醇

6.s2 ldl
low-density lipoproteins: 低密度脂蛋白

7. s3 hdl 
high-density lipoproteins : 高密度脂蛋白 

8. s4 tch 
total cholesterol / HDL : 总胆固醇/高密度脂蛋白

9. s5 ltg 
possibly log of serum triglycerides level : 血清甘油三酯水平的可能对数

10. s6 glu 
blood sugar level : 血糖水平

这里需要注意的是10个特征都被标准化（即均值为0，且全部样本每列特征的平方和为1）。另外y的值为未来1年后的血糖值，其是美式标准，即比中国标准大18倍（mg/dl÷18=mmol/L，mmol/L×18=mg/dl）

实例1:

diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target
print(X.shape)
print(y)

5.Linnerrud dataset (林纳鲁德)

一个很小的数据集，记录锻炼(exercise )时, 生理(physiological )的变化，共20条记录，适合多结果输出的回归模型

生理变化 y (physiological) ：体重（Weight）, 腰围（Waist） and 脉搏（Pulse）.
锻炼 x (exercise） : 引体向上（Chins）, 仰卧起坐（Situps） and 跳跃（Jumps）.

import sklearn
x,y = sklearn.datasets.load_linnerud(return_X_y=True)

6. 葡萄酒数据 (Wine recognition dataset)

一个记录三种葡萄酒的数据集，共178条数据，三个种类的数量分别为：[59,71,48]，维度 (Dimensionality)：13

Alcohol: 酒精度
Malic Acid: 果酸
Ash: 灰
Alkalinity of Ash: 灰碱度
Magnesium: 镁
Total Phenols: 总酚
Flavanoids: 黄酮素
Nonflavanoid Phenols: 非黄酮酚类
Proanthocyanins: 原花青素
Colour Intensity: 颜色强度
Hue: 色调
OD280/OD315 of diluted wines: od280/od315稀释
Proline: 脯氨酸

示例:

import sklearn
x,y = sklearn.datasets.load_wine(return_X_y=True)

7. 威斯康星州乳腺癌（诊断数据） Breast cancer wisconsin (diagnostic) dataset

569个样本，每个样本有10*3组数据，10组均值，方差，最大值。肿块来自fine needle aspirate（FNA）算法。图像分离平面来自Multisurface Method-Tree (MSM-T)算法

radius (mean) ：半径  [6.981，28.11] mean of distances from center to points on the perimeter
texture (mean) ：质地 [9.71, 39.28] standard deviation of gray-scale values
perimeter (mean) ：周长 [43.79, 188.5]
area (mean) ：面积 [143.5, 2501.0]
smoothness (mean) ：光滑度 [0.053, 0.163] local variation in radius lengths
compactness (mean) ：紧凑 [0.019, 0.345] perimeter^2 / area - 1.0
concavity (mean) ：凹度 [0.0, 0.427] severity of concave portions of the contour
concave points (mean) ：凹点 [0.0, 0.201] number of concave portions of the contour
symmetry (mean) : 对称性 [0.106, 0.304]
fractal dimension (mean): 分形维数 [0.05, 0.097] “coastline approximation” - 1

radius (standard error)
texture (standard error)
perimeter (standard error)
area (standard error)
smoothness (standard error)
compactness (standard error)
concavity (standard error)
concave points (standard error)
symmetry (standard error)
fractal dimension (standard error)

radius (worst)
texture (worst)
perimeter (worst)
area (worst)
smoothness (worst)
compactness (worst)
concavity (worst)
concave points (worst)
symmetry (worst)
fractal dimension (worst)

Reference

sklearn.datasets.load_diabetes

cn: https://sklearn.apachecn.org/
en: https://scikit-learn.org/stable/
toy_datasets: https://scikit-learn.org/stable/datasets/toy_dataset.html
Iris: https://en.wikipedia.org/wiki/Iris_flower_data_set
digits: https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
digits: https://blog.csdn.net/weixin_43893890/article/details/103355229