sklearn自带一些数据集,其中手写数字数据集可通过load_digits加载,load_digits内部:
def load_linnerud():
"""Load and return the linnerud dataset (multivariate regression).
Samples total: 20
Dimensionality: 3 for both data and targets
Features: integer
Targets: integer
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are: 'data' and
'targets', the two multivariate datasets, with 'data' corresponding to
the exercise and 'targets' corresponding to the physiological
measurements, as well as 'feature_names' and 'target_names'.
"""
base_dir = join(dirname(__file__), 'data/')
# Read data
data_exercise = np.loadtxt(base_dir + 'linnerud_exercise.csv', skiprows=1)
data_physiological = np.loadtxt(base_dir + 'linnerud_physiological.csv',
skiprows=1)
# Read header
with open(base_dir + 'linnerud_exercise.csv') as f:
header_exercise = f.readline().split()
with open(base_dir + 'linnerud_physiological.csv') as f:
header_physiological = f.readline().split()
with open(dirname(__file__) + '/descr/linnerud.rst') as f:
descr = f.read()
return Bunch(data=data_exercise, feature_names=header_exercise,
target=data_physiological,
target_names=header_physiological,
DESCR=descr)
数据集:
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
# 手写数字数据集,封装好的对象,可以理解为一个字段
digits = datasets.load_digits()
# 可以使用keys()方法来看一下数据集的详情
print(digits.keys())
# 查看sklearn.datasets提供的数据描述
# 5620张图片,每张图片有64个像素点即特征(8*8整数像素图像)
# 每个特征的取值范围是1~16(sklearn中的不全),对应的分类结果是10个数字
# print(digits.DESCR)
# 特征的shape
X = digits.data
print(X.shape)
# 标签的shape
y = digits.target
print(y.shape)
# 标签分类
print(digits.target_names)
# 去除某一个具体的数据,查看其特征以及标签信息
some_digit = X[666]
print(some_digit)
print(y[666])
# 也可以这条数据进行可视化
some_digmit_image = some_digit.reshape(8, 8)
plt.imshow(some_digmit_image, cmap = matplotlib.cm.binary)
plt.show()
数据可视化结果: