深度学习实践系列(1)-----notMNIST dataset 详细使用教程

最新推荐文章于 2024-08-14 13:37:04 发布

rain_Man2018

最新推荐文章于 2024-08-14 13:37:04 发布

阅读量1.6k

点赞数 1

分类专栏：深度学习文章标签： notMNIST dataset 深度学习神经网络

本文链接：https://blog.csdn.net/weixin_44038165/article/details/102948466

版权

深度学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Assignment 1

The objective of this assignment is to learn about simple data curation practices, and familiarize you with some of the data we’ll be reusing later.

This notebook uses the notMNIST dataset to be used with python experiments. This dataset is designed to look like the classic MNIST dataset, while looking a little more like real data: it’s a harder task, and the data is a lot less ‘clean’ than MNIST.

1. 数据集下载

First, we’ll download the dataset to our local machine. The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to ‘A’ through ‘J’ (10 classes). The training set has about 500k and the testset 19000 labeled examples. Given these sizes, it should be possible to train models quickly on any machine.

# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import imageio
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import tarfile
from IPython.display import display, Image
from sklearn.linear_model import LogisticRegression
from six.moves.urllib.request import urlretrieve
from six.moves import cPickle as pickle

# Config the matplotlib backend as plotting inline in IPython
%matplotlib inline

url = 'https://commondatastorage.googleapis.com/books1000/'  # 下载非常慢，可以百度云下载，链接在下面

last_percent_reported = None
def download_progress_hook(count, blockSize, totalSize):
    """A hook to report the progress of a download. This is mostly intended for users with
    slow internet connections. Reports every 5% change in download progress.
    """
    global last_percent_reported
    percent = int(count * blockSize * 100 / totalSize)

    if last_percent_reported != percent:
        if percent % 5 == 0:
            sys.stdout.write("%s%%" % percent)
            sys.stdout.flush()
        else:
            sys.stdout.write(".")
            sys.stdout.flush()

        last_percent_reported = percent
        
def maybe_download(filename, expected_bytes, force=False):
    """Download a file if not present, and make sure it's the right size."""
    dest_filename = os.path.join(data_root, filename)
    if force or not os.path.exists(dest_filename):
        print('Attempting to download:', filename) 
        filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)
        print('\nDownload Complete!')
    statinfo = os.stat(dest_filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', dest_filename)
    else:
        raise Exception(
          'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')
    return dest_filename

train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)

notMNIST dataset 百度云手动下载：

百度云链接：
链接：https://pan.baidu.com/s/1f-xSsm2SC-xq5dkevobQEQ
提取码：4ev5
复制这段内容后打开百度网盘手机App，操作更方便哦

2. 解压 (百度云下载者请用360压缩手动解压)

Extract the dataset from the compressed .tar.gz file. This should give you a set of directories, labeled A through J.

由于该代码块的解压速度太慢，因此，利用解压工具（360压缩就行）解压该文件。并将解压后的文件夹地址存储起来，为后续的调用做好准备即可。

不使用压缩软件而是用代码解压如下：
下文代码删除了作业中解压部分的代码，而保留存储文件夹路径的代码。

num_classes = 10
np.random.seed(133)

#创建每一个类别的文件夹名
def maybe_extract(filename, force=False):
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  data_folders = [os.path.join(root, d) for d in sorted(os.listdir(root))
    if os.path.isdir(os.path.join(root, d))]
  if len(data_folders) != num_classes:
    raise Exception(
      'Expected %d folders, one per class. Found %d instead.' % (
        num_classes, len(data_folders)))
  return data_folders

#本地存储notMnist数据的的文件夹 。请自行修改文件的绝对路径
train_filename = 'C:\\ProgramInstall\\PythonCode\\notMNIST_large'  
test_filename = 'C:\\ProgramInstall\\PythonCode\\notMNIST_small'
train_folders = maybe_extract(train_filename)
test_folders = maybe_extract(test_filename)

Problem 1：查看图片

Let’s take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font. Display a sample of the images that we just downloaded. Hint: you can use the package IPython.display.

使用绝对路径读取图片，而不是其他博客中的相对路径。

# from IPython.display import display, Image
display(Image(filename = "D:/NLP/lesson09/notMNIST_small/A/Q0sgUGluay50dGY=.png"))

输出如下：
在这里插入图片描述

import random
import matplotlib.image as mpimg

#test_folders = "D:/NLP/lesson09/notMNIST_small"
# 打开文件
path = "D:/NLP/lesson09/notMNIST_small"  # 自行修改路径
test_folders = os.listdir( path )

def plot_samples(data_folders, sample_size, title=None):  # data_folders = "D:/NLP/lesson09/notMNIST_small"
    fig = plt.figure()
    if title: 
        fig.suptitle(title, fontsize=16, fontweight='bold')
    
    for folder in data_folders:  # folder = A,B,C,D....
        newPath = os.path.join(path,folder)
        image_files = os.listdir(newPath)  # image_files = 图片名称
        image_sample = random.sample(image_files, sample_size)
        for image in image_sample:
            image_file = os.path.join(newPath, image)
            ax = fig.add_subplot(len(data_folders), sample_size, sample_size * data_folders.index(folder) +
                                 image_sample.index(image) + 1)
            image = mpimg.imread(image_file)
            ax.imshow(image)
            ax.set_axis_off()

    plt.show()
plot_samples(test_folders, 10, 'Test Folders')

输出：
在这里插入图片描述

加载和归一化图像数据

Now let’s load the data in a more manageable format. Since, depending on your computer setup you might not be able to fit it all in memory, we’ll load each class into a separate dataset, store them on disk and curate them independently. Later we’ll merge them into a single dataset of manageable size.

We’ll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road.

A few images might not be readable, we’ll just skip them.

该代码块主要实现了3个功能：1是将本地硬盘中的每类图像文件夹中的图像数据读到一个3维的dataset对象中，第1维是图像个数索引，其余2维则是图像数据。其中主要是利用了scipy模块中的ndarray对象兑取硬盘中的图像数据。2是将读取到的图像数据按照上文所述的公式进行了归一化。3是将ndarray对象打包为pickle格式并存储在工作目录下，每个类别有一个.pickle文件。并将打包后.pickle文件的地址存储为train_datasets和test_datasets返回

image_size = 28  # Pixel width and height.
pixel_depth = 255.0  # Number of levels per pixel.

def load_letter(folder, min_num_images):
    """Load the data for a single letter label."""
    image_files = os.listdir(folder)
    dataset = np.ndarray(shape=(len(image_files), image_size, image_size), dtype=np.float32)
    print(folder)
    num_images = 0
    for image in image_files:
        image_file = os.path.join(folder, image)
        try:
            image_data = (imageio.imread(image_file).astype(float) - 
                        pixel_depth / 2) / pixel_depth
            if image_data.shape != (image_size, image_size):
                raise Exception('Unexpected image shape: %s' % str(image_data.shape))
            dataset[num_images, :, :] = image_data
            num_images = num_images + 1
        except (IOError, ValueError) as e:
            print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')
    
    dataset = dataset[0:num_images, :, :]
    if num_images < min_num_images:
        raise Exception('Many fewer images than expected: %d < %d' %
                    (num_images, min_num_images))
    
    print('Full dataset tensor:', dataset.shape)
    print('Mean:', np.mean(dataset))
    print('Standard deviation:', np.std(dataset))
    return dataset
        
def maybe_pickle(data_folders, min_num_images_per_class, force=False):
    dataset_names = []
    for folder in data_folders:
        set_filename = folder + '.pickle'
        dataset_names.append(set_filename)
        if os.path.exists(set_filename) and not force:
            # You may override by setting force=True.
            print('%s already present - Skipping pickling.' % set_filename)
        else:
            print('Pickling %s.' % set_filename)
            dataset = load_letter(folder, min_num_images_per_class)
            try:
                with open(set_filename, 'wb') as f:
                pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
            except Exception as e:
                print('Unable to save data to', set_filename, ':', e)
  
    return dataset_names

train_datasets = maybe_pickle(train_folders, 45000)
test_datasets = maybe_pickle(test_folders, 1800)

Pickle模块的dump()方法和load()方法

pickle提供了一个简单的持久化功能。可以将对象以文件的形式存放在磁盘上
将图片转为ndarray矩阵并序列化到pickle中
序列化对象，将对象obj保存到文件file中去
反序列化对象，将文件中的数据解析为一个python对象。file中有read()接口和readline()接口

有两个文件需要处理，注释语句是处理另外一个文件的，请自行运行

image_size = 28  # Pixel width and height.
pixel_depth = 255.0  # Number of levels per pixel.

#path = "D:/NLP/lesson09/notMNIST_small"   # 自行运行

path = "D:/NLP/lesson09/notMNIST_large"   # 记得注释这里后再运行上面的路径

def load_letter(folder, min_num_images):  # 读取文件A下的num张图片
    """Load the data for a single letter label."""
    newPath = os.path.join(path, folder)  # D:/NLP/lesson09/notMNIST_small/A
    image_files = os.listdir(newPath)   # 提取出图片的名字
    dataset = np.ndarray(shape=(len(image_files), image_size, image_size), dtype=np.float32)
    print(folder)
    num_images = 0
    for image in image_files:
        image_file = os.path.join(newPath, image) # D:/NLP/lesson09/notMNIST_small/A/MDEtMDEtMDAudHRm.png
        try:
            image_data = (imageio.imread(image_file).astype(float) -   # #去均值和归一化
                        pixel_depth / 2) / pixel_depth
            if image_data.shape != (image_size, image_size):
                raise Exception('Unexpected image shape: %s' % str(image_data.shape))
            dataset[num_images, :, :] = image_data
            num_images = num_images + 1
        except (IOError, ValueError) as e:
            print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')
    
    dataset = dataset[0:num_images, :, :]
    if num_images < min_num_images:
        raise Exception('Many fewer images than expected: %d < %d' %
                    (num_images, min_num_images))
    
    print('Full dataset tensor:', dataset.shape)
    print('Mean:', np.mean(dataset))
    print('Standard deviation:', np.std(dataset))
    return dataset
        
def maybe_pickle(data_folders, min_num_images_per_class, force=False):  # 修改文件的扩展名
    dataset_names = []
    for folder in data_folders:
        newPath = os.path.join(path,folder) # D:/NLP/lesson09/notMNIST_small/A
        #image_files = os.listdir(newPath) # ['MDEtMDEtMDAudHRm.png', 'MDRiXzA4LnR0Zg==.png']
        set_filename = folder + '.pickle' # D:/NLP/lesson09/notMNIST_small/A.pickle
        dataset_names.append(set_filename)
        print(dataset_names)
        if os.path.exists(set_filename) and not force:
            # You may override by setting force=True.
            print('%s already present - Skipping pickling.' % set_filename)
        else:
            print('Pickling %s.' % set_filename)
            dataset = load_letter(folder, min_num_images_per_class)  # 这里来调用上面的函数
            try:
                with open(set_filename, 'wb') as f: 
                    pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
            except Exception as e:
                print('Unable to save data to', set_filename, ':', e)
  
    return dataset_names

#test_folders = os.listdir( path )
#test_datasets = maybe_pickle(test_folders, 1800)  # ['A.pickle', 'B.pickle', 'C.pickle']


train_folders = os.listdir( path )
train_datasets = maybe_pickle(train_folders, 45000)

查看数据集保存的路径：

os.getcwd()

输出：‘C:\Users\Administrator’
表示notebook运行代码的路径以及数据集的保存路径，自己去该路径下看看数据集。

Problem 2

Let’s verify that the data still looks good. Displaying a sample of the labels and images from the ndarray. Hint: you can use matplotlib.pyplot.

pickle_file = os.path.join('notMNIST_small', test_datasets[1])
with open(pickle_file, 'rb') as f:
    letter_set = pickle.load(f)  # 将文件B.pickle反序列化
    sample_idx = np.random.randint(len(letter_set))  # 在这个文件中随机选一个
    sample_image = letter_set[sample_idx,:,:]  # 三维数组的取值啊，表示取出下标为sample_idx的这张图片
    plt.figure()
    plt.imshow(sample_image)

输出：
在这里插入图片描述

Problem 3 交叉验证

Another check: we expect the data to be balanced across classes. Verify that.

for idx in range(len(test_datasets)):  # 测试集的标签及每一类的长度
    class_idx = test_datasets[idx]
    class_idx = os.path.join('notMNIST_small', class_idx)
    with open(class_idx, 'rb') as f:
        letter_set = pickle.load(f)
        print(idx,len(letter_set))

for idx in range(len(train_datasets)):  # 训练集的标签及每一类的长度
    class_idx = train_datasets[idx]
    class_idx = os.path.join('notMNIST_large', class_idx)
    print(class_idx)
    with open(class_idx, 'rb') as f:
        letter_set = pickle.load(f)
        print(idx,len(letter_set))

将不同类别的数据混合并将得到验证集

Merge and prune（精简） the training data as needed. Depending on your computer setup, you might not be able to fit it all in memory, and you can tune train_size as needed. The labels will be stored into a separate array of integers 0 through 9.

Also create a validation dataset for hyperparameter tuning（超参赛调优）.

def make_arrays(nb_rows, img_size):
    if nb_rows:
        dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)
        labels = np.ndarray(nb_rows, dtype=np.int32)
    else:
        dataset, labels = None, None
    return dataset, labels

def merge_datasets(pickle_files, train_size, valid_size=0):  # 特别要注意一下缩进问题
    num_classes = len(pickle_files)
    valid_dataset, valid_labels = make_arrays(valid_size, image_size)
    train_dataset, train_labels = make_arrays(train_size, image_size)
    vsize_per_class = valid_size // num_classes
    tsize_per_class = train_size // num_classes

    start_v, start_t = 0, 0
    end_v, end_t = vsize_per_class, tsize_per_class
    end_l = vsize_per_class+tsize_per_class
    for label, pickle_file in enumerate(pickle_files):    
        pickle_file = os.path.join('notMNIST_large', pickle_file) 
#         pickle_file = os.path.join('notMNIST_small', pickle_file)   # 训练集和测试集的文件目录是不一样的
        try:
            with open(pickle_file, 'rb') as f:
                letter_set = pickle.load(f)
                # let's shuffle the letters to have random validation and training set
                np.random.shuffle(letter_set)
            if valid_dataset is not None:
                valid_letter = letter_set[:vsize_per_class, :, :]
                valid_dataset[start_v:end_v, :, :] = valid_letter
                valid_labels[start_v:end_v] = label
                start_v += vsize_per_class
                end_v += vsize_per_class
                    
            train_letter = letter_set[vsize_per_class:end_l, :, :]
            train_dataset[start_t:end_t, :, :] = train_letter
            train_labels[start_t:end_t] = label
            start_t += tsize_per_class
            end_t += tsize_per_class
        except Exception as e:
            print('Unable to process data from', pickle_file, ':', e)
            raise
    
    return valid_dataset, valid_labels, train_dataset, train_labels

train_size = 200000
valid_size = 10000
valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(train_datasets, train_size, valid_size)

print('Training:', train_dataset.shape, train_labels.shape)
print('Validation:', valid_dataset.shape, valid_labels.shape)

test_size = 10000
_, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)

print('Testing:', test_dataset.shape, test_labels.shape)

Next, we’ll randomize the data. It’s important to have the labels well shuffled for the training and test distributions to match.

def randomize(dataset, labels):
    permutation = np.random.permutation(labels.shape[0]) # 返回一个新的打乱顺序的数组，并不改变原来的数组。
    shuffled_dataset = dataset[permutation,:,:]
    shuffled_labels = labels[permutation]
    return shuffled_dataset, shuffled_labels


train_dataset, train_labels = randomize(train_dataset, train_labels)
test_dataset, test_labels = randomize(test_dataset, test_labels)
valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)

Problem 4

Convince yourself that the data is still good after shuffling!

letter_label_map = {i: chr(ord('A')+i) for i in range(10)}  # dict

def plot_shuffled_samples(datasets, labels, title=None):
    fig, axes = plt.subplots(4, 4, figsize=(10, 10))
    if title: 
        fig.suptitle(title, y=1.01, fontsize=16, fontweight='bold')
    sample_idx = random.sample(range(datasets.shape[0]), 16)
    x_samples = datasets[sample_idx]
    label_samples = labels[sample_idx]
    for i, idx in enumerate(sample_idx):
        r, c = divmod(i, 4)
        axes[r, c].imshow(datasets[idx])
        axes[r, c].set_title(letter_label_map[labels[idx]])
        axes[r, c].set_axis_off()
    fig.tight_layout()
    
plot_shuffled_samples(train_dataset, train_labels, title='shuffled training sets')  # train_dataset ， train_labels
plot_shuffled_samples(test_dataset, test_labels, title='shuffled test sets')
plot_shuffled_samples(valid_dataset, valid_labels, title='shuffled valid sets')

输出如下：
在这里插入图片描述

Finally, let’s save the data for later reuse:

data_root = os.getcwd()
pickle_file = os.path.join(data_root, 'notMNIST.pickle')  # 存在C盘了

try:
    f = open(pickle_file, 'wb')
    save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
    }
    pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)
    raise

statinfo = os.stat(pickle_file)
print('Compressed pickle size:', statinfo.st_size)

Problem 5：数据清洗，去掉相似图片

By construction, this dataset might contain a lot of overlapping（重叠） samples, including training data that’s also contained in the validation and test set! Overlap between training and test can skew（偏离） the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur（重现） when you use it.

Measure how much overlap there is between training, validation and test samples.

Optional questions:

What about near duplicates between datasets? (images that are almost identical（完全一样的）)
得到图片的指纹后, 就可以对比不同的图片的指纹, 计算出64位中有多少位是不一样的. 如果不相同的数据位数不超过5, 就说明两张图片很相似, 如果大于10, 说明它们是两张不同的图片
Create a sanitized（净化的） validation and test set, and compare your accuracy on those in subsequent assignments.
去掉验证集合测试集中相似的照片

相似图片检测：感知哈希算法

import imagehash
from PIL import Image

def extract_overlap(dataset_1, dataset_2): # 求两个数据集中重叠的图片
    dataset_1_hash = np.array([str(imagehash.phash(Image.fromarray(img, mode='P'))) for img in dataset_1]) # 求dataset_1中每一张图片的hash值
    dataset_2_hash = np.array([str(imagehash.phash(Image.fromarray(img, mode='P'))) for img in dataset_2])
    overlap = {}
    for i, img_hash in enumerate(dataset_1_hash):
        duplicates = np.where(img_hash == dataset_2_hash)[0]  # 输出满足条件元素的坐标，返回值可能是一个list
        if len(duplicates) > 0:
            overlap[i] = duplicates  # 1:5 表示下标为1的照片和下标为5的照片是相似的
    return overlap

def plot_duplicate(overlap):
    source = random.choice(list(overlap.keys()))
    target = random.choice(overlap[source])
    for key ,value in testOverlap.items():
        fig, (ax1, ax2) =  plt.subplots(1, 2) # 1行2列
        fig.suptitle('duplicate samples')
        ax1.imshow(train_dataset[key])
        ax1.set_title('sample from training sets')
        ax1.set_axis_off()
        ax2.imshow(test_dataset[value[0]])
        ax2.set_title('sample from test sets')
        ax2.set_axis_off()

plot_duplicate(testOverlap)

输出如下：
在这里插入图片描述

从 training sets 中剔除与 validation sets 和 test sets 中相似的图片

def sanitize_data(dataset_1, dataset_2, labels_1):
    dataset_1_hash = np.array([str(imagehash.phash(Image.fromarray(img, mode='P'))) for img in dataset_1])
    dataset_2_hash = np.array([str(imagehash.phash(Image.fromarray(img, mode='P'))) for img in dataset_2])
    overlap = []
    for i, hash_img in enumerate(dataset_1_hash):
        duplicates = np.where(hash_img == dataset_2_hash)[0]
        if len(duplicates):
            overlap.append(i)
    return np.delete(dataset_1, overlap, 0), np.delete(labels_1, overlap)  # delete删除的是向量，从dataset_1中删除第overlap个向量，0代表一行

%time train_datasets_sanitized, train_labels_sanitized = sanitize_data(train_dataset,valid_dataset,train_labels)
print('从训练集中删除和验证集相似的一共{nums}个图片'.format(nums=train_dataset.shape[0] - train_datasets_sanitized.shape[0]))
mp_size = train_datasets_sanitized.shape[0]
train_datasets_sanitized, train_labels_sanitized = sanitize_data(train_datasets_sanitized, test_dataset, train_labels_sanitized)

print('从训练集中删除和测试集相似的一共{nums}个图片'.format(nums=mp_size - train_datasets_sanitized.shape[0]))

存进文件里面

data_root = os.getcwd()
pickle_file = os.path.join(data_root, 'notMNIST_sanitized.pickle')

try:
    f = open(pickle_file, 'wb')
    save = {
        'train_datasets_sanitized': train_datasets_sanitized,
        'train_labels_sanitized': train_labels_sanitized,
        'valid_datasets': valid_dataset,
        'valid_labels': valid_labels,
        'test_datasets': test_dataset,
        'test_labels': test_labels,
    }
    pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)
    raise

statinfo = os.stat(pickle_file)
print('Compressed pickle size:', statinfo.st_size)

Problem 6 模型训练

Let’s get an idea of what an off-the-shelf classifier can give you on this data. It’s always good to check that there is something to learn, and that it’s a problem that is not so trivial that a canned solution solves it.

off-the-shelf = 一种“现成的”算法是已经由其他人实现的一种算法，可在库中使用

Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.

Optional question: train an off-the-shelf model on all the data

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import neighbors, datasets
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

def get_performance(clf, X_, y_, sample=1000):  # sample表示取出多少个样例出来
    random_indices = np.random.choice(np.arange(len(y_)), sample) 
    sub_x = X_[random_indices]  # 实际的X值
    sub_y = y_[random_indices]  # 实际的Y值
    y_hat = clf.predict(sub_x)  # 预测的Y值
    print('percision is: {}'.format(precision_score(sub_y, y_hat)))
    print('recall is: {}'.format(recall_score(sub_y, y_hat)))
    print('roc_auc is: {}'.format(roc_auc_score(sub_y, y_hat)))
    print('confusion matrix: \n{}'.format(confusion_matrix(sub_y, y_hat, labels=[0, 1])))

import warnings
warnings.filterwarnings('ignore')

lenSample = train_labels_sanitized.shape[0]

sample_size = 50
idx = random.sample(range(lenSample), sample_size)  # 从列表中选择出sample_size个元素出来
X_c_train, y_c_train = train_datasets_sanitized[idx].reshape(sample_size, -1), train_labels_sanitized[idx]
# 把28*28的一张图片转为为 1*784的向量
clfLR = LogisticRegression(n_jobs=4).fit(X_c_train, y_c_train)
#get_performance(clfLR, X_=X_c_train, y_=y_c_train, sample=50)
clfLR.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)

sample_size = 100
idx = random.sample(range(lenSample), sample_size)  # 从列表中选择出sample_size个元素出来
X_c_train, y_c_train = train_datasets_sanitized[idx].reshape(sample_size, -1), train_labels_sanitized[idx]
# 把28*28的一张图片转为为 1*784的向量
clfLR = LogisticRegression(n_jobs=4).fit(X_c_train, y_c_train)
#get_performance(clfLR, X_=X_c_train, y_=y_c_train, sample=50)
clfLR.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)


sample_size = 1000
idx = random.sample(range(lenSample), sample_size)  # 从列表中选择出sample_size个元素出来
X_c_train, y_c_train = train_datasets_sanitized[idx].reshape(sample_size, -1), train_labels_sanitized[idx]
# 把28*28的一张图片转为为 1*784的向量
clfLR = LogisticRegression(n_jobs=4).fit(X_c_train, y_c_train)
#get_performance(clfLR, X_=X_c_train, y_=y_c_train, sample=50)
clfLR.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)

sample_size = 5000
idx = random.sample(range(lenSample), sample_size)  # 从列表中选择出sample_size个元素出来
X_c_train, y_c_train = train_datasets_sanitized[idx].reshape(sample_size, -1), train_labels_sanitized[idx]
# 把28*28的一张图片转为为 1*784的向量
clfLR = LogisticRegression(n_jobs=4).fit(X_c_train, y_c_train)
#get_performance(clfLR, X_=X_c_train, y_=y_c_train, sample=50)
clfLR.score(valid_dataset.reshape(valid_dataset.shape[0], -1), valid_labels)