ud730任务:notMNIST

题目链接
这个任务一共有六个小任务,在Problem 1前的三段代码是预处理图片的代码,分别用于载入需要的库、在相应网址中下载训练集的压缩包、解压压缩包,编译的时候直接把三段代码复制下来即可完成所有预备工作,解压花的时间会多一点,别强行终止程序就是了,如果强行终止,第二次再运行会出error。运行成功后会显示如下代码:

Found and verified .\notMNIST_large.tar.gz
Found and verified .\notMNIST_small.tar.gz
.\notMNIST_large already present - Skipping extraction of .\notMNIST_large.tar.gz.
['.\\notMNIST_large\\A', '.\\notMNIST_large\\B', '.\\notMNIST_large\\C', '.\\notMNIST_large\\D', '.\\notMNIST_large\\E', '.\\notMNIST_large\\F', '.\\notMNIST_large\\G', '.\\notMNIST_large\\H', '.\\notMNIST_large\\I', '.\\notMNIST_large\\J']
.\notMNIST_small already present - Skipping extraction of .\notMNIST_small.tar.gz.
['.\\notMNIST_small\\A', '.\\notMNIST_small\\B', '.\\notMNIST_small\\C', '.\\notMNIST_small\\D', '.\\notMNIST_small\\E', '.\\notMNIST_small\\F', '.\\notMNIST_small\\G', '.\\notMNIST_small\\H', '.\\notMNIST_small\\I', '.\\notMNIST_small\\J']

Problem1.就是让我们写代码把一些图片展示出来,博主不会写把展示一堆图片的,菜哭寄几呜呜呜:

from PIL import Image
#圆括号里是路径+图片名,记得不要用反斜杠表示目录之间的关系,要用正斜杠,反斜杠是个转义字符(可能是吧哈哈哈)
img=Image.open('F:/python/19.01.24/venv/notMNIST_small/A/MDEtMDEtMDAudHRm.png')
img.show()

Problem 2:将转化成pickle文件后的图片可视化

参考博客

with open("F:/python/19.01.24/venv/notMNIST_large/B.pickle",'rb') as pk_f:
    show_pickle=pickle.load(pk_f)#pickle.load()方法用于反序列化对象,将文件中的数据解析为一个python对象
    #第一个数是图片在文件里的序列号,从1开始,文件里的第二张图片是个“术”的繁体字,如运行截图所示:
img=show_pickle[2,:,:]
#imshow()用于显示灰度图
plt.imshow(img)
#show()在屏幕上显示图像
plt.show()

运行结果如下:
运行结果


Problem3:检测各个数据集的大小,确保各个数据集大小相差不大
参考博客3
参考博客4
这里顺便介绍一个知识点:python中遍历文件夹,并输出特定类型文件的大小:

import os
fileList=os.listdir("F:/python/19.01.24/venv/notMNIST_large/")
for filename in fileList:
        pathTmp=os.path.join("F:/python/19.01.24/venv/notMNIST_large/",filename)#把路径名和子文件名组合成一个新路径名
        if os.path.isfile(pathTmp):#判断是否文件,要判断是否为文件夹可用isdir
            print(pathTmp)#打印路径名
            print(os.path.getsize(pathTmp))#打印pickle文件大小

运行结果:
在这里插入图片描述
可以看到各个字母的数据集大小都差不多


Problem4.给数据打标记,A-J分别对应0-9,并创建一个用于参数调整的验证集。现在来验证一下创建后的数据是否完好,并查看数据是否乱序:

参考博客

def showdata_dataset(data_set,num):
    if num<=0:
        print("wrong input")
        return
    img_list=data_set[0:num,:,:]#获取文件列表
    for index,img in enumerate(img_list):
        plt.subplot(num//5+1,5,index+1)#第一个参数表示每一列有多少个,第二个参数表示每一行有多少个,最后一个参数表示图片在序列中的序号
        plt.imshow(img)#显示灰度图
    plt.show()#把结果在屏幕上显示出来

showdata_dataset(train_dataset,50)

运行结果:

在这里插入图片描述
Problem5.
插播一个英文词汇:validation 确认
题意就是说这些数据有些是重复的,需要你去算一下有多少个重复的数据
参考博客:Udacity深度学习DeepLearning课程作业1—notMnist

import matplotlib.pylab as plt
import numpy as np
import hashlib


#1.提取overlap
def extract_overlap_from(dataset_1,dataset_2):
    hash_dataset_1=np.array([hashlib.sha256(img).hexdigest() for img in dataset_1])
    hash_dataset_2 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_2])
    overlap={}
    for index,hash in enumerate(hash_dataset_1):
        duplicate=np.where(hash_dataset_2==hash)
        if len(duplicate[0]):
            overlap[index]=duplicate[0]
    return overlap

#2.display the overlap
def display_overlap(overlap,source_dataset,target_dataset):
    overlap ={k:v for k,v in overlap.items() if len(v)>=3}
    item=np.random.choice(list(overlap.keys()))
    imgs=np.concatenate(([source_dataset[item]],target_dataset[overlap[item][0:3]]))
    plt.subtitle(item)
    for i,img in enumerate(imgs):
        plt.subplot(1,4,i+1)
        plt.imshow(img)
    plt.show()

#3.数据清洗
def sanit(dataset_1,dataset_2,Label_1):
    hash_dataset_1 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_1])
    hash_dataset_2 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_2])
    overlap=[]
    for i,hash in enumerate(hash_dataset_1):
        duplicates=np.where(hash_dataset_2==hash)
        if len(duplicates[0]):
            overlap.append(i)
    return np.delete(dataset_1,overlap,0),np.delete(Label_1,overlap,None)

#从训练集中获取并展示与测试集重合的部分
overlap_test=extract_overlap_from(test_dataset,train_dataset)
print('Number of overlap of test : ',len(overlap_test.keys()))
display_overlap(overlap_test,test_dataset,train_dataset)
#从训练集中获取并展示与验证集重合的部分
overlap_valid=extract_overlap_from(valid_dataset,train_dataset)
print('Number of overlap of test : ',len(overlap_valid.keys()))
display_overlap(overlap_valid,valid_dataset,train_dataset)

test_sanit,testlabel_sanit=sanit(test_dataset,train_dataset,test_labels)
valid_sanit,validlabel_sanit=sanit(valid_dataset,train_dataset,valid_labels)

print('original testsize:',test_dataset.shape,' sanit_test:',test_sanit.shape)
print('overlapping images removed from test_dataset: ',len(test_dataset)-len(test_sanit))
print('original validsize:',valid_dataset.shape,' sanit_valid:',valid_sanit.shape)
print('overlapping images removed from valid_dataset: ',len(valid_dataset)-len(valid_sanit))
print('trainsize: ',train_dataset.shape)

pickle_file_sanit = 'notMNIST_sanit.pickle'
try:
    f=open(pickle_file_sanit,'wb')
    save={
        'train_dataset':train_dataset,
        'train_labels':train_labels,
        'valid_dataset':valid_dataset,
        'valid_labels':valid_labels,
        'test_dataset':test_dataset,
        'test_labels':test_labels
    }
    pickle.dump(save,f,pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to ',pickle_file,':',e)
    raise
satinfo=os.stat(pickle_file_sanit)
print('Compressed pickle size:',satinfo.st_size)

运行截图:
在这里插入图片描述
结论就是,原来验证集测试集都是有10000个样本,但是它们和训练集中的重叠样本量分别为1324和1067,所以去重后的样本集大小分别为8676和8933

P6:

def train_model(sample_size):
    print(sample_size)
    X_train=train_dataset[:sample_size].reshape(sample_size,28*28)
    Y_label=train_labels[:sample_size]
    LR=LogisticRegression()
    LR.fit(X_train,Y_label)

    X_test=test_sanit[:sample_size].reshape(sample_size,28*28)
    Y_testlabel=testlabel_sanit[:sample_size]
    print('accuracy:',LR.score(X_test,Y_testlabel),',when sample_size=',sample_size)

for sample_size in [50,100,1000,5000,len(test_sanit)]:
    train_model(sample_size)

运行结果:

50
accuracy: 0.36 ,when sample_size= 50
100
accuracy: 0.67 ,when sample_size= 100
1000
accuracy: 0.839 ,when sample_size= 1000
5000
accuracy: 0.843 ,when sample_size= 5000
8676
accuracy: 0.8433609958506224 ,when sample_size= 8676

今天我改了一下训练集的大小,从20万增加到40万,预测结果如下:

50
accuracy: 0.78 ,when sample_size= 50
100
accuracy: 0.77 ,when sample_size= 100
1000
accuracy: 0.819 ,when sample_size= 1000
5000
accuracy: 0.821 ,when sample_size= 5000
7802
accuracy: 0.8361958472186619 ,when sample_size= 7802

可以看到,增加训练集后,对于小测试集的预测准确性提高较多,但是大测试集的准确性反而变小,至于原因,我现在也还搞不懂。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值