题目链接
这个任务一共有六个小任务,在Problem 1前的三段代码是预处理图片的代码,分别用于载入需要的库、在相应网址中下载训练集的压缩包、解压压缩包,编译的时候直接把三段代码复制下来即可完成所有预备工作,解压花的时间会多一点,别强行终止程序就是了,如果强行终止,第二次再运行会出error。运行成功后会显示如下代码:
Found and verified .\notMNIST_large.tar.gz
Found and verified .\notMNIST_small.tar.gz
.\notMNIST_large already present - Skipping extraction of .\notMNIST_large.tar.gz.
['.\\notMNIST_large\\A', '.\\notMNIST_large\\B', '.\\notMNIST_large\\C', '.\\notMNIST_large\\D', '.\\notMNIST_large\\E', '.\\notMNIST_large\\F', '.\\notMNIST_large\\G', '.\\notMNIST_large\\H', '.\\notMNIST_large\\I', '.\\notMNIST_large\\J']
.\notMNIST_small already present - Skipping extraction of .\notMNIST_small.tar.gz.
['.\\notMNIST_small\\A', '.\\notMNIST_small\\B', '.\\notMNIST_small\\C', '.\\notMNIST_small\\D', '.\\notMNIST_small\\E', '.\\notMNIST_small\\F', '.\\notMNIST_small\\G', '.\\notMNIST_small\\H', '.\\notMNIST_small\\I', '.\\notMNIST_small\\J']
Problem1.就是让我们写代码把一些图片展示出来,博主不会写把展示一堆图片的,菜哭寄几呜呜呜:
from PIL import Image
#圆括号里是路径+图片名,记得不要用反斜杠表示目录之间的关系,要用正斜杠,反斜杠是个转义字符(可能是吧哈哈哈)
img=Image.open('F:/python/19.01.24/venv/notMNIST_small/A/MDEtMDEtMDAudHRm.png')
img.show()
Problem 2:将转化成pickle文件后的图片可视化
with open("F:/python/19.01.24/venv/notMNIST_large/B.pickle",'rb') as pk_f:
show_pickle=pickle.load(pk_f)#pickle.load()方法用于反序列化对象,将文件中的数据解析为一个python对象
#第一个数是图片在文件里的序列号,从1开始,文件里的第二张图片是个“术”的繁体字,如运行截图所示:
img=show_pickle[2,:,:]
#imshow()用于显示灰度图
plt.imshow(img)
#show()在屏幕上显示图像
plt.show()
运行结果如下:
Problem3:检测各个数据集的大小,确保各个数据集大小相差不大
参考博客3
参考博客4
这里顺便介绍一个知识点:python中遍历文件夹,并输出特定类型文件的大小:
import os
fileList=os.listdir("F:/python/19.01.24/venv/notMNIST_large/")
for filename in fileList:
pathTmp=os.path.join("F:/python/19.01.24/venv/notMNIST_large/",filename)#把路径名和子文件名组合成一个新路径名
if os.path.isfile(pathTmp):#判断是否文件,要判断是否为文件夹可用isdir
print(pathTmp)#打印路径名
print(os.path.getsize(pathTmp))#打印pickle文件大小
运行结果:
可以看到各个字母的数据集大小都差不多
Problem4.给数据打标记,A-J分别对应0-9,并创建一个用于参数调整的验证集。现在来验证一下创建后的数据是否完好,并查看数据是否乱序:
def showdata_dataset(data_set,num):
if num<=0:
print("wrong input")
return
img_list=data_set[0:num,:,:]#获取文件列表
for index,img in enumerate(img_list):
plt.subplot(num//5+1,5,index+1)#第一个参数表示每一列有多少个,第二个参数表示每一行有多少个,最后一个参数表示图片在序列中的序号
plt.imshow(img)#显示灰度图
plt.show()#把结果在屏幕上显示出来
showdata_dataset(train_dataset,50)
运行结果:
Problem5.
插播一个英文词汇:validation 确认
题意就是说这些数据有些是重复的,需要你去算一下有多少个重复的数据
参考博客:Udacity深度学习DeepLearning课程作业1—notMnist
import matplotlib.pylab as plt
import numpy as np
import hashlib
#1.提取overlap
def extract_overlap_from(dataset_1,dataset_2):
hash_dataset_1=np.array([hashlib.sha256(img).hexdigest() for img in dataset_1])
hash_dataset_2 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_2])
overlap={}
for index,hash in enumerate(hash_dataset_1):
duplicate=np.where(hash_dataset_2==hash)
if len(duplicate[0]):
overlap[index]=duplicate[0]
return overlap
#2.display the overlap
def display_overlap(overlap,source_dataset,target_dataset):
overlap ={k:v for k,v in overlap.items() if len(v)>=3}
item=np.random.choice(list(overlap.keys()))
imgs=np.concatenate(([source_dataset[item]],target_dataset[overlap[item][0:3]]))
plt.subtitle(item)
for i,img in enumerate(imgs):
plt.subplot(1,4,i+1)
plt.imshow(img)
plt.show()
#3.数据清洗
def sanit(dataset_1,dataset_2,Label_1):
hash_dataset_1 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_1])
hash_dataset_2 = np.array([hashlib.sha256(img).hexdigest() for img in dataset_2])
overlap=[]
for i,hash in enumerate(hash_dataset_1):
duplicates=np.where(hash_dataset_2==hash)
if len(duplicates[0]):
overlap.append(i)
return np.delete(dataset_1,overlap,0),np.delete(Label_1,overlap,None)
#从训练集中获取并展示与测试集重合的部分
overlap_test=extract_overlap_from(test_dataset,train_dataset)
print('Number of overlap of test : ',len(overlap_test.keys()))
display_overlap(overlap_test,test_dataset,train_dataset)
#从训练集中获取并展示与验证集重合的部分
overlap_valid=extract_overlap_from(valid_dataset,train_dataset)
print('Number of overlap of test : ',len(overlap_valid.keys()))
display_overlap(overlap_valid,valid_dataset,train_dataset)
test_sanit,testlabel_sanit=sanit(test_dataset,train_dataset,test_labels)
valid_sanit,validlabel_sanit=sanit(valid_dataset,train_dataset,valid_labels)
print('original testsize:',test_dataset.shape,' sanit_test:',test_sanit.shape)
print('overlapping images removed from test_dataset: ',len(test_dataset)-len(test_sanit))
print('original validsize:',valid_dataset.shape,' sanit_valid:',valid_sanit.shape)
print('overlapping images removed from valid_dataset: ',len(valid_dataset)-len(valid_sanit))
print('trainsize: ',train_dataset.shape)
pickle_file_sanit = 'notMNIST_sanit.pickle'
try:
f=open(pickle_file_sanit,'wb')
save={
'train_dataset':train_dataset,
'train_labels':train_labels,
'valid_dataset':valid_dataset,
'valid_labels':valid_labels,
'test_dataset':test_dataset,
'test_labels':test_labels
}
pickle.dump(save,f,pickle.HIGHEST_PROTOCOL)
f.close()
except Exception as e:
print('Unable to save data to ',pickle_file,':',e)
raise
satinfo=os.stat(pickle_file_sanit)
print('Compressed pickle size:',satinfo.st_size)
运行截图:
结论就是,原来验证集测试集都是有10000个样本,但是它们和训练集中的重叠样本量分别为1324和1067,所以去重后的样本集大小分别为8676和8933
P6:
def train_model(sample_size):
print(sample_size)
X_train=train_dataset[:sample_size].reshape(sample_size,28*28)
Y_label=train_labels[:sample_size]
LR=LogisticRegression()
LR.fit(X_train,Y_label)
X_test=test_sanit[:sample_size].reshape(sample_size,28*28)
Y_testlabel=testlabel_sanit[:sample_size]
print('accuracy:',LR.score(X_test,Y_testlabel),',when sample_size=',sample_size)
for sample_size in [50,100,1000,5000,len(test_sanit)]:
train_model(sample_size)
运行结果:
50
accuracy: 0.36 ,when sample_size= 50
100
accuracy: 0.67 ,when sample_size= 100
1000
accuracy: 0.839 ,when sample_size= 1000
5000
accuracy: 0.843 ,when sample_size= 5000
8676
accuracy: 0.8433609958506224 ,when sample_size= 8676
今天我改了一下训练集的大小,从20万增加到40万,预测结果如下:
50
accuracy: 0.78 ,when sample_size= 50
100
accuracy: 0.77 ,when sample_size= 100
1000
accuracy: 0.819 ,when sample_size= 1000
5000
accuracy: 0.821 ,when sample_size= 5000
7802
accuracy: 0.8361958472186619 ,when sample_size= 7802
可以看到,增加训练集后,对于小测试集的预测准确性提高较多,但是大测试集的准确性反而变小,至于原因,我现在也还搞不懂。