MAML中few-shot (小样本）learning中数据集的处理

最新推荐文章于 2024-02-04 09:48:39 发布

小刘同学_

最新推荐文章于 2024-02-04 09:48:39 发布

阅读量4.7k

点赞数 9

分类专栏：元学习机器学习 python

本文链接：https://blog.csdn.net/SweetSeven_/article/details/103477507

版权

python 同时被 3 个专栏收录

35 篇文章 7 订阅

订阅专栏

机器学习

34 篇文章 2 订阅

订阅专栏

元学习

2 篇文章 1 订阅

订阅专栏

Few-shot learning

数据集

小样本学习（few shot learning）里面常用的测试数据集主要有Omniglot和miniImagenet两个，但是网上能查到的下载地址都在谷歌网盘上，而且miniImagenet中还缺少标注数据的csv文件,这里写一下搜索到的地址

miniImagenet部分

miniImagenet下载地址：

百度云链接: https://pan.baidu.com/s/1npRhZajLrLe6-KtSbJsa1A 密码: ztp5
百度云下载速度有些慢，尝试使用谷歌云盘：https://drive.google.com/open?id=1HkgrkAwukzEZA0TpO7010PkAOREb2Nuk
需要csv文件从这里获取：https://github.com/vieozhu/MAML-TensorFlow-1

开始主要是跑MAML算法测试，发现github上cbfinn提供的代码https://github.com/cbfinn/maml.git中，处理数据的部分只适用于linux，在win下运行会出错，将proc_images.py中os.system改为对应的os操作即可。
直接贴修改后的代码

"""
Script for converting from csv file datafiles to a directory for each image (which is how it is loaded by MAML code)

Acquire miniImagenet from Ravi & Larochelle '17, along with the train, val, and test csv files. Put the
csv files in the miniImagenet directory and put the images in the directory 
其实这里的意思就是，你要把下载的原始miniimagenet数据集解压缩之后的images文件夹移动到miniImagenet文件夹之下，
你的proc_images.py文件也在同一个文件夹之下，这样就可以对数据进行处理了。
'miniImagenet/images/'.
Then run this script from the miniImagenet directory:
    cd data/miniImagenet/
    python proc_images.py
"""
上面这部分是finn自己的代码适合linux
from __future__ import print_function
import csv
import glob
import os

from PIL import Image

path_to_images = 'images/'

all_images = glob.glob(path_to_images + '*')

# Resize images
for i, image_file in enumerate(all_images):
    im = Image.open(image_file)
    im = im.resize((84, 84), resample=Image.LANCZOS)
    im.save(image_file)
    if i % 500 == 0:
        print(i)

# Put in correct directory
for datatype in ['train', 'val', 'test']:
    os.system('mkdir ' + datatype)

    with open(datatype + '.csv', 'r') as f:
        reader = csv.reader(f, delimiter=',')
        last_label = ''
        for i, row in enumerate(reader):
            if i == 0:  # skip the headers
                continue
            label = row[1]
            image_name = row[0]
            if label != last_label:
                cur_dir = datatype + '/' + label + '/'
                os.system('mkdir ' + cur_dir)
                last_label = label
            os.system('mv images/' + image_name + ' ' + cur_dir)

下面这部分是适用于windows的
from __future__ import print_function
import csv
import glob
import os

from PIL import Image

path_to_images = 'images/'

all_images = glob.glob(path_to_images + '*')

# Resize images
for i, image_file in enumerate(all_images):
    im = Image.open(image_file)
    im = im.resize((84, 84), resample=Image.LANCZOS)
    im.save(image_file)
    if i % 500 == 0:
        print(i)

# Put in correct directory
for datatype in ['train', 'val', 'test']:
    os.mkdir(datatype)

    with open(datatype + '.csv', 'r') as f:
        reader = csv.reader(f, delimiter=',')
        last_label = ''
        for i, row in enumerate(reader):
            if i == 0:  # skip the headers
                continue
            label = row[1]
            image_name = row[0]
            if label != last_label:
                cur_dir = datatype + '/' + label + '/'
                os.mkdir(cur_dir)
                last_label = label
            os.rename('images/' + image_name,  cur_dir+image_name)

Omniglot数据集

直接下载github整个项目(94M)，解压取python版本，新建一个data，将所有压缩包放进data即可。

数据集简介

Omniglot 一般会被戏称为 MNIST 的转置，大家可以想想为什么？下面对 Omniglot 数据集进行简要介绍：

Omniglot 数据集包含来自 5050 个不同字母的 16231623 个不同手写字符。每一个字符都是由 2020 个不同的人通过亚马逊的 Mechanical Turk 在线绘制的。

每个图像都与笔画数据配对, 坐标序列为 [x, y, t][x,y,t], 且时间 (t)(t) 以毫秒为单位。笔画数据仅在 matlab/ 文件中可用。

数据集的引用: Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332-1338.

Omniglot 数据集总共包含 5050 个字母。我们通常将这些分成一组包含 3030 个字母的背景（background）集和一组包含 2020 个字母的评估（evaluation）集。

更具挑战性的表示学习任务是使用较小的背景集 “background small 1” 和 “background small 2”。每一个都只包含 55 个字母, 更类似于一个成年人在学习一般的字符时可能遇到的经验。

为了更加直观的感受 Omniglot 的组成，我借助 brendenlake/omniglot 的源码，对该数据集进行了剖析，并以 .ipynb 的文件格式进行展示。数据集具体形式可见 omniglot/python 。查看 数据使用说明 无需解压便可直接获取数据集的相关信息。如果你更喜欢命令行的形式，可以查看 dataloader。

更进一步，如果你想要使用 Modified Hausdorff 距离测试 one-shot 在原论文的效果如何，你可以查看 one-shot-classification。

更甚者，如果你仅仅是想要在线查看该数据集，而不想将其下载下来。你可以在 https://mybinder.org/上在线对该数据集进行一些你想要的操作，包括跑程序。具体的做法是：

点击 Omniglot 进入在线编辑模式；
数据集见 omniglot/ 目录；数据使用说明.ipynb 文件可以用来操作 Omniglot 数据集；
测试 one-shot 的数据集见 omniglot/python/one-shot-classification 目录。文件 test_demo.ipynb 可以做一些测试工作。

小刘同学_

关注

9
点赞
踩
20

收藏

觉得还不错? 一键收藏
1
评论
MAML中few-shot (小样本）learning中数据集的处理

Few-shot learning数据集小样本学习（few shot learning）里面常用的测试数据集主要有Omniglot和miniImagenet两个，但是网上能查到的下载地址都在谷歌网盘上，而且miniImagenet中还缺少标注数据的csv文件,这里写一下搜索到的地址miniImagenet部分miniImagenet下载地址：百度云链接: https:...
复制链接

扫一扫