MIR-Flickr25K数据集预处理

HackerTom

已于 2022-06-20 16:39:14 修改

阅读量1.5w

点赞数 25

分类专栏：机器学习文章标签： python dataset flickr

于 2019-08-05 09:44:21 首次发布

本文链接：https://blog.csdn.net/hackertom/article/details/98477506

版权

机器学习专栏收录该内容

121 篇文章

订阅专栏

github：iTomxy/data/flickr25k

Notes

Flickr-25K 有 2,5000 张图，每张图有对应的 tags 和 annotation。
tags 可作为文本描述（text），其中至少出现在 20 张图片中的 tags 有 1386 个；
annotation 作为 label，一共 24 个。

关于 annotation / label 的个数，有两种说法：

24 个，如 DCMH^[4]，就是去除了那些 _r1.txt 文件，其实看一下那些 _r1.txt 文件内用就会发>现，在对应的没有 _r1 后缀的文件中已经包含了，比如 baby.txt 和 baby_r1.txt，其实后者中的内容在前者中都已包含；
38 个，如 [6]，应该就是将诸如 baby.txt 和 baby_r1.txt 区别对待的效果。

我也不知道带和不带 _r1 的区别。在 README.txt 里，不带的叫 POTENTIAL LABELS，带的叫 RELEVANT LABELS。

最终将 image 处理成 VGG19 的 4096-D 特征、text 是 1386-D BoW 向量、label 是 24-D 0/1 向量。

[4] 有提供一份处理好的 flickr 数据集，但总数是 2,0015。

Prepare

下载 mirflickr25k.zip 和 mirflickr25k_annotations_v080.zip，解压出 mirflickr/ 和 mirflickr25k_annotations_v080/。

图像就在 mirflickr/，文件名标了号：im*.jpg。
mirflickr/doc/ 下有 common_tags.txt，里面是上述的 1386 个 tags 和其对应的出现频数。
mirflickr/meta/tags/ 是每张图对应的处理过的 tags（转成小写、去除空格…），就用这些 tags。
mirflickr25k_annotations_v080/ 下是各 annotations 的文件，除去那些 *_r1.txt 和一个 README.txt 共 24 个。每个文件里都是若干个标号，表示这些标号的 image 属于这个 annotation。

Code

Common

from os.path import join
from os import listdir, makedirs
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
%matplotlib inline

P = "/usr/local/dataset/flickr"
BASE = join(P, 'mirflickr')
IMG_P = BASE  # image 路径
TXT_P = join(BASE, 'meta/tags')  #  text 路径
LAB_P = join(P, 'mirflickr25k_annotations_v080')  # label 路径
COM_TAG_F = join(BASE, 'doc/common_tags.txt')  # common tags
N_DATA = 25000

Image

from imageio import imread, imwrite
# from scipy.misc import imresize
from skimage.transform import resize
from keras.applications.vgg19 import VGG19
from keras.models import Model

# 加载 VGG 19
vgg = VGG19(weights='imagenet')
# vgg.summary()
vgg = Model(vgg.input, vgg.get_layer('fc2').output)  # 取 4096-D feature 输出
vgg.summary()

# image 文件列表
fs_img = [f for f in listdir(IMG_P) if '.jpg' in f]
fs_img = sorted(fs_img, key=key_img)  # 按标号排序
key_img = lambda s: int(s.split('.jpg')[0].split('im')[-1])
fs_img = [join(IMG_P, f) for f in fs_img]

N_IMG = len(fs_img)
print("#images:", N_IMG)

# 提取 image 特征
all_img = []
for i in range(N_IMG):
    im = imread(fs_img[i])
    im = resize(im, (224, 224, 3))
    im = np.expand_dims(im, axis=0)
    all_img.append(vgg.predict(im))
    
all_img = np.vstack(all_img)
print("images shape:", all_img.shape)

# 保存
np.save(join(P, "images.npy"), all_img)

Text / Tag

# 处理 common tags
tag_idx, idx_tag = {}, {}
cnt = 0
with open(COM_TAG_F, 'r') as f:
    for line in f:
        line = line.split()
        tag_idx[line[0]] = cnt
        idx_tag[cnt] = line[0]
        cnt += 1
DIM_TXT = len(tag_idx.keys())
print("text dim:", DIM_TXT)

# text 文件列表
key_txt = lambda s: int(s.split('.txt')[0].split('tags')[-1])  # 按标号排序
fs_tags = sorted(listdir(TXT_P), key=key_txt)
fs_tags = [join(TXT_P, f) for f in fs_tags]

N_TXT = len(fs_tags)
print("#texts:", N_TXT)


def get_tags(tag_f):
	"""读 tag 文件，获取该 sample 的 tags"""
    tg = []
    with open(tag_f, 'r') as f:
        for line in f:
            a = line.strip()
            if a in tag_idx:
                tg.append(a)
    return tg


# 制成 BoW
all_txt = np.zeros((N_TXT, DIM_TXT))
for i in range(N_TXT):
    tag = get_tags(fs_tags[i])
    for s in tag:
        if s in tag_idx:  # 在 common tags 内
        	# print(i, s)
            all_txt[i][tag_idx[s]] = 1
print("texts shape:", all_txt.shape)

# 保存
# np.save(join(P, "texts.npy"), all_txt.astype(np.int8))
all_txt = all_txt.astype(np.uint8)
sio.savemat(join(P, "texts.mat"), {"texts": all_txt}, do_compression=True)

Label / Annotation

key_lab = lambda s: s.split('.txt')[0]  # 按类名字典序升序
# label 文件列表
fs_lab = [s for s in listdir(LAB_P) if "README" not in s]
fs_lab = [s for s in fs_lab if "_r1" not in s]  # 这行注掉就是 38 个类
fs_lab = sorted(fs_lab, key=key_lab)

with open(join(P, "class-name-{}.txt".format(len(fs_lab))), "w") as f:
    # 记下 class name 与对应的 ID
    # format: <class name>, <class ID>
    # （用来统一 class 顺序）
    for i, c in enumerate(fs_lab):
        c = key_lab(c)
        f.write("{}, {}\n".format(c, i))

fs_lab = [join(LAB_P, s) for s in fs_lab]
N_CLASS = len(fs_lab)
print("#classes:", N_CLASS)


def sample_of_lab(lab_f):
    """读 annotation 文件，获取属于该类的 samples 标号"""
    samples = []
    with open(lab_f, 'r') as f:
        for line in f:
            sid = int(line)
            samples.append(sid)
    return samples


# 处理 label
all_lab = np.zeros((N_DATA, N_CLASS))
for i in range(len(fs_lab)):
    samp_ls = sample_of_lab(fs_lab[i])
    for s in samp_ls:
        all_lab[s - 1][i] = 1  # s-th 样本属于 i-th 类
print("labels shape:", all_lab.shape)

# 保存
# np.save(join(P, "labels.{}.npy".format(N_CLASS)), all_lab.astype(np.int8))
all_lab = all_lab.astype(np.uint8)
sio.savemat(join(P, "labels.{}.mat".format(N_CLASS)), {"labels": all_lab}, do_compression=True)

clean data

评论中 coasxu 指出，数据中有部分是 text 或 label 是全 0 向量，如果去除这部分，就会得到 2,0015 的样本数，和 [4] 中所提供的数据的样本数一致。
24 类和 38 类筛完之后都是 2,0015 这个数。

id_clean = []
for i in range(all_lab.shape[0]):
    if (all_txt[i].sum() > 0) and (all_lab[i].sum() > 0):
    	# text 和 label 都 不 为空
        id_clean.append(i)

id_clean = np.asarray(id_clean)
print("#clean sample:", id_clean.shape)  # (20015,)
np.save(join(P, "clean_id.npy"), id_clean)

经验证，DCMH 新提供的数据^[4]（即 Cleared-Set 里的）的顺序其实就同原数据的顺序是一样的，也就跟此处我自制的数据顺序一致。
此处用 label 验证：限制在那 2,0015 个 clean data 内，其余跳过，若所有位置的 label 的基数都相等，则认为顺序相同（因为 class 的顺序可能不同，所以只考虑总和）。

# DCMH 提供的 label
L_dcmh = sio.loadmat("../flickr.DCMH/labels.mat")["LAll"]
# 自制 label
L_my = np.load("labels.24.npy")
print(L_dcmh.shape, L_my.shape)

clean_id = np.load("clean_id.npy")
clean_id = {i for i in clean_id}

i = 0
for j in range(L_my.shape[0]):
    if (j not in clean_id):  # 非 clean data 跳过
        continue
    if L_dcmh[i].sum() != L_my[j].sum():  # 基数不同报警
        print("DIFF:", i, j)
        break
    else:
        i = i + 1
print("DONE")

输出

(20015, 24) (25000, 24)
DONE

又进一步验了一下，好像连 class 顺序都是一样的，即两份 label 数据完全一样。
！！但是与之前提供的数据（FLICKR-25K.mat 或 FLICKR-25K.h5 里的）顺序不同！！

Files

Baidu Cloud

原文件、处理程序、处理后的一些文件放上网盘：https://pan.baidu.com/s/19Zud5NQRKQRdcpGGJtpKjg。

其中 Cleared-Set 是从 [4] 拿过来的，是他新提供的数据；而 FLICKR-25K.h5 即代码里的 FLICKR-25K.mat（因为 python 要当成 .h5 文件，要用 h5py 读才对，所以我改了名），是他之前提供的数据。两份数据的 images 范围不一样（一份同官网下的 raw images 一样是 [0, 255] 的，另一份可能做过什么处理）；而且数据顺序好像也不一样，可以用 label 验证。
网盘里我做的 images.cnnf5.tf.old.npy 是用 FLICKR-25K.h5 提取的 CNN-F feature；而 images.cnnf5.torch.npy 是按官网原数据标号升序的（只含 clean_id 内的样本）。注意样本顺序不同。都是是 pool5 那层的输出（即还剩下最后两层 4096 的全连接没 forward，留着 fine-tune），tf 形状是 (6, 6, 256)。