NUS-WIDE数据集预处理

HackerTom

已于 2022-05-22 10:55:29 修改

阅读量1.8w

点赞数 15

分类专栏：机器学习文章标签： NUS-WIDE 数据集 python

于 2020-11-24 20:46:43 首次发布

本文链接：https://blog.csdn.net/HackerTom/article/details/110092390

版权

机器学习专栏收录该内容

121 篇文章

订阅专栏

本文介绍了如何从NUS-WIDE数据集中获取和处理多标签信息，包括标签文件解读、图像链接处理、文本数据的不同版本筛选，以及数据清洗和TC-21/TC-10类别筛选的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

github：iTomxy/data/nuswide

NUS-WIDE^[1]是一个多标签（multi-label）数据集，26,9648 个样本、81 个类。从 [1] 下载几个文件：

Groundtruth，label，解压后有 AllLabels/ 和 TrainTestLabels/ 两个目录，本篇只用前者。里面 81 文件对应 81 个类，文件名形如 Labels_*.txt，里面都是 26,9648 行的 0/1 数据，表明对应样本是否属于该类。
Tags，可以做 text 模态数据，下载得 NUS_WID_Tags.zip。其中 Final_Tag_List.txt 是论文^[2]里提到的那 5018 个 tags；TagList1k.txt 应该是其中 1000 个 tags 的子集，全英，数量上与 DCMH^[6]的描述对应；All_Tags.txt 是各样本的样本标号和对应的 tags（此文件的 tags 我猜是 5018 个 tags 来的？因为有非英文的 tags）；AllTags1k.txt 是 26,9648 x 1000 的 0/1 矩阵，对应各个样本是否有各个 tag。
Concept List，解压得 Concepts81.txt，81 个类的类名。
Image List，其中 Imagelist.txt 指明每个样本对应的 image。
Image Urls，给出 image 数据的下载链。

下的文件里本身有 train/test 的划分，但本篇忽略，保持它原本的样本顺序，后期再按设置自行划分，如 [5]。
image 数据可另外找资源下，如 [3,4]，图片会装在 Flickr/ 目录下，里面又有 704 个子目录，子目录名与上述 Imagelist.txt 里的对应，不过本篇忽略这些目录结构，只用 Groundtruth/ 里的文件做 label。
本篇假设所有下载的文件都放在 nuswide/ 目录下，并以此为当前目录，做出的数据文件也保存在这。

Common

import os
import numpy as np
import scipy.io as sio
import h5py


N_SAMPLE = 269648

Label

(2021.11.3) @Nijiayoudai 在评论指出原标签文件有一处 bug，在 Groundtruth/TrainTestLabels/Labels_lake_Train.txt 的 78372 行，值是 -1130179。筛查一次：

import os.path as osp
import glob

P = "G:/dataset/NUSWIDE/Groundtruth"

for sub_p in ["AllLabels", "TrainTestLabels"]:
    p = osp.join(P, sub_p)
    for fn in glob.glob("{}/*".format(p)):
        with open(fn, "r") as f:
            for ln, line in enumerate(f):
                label = int(line)
                if label not in (0, 1):
                    print("* BUG:", fn, ln, label)

print("DONE")

结论：应该就只有那一处。不过本文的 labels 没有用到那个文件，不受影响。

类的顺序就按 Concepts81.txt 里的顺序。
按 Groundtruth/AllLabels/ 里的文件做标签。

print("--- label ---")
LABEL_P = "Groundtruth/AllLabels"

# class order determined by `Concepts81.txt`
cls_id = {}
with open("Concepts81.txt", "r") as f:
    for cid, line in enumerate(f):
        cn = line.strip()
        cls_id[cn] = cid
# print("\nclass-ID:", cls_id)
id_cls = {cls_id[k]: k for k in cls_id}
# print("\nID-class:", id_cls)
N_CLASS = len(cls_id)
print("\n#classes:", N_CLASS)

class_files = os.listdir(LABEL_P)
# print("\nlabel file:", len(class_files), class_files)
label_key = lambda x: x.split(".txt")[0].split("Labels_")[-1]

labels = np.zeros([N_SAMPLE, N_CLASS], dtype=np.int8)
for cf in class_files:
    c_name = label_key(cf)
    cid = cls_id[c_name]
    print('->', cid, c_name)
    with open(os.path.join(LABEL_P, cf), "r") as f:
        for sid, line in enumerate(f):
            if int(line) > 0:
                labels[sid][cid] = 1
print("labels:", labels.shape, ", cardinality:", labels.sum())
# labels: (269648, 81) , cardinality: 503848
# np.save("labels.npy", labels.astype(np.int8))
labels = labels.astype(np.uint8)
sio.savemat("labels.mat", {"labels": labels}, do_compression=True)

Image

将 image 统一放到 images/ 下，方便以后读取，只放软链接^[7]。
注意：如果用 docker，这步应该在 docker 内执行，否则软链可能会指错地方，就读不出数据。

print("--- image ---")
P = "/home/dataset/nuswide"  # (mapped) path IN DOCKER
IMAGE_LIST = os.path.join(P, "ImageList/Imagelist.txt")
IMAGE_SRC = os.path.join(P, "Flickr")
IMAGE_DEST = os.path.join(os.getcwd(), "images")  # path you place `images/` in
if not os.path.exists(IMAGE_DEST):
    os.makedirs(IMAGE_DEST)

with open(IMAGE_LIST, "r") as f:
    for sid, line in enumerate(f):
    	line = line.replace('\\/'.replace(os.sep, ''), os.sep).strip()
        img_p = os.path.join(IMAGE_SRC, line)
        new_img_p = os.path.join(IMAGE_DEST, "{}.jpg".format(sid))
        os.system("ln -s {} {}".format(img_p, new_img_p))
        if sid % 1000 == 0:
            print(sid)

pitfall

Flickr/albatross/0213_10341804.jpg 用 cv2 读出来是 None，用 PIL.Image 读出来还是 2 维的，去看原图：
…佛了，想去 image url 那手动下一下，几个链接都打不开…
两个选择：

当脏数据筛掉。
当噪声数据保留。

目前读法：

# import numpy as np
# import cv2
# from PIL import Image

img = cv2.imread(img_p)#[:, :, ::-1]
if img is None:  # cv2 读不了，就改用 PIL.Image 读
    with Image.open(img_p) as img_f:
    	img = np.asarray(img_f)
    if 2 == img.ndim:  # 缺 channel
        img = np.repeat(img[:, :, np.newaxis], 3, axis=2)
else:
    # img = img[:, :, ::-1]
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

duplication

(2021.11.30) 说是有 26,9648 个数据，但其实有重复数据。检查：

img_id = {}  # image path -> 流水 sample id
# 格式：`actor\0001_2124494179.jpg`
# 将其中 `2124494179` 作为 image id，此 image id 有重复
id_img = {}  # image id -> image path

# 记重复 pairs 的 index（因为实际验证表明，只有两重重复）
idx_a, idx_b = [], []

with open(IMAGE_LIST, "r") as f:
    for sid, line in enumerate(f):
        line = line.strip()
        if line:
            img_id[line] = sid

            img_f = line.split("\\")[1].split("_")[1]
            _id = int(img_f.split(".")[0])
            if _id in id_img:
                print("duplicated:", _id, line, id_img[_id])
                _idx_a, _idx_b = img_id[id_img[_id]], sid
                idx_a.append(_idx_a)
                idx_b.append(_idx_b)
                #print("sid pair:", _idx_a, _idx_b)
            else:
                id_img[_id] = line

print("unique id:", len(id_pool))  # 269642
idx_a = np.asarray(idx_a)
idx_b = np.asarray(idx_b)
print("duplicated index pairs:", idx_a, idx_b)

输出，只有两重重复

duplicated: 2728487708 dog\0008_2728487708.jpg animal\0001_2728487708.jpg
duplicated: 815043568 horizon\0423_815043568.jpg buildings\0648_815043568.jpg
duplicated: 2729498990 iguana\0011_2729498990.jpg close-up\0002_2729498990.jpg
duplicated: 702409954 man\0215_702409954.jpg bus\0018_702409954.jpg
duplicated: 1100787682 sunrise\0526_1100787682.jpg hawaii\0077_1100787682.jpg
duplicated: 2230197395 vegetables\0206_2230197395.jpg kitchen\0486_2230197395.jpg

unique id: 269642

duplicated index pairs:
[  6974  34189  55412  34729 122745 140861]
[ 79208 126990 132983 150166 238181 258345]

去看这些重复对实际的图会发现，这些重复的 image id 所对应的图就是同一幅图。所以不算重复，只有 26,9642 幅图。进一步检查 label 和 texts 的一致性（texts 的制作见后文）：

print("check the consistency between duplicated pairs")

for i in range(idx_a.shape[0]):
    _idx_a, _idx_b = idx_a[i], idx_b[i]
    
    la, lb = labels[_idx_a], labels[_idx_b]
    label_diff = (la != lb).sum()
    if 0 != label_diff:
        print("label diff:", _idx_a, _idx_b, label_diff)
        print("class set 1:", [id_cls[c] for c in range(la.shape[0]) if (la[c] > 0)])
        print("class set 2:", [id_cls[c] for c in range(lb.shape[0]) if (lb[c] > 0)])
        
    text_diff = (texts[_idx_a] != texts[_idx_b]).sum()
    if 0 != text_diff:
        print("text diff:", _idx_a, _idx_b, text_diff)

输出

check the consistency between duplicated pairs

label diff: 34729 150166 3
class set 1: ['person', 'window']
class set 2: ['person', 'reflection', 'road']

label diff: 122745 238181 2
class set 1: ['clouds', 'mountain', 'sky', 'sun', 'valley']
class set 2: ['clouds', 'mountain', 'sand', 'sky', 'sun']

居然有两对的 label 不一致…不知道这些数据在不在 clean data 里（clean data 处理见后文）…

Text

跟 DCMH 的设置，用那 1k 个 tags，但有两种做法，且结果不同！
后面会与 DCMH 提供的数据对比，以决定取哪一种。
这里先把那 1k 个 tags 读出来，顺便确定 tags 的顺序，即按 TagList1k.txt 的顺序。

print("--- text ---")
TEXT_P = "NUS_WID_Tags"

# use 1k tags as DCMH
tag_id = {}
with open(os.path.join(TEXT_P, "TagList1k.txt"), "r", encoding='utf-8') as f:
    for tid, line in enumerate(f):
        tn = line.strip()
        tag_id[tn] = tid
id_tag = {tag_id[k]: k for k in tag_id}
print("\ntag-ID:", len(tag_id), list(tag_id)[:10])
N_TAG = len(id_tag)
print("\n#tag:", N_TAG)  # 1000

first way

第一种是利用 All_Tags.txt 文件，从中筛出属于那 1k 个 tags 的子集。

print("- 1st: from `All_Tags.txt` -")
texts_1 = np.zeros([N_SAMPLE, N_TAG], dtype=np.int8)
with open(os.path.join(TEXT_P, "All_Tags.txt"), "r", encoding='utf-8') as f:
    for sid, line in enumerate(f):
        # format: <sample id> <tags...>
        _tags = line.split()[1:]
        # print(_tags)
        for _t in _tags:
            if _t in tag_id:  # 限制在那 1k 个 tags 里
                tid = tag_id[_t]
                texts_1[sid][tid] = 1
        if sid % 1000 == 0:
            print(sid)
print("1st texts:", texts_1.shape, ", cardinality:", texts_1.sum())
# 1st texts: (269648, 1000) , cardinality: 1559503
# np.save("texts.All_Tags.npy", texts_1.astype(np.int8))
texts_1 = texts_1.astype(np.uint8)
sio.savemat("texts.All_Tags.mat", {"texts": texts_1}, do_compression=True)

second way

第二种是直接从 AllTags1k.txt 读。
注意其中 cardinality 与第一种方法所得不同。

print("- 2nd: from `AllTags1k.txt` -")
texts_2 = np.zeros([N_SAMPLE, N_TAG], dtype=np.int8)
with open(os.path.join(TEXT_P, "AllTags1k.txt"), "r") as f:
    for sid, line in enumerate(f):
        # format: 81-bit multi-hot vector
        line = list(map(int, line.split()))
        # assert len(line) == 1000
        texts_2[sid] = np.asarray(line).astype(np.int8)
        if sid % 1000 == 0:
            print(sid)
print("2nd texts:", texts_2.shape, ", cardinality:", texts_2.sum())
# 2nd texts: (269648, 1000) , cardinality: 1559464
# np.save("texts.AllTags1k.npy", texts_2.astype(np.int8))
texts_2 = texts_2.astype(np.uint8)
sio.savemat("texts.AllTags1k.mat", {"texts": texts_2}, do_compression=True)

comparison

这里对比两种方法所得 texts 数据。
区别只在其中几个样本，第一种所得的 tags 比第二种的多。

print("- compare 2 texts -")
print("2 text:", texts_1.shape, texts_2.shape, texts_1.sum(), texts_2.sum())
n_diff_order, n_diff_card = 0, 0
with open(os.path.join(TEXT_P, "All_Tags.txt"), "r", encoding='utf-8') as f1, \
        open(os.path.join(TEXT_P, "AllTags1k.txt"), "r") as f2:
    for i in range(texts_1.shape[0]):
        n1 = texts_1[i].sum()
        n2 = texts_2[i].sum()
        line1 = next(f1)
        line2 = next(f2)
        if n1 == n2:
            diff = np.abs(texts_1[i] - texts_2[i]).sum()
            if diff != 0:
                n_diff_order += 1
                print("tag order diff:", i, diff)
            continue

        print("--- diff:", i, n1, n2)
        n_diff_card += 1
        tags1 = set([_t for _t in line1.split()[1:] if _t in tag_id])
        line2 = list(map(int, line2.split()))
        tags2 = set([id_tag[i] for i in range(len(line2)) if line2[i] > 0])
        print("tags 1:", sorted(list(tags1)))
        print("\ntags 2:", sorted(list(tags2)))

        extra1 = tags1 - tags2
        if len(extra1) > 0:
            print("\nextra 1:", extra1)
            for k in extra1:
                if k not in tag_id:
                    print("* ERROR:", k, "not it tag_id")
        extra2 = tags2 - tags1
        if len(extra2) > 0:
            print("\nextra 2:", extra2)
            for k in extra2:
                if k not in tag_id:
                    print("* ERROR:", k, "not it tag_id")

print("#tag order mismatch:", n_diff_order)  # 0
print("#tag cardinarity different:", n_diff_card)  # 16

TC-21, TC-10

常见的两种设置：只保留样本数最多的 21 个/10 个类，即 TC-21/TC-10。
这里制作对应的 label 数据。

print("--- TC-21, TC-10 ---")
# labels = np.load("labels.npy")
lab_sum = labels.sum(0)
# print("label sum:", lab_sum)
class_desc = np.argsort(lab_sum)[::-1]
tc21 = np.sort(class_desc[:21])
tc10 = np.sort(class_desc[:10])
print("TC-21:", {id_cls[k]: lab_sum[k] for k in tc21})
print("TC-10:", {id_cls[k]: lab_sum[k] for k in tc10})


def make_sub_class(tc):
    n_top = len(tc)
    print("- process TC-{} -".format(n_top))
    with open("class-name-tc{}.txt".format(n_top), "w") as f:
        for i in range(n_top):
            cid = tc[i]
            cn = id_cls[cid]
            n_sample = lab_sum[cid]
            # format: <new class id> <class name> <original class id> <#sample>
            f.write("{} {} {} {}\n".format(i, cn, cid, n_sample))

    sub_labels = labels[:, tc]
    print("sub labels:", sub_labels.shape, ", cardinality:", sub_labels.sum())
    # sub labels: (269648, 21) , cardinality: 411438
    # sub labels: (269648, 10) , cardinality: 332189
    # np.save("labels.tc-{}.npy".format(n_top), sub_labels.astype(np.int8))
    sub_labels = sub_labels.astype(np.uint8)
    sio.savemat("labels.tc-{}.mat".format(n_top), {"labels": sub_labels}, do_compression=True)


make_sub_class(tc21)
make_sub_class(tc10)

Clean Data

清洗数据，获得干净数据的索引。
有两种筛法：只筛 label 为空的、筛 label 或 text 为空的，而 text 又有两个版本，所以一共 6 种结果。
TC-21 的单筛 19,5834 跟 DCMH 数据量对应，双筛 19,0421 跟 SSAH^[9] 数据量对应。
顺便记录干净数据里的新索引与原数据中的索引的映射，写入 clean-full-map.*.txt 文件里（说不定以后要用）。

print("--- clean data ---")
labels_21 = np.load("labels.tc-21.npy")
labels_10 = np.load("labels.tc-10.npy")


def pick_clean(label, text, name, double_sieve):
    clean_id = []
    on_map = {}
    new_id = 0
    for i, (l, t) in enumerate(zip(label, text)):
        # if only sieved by label (`double_sieve` = False)
        # we get 19,5834 samples in TC-21, and 18,6577 in TC-10
        # which matches the one DCMH provided
        if (0 == l.sum()):
            continue
        # if sieved by both label & text (`double_sieve` = True)
        # we got 19,0421 samples in TC-21, and 18,1365 in TC-10
        if double_sieve and (0 == t.sum()):
            continue
        on_map[new_id] = i
        new_id += 1
        clean_id.append(i)
    clean_id = np.asarray(clean_id)
    print("clean id:", clean_id.shape)
    np.save("clean_id.{}.npy".format(name), clean_id)

    with open("clean-full-map.{}.txt".format(name), "w") as f:
        for k in on_map:
            f.write("{} {}\n".format(k, on_map[k]))


for label, ln in zip([labels_21, labels_10], ["tc21", "tc10"]):
    pick_clean(label, label, ln, False)
    for text, tn in zip([texts_1, texts_2], ["All_Tags", "AllTags1k"]):
        pick_clean(label, text, "{}.{}".format(ln, tn), True)

Comparison to DCMH

这里与 DCMH 提供的数据^[8]对比。
结论：样本顺序不同；label 总和相等（当它正确）；第 2 种方法制得的 text 与 DCMH 总和相等（当它正确），就用它。
加测：label 和 text 行和、列和，在排序后应对应相等。
nuswide-tc21 用 DCMH 的 matlab 代码可以复现其文中结果，应该是正确的。

print("--- compare with the DCMH provided ---")
L_21_dcmh = sio.loadmat("nus-wide-tc21-lall.mat")["LAll"]
L_10_dcmh = sio.loadmat("nus-wide-tc10-lall.mat")["LAll"]
T_21_dcmh = sio.loadmat("nus-wide-tc21-yall.mat")["YAll"]
T_10_dcmh = h5py.File("nus-wide-tc10-yall.mat")["YAll"][:].T.astype(np.int)
print(L_21_dcmh.shape, L_10_dcmh.shape, T_21_dcmh.shape, T_10_dcmh.shape)
# (195834, 21) (186577, 10) (195834, 1000) (186577，1000)

clean_id_10 = np.load("clean_id.10.npy")
clean_id_21 = np.load("clean_id.21.npy")

# 对比 label 的总和，相等
print("label 21:", L_21_dcmh.sum(), labels[clean_id_21].sum())
print("label 10:", L_10_dcmh.sum(), labels[clean_id_10].sum())
# 对比两种 text，发现第 **二** 种与 DCMH 的数据对应
print("text 21:", T_21_dcmh.sum(), texts_1[clean_id_21].sum(), texts_2[clean_id_21].sum())
print("text 10:", T_10_dcmh.sum(), texts_1[clean_id_10].sum(), texts_2[clean_id_10].sum())
L_21_my = labels[clean_id_21]
L_10_my = labels[clean_id_10]
T_21_my = texts_2[clean_id_21]
T_10_my = texts_2[clean_id_10]

count = lambda x: x.astype(np.int).sum()
# TC-21 排序后的行和、列和
# 可类似地测 TC-10
lrs_dcmh = np.sort(L_21_dcmh.sum(1))
lrs_my = np.sort(L_21_my.sum(1))
print("label row sum diff:", count(lrs_dcmh != lrs_my))  # 0
lcs_dcmh = np.sort(L_21_dcmh.sum(0))
lcs_my = np.sort(L_21_my.sum(0))
print("label col sum diff:", count(lcs_dcmh != lcs_my))  # 0
trs_dcmh = np.sort(T_21_dcmh.sum(1))
trs_my = np.sort(T_21_my.sum(1))
print("text row sum diff:", count(trs_dcmh != trs_my))  # 0
tcs_dcmh = np.sort(T_21_dcmh.sum(0))
tcs_my = np.sort(T_21_my.sum(0))
print("text col sum diff:", count(tcs_dcmh != tcs_my))  # 0


def check_sample_order(L_dcmh, L_my, T_dcmh, T_my):
    nc = L_dcmh.shape[1]
    print("---", nc, "---")
    has_diff = False
    for i in range(L_dcmh.shape[0]):
        l1 = L_dcmh[i].sum()
        l2 = L_my[i].sum()
        if l1 != l2:
            print("* label diff:", i, l1, l2)
            has_diff = True
            break
        t1 = T_dcmh[i].sum()
        t2 = T_my[i].sum()
        if t1 != t2:
            print("* text diff:", i, t1, t2)
            has_diff = True
            break
    if not has_diff:
        print("DONE")


# 对比样本顺序，**不**同
check_sample_order(L_21_dcmh, L_21_my, T_21_dcmh, T_21_my)
check_sample_order(L_10_dcmh, L_10_my, T_10_dcmh, T_10_my)