Kaggle 虾皮商品匹配大赛多模态基线模型

最新推荐文章于 2022-11-03 08:00:00 发布

深度之眼

最新推荐文章于 2022-11-03 08:00:00 发布

阅读量1k

点赞数 1

分类专栏：比赛 kaggle

本文链接：https://blog.csdn.net/weixin_42645636/article/details/118885782

版权

比赛同时被 2 个专栏收录

42 篇文章 1 订阅

订阅专栏

kaggle

37 篇文章 6 订阅

订阅专栏

1 竞赛介绍

Shopee Price Match Guarantee比赛希望我们能够从商品的图片、标题判断哪些是同样的商品

简单来讲，像是如果我在虾皮上面搜寻「switch」这个词会出以下页面。

而可以看到上面其实有些是Switch主机，有些是switch+健身环，有些则是保护壳、收纳袋之类的，这次的比赛就是希望能够仅从「图片+商品标题」判断出来哪些是同样的商品，借此shopee能够做出更精准的商品推荐、比价、甚至可能可以做假货分析（同样商品价格落差太大）…等新功能
而实际data如下

2 赛题任务分析

里面最重要的就是image、title、label_group这三个feature。

image : 这个商品的图片名称
title : 商品的标题
label_group :商品的类别，也就是我们要预测的target（同一个类别可以有多个商品）

而image_phash就是一种基础的图片hashing方法（越相似的图片hashing值会越接近），在这比赛中会是最最最基础的baseline，但是因为大部分人都直接重抽图片Feature，所以image_phash等于废掉。
而我们要预测的就是给定一个新的商品（一样包含image、title），找出哪些商品跟他属于一样的类别。

这个比赛最困难的就是如何对image跟title抽取feature

下面是data中的一些图片，可以看出图片的拍摄方法、品质可能差异极大，这也是其中一个对商品图片分类困难点。

而这个比赛的Evaluation方法是F1 Score，因为是标准的衡量方法，这边不赘述。

3 基于文本图像的多模态商品匹配模型

3.1 导入包

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2, matplotlib.pyplot as plt
from tqdm import tqdm_notebook

import cudf, cuml, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
from PIL import Image

import torch
torch.manual_seed(0)
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True

import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data.dataset import Dataset

3.2 加载数据

COMPUTE_CV = True

test = pd.read_csv(DATA_PATH + 'test.csv')
if len(test)>3: COMPUTE_CV = False
else: print('this submission notebook will compute CV score, but commit notebook will not')

# COMPUTE_CV = False

if COMPUTE_CV:
    train = pd.read_csv(DATA_PATH + 'train.csv')
    train['image'] = DATA_PATH + 'train_images/' + train['image']
    tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
    train['target'] = train.label_group.map(tmp)
    train_gf = cudf.read_csv(DATA_PATH + 'train.csv')
else:
    train = pd.read_csv(DATA_PATH + 'test.csv')
    train['image'] = DATA_PATH + 'test_images/' + train['image']
    train_gf = cudf.read_csv(DATA_PATH + 'test.csv')
    
print('train shape is', train.shape )
train.head()

3.3 基于Resnet18提取图像特征

以下为提取商品图片图像特征的模块

class ShopeeImageEmbeddingNet(nn.Module):
    def __init__(self):
        super(ShopeeImageEmbeddingNet, self).__init__()
              
        model = models.resnet18(True)
        model.avgpool = nn.AdaptiveMaxPool2d(output_size=(1, 1))
        model = nn.Sequential(*list(model.children())[:-1])
        model.eval()
        self.model = model
        
    def forward(self, img):        
        out = self.model(img)
        return out

把每张图片的图像特征存储起来

DEVICE = 'cuda'

imgmodel = ShopeeImageEmbeddingNet()
imgmodel = imgmodel.to(DEVICE)

imagefeat = []
with torch.no_grad():
    for data in tqdm_notebook(imageloader):
        data = data.to(DEVICE)
        feat = imgmodel(data)
        feat = feat.reshape(feat.shape[0], feat.shape[1])
        feat = feat.data.cpu().numpy()
        
        imagefeat.append(feat)

3.4 基于KNN算法构建图像匹配的候选结果

KNN = 50
if len(test)==3: KNN = 2
model = NearestNeighbors(n_neighbors=KNN)
model.fit(imagefeat)

preds = []
CHUNK = 1024*4

imagefeat = cupy.array(imagefeat)

print('Finding similar images...')
CTS = len(imagefeat)//CHUNK
if len(imagefeat)%CHUNK!=0: CTS += 1
for j in range( CTS ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b, len(imagefeat))
    print('chunk',a,'to',b)
    
    distances = cupy.matmul(imagefeat, imagefeat[a:b].T).T
    # distances = np.dot(imagefeat[a:b,], imagefeat.T)
    
    for k in range(b-a):
        IDX = cupy.where(distances[k,]>0.95)[0]
        # IDX = np.where(distances[k,]>0.95)[0][:]
        o = train.iloc[cupy.asnumpy(IDX)].posting_id.values
        preds.append(o)
        
# del imagefeat, imgmodel

3.5 基于Tfidf向量与余弦相似度提取候选结果

preds = []
CHUNK = 1024*4

print('Finding similar titles...')
CTS = len(train)//CHUNK
if len(train)%CHUNK!=0: CTS += 1
for j in range( CTS ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(train))
    print('chunk',a,'to',b)
    
    # COSINE SIMILARITY DISTANCE
    # cts = np.dot( text_embeddings, text_embeddings[a:b].T).T
    cts = cupy.matmul(text_embeddings, text_embeddings[a:b].T).T
    
    for k in range(b-a):
        # IDX = np.where(cts[k,]>0.7)[0]
        IDX = cupy.where(cts[k,]>0.7)[0]
        o = train.iloc[cupy.asnumpy(IDX)].posting_id.values
        preds.append(o)
        
del model, text_embeddings

3.6 合并图像和文本的两种结果

def combine_for_sub(row):
    x = np.concatenate([row.oof_text,row.oof_cnn, row.oof_hash])
    return ' '.join( np.unique(x) )

def combine_for_cv(row):
    x = np.concatenate([row.oof_text,row.oof_cnn, row.oof_hash])
    return np.unique(x)
if COMPUTE_CV:
    tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
    train['target'] = train.label_group.map(tmp)
    train['oof'] = train.apply(combine_for_cv,axis=1)
    train['f1'] = train.apply(getMetric('oof'),axis=1)
    print('CV Score =', train.f1.mean() )

train['matches'] = train.apply(combine_for_sub,axis=1)

完整代码可通过私信联系小编获取
署名作者：小李飞刀

深度之眼

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
Kaggle 虾皮商品匹配大赛多模态基线模型

Shopee — Price Match Guarantee 竞赛介绍Shopee Price Match Guarantee比赛希望我们能够从商品的图片、标题判断哪些是同样的商品简单来讲，像是如果我在虾皮上面搜寻「switch」这个词会出以下页面。而可以看到上面其实有些是Switch主机，有些是switch+健身环，有些则是保护壳、收纳袋之类的，这次的比赛就是希望能够仅从「图片+商品标题」判断出来哪些是同样的商品，借此shopee能够做出更精准的商品推荐、比价、甚至可能可以做假货分析（同样商品
复制链接

扫一扫