Kaggle 虾皮商品匹配大赛多模态基线模型

1 竞赛介绍

Shopee Price Match Guarantee比赛希望我们能够从商品的图片、标题判断哪些是同样的商品

简单来讲,像是如果我在虾皮上面搜寻「switch」这个词会出以下页面。

而可以看到上面其实有些是Switch主机,有些是switch+健身环,有些则是保护壳、收纳袋之类的,这次的比赛就是希望能够仅从「图片+商品标题」判断出来哪些是同样的商品,借此shopee能够做出更精准的商品推荐、比价、甚至可能可以做假货分析(同样商品价格落差太大)…等新功能
而实际data如下

2 赛题任务分析

里面最重要的就是image、title、label_group这三个feature。

  • image : 这个商品的图片名称
  • title : 商品的标题
  • label_group :商品的类别,也就是我们要预测的target(同一个类别可以有多个商品)

而image_phash就是一种基础的图片hashing方法(越相似的图片hashing值会越接近),在这比赛中会是最最最基础的baseline,但是因为大部分人都直接重抽图片Feature,所以image_phash等于废掉。
而我们要预测的就是给定一个新的商品(一样包含image、title),找出哪些商品跟他属于一样的类别。

这个比赛最困难的就是如何对image跟title抽取feature

下面是data中的一些图片,可以看出图片的拍摄方法、品质可能差异极大,这也是其中一个对商品图片分类困难点。

而这个比赛的Evaluation方法是F1 Score,因为是标准的衡量方法,这边不赘述。

3 基于文本图像的多模态商品匹配模型

3.1 导入包

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2, matplotlib.pyplot as plt
from tqdm import tqdm_notebook

import cudf, cuml, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
from PIL import Image

import torch
torch.manual_seed(0)
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True

import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data.dataset import Dataset

3.2 加载数据

COMPUTE_CV = True

test = pd.read_csv(DATA_PATH + 'test.csv')
if len(test)>3: COMPUTE_CV = False
else: print('this submission notebook will compute CV score, but commit notebook will not')

# COMPUTE_CV = False

if COMPUTE_CV:
    train = pd.read_csv(DATA_PATH + 'train.csv')
    train['image'] = DATA_PATH + 'train_images/' + train['image']
    tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
    train['target'] = train.label_group.map(tmp)
    train_gf = cudf.read_csv(DATA_PATH + 'train.csv')
else:
    train = pd.read_csv(DATA_PATH + 'test.csv')
    train['image'] = DATA_PATH + 'test_images/' + train['image']
    train_gf = cudf.read_csv(DATA_PATH + 'test.csv')
    
print('train shape is', train.shape )
train.head()

3.3 基于Resnet18提取图像特征

以下为提取商品图片图像特征的模块

class ShopeeImageEmbeddingNet(nn.Module):
    def __init__(self):
        super(ShopeeImageEmbeddingNet, self).__init__()
              
        model = models.resnet18(True)
        model.avgpool = nn.AdaptiveMaxPool2d(output_size=(1, 1))
        model = nn.Sequential(*list(model.children())[:-1])
        model.eval()
        self.model = model
        
    def forward(self, img):        
        out = self.model(img)
        return out

把每张图片的图像特征存储起来

DEVICE = 'cuda'

imgmodel = ShopeeImageEmbeddingNet()
imgmodel = imgmodel.to(DEVICE)

imagefeat = []
with torch.no_grad():
    for data in tqdm_notebook(imageloader):
        data = data.to(DEVICE)
        feat = imgmodel(data)
        feat = feat.reshape(feat.shape[0], feat.shape[1])
        feat = feat.data.cpu().numpy()
        
        imagefeat.append(feat)

3.4 基于KNN算法构建图像匹配的候选结果

KNN = 50
if len(test)==3: KNN = 2
model = NearestNeighbors(n_neighbors=KNN)
model.fit(imagefeat)

preds = []
CHUNK = 1024*4

imagefeat = cupy.array(imagefeat)

print('Finding similar images...')
CTS = len(imagefeat)//CHUNK
if len(imagefeat)%CHUNK!=0: CTS += 1
for j in range( CTS ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b, len(imagefeat))
    print('chunk',a,'to',b)
    
    distances = cupy.matmul(imagefeat, imagefeat[a:b].T).T
    # distances = np.dot(imagefeat[a:b,], imagefeat.T)
    
    for k in range(b-a):
        IDX = cupy.where(distances[k,]>0.95)[0]
        # IDX = np.where(distances[k,]>0.95)[0][:]
        o = train.iloc[cupy.asnumpy(IDX)].posting_id.values
        preds.append(o)
        
# del imagefeat, imgmodel

3.5 基于Tfidf向量与余弦相似度提取候选结果

preds = []
CHUNK = 1024*4

print('Finding similar titles...')
CTS = len(train)//CHUNK
if len(train)%CHUNK!=0: CTS += 1
for j in range( CTS ):
    
    a = j*CHUNK
    b = (j+1)*CHUNK
    b = min(b,len(train))
    print('chunk',a,'to',b)
    
    # COSINE SIMILARITY DISTANCE
    # cts = np.dot( text_embeddings, text_embeddings[a:b].T).T
    cts = cupy.matmul(text_embeddings, text_embeddings[a:b].T).T
    
    for k in range(b-a):
        # IDX = np.where(cts[k,]>0.7)[0]
        IDX = cupy.where(cts[k,]>0.7)[0]
        o = train.iloc[cupy.asnumpy(IDX)].posting_id.values
        preds.append(o)
        
del model, text_embeddings

3.6 合并图像和文本的两种结果

def combine_for_sub(row):
    x = np.concatenate([row.oof_text,row.oof_cnn, row.oof_hash])
    return ' '.join( np.unique(x) )

def combine_for_cv(row):
    x = np.concatenate([row.oof_text,row.oof_cnn, row.oof_hash])
    return np.unique(x)
if COMPUTE_CV:
    tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
    train['target'] = train.label_group.map(tmp)
    train['oof'] = train.apply(combine_for_cv,axis=1)
    train['f1'] = train.apply(getMetric('oof'),axis=1)
    print('CV Score =', train.f1.mean() )

train['matches'] = train.apply(combine_for_sub,axis=1)

完整代码可通过私信联系小编获取
署名作者:小李飞刀

  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值