Python MUFFIN人岗匹配,推荐算法,Self-Attentional Multi-Field Features Representation and Interaction Learning

44 篇文章 8 订阅
12 篇文章 0 订阅

1.推荐算法:人岗匹配简介

Self-Attentional Multi-Field Features Representation and Interaction Learning for Person–Job Fit

推荐算法应用于人岗匹配领域,主要体现在为人推岗和为岗推人,不管是人推岗还是岗推人,在方法和死路上应该是一致的。概览一下这篇文章,我只讲一下作者的数据和模型思路,更多细节大家可以去研读一下这篇文章。

在第三章节,B部分作者开始展开思路。在概述里面:作者把特征分为三类:连续特征(数值特征)、分类特征(类别特征)和文本特征(长文本特征)。不管是简历还是岗位信息,都会与这三类特征map。

 

首先就是对数值特征进行标准化,对分类特征进行one-hot编码。然后用tokenizer(分词器)对文本特征进行分词。然后用nn.Embedding()对数值特性和分类特征进行Embedding,对文本特征用albert模型进行Embedding。见公式1,2,3。

 

下一步,不同领域特征进行交互层。其实很简单,相同角度的特征交互就是简历i领域特征,工作i领域的特征,以及两个特征相减和相乘,最后把这几部分特征拼接,最后通过一个线性层求和取均值,完事。

 

 

 

 

这里的特征外部的注意力机制就是指不同特征之间的注意力,可以把hi这一系列特征看成一个序列,这就跟transformer的注意力机制很像了,当然这个没有mask和位置编码,只是计算方式很像。 

计算完注意力机制后,还有一个残差机制,就是hat-h + h ,最后的zi就是得到的结果。 

 

不同领域的zi进行concat,得到d,代表一整个简历或者job信息的向量,然后通过全连接层relu(线性层),最后经过sigmod层,得到最后的二分类概率。

 

 损失函数,就是很常见的交叉熵损失函数,不多讲。

 

2. 模型代码:

连续特征embedding

class CoEmbdNet(nn.Module):
    '''
    continuous features Embedding向量化
    '''
    def __init__(self, config):
        super(CoEmbdNet, self).__init__()
        self.config = config
        # for the continuous features
        self.co_emb = nn.Embedding(config.co_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.co_emb.weight)

    def forward(self, xi, xv):
        # for continuous features
        co_emb = self.co_emb(xi)
        co_value = torch.mul(co_emb, xv.unsqueeze(-1))
        return co_value

分类特征embedding

class CaEmbdNet(nn.Module):
    '''
    category features  Embedding向量化 (只有city类型)
    '''
    def __init__(self, config):
        super(CaEmbdNet, self).__init__()
        self.config = config
        self.ca_idx = self.get_cat_size()
        # for the category features
        self.ca_emb = nn.Embedding(self.ca_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.ca_emb.weight)

    def get_cat_size(self):
        config = self.config
        with open(config.feature2idx_path,'r') as file:
            dicts = eval(file.read())
        return len(dicts['city'])

    def forward(self, xi, xv):
        # for continuous features
        ca_emb = self.ca_emb(xi)
        ca_value = torch.mul(ca_emb, xv.unsqueeze(-1))
        return ca_value

文本特征,用albert做embedding

class AlbertEmbd(nn.Module):
    '''
       调用transformers包中的bert模型,对文本特征做Embedding
       '''

    def __init__(self, config, model_config):
        super(AlbertEmbd, self).__init__()
        self.config = config
        self.embd = AlbertModel.from_pretrained(config.albertpath, config=model_config)

    def forward(self, token_tensor,exp_tensor,jd_tensor):
        embd_all = []
        for i in range(self.config.ca_num):
            embd = self.embd(token_tensor[:, i * self.config.cat_token_len:(i + 1) * self.config.cat_token_len])
            embd_all.append(embd[1])  # 将首位CLS的向量作为句子表示输入到后续模型中
        cat_token_embd = torch.cat(embd_all, dim=-1)
        cat_token_embd = cat_token_embd.reshape(cat_token_embd.size()[0], self.config.ca_num, -1)
        exp_embd = self.embd(exp_tensor)
        jd_embd = self.embd(jd_tensor)
        return cat_token_embd,exp_embd[1].unsqueeze(1),jd_embd[1].unsqueeze(1)

 连续特征的全连接层

class Co_FC(nn.Module):
    '''
    对continuous features的Embedding输入到FC
    '''
    def __init__(self,in_size, hiden_size, out_size):
        super(Co_FC, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_size,hiden_size[0]),
            nn.ReLU(),
            nn.Linear(hiden_size[0],hiden_size[1]),
            nn.ReLU(),
            nn.Linear(hiden_size[1],out_size)
        )
    def forward(self,x):
        x = self.fc(x)
        return x

 连接层

class MatchNet(nn.Module):
    def __init__(self,config,model_config):
        super(MatchNet, self).__init__()
        self.config = config
        # embedding
        self.co_embd_net = CoEmbdNet(config)
        self.ca_embd_net = CaEmbdNet(config)
        self.albert_emd_net = AlbertEmbd(config,model_config)
        # local match each return a tensor(embdsize)
        self.match_years = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_degree = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_salary = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_city = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_industry = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_type = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_text = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        # preject the text embedding into low dimension
        self.fc = nn.Sequential(
            nn.Linear(config.albert_size, config.hiden_size[1]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[1], config.hiden_size[2]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[2], config.embdsize),
            nn.ReLU()
        )
        # MultiheadAtt
        self.matt = MultiheadAtt(config.embdsize,config.num_heads)
        # mlp
        self.mlp = nn.Sequential(
            nn.ReLU(),
            nn.Linear(7*config.embdsize, 2*config.embdsize),
            nn.ReLU(),
            nn.Linear(2*config.embdsize, config.embdsize),
            nn.ReLU(),
            nn.Linear(config.embdsize, 1),
        )

    def forward(self,co_xi, co_xv, city_xi, city_xv, cat_token_tensors, exp_tensor, jd_tensor):
        # get the embedding vecotr of each feature field
        co_embd = self.co_embd_net(co_xi, co_xv) # - *11*config.embdsize
        ca_embd = self.ca_embd_net(city_xi, city_xv) # - *5*config.embdsize
        # -*9*312,-*1*312,-*1*312
        ca_token_embd, exp_embd, jd_embd = self.albert_emd_net(cat_token_tensors,exp_tensor,jd_tensor)
        # years local match
        years_a = co_embd[:, [3], :]
        years_b = co_embd[:, [4], :]
        years = self.match_years(years_a,years_b) # - * config.embdsize
        # edu local match
        edu_a = co_embd[:, [5], :]
        edu_b = co_embd[:, [6], :]
        edu = self.match_degree(edu_a,edu_b)
        # salary local match, the num 9 denote the job's max salary
        # he num 10 denote the job's max salary
        salary_a = co_embd[:, [7,8], :]
        salary_b = co_embd[:, [9], :]
        # salary_b = co_embd[:, [10], :]
        l = salary_a.size()[1]
        salary_b = salary_b.repeat(1,l,1)
        salary = self.match_degree(salary_a, salary_b)
        # city local match
        city_a = ca_embd[:, [0,1,2,3], :]
        city_b = ca_embd[:, [4], :]
        l = city_a.size()[1]
        city_b = city_b.repeat(1, l, 1)
        city = self.match_city(city_a,city_b)
        # industry local match
        industry_a = ca_token_embd[:, [0, 1, 2, 3], :]
        industry_b = ca_token_embd[:, [4], :]
        l = industry_a.size()[1]
        industry_b = industry_b.repeat(1, l, 1)
        industry = self.match_industry(industry_a, industry_b)
        # type local match
        type_a = ca_token_embd[:, [5, 6, 7], :]
        type_b = ca_token_embd[:, [8], :]
        l = type_a.size()[1]
        type_b = type_b.repeat(1, l, 1)
        type = self.match_type(type_a, type_b)
        # text local match
        text = self.match_text(exp_embd,jd_embd)
        # preject the text vectors into low dimension
        text_vec = torch.cat([industry, type, text],1).view(-1,3,self.config.albert_size)
        text_vec = self.fc(text_vec)
        # concatenate the local match vectors
        cat_vec = torch.cat([years, edu, salary, city, ], 1).view(-1,4,self.config.embdsize)
        feat_vec = torch.cat([cat_vec,text_vec],1) # - *7*config.embdsize
        # multi-head self fattention
        matt_vec, matt_weights = self.matt(feat_vec)
        # residual layer
        in_vec = (feat_vec + matt_vec).view(-1,7*self.config.embdsize)
        # input into the mlp
        pre = self.mlp(in_vec)
        return pre,matt_weights

多头注意力机制 

class MultiheadAtt(nn.Module):
    '''
    借助nn.MultiheadAttention实现多头注意力机制,其实现过程中有相应的weights
    '''
    def __init__(self,ebd_size,num_heads):
        super(MultiheadAtt,self).__init__()
        self.matt = nn.MultiheadAttention(ebd_size,num_heads)
    def forward(self,x):
        x = x.transpose(0, 1)
        matt_out, matt_weights = self.matt(x, x, x)
        return matt_out.transpose(0,1),matt_weights

算是向量的池化层,最大、平均和求和,参数需要自己设置,默认最大池化。

class LocalMatch(nn.Module):
    def __init__(self,insize,outsize,dropout):
        super(LocalMatch, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(insize, insize),
            # nn.Dropout(dropout),
            nn.PReLU(),
            nn.Linear(insize,outsize),
            nn.PReLU(),
        )
    def forward(self,a,b, pool='max'):
        # 参数pool设置不同效果可能不同
        c = torch.cat([a, b, a - b, a * b], dim=-1)
        # c = torch.cat([a, b], dim=-1)
        c = self.net(c)
        # c = c.unsqueeze(-1)
        if pool.lower() == 'max':
            c = c.max(dim=1).values
        elif pool.lower() == 'mean':
            c = c.mean(dim=1)
        elif pool.lower() == 'sum':
            c = c.sum(dim=1)
        return c

 完整代码:

# for ourmodel
import numpy as np
import torch
import torch.nn as nn
from model.albertdataset import load_csv, AlBertDataset
from torch.utils.data import DataLoader
from transformers import AlbertConfig, BertTokenizer, AlbertModel
# from config import DefaultConfig
# config = DefaultConfig()

class CoEmbdNet(nn.Module):
    '''
    continuous features Embedding向量化
    '''
    def __init__(self, config):
        super(CoEmbdNet, self).__init__()
        self.config = config
        # for the continuous features
        self.co_emb = nn.Embedding(config.co_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.co_emb.weight)

    def forward(self, xi, xv):
        # for continuous features
        co_emb = self.co_emb(xi)
        co_value = torch.mul(co_emb, xv.unsqueeze(-1))
        return co_value

class CaEmbdNet(nn.Module):
    '''
    category features  Embedding向量化 (只有city类型)
    '''
    def __init__(self, config):
        super(CaEmbdNet, self).__init__()
        self.config = config
        self.ca_idx = self.get_cat_size()
        # for the category features
        self.ca_emb = nn.Embedding(self.ca_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.ca_emb.weight)

    def get_cat_size(self):
        config = self.config
        with open(config.feature2idx_path,'r') as file:
            dicts = eval(file.read())
        return len(dicts['city'])

    def forward(self, xi, xv):
        # for continuous features
        ca_emb = self.ca_emb(xi)
        ca_value = torch.mul(ca_emb, xv.unsqueeze(-1))
        return ca_value

class AlbertEmbd(nn.Module):
    '''
       调用transformers包中的bert模型,对文本特征做Embedding
       '''

    def __init__(self, config, model_config):
        super(AlbertEmbd, self).__init__()
        self.config = config
        self.embd = AlbertModel.from_pretrained(config.albertpath, config=model_config)

    def forward(self, token_tensor,exp_tensor,jd_tensor):
        embd_all = []
        for i in range(self.config.ca_num):
            embd = self.embd(token_tensor[:, i * self.config.cat_token_len:(i + 1) * self.config.cat_token_len])
            embd_all.append(embd[1])  # 将首位CLS的向量作为句子表示输入到后续模型中
        cat_token_embd = torch.cat(embd_all, dim=-1)
        cat_token_embd = cat_token_embd.reshape(cat_token_embd.size()[0], self.config.ca_num, -1)
        exp_embd = self.embd(exp_tensor)
        jd_embd = self.embd(jd_tensor)
        return cat_token_embd,exp_embd[1].unsqueeze(1),jd_embd[1].unsqueeze(1)

class Co_FC(nn.Module):
    '''
    对continuous features的Embedding输入到FC
    '''
    def __init__(self,in_size, hiden_size, out_size):
        super(Co_FC, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_size,hiden_size[0]),
            nn.ReLU(),
            nn.Linear(hiden_size[0],hiden_size[1]),
            nn.ReLU(),
            nn.Linear(hiden_size[1],out_size)
        )
    def forward(self,x):
        x = self.fc(x)
        return x

class MatchNet(nn.Module):
    def __init__(self,config,model_config):
        super(MatchNet, self).__init__()
        self.config = config
        # embedding
        self.co_embd_net = CoEmbdNet(config)
        self.ca_embd_net = CaEmbdNet(config)
        self.albert_emd_net = AlbertEmbd(config,model_config)
        # local match each return a tensor(embdsize)
        self.match_years = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_degree = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_salary = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_city = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_industry = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_type = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_text = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        # preject the text embedding into low dimension
        self.fc = nn.Sequential(
            nn.Linear(config.albert_size, config.hiden_size[1]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[1], config.hiden_size[2]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[2], config.embdsize),
            nn.ReLU()
        )
        # MultiheadAtt
        self.matt = MultiheadAtt(config.embdsize,config.num_heads)
        # mlp
        self.mlp = nn.Sequential(
            nn.ReLU(),
            nn.Linear(7*config.embdsize, 2*config.embdsize),
            nn.ReLU(),
            nn.Linear(2*config.embdsize, config.embdsize),
            nn.ReLU(),
            nn.Linear(config.embdsize, 1),
        )

    def forward(self,co_xi, co_xv, city_xi, city_xv, cat_token_tensors, exp_tensor, jd_tensor):
        # get the embedding vecotr of each feature field
        co_embd = self.co_embd_net(co_xi, co_xv) # - *11*config.embdsize
        ca_embd = self.ca_embd_net(city_xi, city_xv) # - *5*config.embdsize
        # -*9*312,-*1*312,-*1*312
        ca_token_embd, exp_embd, jd_embd = self.albert_emd_net(cat_token_tensors,exp_tensor,jd_tensor)
        # years local match
        years_a = co_embd[:, [3], :]
        years_b = co_embd[:, [4], :]
        years = self.match_years(years_a,years_b) # - * config.embdsize
        # edu local match
        edu_a = co_embd[:, [5], :]
        edu_b = co_embd[:, [6], :]
        edu = self.match_degree(edu_a,edu_b)
        # salary local match, the num 9 denote the job's max salary
        # he num 10 denote the job's max salary
        salary_a = co_embd[:, [7,8], :]
        salary_b = co_embd[:, [9], :]
        # salary_b = co_embd[:, [10], :]
        l = salary_a.size()[1]
        salary_b = salary_b.repeat(1,l,1)
        salary = self.match_degree(salary_a, salary_b)
        # city local match
        city_a = ca_embd[:, [0,1,2,3], :]
        city_b = ca_embd[:, [4], :]
        l = city_a.size()[1]
        city_b = city_b.repeat(1, l, 1)
        city = self.match_city(city_a,city_b)
        # industry local match
        industry_a = ca_token_embd[:, [0, 1, 2, 3], :]
        industry_b = ca_token_embd[:, [4], :]
        l = industry_a.size()[1]
        industry_b = industry_b.repeat(1, l, 1)
        industry = self.match_industry(industry_a, industry_b)
        # type local match
        type_a = ca_token_embd[:, [5, 6, 7], :]
        type_b = ca_token_embd[:, [8], :]
        l = type_a.size()[1]
        type_b = type_b.repeat(1, l, 1)
        type = self.match_type(type_a, type_b)
        # text local match
        text = self.match_text(exp_embd,jd_embd)
        # preject the text vectors into low dimension
        text_vec = torch.cat([industry, type, text],1).view(-1,3,self.config.albert_size)
        text_vec = self.fc(text_vec)
        # concatenate the local match vectors
        cat_vec = torch.cat([years, edu, salary, city, ], 1).view(-1,4,self.config.embdsize)
        feat_vec = torch.cat([cat_vec,text_vec],1) # - *7*config.embdsize
        # multi-head self fattention
        matt_vec, matt_weights = self.matt(feat_vec)
        # residual layer
        in_vec = (feat_vec + matt_vec).view(-1,7*self.config.embdsize)
        # input into the mlp
        pre = self.mlp(in_vec)
        return pre,matt_weights

class MultiheadAtt(nn.Module):
    '''
    借助nn.MultiheadAttention实现多头注意力机制,其实现过程中有相应的weights
    '''
    def __init__(self,ebd_size,num_heads):
        super(MultiheadAtt,self).__init__()
        self.matt = nn.MultiheadAttention(ebd_size,num_heads)
    def forward(self,x):
        x = x.transpose(0, 1)
        matt_out, matt_weights = self.matt(x, x, x)
        return matt_out.transpose(0,1),matt_weights
        
class LocalMatch(nn.Module):
    def __init__(self,insize,outsize,dropout):
        super(LocalMatch, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(insize, insize),
            # nn.Dropout(dropout),
            nn.PReLU(),
            nn.Linear(insize,outsize),
            nn.PReLU(),
        )
    def forward(self,a,b, pool='max'):
        # 参数pool设置不同效果可能不同
        c = torch.cat([a, b, a - b, a * b], dim=-1)
        # c = torch.cat([a, b], dim=-1)
        c = self.net(c)
        # c = c.unsqueeze(-1)
        if pool.lower() == 'max':
            c = c.max(dim=1).values
        elif pool.lower() == 'mean':
            c = c.mean(dim=1)
        elif pool.lower() == 'sum':
            c = c.sum(dim=1)
        return c

    

3. 总结

文章的数据集为天池比赛的人岗比配数据集,大家可以自己去下载把玩一下。

模型优点:充分利用数值特征、分类特征和文本特征。

模型缺点:没有利用简历和岗位图结构信息

  • 6
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

医学小达人

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值