Python MUFFIN人岗匹配，推荐算法，Self-Attentional Multi-Field Features Representation and Interaction Learning

医学小达人

于 2024-03-16 09:00:00 发布

阅读量596

点赞数 6

分类专栏：推荐算法人工智能 NLP 文章标签： python 人岗匹配岗位推荐推荐算法人找岗岗找人人工智能

本文链接：https://blog.csdn.net/l_goodboy/article/details/136734232

版权

人工智能同时被 3 个专栏收录

44 篇文章 21 订阅

订阅专栏

NLP

44 篇文章 8 订阅

订阅专栏

推荐算法

12 篇文章 0 订阅

订阅专栏

1.推荐算法：人岗匹配简介

《Self-Attentional Multi-Field Features Representation and Interaction Learning for Person–Job Fit》

推荐算法应用于人岗匹配领域，主要体现在为人推岗和为岗推人，不管是人推岗还是岗推人，在方法和死路上应该是一致的。概览一下这篇文章，我只讲一下作者的数据和模型思路，更多细节大家可以去研读一下这篇文章。

在第三章节，B部分作者开始展开思路。在概述里面：作者把特征分为三类：连续特征（数值特征）、分类特征（类别特征）和文本特征（长文本特征）。不管是简历还是岗位信息，都会与这三类特征map。

首先就是对数值特征进行标准化，对分类特征进行one-hot编码。然后用tokenizer（分词器）对文本特征进行分词。然后用nn.Embedding(）对数值特性和分类特征进行Embedding，对文本特征用albert模型进行Embedding。见公式1,2,3。

下一步，不同领域特征进行交互层。其实很简单，相同角度的特征交互就是简历i领域特征，工作i领域的特征，以及两个特征相减和相乘，最后把这几部分特征拼接，最后通过一个线性层求和取均值，完事。

这里的特征外部的注意力机制就是指不同特征之间的注意力，可以把hi这一系列特征看成一个序列，这就跟transformer的注意力机制很像了，当然这个没有mask和位置编码，只是计算方式很像。

计算完注意力机制后，还有一个残差机制，就是hat-h + h ，最后的zi就是得到的结果。

不同领域的zi进行concat，得到d，代表一整个简历或者job信息的向量，然后通过全连接层relu（线性层），最后经过sigmod层，得到最后的二分类概率。

损失函数，就是很常见的交叉熵损失函数，不多讲。

2. 模型代码：

连续特征embedding

class CoEmbdNet(nn.Module):
    '''
    continuous features Embedding向量化
    '''
    def __init__(self, config):
        super(CoEmbdNet, self).__init__()
        self.config = config
        # for the continuous features
        self.co_emb = nn.Embedding(config.co_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.co_emb.weight)

    def forward(self, xi, xv):
        # for continuous features
        co_emb = self.co_emb(xi)
        co_value = torch.mul(co_emb, xv.unsqueeze(-1))
        return co_value

分类特征embedding

class CaEmbdNet(nn.Module):
    '''
    category features  Embedding向量化 （只有city类型）
    '''
    def __init__(self, config):
        super(CaEmbdNet, self).__init__()
        self.config = config
        self.ca_idx = self.get_cat_size()
        # for the category features
        self.ca_emb = nn.Embedding(self.ca_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.ca_emb.weight)

    def get_cat_size(self):
        config = self.config
        with open(config.feature2idx_path,'r') as file:
            dicts = eval(file.read())
        return len(dicts['city'])

    def forward(self, xi, xv):
        # for continuous features
        ca_emb = self.ca_emb(xi)
        ca_value = torch.mul(ca_emb, xv.unsqueeze(-1))
        return ca_value

文本特征，用albert做embedding

class AlbertEmbd(nn.Module):
    '''
       调用transformers包中的bert模型，对文本特征做Embedding
       '''

    def __init__(self, config, model_config):
        super(AlbertEmbd, self).__init__()
        self.config = config
        self.embd = AlbertModel.from_pretrained(config.albertpath, config=model_config)

    def forward(self, token_tensor,exp_tensor,jd_tensor):
        embd_all = []
        for i in range(self.config.ca_num):
            embd = self.embd(token_tensor[:, i * self.config.cat_token_len:(i + 1) * self.config.cat_token_len])
            embd_all.append(embd[1])  # 将首位CLS的向量作为句子表示输入到后续模型中
        cat_token_embd = torch.cat(embd_all, dim=-1)
        cat_token_embd = cat_token_embd.reshape(cat_token_embd.size()[0], self.config.ca_num, -1)
        exp_embd = self.embd(exp_tensor)
        jd_embd = self.embd(jd_tensor)
        return cat_token_embd,exp_embd[1].unsqueeze(1),jd_embd[1].unsqueeze(1)

连续特征的全连接层

class Co_FC(nn.Module):
    '''
    对continuous features的Embedding输入到FC
    '''
    def __init__(self,in_size, hiden_size, out_size):
        super(Co_FC, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_size,hiden_size[0]),
            nn.ReLU(),
            nn.Linear(hiden_size[0],hiden_size[1]),
            nn.ReLU(),
            nn.Linear(hiden_size[1],out_size)
        )
    def forward(self,x):
        x = self.fc(x)
        return x

连接层

class MatchNet(nn.Module):
    def __init__(self,config,model_config):
        super(MatchNet, self).__init__()
        self.config = config
        # embedding
        self.co_embd_net = CoEmbdNet(config)
        self.ca_embd_net = CaEmbdNet(config)
        self.albert_emd_net = AlbertEmbd(config,model_config)
        # local match each return a tensor(embdsize)
        self.match_years = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_degree = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_salary = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_city = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_industry = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_type = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_text = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        # preject the text embedding into low dimension
        self.fc = nn.Sequential(
            nn.Linear(config.albert_size, config.hiden_size[1]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[1], config.hiden_size[2]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[2], config.embdsize),
            nn.ReLU()
        )
        # MultiheadAtt
        self.matt = MultiheadAtt(config.embdsize,config.num_heads)
        # mlp
        self.mlp = nn.Sequential(
            nn.ReLU(),
            nn.Linear(7*config.embdsize, 2*config.embdsize),
            nn.ReLU(),
            nn.Linear(2*config.embdsize, config.embdsize),
            nn.ReLU(),
            nn.Linear(config.embdsize, 1),
        )

    def forward(self,co_xi, co_xv, city_xi, city_xv, cat_token_tensors, exp_tensor, jd_tensor):
        # get the embedding vecotr of each feature field
        co_embd = self.co_embd_net(co_xi, co_xv) # - *11*config.embdsize
        ca_embd = self.ca_embd_net(city_xi, city_xv) # - *5*config.embdsize
        # -*9*312,-*1*312,-*1*312
        ca_token_embd, exp_embd, jd_embd = self.albert_emd_net(cat_token_tensors,exp_tensor,jd_tensor)
        # years local match
        years_a = co_embd[:, [3], :]
        years_b = co_embd[:, [4], :]
        years = self.match_years(years_a,years_b) # - * config.embdsize
        # edu local match
        edu_a = co_embd[:, [5], :]
        edu_b = co_embd[:, [6], :]
        edu = self.match_degree(edu_a,edu_b)
        # salary local match, the num 9 denote the job's max salary
        # he num 10 denote the job's max salary
        salary_a = co_embd[:, [7,8], :]
        salary_b = co_embd[:, [9], :]
        # salary_b = co_embd[:, [10], :]
        l = salary_a.size()[1]
        salary_b = salary_b.repeat(1,l,1)
        salary = self.match_degree(salary_a, salary_b)
        # city local match
        city_a = ca_embd[:, [0,1,2,3], :]
        city_b = ca_embd[:, [4], :]
        l = city_a.size()[1]
        city_b = city_b.repeat(1, l, 1)
        city = self.match_city(city_a,city_b)
        # industry local match
        industry_a = ca_token_embd[:, [0, 1, 2, 3], :]
        industry_b = ca_token_embd[:, [4], :]
        l = industry_a.size()[1]
        industry_b = industry_b.repeat(1, l, 1)
        industry = self.match_industry(industry_a, industry_b)
        # type local match
        type_a = ca_token_embd[:, [5, 6, 7], :]
        type_b = ca_token_embd[:, [8], :]
        l = type_a.size()[1]
        type_b = type_b.repeat(1, l, 1)
        type = self.match_type(type_a, type_b)
        # text local match
        text = self.match_text(exp_embd,jd_embd)
        # preject the text vectors into low dimension
        text_vec = torch.cat([industry, type, text],1).view(-1,3,self.config.albert_size)
        text_vec = self.fc(text_vec)
        # concatenate the local match vectors
        cat_vec = torch.cat([years, edu, salary, city, ], 1).view(-1,4,self.config.embdsize)
        feat_vec = torch.cat([cat_vec,text_vec],1) # - *7*config.embdsize
        # multi-head self fattention
        matt_vec, matt_weights = self.matt(feat_vec)
        # residual layer
        in_vec = (feat_vec + matt_vec).view(-1,7*self.config.embdsize)
        # input into the mlp
        pre = self.mlp(in_vec)
        return pre,matt_weights

多头注意力机制

class MultiheadAtt(nn.Module):
    '''
    借助nn.MultiheadAttention实现多头注意力机制,其实现过程中有相应的weights
    '''
    def __init__(self,ebd_size,num_heads):
        super(MultiheadAtt,self).__init__()
        self.matt = nn.MultiheadAttention(ebd_size,num_heads)
    def forward(self,x):
        x = x.transpose(0, 1)
        matt_out, matt_weights = self.matt(x, x, x)
        return matt_out.transpose(0,1),matt_weights

算是向量的池化层，最大、平均和求和，参数需要自己设置，默认最大池化。

class LocalMatch(nn.Module):
    def __init__(self,insize,outsize,dropout):
        super(LocalMatch, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(insize, insize),
            # nn.Dropout(dropout),
            nn.PReLU(),
            nn.Linear(insize,outsize),
            nn.PReLU(),
        )
    def forward(self,a,b, pool='max'):
        # 参数pool设置不同效果可能不同
        c = torch.cat([a, b, a - b, a * b], dim=-1)
        # c = torch.cat([a, b], dim=-1)
        c = self.net(c)
        # c = c.unsqueeze(-1)
        if pool.lower() == 'max':
            c = c.max(dim=1).values
        elif pool.lower() == 'mean':
            c = c.mean(dim=1)
        elif pool.lower() == 'sum':
            c = c.sum(dim=1)
        return c

完整代码：

# for ourmodel
import numpy as np
import torch
import torch.nn as nn
from model.albertdataset import load_csv, AlBertDataset
from torch.utils.data import DataLoader
from transformers import AlbertConfig, BertTokenizer, AlbertModel
# from config import DefaultConfig
# config = DefaultConfig()

class CoEmbdNet(nn.Module):
    '''
    continuous features Embedding向量化
    '''
    def __init__(self, config):
        super(CoEmbdNet, self).__init__()
        self.config = config
        # for the continuous features
        self.co_emb = nn.Embedding(config.co_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.co_emb.weight)

    def forward(self, xi, xv):
        # for continuous features
        co_emb = self.co_emb(xi)
        co_value = torch.mul(co_emb, xv.unsqueeze(-1))
        return co_value

class CaEmbdNet(nn.Module):
    '''
    category features  Embedding向量化 （只有city类型）
    '''
    def __init__(self, config):
        super(CaEmbdNet, self).__init__()
        self.config = config
        self.ca_idx = self.get_cat_size()
        # for the category features
        self.ca_emb = nn.Embedding(self.ca_idx+1, config.embdsize)
        nn.init.xavier_uniform_(self.ca_emb.weight)

    def get_cat_size(self):
        config = self.config
        with open(config.feature2idx_path,'r') as file:
            dicts = eval(file.read())
        return len(dicts['city'])

    def forward(self, xi, xv):
        # for continuous features
        ca_emb = self.ca_emb(xi)
        ca_value = torch.mul(ca_emb, xv.unsqueeze(-1))
        return ca_value

class AlbertEmbd(nn.Module):
    '''
       调用transformers包中的bert模型，对文本特征做Embedding
       '''

    def __init__(self, config, model_config):
        super(AlbertEmbd, self).__init__()
        self.config = config
        self.embd = AlbertModel.from_pretrained(config.albertpath, config=model_config)

    def forward(self, token_tensor,exp_tensor,jd_tensor):
        embd_all = []
        for i in range(self.config.ca_num):
            embd = self.embd(token_tensor[:, i * self.config.cat_token_len:(i + 1) * self.config.cat_token_len])
            embd_all.append(embd[1])  # 将首位CLS的向量作为句子表示输入到后续模型中
        cat_token_embd = torch.cat(embd_all, dim=-1)
        cat_token_embd = cat_token_embd.reshape(cat_token_embd.size()[0], self.config.ca_num, -1)
        exp_embd = self.embd(exp_tensor)
        jd_embd = self.embd(jd_tensor)
        return cat_token_embd,exp_embd[1].unsqueeze(1),jd_embd[1].unsqueeze(1)

class Co_FC(nn.Module):
    '''
    对continuous features的Embedding输入到FC
    '''
    def __init__(self,in_size, hiden_size, out_size):
        super(Co_FC, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_size,hiden_size[0]),
            nn.ReLU(),
            nn.Linear(hiden_size[0],hiden_size[1]),
            nn.ReLU(),
            nn.Linear(hiden_size[1],out_size)
        )
    def forward(self,x):
        x = self.fc(x)
        return x

class MatchNet(nn.Module):
    def __init__(self,config,model_config):
        super(MatchNet, self).__init__()
        self.config = config
        # embedding
        self.co_embd_net = CoEmbdNet(config)
        self.ca_embd_net = CaEmbdNet(config)
        self.albert_emd_net = AlbertEmbd(config,model_config)
        # local match each return a tensor(embdsize)
        self.match_years = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_degree = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_salary = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_city = LocalMatch(4 * config.embdsize, config.embdsize, dropout=config.local_dropout)
        self.match_industry = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_type = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        self.match_text = LocalMatch(4 * config.albert_size, config.albert_size, dropout=config.local_dropout)
        # preject the text embedding into low dimension
        self.fc = nn.Sequential(
            nn.Linear(config.albert_size, config.hiden_size[1]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[1], config.hiden_size[2]),
            nn.ReLU(),
            nn.Linear(config.hiden_size[2], config.embdsize),
            nn.ReLU()
        )
        # MultiheadAtt
        self.matt = MultiheadAtt(config.embdsize,config.num_heads)
        # mlp
        self.mlp = nn.Sequential(
            nn.ReLU(),
            nn.Linear(7*config.embdsize, 2*config.embdsize),
            nn.ReLU(),
            nn.Linear(2*config.embdsize, config.embdsize),
            nn.ReLU(),
            nn.Linear(config.embdsize, 1),
        )

    def forward(self,co_xi, co_xv, city_xi, city_xv, cat_token_tensors, exp_tensor, jd_tensor):
        # get the embedding vecotr of each feature field
        co_embd = self.co_embd_net(co_xi, co_xv) # - *11*config.embdsize
        ca_embd = self.ca_embd_net(city_xi, city_xv) # - *5*config.embdsize
        # -*9*312,-*1*312,-*1*312
        ca_token_embd, exp_embd, jd_embd = self.albert_emd_net(cat_token_tensors,exp_tensor,jd_tensor)
        # years local match
        years_a = co_embd[:, [3], :]
        years_b = co_embd[:, [4], :]
        years = self.match_years(years_a,years_b) # - * config.embdsize
        # edu local match
        edu_a = co_embd[:, [5], :]
        edu_b = co_embd[:, [6], :]
        edu = self.match_degree(edu_a,edu_b)
        # salary local match, the num 9 denote the job's max salary
        # he num 10 denote the job's max salary
        salary_a = co_embd[:, [7,8], :]
        salary_b = co_embd[:, [9], :]
        # salary_b = co_embd[:, [10], :]
        l = salary_a.size()[1]
        salary_b = salary_b.repeat(1,l,1)
        salary = self.match_degree(salary_a, salary_b)
        # city local match
        city_a = ca_embd[:, [0,1,2,3], :]
        city_b = ca_embd[:, [4], :]
        l = city_a.size()[1]
        city_b = city_b.repeat(1, l, 1)
        city = self.match_city(city_a,city_b)
        # industry local match
        industry_a = ca_token_embd[:, [0, 1, 2, 3], :]
        industry_b = ca_token_embd[:, [4], :]
        l = industry_a.size()[1]
        industry_b = industry_b.repeat(1, l, 1)
        industry = self.match_industry(industry_a, industry_b)
        # type local match
        type_a = ca_token_embd[:, [5, 6, 7], :]
        type_b = ca_token_embd[:, [8], :]
        l = type_a.size()[1]
        type_b = type_b.repeat(1, l, 1)
        type = self.match_type(type_a, type_b)
        # text local match
        text = self.match_text(exp_embd,jd_embd)
        # preject the text vectors into low dimension
        text_vec = torch.cat([industry, type, text],1).view(-1,3,self.config.albert_size)
        text_vec = self.fc(text_vec)
        # concatenate the local match vectors
        cat_vec = torch.cat([years, edu, salary, city, ], 1).view(-1,4,self.config.embdsize)
        feat_vec = torch.cat([cat_vec,text_vec],1) # - *7*config.embdsize
        # multi-head self fattention
        matt_vec, matt_weights = self.matt(feat_vec)
        # residual layer
        in_vec = (feat_vec + matt_vec).view(-1,7*self.config.embdsize)
        # input into the mlp
        pre = self.mlp(in_vec)
        return pre,matt_weights

class MultiheadAtt(nn.Module):
    '''
    借助nn.MultiheadAttention实现多头注意力机制,其实现过程中有相应的weights
    '''
    def __init__(self,ebd_size,num_heads):
        super(MultiheadAtt,self).__init__()
        self.matt = nn.MultiheadAttention(ebd_size,num_heads)
    def forward(self,x):
        x = x.transpose(0, 1)
        matt_out, matt_weights = self.matt(x, x, x)
        return matt_out.transpose(0,1),matt_weights
        
class LocalMatch(nn.Module):
    def __init__(self,insize,outsize,dropout):
        super(LocalMatch, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(insize, insize),
            # nn.Dropout(dropout),
            nn.PReLU(),
            nn.Linear(insize,outsize),
            nn.PReLU(),
        )
    def forward(self,a,b, pool='max'):
        # 参数pool设置不同效果可能不同
        c = torch.cat([a, b, a - b, a * b], dim=-1)
        # c = torch.cat([a, b], dim=-1)
        c = self.net(c)
        # c = c.unsqueeze(-1)
        if pool.lower() == 'max':
            c = c.max(dim=1).values
        elif pool.lower() == 'mean':
            c = c.mean(dim=1)
        elif pool.lower() == 'sum':
            c = c.sum(dim=1)
        return c

3. 总结

文章的数据集为天池比赛的人岗比配数据集，大家可以自己去下载把玩一下。

模型优点：充分利用数值特征、分类特征和文本特征。

模型缺点：没有利用简历和岗位图结构信息。

医学小达人

关注

6
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
2
评论
Python MUFFIN人岗匹配，推荐算法，Self-Attentional Multi-Field Features Representation and Interaction Learning

Python MUFFIN人岗匹配，推荐算法，Self-Attentional Multi-Field Features Representation and Interaction Learning
复制链接

扫一扫