标签推荐领域阅读

最新推荐文章于 2023-11-04 21:58:50 发布

czlm爱你的笑

最新推荐文章于 2023-11-04 21:58:50 发布

阅读量707

点赞数 2

分类专栏： paper阅读

本文链接：https://blog.csdn.net/lianwaiyuwusheng/article/details/109540572

版权

paper阅读专栏收录该内容

5 篇文章

订阅专栏

标签推荐

2006-Information retrieval in folksonomies: Search and ranking

引用数：1239 来源：European semantic web conference

takes into account the folksonomy structure for ranking search requests in folksonomy based systems. The algorithm will be used for two purposes: determining an overall ranking, and specific topic-related rankings.

The word ‘folksonomy’ is a blend of the words ‘taxonomy’ and ‘folk’, and stands for conceptual structures created by the people；Once a user is logged in, he can add a resource to the system, and assign arbitrary tags to it.
问题：the resources that are displayed are usually ordered by date；‘relevance’ ranking is still missing。
The basic notion is that a resource which is tagged with important tags by important users becomes important itself.
- Pagerank： $\leftarrow dAw+(1-d)p$
- Adapted PageRank：
  $\leftarrow aw+\beta Aw+\gamma p$
  $a$ slows down the convergence rate，
- developed a differential approach, which compares the resulting rankings with and without preference vector。 $w_0:\beta=1$ ， $w_1:\beta<1$ ； $w=w_1-w_0$ 。方法：a single entry or a small set of entries is set to a high value, and the remaining weight is equally distributed over the other entries.

2010-Pairwise interaction tensor factorization for personalized tag recommendation

引用数：662 来源：Proceedings of the third ACM international conference on Web search and data mining

问题：目前最好的标签推荐模型——Tucker分解，时间复杂度时立方级别的。提出了线性时间内的PITF（Pairwise Interaction Tensor Factorization）模型，并且性能也较好，使用BPR算法进行优化。
个性化推荐性能高于非个性化模型的性能上界。
问题描述：对于历史观测数据 $S\subseteq U\times I\times T$ ，定义一个posts为 $(u, i)$ 对，标签推荐为对一个post推荐标签列表。可描述为对于一个 $(u, i, t)$ 输出一个分数，然后推荐分数大的几个。
使用pairwise ranking， $D_S:=\{(u,i,t_A,t_B):(u,i,t_A)\in S\land (u,i,t_B)\notin S\}$ ，数据集大，只采样部分进行训练。
分解模型：
- Tucker Decomposition： $\hat Y^{TD}:=\hat C\times\hat U\times\hat I\times\hat T，即：\hat y_{u,i,t}^{TD}=\sum_{\bar u}\sum_{\bar i}\sum_{\bar t}\hat c_{\bar u,\bar i,\bar t}\cdot\hat u_{u,\bar u}\cdot\hat i_{i,\bar i}\cdot\hat t_{t,\bar t}$ 。
- Canonical Decomposition： $\hat Y^{CD}:=\hat U\times\hat I\times\hat T，即：\hat y_{u,i,t}^{CD}=\sum_f^k\hat u_{u,f}\cdot\hat i_{i,f}\cdot\hat t_{t,f}$ 。
- PITF： $\hat y_{u,i,t}=\sum_f\hat u_{u,f}^T\cdot\hat t_{t,f}^U+\sum_f\hat i_{i,f}^T\cdot\hat t_{t,f}^I+\sum_f\hat u_{u,f}^I\cdot\hat i_{i,f}^U$ 。去掉用户物品关系。
评价指标：时间和预测质量（F1分数），
数据集：Bibsonomy ，Last.fm，ECML/PKDD Discovery Challenge 2009，每个用户抽取一个post作测试集。
实验：数据划分10次，性能取平均值，第一次划分数据集上确定参数。

在这里插入图片描述

2016-Hashtag recommendation using attention-based convolutional neural network

引用数：132 来源：IJCAI

问题：大多方法使用人工特征，本文使用CNN、trigger word、attention。
方法：将标签推荐视为多分类问题。按照trigger word假设，模型分为全局和局部通道，全局通道编码所有词，局部通道编码部分trigger词，然后通过卷积层结合两者特征， $\hat h=\tanh(M*v[h_g;h_l])+b$ 。
- Local Attention Channel：将词用词向量表示，给定一个窗口h（5）和阈值 $\eta$ ，通过注意力层获得trigger词序列，窗口中心词： $s_{(2i+h-1)/2}=g(M^1*w_{i:i+h}+b)$ 。如果分数大于阈值，则被提取作为trigger词。阈值为： $\eta=\delta\cdot\min\{s\}+(1-\delta)\cdot\max\{s\}$ （0.8）。然后对trigger词提取特征： $z=g(M^1*folding(\hat w)+b)$ ，folding是所有trigger词相加。
- Global Channel：使用卷积层获得特征： $z_i=g(M^g\cdot w_{i:i+l-1}+b)$ ，构成向量 $z=[z_1,\dots,z_{n-l+1}]$ ，然后进行max pooling处理，得到一个最大的特征值，本文使用多个filter（100）和窗口（1，2，3），进行多特征提取。
- 训练：两个通道一起训练，目标函数： $J=\sum_{(m,a)\in D}-\log p(a|m)$ 。a是微博m的标签。
数据集：110,000个microblogs，评价指标：Precision§, Recall®, and F-score(F1)，

在这里插入图片描述

2016-ConTagNet: Exploiting user context for image tag recommendation

引用数：42 来源：Proceedings of the 24th ACM international conference on Multimedia

方法：将标签推荐视为多标签分类问题。使用两个网络CNN（AlexNet提取图像信息）和NN（ContextNet提取用户情景）提取各自特征，再串联，通过全连接层得到每个标签的预测值。
- 图像特征：
- 用户背景特征：拍摄时间地点。使用两者信息构建了6维的特征向量。
模型：

在这里插入图片描述

实验：YFCC100M

2017-Hashtag Recommendation for Multimodal Microblog Using Co-Attention Network

引用数：61 来源：IJCAI

问题：很多方法只用文本信息，本文提出了结合文本和图像的co-attention network。
方法：给定一个包含图片和文本的tweet，输出每个标签的概率。
- 特征提取：使用VGG16和LSTM分别得到图像和文本特征集。
  - 图像：将图像缩放224，分为 $m=N\times N$ 的小块，用VGG16（训练好的）为每个区域提取特征，通过一层网络将其映射到和文本特征维度一样长； $v_I=\{v_i|v_i\in R^d,i=1\dots m\}$ 。
  - 文本：词one-hot编码，嵌入层 $x_i$ ，文本 $t=\{x_1\dots,x_T\}$ ，T为文本最大长度，使用LSTM方法，得到每个时间点的隐藏状态 $h_t$ ，文本特征： $v_T=\{h_i|h_i\in R^d,i=1\dots T\}$ 。
- Co-Attention Network：文本比图像更重要。对文本特征平均池化得到推特特征 $v_t=h_{ave}$ ，使用Tweet-guided visual attention得到权重 $p_I=softmax(W_{PI}tanh(W_{vI}v_I\odot W_{v_t}v_t)+b_{PI})$ ；然后得到图像特征 $\tilde v_I=\sum_ip_iv_i$ 。反过来，由新的图像特征，使用Image-guided textual attention得到新的文本特征 $\tilde v_T$ 。
- Stacked Co-attention network：为了得到更复杂的关系，使用新产生的特征迭代的执行Co-Attention Network，新的query向量（注意力机制）为新产生的特征向量加上之前的特征向量： $q_I^k=\tilde v_I^k+q_I^{k-1},q_I^1=\tilde v_I$ 。
- 预测：使用一层的softmax，输入 $f=\tilde v_I+\tilde v_T$ ，标签的预测分数 $P(y^d=a|h^d,\theta)=\frac{\exp(\theta^{(a)T}(W_ff+b_f))}{\sum_{j\in A} \exp(\theta^{(j)T}(W_ff+b_f))}$ 。目标函数为： $J=\sum_{(h_p,t_p)\in S}-\log p(h_p|t_p;\theta)$ 。
数据集：从推特上爬取的数据。评价指标：Precision§, Recall®, and F-score(F1)

在这里插入图片描述

2017-A survey on tag recommendation methods

引用数：33 来源：Journal of the Association for Information Science and Technology

推荐的目的：最基本的是标签和物品的相关性relevance，同时还应该有新颖性novelty和多样性diversity。
- 相关性：标签能否很好的描述内容，可能会出现一些同义词之类的。
- 新颖性：推荐的标签于出现过的标签的差异性，两种层面：how rare和how dissimilar，这可以给用户一些惊喜，但也可能是噪声。
  - 与之相关的是具体性specificity：标签描述内容的精确性，有的用一个物品中标签出现的次数的逆函数表示。
  - 还有穷尽性exhaustivity：物品的主要话题的覆盖范围。
- 多样性：how different，评价方法：
  - 推荐标签中的彼此间的平均语义距离
  - 与物品相关的话题覆盖程度
标签推荐分类：
- 根据推荐的对象：是否针对具体用户，
- 根据推荐的目标：相关性、多样性、新颖性
- 根据使用的数据：候选标签的来源，物品的何种特征，用户的何种特征。
- 根据使用的技术：
  - Tag co-occurrence-based：如候选标签和以前在该用户物品上的标签一起出现的频率
  - Content-based：从物品中提取出候选标签
  - Matrix factorization：用户、标签、物品间的关系
  - Graph-based：如节点为物品，边为相似的物品，通过领域提取候选标签
  - Clustering-based：对物品和标签聚类，推荐该物品所在类的最具代表的标签
  - learning-to-rank (L2R)：用标签质量属性的向量学习推荐函数，

在这里插入图片描述

挑战：
- 语义信息：如同义词，一词多义等；减少词根数量
- 稀疏和冷启动cold start：用户给的标签少，或一些用户没有打过标签。
- 标签垃圾：一些误导性的标签，噪声
评价：
- 自动方法：历史数据的一部分作为测试集
- 人工方法：推荐系统中的目标用户，如在线。

2017-FolkPopularityRank: Tag Recommendation for Enhancing Social Popularity using Text Tags in Content Sharing Services.

引用数：9 来源：International Joint Conference on Artificial Intelligence（IJCAI-17）

解决的问题：1）scoring the text tags in terms of the influence to the popularity；2）recommending additional tags to increase such popularity
image and video feature does not contribute much in predicting social popularity scores；Text tags play a very important role,
目标：recommend tags that enhance social popularity scores，based on existing tags；以前问题：focused only on tags that are semantically correct。
方法：the scores of the tags are calculated not only by the co-occurrence of the tags but also by considering the popularity related numbers of the content.
assumption：extract tags with a higher level of influence on social popularity scores
1. tags used for images and videos with high social popularity scores are important,
2. the tags co-occurring with such important tags are also important, and
3. the tags become less important when annotated with many other tags.
方法：
$r_{FP}=\alpha A_{FP}+(1-\alpha)p \\ s=r_{FP}-r_F^{tag}$

2019-User-Aware Folk Popularity Rank: User-Popularity-Based Tag Recommendation That Can Enhance Social Popularity

引用数：2 来源：Proceedings of the 27th ACM International Conference on Multimedia

based on the tags already attached to the content, we can recommend new tags。

问题：FolkPopularityRank only consider the relationship among images, tags, and their popularity，exclude who uploaded it.
使用场景：enhancing social popularity of posted content using text tags attached to it；思路：considering both content popularity and user popularity.
相关工作：Graph-based Ranking、Tag Recommendation、Social Popularity Prediction、Tag Generation、User-aware Recommendation。
方法：Tag Scoring；Adjacency Matrix Construction

2019-Personalized hashtag recommendation for micro-videos

引用数：10 来源：Proceedings of the 27th ACM International Conference on Multimedia 代码

问题：大多存在的方法基于post和标签的关系（interaction），或者用户和标签的关系，对于微视频，没有利用用户、微视频、标签三者间的关系。
方法：使用GCN对三者关系建模，使用message-passing strategy学习节点（用户、标签、微视频）表示（从相连的其他类型节点得到该节点的表示）。使用attention mechanism过滤微视频到用户和标签的信息。
- 节点： $u_i\in U,v_k\in V,h_j\in H$ 。 $\mathtt v_k$ 表示视频 $v_k$ 的特征向量。有交互则两节点有边。
- 用户节点表示（偏好） $\mathtt u_i$ ：由用户对视频的偏好和标签的偏好得到
  - 用户对视频的喜好 $\mathtt u_i^v$ ：用户感兴趣的视频内容，相邻的视频集。
  - 用户对标签喜好 $\mathtt u_i^h$ ：与用户相连的标签集得到。
- 标签节点表示 $\mathtt h_j$ ：和用户节点一样的方法
- 视频节点表示 $\mathtt v_k$ ：视频的特征。
- 由 $\mathtt u_i$ 和 $\mathtt h_j$ 得到具体用户标签特征 $\bar{\mathtt h_j^i}$ ；由 $\mathtt u_i$ 和 $\mathtt v_k$ 得到具体用户视频特征 $\bar{\mathtt v_k^i}$ 。由两者的点积得到标签j和视频v的适合度分数。
训练：进行Pairwise-based Learning，输入数据 $R=(v_k,h_j,h'_j)$ ，目标函数：
$\arg \min_{\theta}\sum_{(v_k,h_j,h'_j)\in R}-\ln\phi((\bar{\mathtt h_j^i})^T\bar{\mathtt v_k^i}-(\bar{\mathtt {h'}_j^i})^T\bar{\mathtt v_k^i})+\lambda||\Theta||_2^2$
数据集：YFCC100M、Instagram。

在这里插入图片描述

2019-Hashtag recommendation for photo sharing services

引用数：5 来源：Proceedings of the AAAI Conference on Artificial Intelligence 代码

问题：对于图片分享服务进行标签推荐，由于包含文本和图像，以及用户打标签习惯仍然比较难。本文通过内容模块和习惯模块分别对以上两种信息建模。
给posts加上标签能够增加posts的参与度。目前图片分析服务的标签推荐的缺点是：只注意文本或图像。对于受个人喜好或社区影响的标签习惯，主要基于已有的用户、物品、标签间的交互隐含协同建模，缺少明显的结合内容建模的习惯建模。
方法：模型主要包括图像和文本的内容建模模块得到内容向量p，用户标签习惯的建模模块得到影响向量t。
1. 文本：文本词表示表示为嵌入向量 $x_i$ ，文本表示未 $x_1,……x_N]$ ，N为文本最大长度。通过LSTM得到每个词的输出特征 $h_i=LSTM(x_i,h_{i-1})$ ，构成文本特征 $u=[h_i……h_N]$ ， $h_i\in R^d$ ，d为词特征向量的维度。
2. 图像：通过VGG16最后的池化层（7*7*512）输出作为图像局部特征(M*512，M=49)，通过全连接层将其变换到与文本特征维度相同的特征(M*d)。得到图像特征 $v=[v_1……v_M]$ 。
3. 文本-图像关系建模：使用parallel co-attention mechanism对两者建模，通过两者关系彼此转移，彼此相互指导自己的注意力学习。得到整体文本和图像特征，两者相加得到内容特征p。
4. 用户习惯建模：采取用户少量历史posts和对应标签作为外部存储，然后在这些历史posts中学习用户习惯，并将习惯于当前post的内容关联起来。采取可以是一个用户（或者相似用户）的随机抽样。
  - 由历史post里的标签集得到标签特征向量 $t_i$ 。
  - 提取历史posts的特征，计算他们于当前post的特征的相似度，将外部存储里的标签 $t_i$ 根据post间的相似度作加权和得到影响向量t。
5. 然后由当前post的影响向量t和内容特征p得到总特征q，经过softmax得到所有标签预测值，取预测值最大的几个标签作为推荐标签。
训练目标函数： $J=\frac{1}{|S|}\sum_{(p_i,T_i)\in S}\sum_{z\in T_i}-\log P(z|p_i)$ ，S是训练集， $p_i，T_i$ 是一个post和对应的标签，z是标签集里的一个标签， $P(z|p_i)$ 表示标签z出现在post里的softmax概率。
数据集：从Instagram上爬取的数据集。
评价指标：precision §, recall ®, and F1-score (F1).
实验结果：

在这里插入图片描述

2019-Self-Attentive Neural Network for Hashtag Recommendation

引用数：1 来源：Journal of Engineering Science & Technology Review

使用词层面的自注意力学到微博文本的特征，然后得到标签的预测值。
方法：使用已有方法得到词向量，加上位置向量后作为输入，使用不同参数得到Q，K，V；使用self-attention得到每个词的输出 $z_i$ ，使用平均池化层得到z，经过网络进行降维，经过线性层扩展到标签数的维度，使用softmax得到每个标签的预测值。目标函数为： $J=-\sum_{s\in S}\sum_{t\in tags(s)}\log p(t|s)$ 。
数据集：Twitter上爬取的数据。

在这里插入图片描述

2019-Co-attention Memory Network for Multimodal Microblog’s Hashtag Recommendation

引用数：5 来源：IEEE Transactions on Knowledge and Data Engineering

将标签推荐视为多分类问题，则无法处理新标签；本文将其视为匹配问题matching problem。
方法：使用End-to-end memory networks结合tweet内容和包含标签的历史posts，包含两个模块，分别对query tweet和候选标签喜好建模；同时与标签相关的内容只与tweet的部分文本和图像有关，使用co-attention mechanism去掉那些无关的内容。
- query tweet特征提取：对每个tweet里的文本和图像进行特征提取分别得到文本特征集 ${u_i\}$ 和图像特征集 ${v_i\}$ ，然后使用注意机制整合得到文本特征 $\tilde u^0$ 和图像特征 $\tilde v^0$ 。
  - 文本特征：每个词one-hot编码，使用嵌入层得到嵌入向量，得到文本集 ${x_i\}$ ，长度为文本的最大长度，使用bidirectional LSTM得到文本特征集 ${u_i\}$ 。
  - 图像特征：图像缩放224*224，分为 $N\times N$ 个区域，对每个区域使用VGG16得到512维的特征，使用一个感知层变为与文本特征维度一样。
  - co-attention mechanism：(2-FNN+softmax)
    1. Text-based visual attention：使用average pooling layer得到tweet句子层面的表示 $u$ ，用其指导图像特征集 ${v_i\}$ 的注意力学习，得到图像特征 $\tilde v^0$ 。
    2. Image-based textual attention：使用新的图像特征 $\tilde v^0$ 指导文本特征集 ${u_i\}$ 的注意力学习，得到文本特征 $\tilde u^0$ 。
- 标签h历史：包含某个标签的tweet集构成了标签的历史，包含文本 $D_T=\{t_i\}$ ，图像 $D_I={e_i}$ ；使用hierarchical attention mechanism得到该标签的表示：
  - Grid-level modelling：对 $D_I$ 里的每个图像使用和上面图像特征提取方法一样，指导注意力的是最近的图像特征 $\tilde v^{k-1}$ ，得到每个图像的特征 $v_i^*$ 。
  - Image-level modelling：使用 $\tilde v^{k-1}$ 指导 $D_I$ 里的所有图像进行注意力学习得到标签的图像历史喜好 $\tilde v_h$ 。
  - Word-level modelling：同上面文本特征提取方法一样，对 $D_T$ 里的每个文本使用最近文本特征 $\tilde u^{k-1}$ 进行注意力学习，得到每个tweet的文本特征 $u_i^*$ 。
  - Tweet-level modelling：使用 $\tilde u^{k-1}$ 指导 $D_T$ 里的所有文本进行注意力学习得到标签的文本历史喜好 $\tilde u_h$ 。
  - Stacked history modelling network：将query tweet的文本、图像特征和标签历史的文本、图像特征分别相加得到下一轮注意力指导特征。由最后的标签历史的文本和图像特征得到历史特征 $q_h$ ，由query tweet的文本和图像特征得到tweet特征 $q^0$ ，加上历史特征 $q_h$ 得到总的特征 $q^k$ 。
  - 最后由总特征 $q^k$ 通过一个全连接层、softmax得到标签h和query tweet的匹配度。
模型：

在这里插入图片描述

实验：通过爬取tweets，标签集为3,280；为每个标签随机抽取8个tweets作为标签的历史，总tweet为334,019；20%测试集。文本词汇259,410；memory的容量为5（从8个里面随机抽取5个），文本的最大长度为34，候选标签集大小为10，嵌入向量维度300，网络堆叠的层数为5。学习率0.01，dropout率0.2。
对比方法：
- NB：Naive Bayes对每个标签的后验概率建模，只是用文本信息。
- SVM：用预训练的词向量表示tweet，然后使用SVM。
- KNN：用预训练的词向量表示tweet，用cos相似，k邻近里出现最多的标签。
- LSTM+CNN：使用文本和图像信息，多分类。
- LSTM+CNN+H：与本文相似，没有使用注意力。
- TTM（2013）：话题转换模型，文本信息。
- TOMOHA（2014）：标签作为话题的label，对文本、标签、话题建模。
- CNN-Attention（2016）：使用CNN+attention结合文本信息和trigger词。
- Co-Attention（2017）：结合文本和图像。

论文解读：记忆网络（Memory Network）、记忆网络之Memory Networks、论文阅读：记忆网络《Memory Network》

2019-Long-tail hashtag recommendation for micro-videos with graph convolutional network

引用数：4 来源：Proceedings of the 28th ACM International Conference on Information and Knowledge Management 代码

long-tail：出现概率较少的标签，这样的标签占了总标签的一大部分。预测这些标签比较困难。
方法：
- Hashtag Embedding：标签的频率不一样，使用添加外部知识的hashtag embedding propagation来解决标签的long-tail分布问题，使用标签关系构建图 $\to$ 在图上用传导机制（常用标签可以给long-tail标签分享知识），更新标签embedding。 $Y^{l+1}=f(Y^l,A)$ ，Y为标签表示，A为标签关系。
- GCN for Hashtag Propagation：使用函数： $Y^{l+1}=\sigma(D^{-\frac{1}{2}}AD^{-\frac{1}{2}}Y^lW_{GCN}^l)$ ；定义了四个关系：composition（标签由多个词组成）、super-subordinate （WordNet）、positive（相似label）、co-occurrence；优先顺序依次递减，得到包含四种不同边的关系A，由关系得到相邻节点，加上自身节点，通过网络更新节点表示。
- Micro-video Embedding：每个模态视频（FFmpeg+ResNet）、声音（Librosa）、文本（Word2Vec，最多6个词）得到序列特征，使用parallel LSTM学习文本、声音、视觉的表示，使用标签指导注意力学习，得到每个模态的表示，将三个模态映射到共同空间。
- User Embedding：使用用户发布过的微视频的视觉特征（预训练的CNN）、标签的文本特征（Word2Vec），平均+串联+3个全连接层，得到用户表示。
- Interactive Embedding Model：将用户、微视频、标签的表示输入到Bi-Interaction layer（池化操作变为一个向量）和隐藏层（全连接层）来预测分数，
实验：从Instagram收集的INSVIDEO数据集，213,847视频、6,786用户、每个视频平均13.4个标签。数据集划分8:1:1，标签保持一致，每个视频随机6个负例。将少于100次的标签称为long-tail标签。
- 对比方法：
  - ConTagNet（2016）：结合文本、图像、拍摄信息。
  - Co-Attention（2017）：使用注意力，文本+图像
  - User-specific Hashtag Modeling（2018）：对图像、标签、用户embedding建模。
  - V2HTw/o UP, V2HTw/o P, V2HTw/o U：模型的变形，U-用户，P-propagation建模。
模型：

在这里插入图片描述

2020-Adversarial Learning for Personalized Tag Recommendation

引用数：0 来源：IEEE Transactions on Multimedia 代码

问题：目前的多标签分类主要是基于实验室打的标签，但人们给图像打的标签不一样。
- 图像多标签分类的数据集主要是由研究人员基于视觉内容打的标签。这样训练的分类器预测的标签不能满足用户标签推荐。
- 个性化标签推荐的方法主要基于标签共现、矩阵分解、图模型；不能很好地扩展到大数据集。
因此本文使用端到端的深度网络来解决个性化标签推荐。网络通过对用户偏好和视觉编码的共同优化，以非监督的方式学到用户偏好，这种共同训练能够整合视觉偏好和打标签行为。同时也使用了对抗训练，使得网络可预测出类似用户打的标签。
- （视觉偏好）判别器区分人打的标签和机器打的标签。通过共同优化该损失和其他多个损失，使得网络可以预测出类似人打的标签。
- （打标签历史偏好）考虑个性化，使用skip连接地编码解码网络无监督地学习用户偏好。
相关研究：多标签分类、个性化推荐、对抗学习。
个性化：标签的选择——打标签历史；视觉内容的偏好——视觉偏好。

在这里插入图片描述

方法：
- 输入：图片 $I$ ，用户打标签历史 $u_h$ ；输出：个性化标签 $T_p$ 。
- 图片经过网络 $F (I)$ （ResNet50的activation_46的输出）得到视觉描述子 $e$ ，结合编码 $U_E$ 解码器 $U_D$ 无监督学到的隐含用户偏好 $u_p$ ，输出个性化标签 $T_p$ 。同时以 $e$ 为输入经过生成器 $G$ 预测一般标签 $T_g$ ，以一个判别器 $D$ 作为对抗，来区分 $T_g$ 和图像真实标签。
- 用户偏好（ $u_p$ ）： $U_E$ 以用户标签历史 $u_h$ 作为输入经过**4个全连接层（1024，512，256，128）得到128维的 $u_p$ ， $U_D$ 以用户偏好 $u_p$ 为输入经过4个全连接层（128，256，512，1024）**得到重建的用户标签历史 $e_h$ 。受U-net的启发，在编码解码之间用skip connection（拼接），使用对离群值不敏感的Huber loss。
- 个性化标签（ $T_p$ ）:结合 $u_p$ 和 $e$ 经过**两个卷积层和两个全连接层（ $C$ ）**得到个性化预测标签。可用求和、乘积、拼接进行结合，拼接最好。
- Adversarial Learning：为了让预测标签的分布像真实人产生的标签分布。** $G$ （三个卷积层和一个全连接层）根据 $e$ 生成标签 $T_g$ ， $D$ （全连接层(1024, 256, 64, 16)）**区分 $T_g$ 和真实标签 $T_{gt}$ 。因为判别器区分户偏好标签与真实标签比较困难，所以没有将判别器用到个性标签上。同时也不会影响个性化标签的学习，整个网络包括 $F、U_E、U_D、C、G$ 。
- 判别器：最大化 $D(T_{gt})$ ，最小化 $D (G (e))$ 。因为真实标签出现或缺失的概率为0或1，而预测的标签为一个概率值，这对于判别器很好判别，所以添加在真实标签上添加抖动，如 $\eta=0.7,l=0.3$ ，则真实标签出现的概率为(0.7,1)，缺失的概率为(0,0.3)。
实现：整个网络和判别器交替训练。使用imgnet上训练的resnet的权重，卷积层为3*3，激活函数为relu，除了最后一层为sigmoid。
数据集：YFCC100M、NUS-WIDE。
评价指标：计算总的和每个类别的accuracy，precision，recall，f1-score。
实验结果：

在这里插入图片描述

2020-Personalized Tag Recommendation Based on Convolution Feature and Weighted Random Walk

引用数：0 来源：International Journal of Computational Intelligence Systems

方法：
- 取CNN（AlexNet）某层的feature map的一点（维度为核的个数）得到图像特征集 $F^l=\{f^l_{i,j}\}$ ，取多层（低层的特征更有效） ${F^l\}$ ；
- 在每一层对提取的特征编码（VLAD），使用K-means clustering得到K（100）个视觉词 $c_k^l$ ，累计每个set里的特征残差 $\sum_{NN(f^l_{i,j}=c_k)}f_{i,j}^l-c_k^l$ ，将得到的k个特征串联为一个图像特征 $x^l$ 。
- 最邻近图像应该考虑话题相似的图像（user group），计算目标图像p和所有图像的相似性（特征的方差），进行Min-Max标准化p；计算两张图所属user group的相似性J。
- 由图像相似性和user group的相似性得到两张图的关系 $y=\lambda (1-p)+(1-\lambda)J$ 。得到分数最高的几张图作为邻域图，
- 在邻域图使用weighted random walk得到标签的分数，选择最大的几个标签。
实验：从Fliker爬取了包含组（图片被组共享）和标签的图片，得到属于20个组的2000张图片，
- 对比方法：
  - Personalized Social Image Recommendation (PSIR)（2017）：利用图像metadata信息进行图像推荐。
  - KNN（2008）：基于图像相似进行投票。
  - Imagga：一个在线标签推荐系统，图像信息。

在这里插入图片描述

2020-Sentiment Enhanced Multi-Modal Hashtag Recommendation for Micro-Videos

引用数：0 来源：IEEE Access

很多方法没有考虑媒介数据所表达的情感。
方法：
- 特征提取：分别为每个序列单元提取两种特征，使用FFmpeg采用帧（12）和声音（6）
  - 用ResNet-152得到内容特征（2048），使用SentiBank上训练的CNN再降维得到情感特征（231）。
  - 使用SoundNet CNN获得内容特征（1024），使用Librosa得到情感特征（512）。
  - 使用pre-trained Glove获得内容特征（300），使用CoreNLP得到情感特征（5）。
  - 标签嵌入：将标签划分为单个词，用pre-trained Glove得到词的embedding，词的平均embedding作为标签的embedding。
用MLP网络将情感特征映射到共同空间，得到三个模态的特征 ${h_i^{v(s)},h_i^{a(s)}h_i^{t(s)}\}$ ；将内容特征用Bi-LSTM，再用self-attention映射到共同空间，得到三个模态的特征 ${h_i^{v(c)},h_i^{a(c)}h_i^{t(c)}\}$ ；对情感特征和内容特征进行self-attention，得到权重，再将情感特征、内容特征、标签进行串联，通过MLP获得匹配分数。
模型：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KHSzKON1-1604673524184)(img/senTiment enhanced multi-mOdal.png)]

实验：Micro tells macro: Predicting the popularity of micro-videos via a transductive model从Vine上爬取的视频，去掉标签数小于10的视频，得到40049个视频，1935个标签。8:1:1；
- 对比方法：
  - RSDAE（2015）：stacked denoising autoencoders，进行深度表示学习。
  - Co-Attention（2017）：使用注意力，文本+图像。
  - TMALL（2016）：transductive multi-modal learning model，最优潜在空间，推荐微视频。
  - EASTERN（2017）：使用三个LSTMs建模，用CNN进行微视频分类。
  - TOAST-L(最后池化)、A（平均池化）、H（标签分割）、D（不用注意力）

2020-AMNN: Attention-Based Multimodal Neural Network Model for Hashtag Recommendation

引用数：0 来源：IEEE Transactions on Computational Social Systems 代码

问题：
- coattention network cannot be well employed to image-only and text-rarely situations。
- uses a multiclass softmax classifier with cross-entropy loss to conduct the hashtag recommendation task, leading to the performance decline in multiple hashtag environments.
- implicit correlations between two hashtags may exist。
方法：
- attention-based neural network framework to extract features from both images and texts and capture correlations between hashtags,
- encoder–decoder architecture is employed for hashtag sequence prediction。employ the softmax+seq2seq mechanism to achieve the expected effect.
模型：

在这里插入图片描述

2020-User Conditional Hashtag Recommendation for Micro-Videos

引用数：0 来源：2020 IEEE International Conference on Multimedia and Expo (ICME)

方法：在学习视频特征时加入用户信息，然后用hashtag特征和视频特征在latent space里的距离得到推荐hashtag。
1. attend both image-level and video-level representations of micro-videos with user side information；在学习micro-video表示时利用user metadata（gender, age, country and historical hashtags；one-hot编码+feedforward network）。
2. 使用hierarchical multi-head attention mechanism在image regions and video frames上提取有效特征。
3. 标签表示：通过一个嵌入矩阵降到K维。
模型：

在这里插入图片描述

2020-Weakly Supervised Attention for Hashtag Recommendation using Graph Data

引用数：0 来源：Proceedings of The Web Conference 2020

User-Hashtag recommendation (UHR)：a user who wants to follow hashtags to find microblogs aligned with her interests, discovering relevant trending hashtags could become a challenge。
Microblog-Hashtag recommendation (MHR)：user attach tags to a post。
the recommendation problem can be defined as finding the relevance score between a hashtag h and a pair of text-followee list $T_u, RN_u >$ 。
以前方法：content of the microblogs of user、historical userhashtags interactions（CF），propagating their embeddings in the network（GCN）。
问题：most users tend not to generate content； computations。diverse and independent interests。
方法：user’s followee/follower links implicitly indicate their interests，model users based on their links towards hub nodes。hashtags and hub nodes are projected into a shared latent space。
实现：performing a weighted aggregation(weak supervision) of her followees’ embeddings。fuse content and graph data。
- Embedding based on Weakly Supervised Attention：labels for relevance of representative nodes to hashtags do not exist。
  - Co-occurrence based Supervision：the number of users followed representation node r and used hashtag h，reflects a relevance between r and h。
  - Informativeness based Supervision：meanwhile consider the number of users follow r but not use h。
- weak labels are not fully accurate。
实验：Joint Loss Function（Supervised-Attention based Loss Function）；sampled 99 hashtags that have no interaction with this user (UHR) or this microblog（MHR）。
模型：

在这里插入图片描述

2020-Privacy-Preserving Visual Content Tagging using Graph Transformer Networks

引用数：1 来源：Proceedings of the 28th ACM International Conference on Multimedia.

问题：The use of anonymisation and privacy-protection methods is desirable, but with the expense of task performance.
隐私保护问题：objects appear to be in correlated patterns. lay the foundation for improving the tagging performance，but is susceptible to privacy issues。
followed representation node r and used hashtag h，reflects a relevance between r and h。
- Informativeness based Supervision：meanwhile consider the number of users follow r but not use h。
- weak labels are not fully accurate。
实验：Joint Loss Function（Supervised-Attention based Loss Function）；sampled 99 hashtags that have no interaction with this user (UHR) or this microblog（MHR）。
模型：