TuckerDNCaching: High-quality NegativeSampling with Tucker Decomposition

TuckerDNCaching: High-quality Negative Sampling with Tucker Decomposition

Tiroshan Madushanka1,2 and Ryutaro Ichise3,2*

1SOKENDAI (The Graduate University for Advanced Studies), Tokyo, Japan.

2National Institute of Informatics, Tokyo, Japan.

3*Tokyo Institute of Technology, Tokyo, Japan.

*Corresponding author(s). E-mail(s): ichise@iee.e.titech.ac.jp; Contributing authors: tiroshan@nii.ac.jp, tiroshanm@kln.ac.lk;

Abstract

Knowledge Graph Embedding (KGE) translates entities and relations of knowledge graphs (KGs) into a low-dimensional vector space, enabling an efficient way of predicting missing facts. Generally, KGE models are trained with positive and negative examples, discriminating posi- tives against negatives. Nevertheless, KGs contain only positive facts;

how to get negative exam KGE training requires generating negatives from non-observed ones in

ples:

sample from non-observ ed ones in KGs.

what is false negatives?

KGs, referred to as negative sampling. Since KGE models are sensitive to inputs, negative sampling becomes crucial, and the quality of the negatives becomes critical in KGE training. Generative adversarial net- works (GAN) and self-adversarial methods have recently been utilized in negative sampling to address the vanishing gradients observed with early negative sampling methods. However, they introduce the prob- lem of false negatives with high probability. In this paper, we extend the idea of reducing false negatives by adopting a Tucker decompo- sition representation, i.e., TuckerDNCaching, to better express latent relations among entities by introducing a relation feature space. Tuck- erDNCaching ensures the quality of generated negative samples, and the experimental results reflect that our proposed negative sampling method outperforms the existing state-of-the-art negative sampling methods.

Keywords: Negative Sampling, Knowledge Graph Embedding, Tucker Decomposition

1

2 TuckerDNCaching

  1. Introduction

Knowledge Graphs (KGs), such as Freebase, DBpedia, WordNet, and YAGO, provide a structured representation of facts, where the textual data is in the form of (head, relation, tail), known as a triplet, e.g., (DaVinci, painted, Mon-

why we need KGC aLisa). The potential of KGs has been utilized in many real-world applications,

such as question-answering, recommendation, and information retrieval sys- tems. Typically, knowledge graphs contain an extensive volume of information. However, KGs are often incomplete and sparse as the knowledge is constructed on the basis of available facts or ground truths, which are often dynamic and evolving. Therefore, it is vital to have methods to complete KGs automatically by adding missing knowledge or facts.

Recent research has revealed the potential of utilizing machine learning (ML) techniques effectively with knowledge graph completion. Knowledge

KGE can provide graph embedding (KGE) methods are proposed to provide better inference

better inference cacapability and efficiency for KGs. KGE maps entities and relations into a

ncy for KGs.

low-dimensional vector space while preserving their semantic meaning. More- over, KGE approaches provide an efficient solution to indicate missing facts in incomplete knowledge graphs. Recent KGE techniques have shown promis- ing results in knowledge acquisition tasks such as link prediction and triplet classification. Conventional KGE approaches accelerate the training of the ML algorithm by extending the idea of ranking observed instances (posi- tives) higher than unobserved instances (negatives). However, KGs contain only positive examples, so negative examples must be generated. Hence, exploring strategies to generate quality negatives that support learning bet- ter knowledge representations is critical. For instance, considering the positive (DaVinci, painted, MonaLisa), we say that the negative (DaVinci, painted, CreationOfAdam) is a quality negative as it enables the KGE model to opti-

mize the knowledge representation unlike a typical neg不a可t缺iv少e的(DaVinci, painted, France). Therefore, negative sampling becomes indispensable in knowledge

representation learning as the KGE model’s performance relies on negative selection.

Most negative sampling methods involve randomly corrupting positives on the basis of a closed world assumption [1, 2] or exploiting the KG structure when generating negatives [3, 4]. Regardless, methods that randomly corrupt

why the negative s positives suffer from vanishing gradients as they generate triplets with zero

amples gradients is gradients during training. As a solution to the vanishing gradient problem, a

new direction for negative sampling has been introduced, adopting the changes in the negative sampling distribution and generating negatives with large gra- dients dynamically [5]. However, to the best of our knowledge, most of the state-of-the-art negative sampling methods suffer from false negatives as they do not guarantee that the generated ones are always relevant negatives, i.e.,

in the case of generating true or latent positives as negatives. As KGE mod- els are sensitive to inputs, false negatives usually fool the models, losing the semantics of entities and relations. Therefore, generating quality negatives

TuckerDNCaching 3

that enhance KGE representation learning is still an open and challenging task in negative sampling.

In this paper, to overcome the challenges with generating quality nega- tives to address the problem of false negatives, we propose a negative sampling method that explores negatives, considering the dynamic distribution of the embedding space while eliminating latent positives from the candidate space. We introduce the idea of modeling latent relations using available positive KG elements and utilize relation predictions to remove latent positives (false negatives) from the negative candidate space. We extend the previous work, MDNCaching [6], addressing the problem with the expressiveness of relations by introducing a relation feature space instead of a relation matrix. We use the Tucker decomposition technique [7] with our latent relation model repre- sentation, updating entity and relation feature spaces. The utilization of the relation feature space further extends the ability to represent multiple rela- tions between entity pairs. First, we train a latent relation model from positive facts utilizing Tucker decomposition. Then, we predict the latent relations and eliminate false negatives from the candidate negative sample space. We use the caching technique to effectively manage negative triplets with large gradi- ents and update the cache considering the changes to the embedding space to overcome the vanishing gradient problem.

In summary, the major contributions of this paper are three-fold. (1) A negative sampling method is introduced that eliminates the false negatives suffered in previous dynamic distribution-based negative sampling methods.

(2) The Tucker decomposition technique is used with our novel latent relation model representation for modeling latent relations (3) Experiments on bench- mark datasets reflect the effectiveness of TuckerDNCaching using standard metrics.

The remainder of this paper is organized as follows. Section 2 discusses related work with knowledge graph embedding and negative sampling. In Section 3, we propose a new negative sampling method for generating qual- ity negatives with large gradients considering the dynamic distribution of the embedding space while eliminating false negatives referring to a latent rela- tion model trained with the Tucker decomposition technique. In Section 4, we present an experimental study in which we compare our proposed nega- tive sampling method with baseline results of benchmark datasets and analyze results with the state-of-the-art. In Section 5, we conclude this paper.

  1. Related Work

Knowledge graph completion remains a challenging research field, and many different approaches have been introduced to make KGs machine-readable, utilizing reasoning techniques. Knowledge graph embedding, also called knowl- edge representation learning, projects KG elements, i.e., entities and relations, into a low-dimensional continuous vector space and utilizes the numerical rep- resentation of embeddings to perform knowledge acquisition tasks. Typically,

4 TuckerDNCaching

using a scoring function, the KGE model captures the similarities between two entities on the basis of a relation. Depending on the properties of the scoring function, two main KGE approaches are found: translational distance-based models and semantic matching-based models. Translational distance-based models interpret relations as geometric transformations in the latent space, where models evaluate the distance of projected KG elements using a scoring function [1, 2, 8]. In contrast, the semantic matching-based approaches model the latent semantics represented in vectorized entities and relations through matrix decomposition [9–11]. The embeddings of both approaches are learned by solving an optimization problem that maximizes the scoring function for observed triplets (positives) while minimizing it for unobserved triplets (nega- tives). Thus, negative sampling is necessary for training a KGE model because negative and positive triplets must be provided during the training.

  1. Negative Sampling in KGE

Since KGE models learn knowledge representations by discriminating positives from negatives, the quality of the negatives affects the training and perfor- mance of knowledge representation downstream tasks. The existing works on negative sampling in KGE can be categorized into two main approaches: fixed distribution-based and dynamic distribution-based sampling.

  1. Fixed distribution-based sampling

Due to its simplicity and efficiency, fixed distribution-based sampling is typically utilized with knowledge representation learning. Initially, Uniform negative sampling [1] was used, which constructs negative triplets by replac- ing either the head or tail entity of a positive triplet from a randomly sampled entity from an entity set. However, randomly corrupted negatives are eas- ily distinguished and do not contribute to training because they are low in quality. In addition, Uniform sampling also generates false negatives with its random selection, such as replacing head entity DaV inci with Michelangelo, i.e., (Michelangelo, ttender, Male), which generates a negative with a fact. Bernoulli negative sampling [2] is introduced with the idea of corrupting either the head entity or tail entity by considering the statistical information of enti- ties and relations that enhance the chance of replacing the head entity in one-to-many relations and tail entity in many-to-one relations. In addition, other novel approaches [12–14] can be found that analyze statistical features of knowledge graphs rather than randomly corrupting positives. However, since the fixed distribution-based methods are sampled from fixed distributions without considering the dynamics of the distributions, the methods suffer from

vanishing gradient problems [3].

  1. Dynamic distribution-based sampling

To address the drawback of the vanishing gradient problem with fixed distribution-based sampling methods, dynamic distribution-based sampling

TuckerDNCaching 5

methods attempt to adopt the changes in negative sampling distributions. With the successful adaptation of the Generative Adversarial Network (GAN) [15] for modeling dynamic distributions, IGAN [16] and KBGAN [5] are introduced as negative sampling methods to generate negatives with large gradients. The GAN-based approach generates negatives by dynami- cally approximating a negative sampling distribution (generator), and the discriminator distinguishes positives and negatives. At the same time, train- ing continues between the generator and the discriminator to optimize the knowledge representation. KBGAN is the first attempt to adapt GAN to neg- ative sampling in KGE. In KBGAN, the generator produces a probability distribution over a candidate set of negatives, selects the one with the highest probability from candidates, and then feeds to the discriminator that min- imizes the marginal loss between positive and negative samples to improve the final embedding. The generator is selected from one of two semantic matching-based KGE models (DistMult [11], ComplEx [10]). The discrimina- tor is selected from two translational distance-based KGE models (TransE [1], TransD [8]). By replacing the probability-based log-loss KGE generator in KBGAN, IGAN utilizes a two-layer fully connected neural network as the gen- erator while keeping the embedding model as the discriminator. KSGAN [17] is a proposed extension to KBGAN and is an adversarial learning approach with a new component for knowledge selection. The knowledge selection filters out false triplets and selects a semantic negative triplet for a given positive triplet. However, GAN-based methods require pre-training, which impacts effi- ciency, and later, RotatE introduced a self-adversarial, i.e., Self-Adv, sampling approach based on a self-scoring function. However, Self-Adv does not perform consistently on other KGE models. By selecting negatives using a random walk approach that ignores non-semantic similar neighbors, Structure Aware Nega- tive Sampling (SANS) [4] improves over Self-Adv. With the aim of generating hard negatives, NSCaching [3] introduces an approach to maintaining a cache of high-quality negatives. After evaluating the gradients, NSCaching stores negative triplets with large gradients in head/tail caches. Entity Similarity- based Negative Sampling (ESNS) [18] considers semantic similarities among entities when selecting negatives and utilizes a shift-based logistic loss function. Despite addressing the problem of vanishing gradients, dynamic distribution- based sampling methods produce false negatives with a high probability, as latent positives also reflect large gradients.

  1. TuckerDNCaching

This section describes the idea behind our method. First, we experimentally analyze the challenges in generating quality negatives in KGE. Then, we intro- duce the proposed negative sampling method to address the existing challenges in negative sampling, generating quality negatives.

6 TuckerDNCaching

Epoch Epoch

  1. Ratio of Zero Loss Cases (b) Ratio of False Negatives

Fig. 1 Experimental analysis on zero loss problem and false negative problem. (a) Ratio of zero loss cases in training FB15K237 by TransE with Bernoulli negative sampling. (b) Ratio of false negatives in ExtremeSelectCaching for FB15K237.

  1. Problem Definition

Our idea is that a negative sample (h¯, r, t) is difficult to discriminate against a positive sample (h, r, t) when a corrupted entity h¯ is semantically meaningful to the original entity h concerning the semantics in a knowl- edge graph. For instance, given the positive (DaV inci, painted, MonaLisa), (DaV inci, painted, CreationOf Adam) could be a quality negative candi- date as it is semantically correct. For knowledge representation learning, it is harder to be discriminated against than (DaV inci, painted, France) and (DaV inci, painted, Louvre). Next, we define the concept of quality negative on the basis of that idea.

Deftnition 1 (Quality Negative. ) A quality negative triplet is a negative triplet

in vector space close to a positive triplet定. 义,a quality negative首先要是一个neative,然后为了让这个negative有意义

,也就是quality的。就需要让这个negative在向量空间中接近positive三元组。

When capturing semantically meaningful negatives, it is obvious to eliminate false negatives as they frequently appear. For instance, when evaluating the probability of candidate negatives (DaV inci, painted, CreationOf Adam), (DaV inci, painted, LadyWithAnErmine), and (DaV inci, painted, TheLastSupper), it is important to eliminate true facts, i.e., TheLastSupper and LadyWithAnErmine.

  1. Analysis of Challenges with Generating Quality Negatives

Since the embedding of entities and relations is randomly initialized, negative triplets generated by a fixed distribution-based negative sampling method, i.e., randomly corrupting the head or tail of a positive, are effective at the begin- ning of stochastic training for KGE. However, randomly generated negative triplets are likely to be out of the margin with a high probability after train- ing for a few epochs and do not further contribute to KGE learning. We call this the vanishing gradient problem or zero loss problem. We conducted a zero loss problem analysis by selecting Bernoulli negative sampling as the candidate, and Figure 1(a) shows that the ratio of zero loss cases increased

TuckerDNCaching 7

dramatically as the training continued with TransE for the FB15K237 dataset. This result shows that the fixed distribution-based negative sampling meth- ods quickly lead to a vanishing gradient problem. The negative samples only effectively contribute at the first epochs. Therefore, these methods yield a slow convergence and may deviate from optimal embedding learning. Hence, generating negatives with large gradients is important to provide continuous learning of the semantics in KGs.

Even though the utilization of the dynamic distribution of negatives attempts to solve the problem of vanishing gradients, it introduces the problem of false negatives with high probability compared with the fixed distribution- based negative sampling methods. We conduct our analysis of the problem of false negatives by introducing a negative sampling method that evaluates all candidates extremely selecting negatives with large gradients without filtering out false negatives and performs caching called “ExtremeSelectCaching.” We evaluate the ratio of false negatives in the negative samples during the training of the “ExtremeSelectCaching” negative sampling method. Figure 1(b) illus- trates that dynamic selection introduces the problem of false negatives with high probability as the ratio of false negative samples increases dramatically within a few epochs of training. Typically, dynamic distribution-based negative sampling methods evaluate the gradient of a negative using an underlying KGE scoring function. When the distribution of embeddings changes while learning the knowledge representation, the appearance of false negatives increases since they have large gradients. The state-of-the-art dynamic distribution-based neg- ative sampling methods attempt to manage the ratio of false negatives by selecting candidates from a small pool sampled from all the candidates [3, 5], introducing a trade-off between the false negatives ratio and the quality of the negative candidates. However, as the knowledge graph embedding models learn to discriminate positive triplets against negative triplets, false negatives fool the models, losing the actual semantics of entities and relations.

The analysis further suggests that the proposed negative sampling method should generate a negative triplet with a large gradient that is not an apparent false negative when producing a quality negative candidate.

3.3 TuckerDNCaching

Recall the stated challenges in negative sampling, (a) adopting a dynamic distribution of negatives to avoid vanishing gradients, and (b) avoiding false negatives that lead to losing the semantics of the KG. A negative sampling method must be carefully designed to overcome these challenges.

The proposed TuckerDNCaching adopts a dynamic distribution of negative samples when selecting candidate negatives as it is required to avoid the prob- lem of vanishing gradients and enable the underlying KGE model to learn the semantics of the KG continuously. Since quality negative samples are less, we need to ensure that all possible quality negative samples are explored. Hence, TuckerDNCaching models the distribution of all candidates and selects qual- ity negatives. When selecting candidates with large gradients, we refer to the

8 TuckerDNCaching

underlying scoring function of the KGE model. However, modeling the dis- tribution for all candidates introduces complexity in executing the steps in TuckerDNCaching. To manage this, we utilize a caching technique that main- tains negatives with large gradients for each positive fact and introduce a lazy update procedure for revamping caches that refresh after number of epochs later rather than immediately. We aim to maintain two separate caches, i.e., head-cache (t, r) that maintains candidates for head corruption and tail- cache (h, r) that maintains candidates for tail corruption. We uniformly sample negatives from the cache efficiently without introducing any bias.

The proposed negative sampling method introduces a Tucker decomposition-based latent relation model to predict and eliminate false negatives from the negative sample space. Modeling the distribution of all candidates and eliminating false negatives ensures that the proposed method explores all quality negatives. The projections of latent relations between entities decrease the possible discrimination for latent positives when the KGE model is learning the KG semantics.

Next, we describe how Tucker decomposition is used to model the KG’s latent relation. Then, we provide details on our negative sampling method, TuckerDNCaching.

  1. Tucker decomposition for latent relation modeling

A tensor is a multidimensional array. An N th-order tensor is an element of the tensor product of N vector spaces, each of which has its own coordinate system. A first-order tensor is a vector, a second-order tensor is a matrix, and tensors of order three or higher are considered higher-order tensors. Inter- estingly, tensors can be represented compactly in decomposed forms. Several decomposition techniques are available; among them, CP [19] and Tucker [7] are popular. CP expresses the tensor as a sum of rank one tensors, i.e., a sum of the outer product of vectors. Tucker decomposition is a generalization of CP decomposition. It decomposes the tensor into a small core tensor and factor matrices. For example, the Tucker decomposition of a third-order data tensor

X ∈ RI×J×K can be represented as X ≈ G×1 A×2 B ×3 C, where G ∈ RX×Y ×Z is a third-order core tensor, A ∈ RI×X , B ∈ RJ×Y , and C ∈ RK×Z are factor matrices, and n is an n-mode tensor product with a matrix. Knowledge graph data can be represented as a 0, 1 valued third order tensor Y 0, 1 E×R×E, where E is the total number of entities and R is the number of relations, with

Yi,j,k = 1 if the relation (i, j, k) is available.

The previous study, i.e., MDNCaching [6], introduced the idea of eliminat- ing false negatives from the candidate negatives while sampling the candidates from a dynamic distribution of negatives. MDNCaching utilizes the matrix decomposition technique to model a relationship between head and tail enti- ties, i.e., let h be a set of heads, t be a set of tails, and R be a relation matrix between h and t as R = R|h|×|t|. MDNCaching represents latent relations such that R H × T ⊤, where H represents the head features, and T represents

TuckerDNCaching 9

Fig. 2 Critical steps of proposed TuckerDNCaching negative sampling method.

the tail features. However, the latent relation model is less expressive as it uti- lizes a relation matrix that reflects the most probable relationship between an entity pair concerning two feature spaces for head and tail entities. In addi- tion, matrix representation is weak in representing many-to-many relations as KG facts are interpreted in a two-dimensional tensor.

To improve the expressiveness of the relation representations, we introduce a relation feature space in the latent relation model apart from the entity fea- ture space, modeling KG facts in a three-dimensional tensor with the proposed method. We utilized the Tucker decomposition tensor representation, which is more general and flexible. Given the KG facts X RI,J,K , Tucker decompo- sition outputs a weight tensor W RP,Q,R and three matrices Eh RI,P ,

R RJ,Q, Et RK,R such that it interprets X W 1 Eh 2 R 3 Et ( n

indicates the tensor product along the nth mode).

  1. Proposed framework

Our framework for the proposed negative sampling method is illustrated in Figure 2, showing critical steps for a tail corruption scenario. In step 1, the negative sampling method, TuckerDNCaching, performs latent rela- tion model training, referring to positives. The significant contribution of the proposed method is eradicating false negatives from the negative sample space. Since the idea of TuckerDNCaching is to model the dynamic distribu- tion of all candidate negatives, the candidate space for a negative sample is initiated with all entities except for the given positive elements. For exam- ple, given the positive (DaVinci, painted, MonaLisa), the proposed method initializes the candidate negatives as (TheCreationOfAdam, TheLastSupper,

10 TuckerDNCaching

Louvre, LadyWithAnErmine, Paris), where = (DaVinci, MonaLisa, TheCre- ationOfAdam, TheLastSupper, Louvre, LadyWithAnErmine, Paris). In step 2, TuckerDNCaching drops true positives to eliminate all the observed posi- tive facts from the candidate sample space. As the candidate negatives may comprise true positives, since KG consists of one-to-many, many-to-many, and many-to-one relations, it is essential to drop true positives from the candidate negatives. For instance, given the positive (DaVinci, painted, TheLastSupper), candidate negative TheLastSupper is removed from the negative sample space. In step 3, TuckerDNCaching drops false negatives (latent positives) from the candidate negatives by utilizing the latent relation model, which is trained in step 1. Identifying false negatives before the importance evaluation for gradient selection is essential since latent positives comprise large gradients. Therefore, the proposed negative sampling method predicts latent relations to exclude false negatives from the candidate negatives, i.e., given the tail corrupted triplet (DaVinci, painted, ), the latent relation model predicts (DaVinci, painted, LadyWithAnErmine) as a latent fact and removes that from the candi- date negatives. To avoid vanishing gradients while exploring quality negatives, TuckerDNCaching introduces an importance probability pimp(x):

head corruption; p (h) = p(h | (t, r)) = Σ

exp(f (h¯,r,t))

tail corruption; pimp(t) = p(t | (h, r)) = Σt ϵT

exp(f (h,r,t¯i))

where H(t,r) is head candidate negatives, and T(h,r) is tail candidate negatives. The importance probability pimp(x) samples essential and effective negatives from negative candidates considering their gradients that perform evaluation in reference to the underlying scoring function f (h, r, t). A higher pimp(x) reflects that the candidate negative is more effective and important with KGE model learning. In step 4, following the probability pimp(x), the proposed method evaluates the importance probability for all candidate negatives. In step 5, the quality negatives are screened, considering the probability values, and the method then directs the screened negatives, i.e., (DaVinci, painted, TheCre- ationOfAdam), to KGE model training. In step 6, the typical KGE model training is performed, discriminating positive (DaVinci, painted, MonaLisa) against the generated negative, i.e., (DaVinci, painted, TheCreationOfAdam). However, modeling the distribution of all candidate negatives and selec- tion of quality negatives introduces complexity in executing the above steps in our proposed negative sampling method. A caching technique is adopted to handle the execution efficiency, introducing a lazy update procedure for evaluating the importance of candidate negatives and updating the caches. The integration of latent relation model training, negative cache initialization, and cache update procedure with the existing KGE model training framework is described in Section 3.3.3 while carefully referring to critical steps in the

proposed TuckerDNCaching.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值