2019.10.18 note
Quaternion Knowledge Graph Embeddings
In this work, authors move beyond the traditional complex-valued representations, introducing more expressive hypercomplex representations to model entities and relations for knowledge graph embeddings. More specifically, quaternion embeddings, hypercomplex-valued embeddings with three imaginary components, are utilized to represent entities. Relations are modeled as rotations in the quaternion space. Experimental results demonstrate that their method achieves state-of-the-art performance on four well established knowledge graph completion benchmarks.
Code: github
MixMatch: A Holistic Approach to Semi-Supervised Learning
In this work, they proposes a semi-supervised algorithms which utilizes three methods: Consistency Regularization (the consistency of the probability predicted by the model with a stochastic data augmentation in two runs, or two random seeds of data augmentation), Entropy Minimization and Traditional Regularization (weight decay).
Given a batch X of labeled examples with corresponding one-hot targets (representing one of L possible labels) and an equally-sized batch U of unlabeled examples, MixMatch produces a processed batch of augmented labeled examples X’ and a batch of augmented unlabeled examples with “guessed” labels U’. Then used in computing separate labeled and unlabeled loss terms. More formally, the combined loss L for semi-supervised learning is computed as:
MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing
- GCN: ( H ( i + 1 ) = σ ( A ^ H ( i ) W ( i ) ) , A ^ = D − 1 / 2 ( A + I ) D − 1 / 2 H^{(i+1)}=\sigma(\hat AH^{(i)}W^{(i)}),\hat A=D^{-1/2}(A+I)D^{-1/2} H(i+1)=σ(A^H(i)W(i)),A^=D−1/2(A+I)D−1/2, where D D D is diagonal node degree matrix and A A A is the description of the graph structure in matrix form). Their proposed GCN: ( H ( i + 1 ) = concat j [ σ ( A ^ j H ( i ) W j ( i ) ) ] H^{(i+1)}=\textbf{concat}_j[\sigma(\hat A^jH^{(i)}W^{(i)}_j)] H(i+1)=concatj[σ(A^jH(i)Wj(i))]).
- They prove that GCNs are not capable of representing general layer-wise neighborhood mixing. However, GCNs defined using their proposed method are capable of representing general layer-wise neighborhood mixing.
TransSent: Towards Generation of Structured Sentences with Discourse Marker
This paper focuses on the task of generating long structured sentences with explicit discourse markers, by proposing a new task Sentence Transfer and a novel model architecture TransSent. For example: I like apples because they are sweet. head -> I like apples, relation -> because, tail -> they are sweet.
Their assumption is similar to TransE. They introduce three loss terms: recong loss, distance loss, ratio loss. Distance loss encourages prediction to be close to tail and ratio loss encourages and the term dis(prediction, tail)/dis(prediction, head) to be large.
The dataset: github
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow
- In Figure 1, they compare autoregressive, non-autoregressive and their proposed model. Non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs.
- The model is shown in Figure 2.
- This work also utilizes these methods: variational inference (ELBO: reconstruction error and KL-divergence) when training, normal distribution for generating latent variables, actnorm in decoder, invertible multi-head linear layers in decoder, affine coupling layers, NN for predicting target sequence length, noisy parallel decoding and importance weighted decoding.
Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification
- This work first presents the HIN framework for modeling the short texts.
-
Unfortunately, GCN cannot be directly applied to the HIN for short texts due to the node heterogeneity issue. Specifically, in the HIN, we have three types of nodes: documents, topics and entities with different feature spaces. To address the issue, they propose the heterogeneous graph convolution, which considers the difference of various types of information and projects them into an implicit common space with their respective transformation matrices. H ( l + 1 ) = σ ( ∑ t A t H t ( l ) W t ( l ) ) H^{(l+1)}=\sigma(\sum_tA_tH^{(l)}_tW^{(l)}_t) H(l+1)=σ(∑tAtHt(l)Wt(l)), where t t t denotes the type index here.
-
They also propose Dual-level Attention Mechanism: Type-level Attention and Node-level Attention and replace A t A_t At as attention weight matrix B t B_t Bt.