Reinforcement Learning for Relation Classification from Noisy Data阅读笔记

Reinforcement Learning for Relation Classification from Noisy Data阅读笔记

《Reinforcement Learning for Relation Classification from Noisy Data》 原文链接

目录

Problem
  1. instance selection

    1. given a set of <sentence,relationlabel> <script type="math/tex" id="MathJax-Element-1"> </script> pairs as X=(x1,r1),(x2,r2),...,(xn,rn) , where xi is a sentence associated with two entities (hi,ti) and ri is a noisy relation label produced by distant supervision
    2. determine which sentence truly describes the relation and should be selected as a training instance
  2. relation classification

    1. given a sentence xi and the mentioned entity pair (hi,ti)
    2. predict the semantic relation ri in xi
Overview

overall process
图片采自原文

Instance selector

The agent follows a policy to decide which action (choosing the current sentence or not) at each state (consisting of the current sentence, the chosen sentence set, and the entity pair), and then receive a reward from the relation classifier at the terminal state when all the selections are made

we split the training sentence instances X = {x 1 , … , x n } into N bags B = {B 1 , B 2 , … , B N } and compute a reward when we finish data selection in a bag. Each bag corresponds to a distinct entity pair, and each bag Bk is a sequence of sentences x1k,x2k,...,xk|Bk| with the same relation label rk

policy: select or not

state: a continuous real-valued vector F(si)

  1. vector representation of the current sentence
    • nonlinear layer of the CNN
  2. The representation of the chosen sentence set
    • average of the vector representations
  3. two entities in sentences
    • pre-trained knowledge graph embedding table(TransE)

action: policy function(logistic regression)
action
reward:
reward

  1. For the special case B̂ = , we set the reward as the average likelihood of all sentences in the training data
  2. where B̂  is the set of selected sentences, which is a subset of B, and r is the relation label of bag B

Optimization:
objective function:
obj func

value function: determined by reward

vi=V(si|B)=r(s|B|+1|B), for i=1,2,...,|B|

update policy:
According to the policy gradient theorem and the REINFORCE algorithm
Monto-Carlo based policy gradient method
update

Relation classifier

CNN + softmax(proposed by Kim Y)
- cnn
- loss

Model training
  1. pre-training strategy is quite crucial, widely recommended by many other reinforcement learning studies
  2. In order to have a stable update, We update Θ’and Φ’ by linear interpolation: Θ’ ← (1 − τ )Θ’ + τ Θ and
    Φ’ ← (1 − τ )Φ’ + τ Φ, where τ << 1 is a hyper-parameter
  3. procedure
    • procedure
Note
  1. distant supervision: In distant supervision, we make use of an already existing database, such as Freebase or a domain-specific database, to collect examples for the relation we want to extract. We then use these examples to automatically generate our training data. For example, Freebase contains the fact that Barack Obama and Michelle Obama are married. We take this fact, and then label each pair of “Barack Obama” and “Michelle Obama” that appear in the same sentence as a positive example for our marriage relation. This way we can easily generate a large amount of (possibly noisy) training data.

  2. Monto-Carlo policy gradient theorem
    high variance,slow convergence rate
    Monto-Carlo policy gradient

思考
  1. 训练policy function, 讲reward分成多个bag进行更新,同一个bag中r相同,对于其他任务,bag应该如何划分?
  2. instance selector, 参数更新,句子的序列顺序是否有关?
    按照目前的v(s)(或reward)计算策略, 如果s,a的个数相同而次序不同,应该不影响。
  3. relation classifier 的Loss function(cross entropy)是否缺了一部分?
  4. 训练使用Mc based gradient, others?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值