source:SAFE REINFORCEMENT LEARNING by PHILIP S. THOMAS
1. Importance Sampling (IS)
- for all trajectories of D case:
IS estimator := mean of the individual IS estimators for each trajectory:
- properties: unbiased
- Upper and Lower Bounds on the IS Estimator
2.PDIS
lower variance than IS
unbiased
all rewards are normalized !!
Use a different importance weight for each reward rather than one importance weight for the entire return.
batch case:
3.NPDIS
4.WIS
'''
Weighted Importance Sampling
* Works in a batch setting
pi_b : batch containing histories sampled from behavorial policy
pi_e : batch containing histories sampled from evaluation policy
reward: batch of list of reward obtained per time step
returns normalized estimate of performance under evaluation policy
'''
def weighted_is(pi_b,pi_e,reward):
estimated_reward = 0
estimated_weight = 0
for history_b,history_e,history_reward in zip(pi_b,pi_e,reward):
estimated_history_reward = history_reward
estimated_history_weight = 1
for i,action_hist_prob in enumerate(history_b):
estimated_history_reward*= history_e[i]/history_b[i]
estimated_history_weight*= history_e[i]/history_b[i]
estimated_reward+= estimated_history_reward
estimated_weight+= estimated_history_weight
return estimated_reward/estimated_weight