吴恩达Coursera, 机器学习专项课程, Machine Learning:Unsupervised Learning, Recommenders, Reinforcement Learning第...

吴恩达Coursera, 机器学习专项课程, Machine Learning:Unsupervised Learning, Recommenders, Reinforcement Learning第三周所有jupyter notebook文件:

吴恩达Coursera, 机器学习专项课程, Machine Learning:Unsupervised Learning, Recommenders, Reinforcement Learning第三周所有jupyter notebook文件(包括实验室练习文件)

本次作业

Exercise 1

# UNQ_C1
# GRADED CELL

# Create the Q-Network
q_network = Sequential([
    ### START CODE HERE ### 
        tf.keras.Input(shape=state_size),    #specify input size
        ### START CODE HERE ### 
        Dense(64, activation='relu', name = 'layer1'),
        Dense(64, activation='relu', name = 'layer2'),
        Dense(num_actions, activation='linear', name = 'layer3')
    ### END CODE HERE ### 
    ])

# Create the target Q^-Network
target_q_network = Sequential([
    ### START CODE HERE ### 
        tf.keras.Input(shape=state_size),    #specify input size
        ### START CODE HERE ### 
        Dense(64, activation='relu', name = 'layer1'),
        Dense(64, activation='relu', name = 'layer2'),
        Dense(num_actions, activation='linear', name = 'layer3')
    ### END CODE HERE ###
    ])

### START CODE HERE ### 
optimizer = tf.keras.optimizers.Adam()
### END CODE HERE ###

Exercise 2

# UNQ_C2
# GRADED FUNCTION: calculate_loss

def compute_loss(experiences, gamma, q_network, target_q_network):
    """ 
    Calculates the loss.
    
    Args:
      experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
      gamma: (float) The discount factor.
      q_network: (tf.keras.Sequential) Keras model for predicting the q_values
      target_q_network: (tf.keras.Sequential) Karas model for predicting the targets
          
    Returns:
      loss: (TensorFlow Tensor(shape=(0,), dtype=int32)) the Mean-Squared Error between
            the y targets and the Q(s,a) values.
    """
    
    # Unpack the mini-batch of experience tuples
    states, actions, rewards, next_states, done_vals = experiences
    
    # Compute max Q^(s,a)
    max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)
    
    # Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).
    ### START CODE HERE ### 
    y_targets = rewards*done_vals + (rewards + gamma*max_qsa)*(1-done_vals)
    ### END CODE HERE ###
    
    # Get the q_values
    q_values = q_network(states)
    q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
                                                tf.cast(actions, tf.int32)], axis=1))
        
    # Compute the loss
    ### START CODE HERE ### 
    loss = MSE(q_values,y_targets)
    ### END CODE HERE ### 
    
    return loss
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ZhemgLee

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值