protomotions.agents.ppo.utils module#
Utility functions for PPO algorithm.
This module provides helper functions for PPO, including advantage computation using Generalized Advantage Estimation (GAE).
- Key Functions:
discount_values: Compute GAE advantages from rewards and values
- protomotions.agents.ppo.utils.discount_values(
- mb_fdones,
- mb_values,
- mb_rewards,
- mb_next_values,
- gamma,
- tau,
Compute Generalized Advantage Estimation (GAE) advantages.
Computes advantages using GAE-Lambda, which provides a bias-variance tradeoff for advantage estimation. Uses backwards iteration through the episode to compute bootstrapped advantages.
- Parameters:
mb_fdones – Done flags (num_steps, num_envs). 1.0 = episode ended.
mb_values – Value predictions at each timestep (num_steps, num_envs).
mb_rewards – Rewards received at each timestep (num_steps, num_envs).
mb_next_values – Value predictions for next states (num_steps, num_envs).
gamma – Discount factor for future rewards (typically 0.99).
tau – GAE lambda parameter for bias-variance tradeoff (typically 0.95).
- Returns:
Tensor of advantages with shape (num_steps, num_envs).
Example
>>> advantages = discount_values(dones, values, rewards, next_values, 0.99, 0.95) >>> returns = advantages + values # Can compute returns from advantages
Note
GAE-Lambda provides advantages that balance bias (low lambda) vs. variance (high lambda). Lambda=0 gives 1-step TD, lambda=1 gives Monte Carlo returns.
- Reference:
Schulman et al. “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (2015)