protomotions.agents.ppo.config module#
Configuration classes for PPO agent.
This module defines all configuration dataclasses for the Proximal Policy Optimization (PPO) algorithm, including actor-critic architecture parameters, optimization settings, and training hyperparameters.
- Key Classes:
PPOAgentConfig: Main PPO agent configuration
PPOModelConfig: PPO model (actor-critic) configuration
PPOActorConfig: Policy network configuration
AdvantageNormalizationConfig: Advantage normalization settings
- class protomotions.agents.ppo.config.PPOActorConfig(
- mu_key,
- in_keys=<factory>,
- out_keys=<factory>,
- _target_='protomotions.agents.ppo.model.PPOActor',
- mu_model=<factory>,
- num_out=None,
- actor_logstd=-2.9,
- learnable_std=False,
Bases:
objectConfiguration for PPO Actor network.
- Attributes:
mu_key: The key of the output of the mu model. in_keys: Input observation keys. out_keys: Output keys: action, mean_action, neglogp. mu_model: Neural network model for action mean. num_out: Number of actions. Set from robot config. actor_logstd: Initial log std for action distribution. learnable_std: Make action log std learnable (requires_grad=True).
- mu_model: ModuleContainerConfig#
- __init__(
- mu_key,
- in_keys=<factory>,
- out_keys=<factory>,
- _target_='protomotions.agents.ppo.model.PPOActor',
- mu_model=<factory>,
- num_out=None,
- actor_logstd=-2.9,
- learnable_std=False,
- class protomotions.agents.ppo.config.PPOModelConfig(
- _target_='protomotions.agents.ppo.model.PPOModel',
- in_keys=<factory>,
- out_keys=<factory>,
- actor=<factory>,
- critic=<factory>,
- actor_optimizer=<factory>,
- critic_optimizer=<factory>,
Bases:
BaseModelConfigConfiguration for PPO Model (Actor-Critic).
- Attributes:
in_keys: Input keys. out_keys: Output keys including actions and value estimate. actor: Actor (policy) network configuration. critic: Critic (value) network configuration. actor_optimizer: Optimizer settings for actor network. critic_optimizer: Optimizer settings for critic network.
- actor: PPOActorConfig#
- critic: ModuleContainerConfig#
- actor_optimizer: OptimizerConfig#
- critic_optimizer: OptimizerConfig#
- __init__(
- _target_='protomotions.agents.ppo.model.PPOModel',
- in_keys=<factory>,
- out_keys=<factory>,
- actor=<factory>,
- critic=<factory>,
- actor_optimizer=<factory>,
- critic_optimizer=<factory>,
- class protomotions.agents.ppo.config.AdvantageNormalizationConfig(
- enabled=True,
- shift_mean=True,
- use_ema=True,
- ema_alpha=0.05,
- min_std=0.02,
- clamp_range=4.0,
Bases:
objectConfiguration for advantage normalization.
- Attributes:
enabled: Whether to normalize advantages. shift_mean: Subtract mean from advantages. use_ema: Use EMA for normalization statistics. ema_alpha: EMA weight for new data. min_std: Minimum std to prevent extreme normalization. clamp_range: Clamp normalized advantages to [-range, range].
- __init__(
- enabled=True,
- shift_mean=True,
- use_ema=True,
- ema_alpha=0.05,
- min_std=0.02,
- clamp_range=4.0,
- class protomotions.agents.ppo.config.AdaptiveLRConfig(
- enabled=False,
- desired_kl=0.01,
- min_lr=1e-05,
- max_lr=0.01,
Bases:
objectConfiguration for adaptive learning rate based on KL divergence.
- Attributes:
enabled: Enable adaptive learning rate based on KL divergence. desired_kl: Target KL divergence for adaptive learning rate. min_lr: Minimum learning rate for both actor and critic. max_lr: Maximum learning rate for both actor and critic.
- __init__(
- enabled=False,
- desired_kl=0.01,
- min_lr=1e-05,
- max_lr=0.01,
- class protomotions.agents.ppo.config.L2C2Config(enabled=False, lambda_l2c2=0.1, obs_pairs=<factory>)[source]#
Bases:
objectL2C2 (Lipschitz-ratio) actor regularization (Kobayashi 2022).
Penalizes the ratio ||mu(noisy) - mu(clean)||^2 / ||noisy - clean||^2 so the actor’s Lipschitz constant stays bounded.
- Attributes:
enabled: Enable L2C2 regularization. lambda_l2c2: L2C2 loss coefficient. obs_pairs: Map from noisy actor obs key to clean counterpart key.
- __init__(
- enabled=False,
- lambda_l2c2=0.1,
- obs_pairs=<factory>,
- class protomotions.agents.ppo.config.PPOAgentConfig(
- batch_size,
- training_max_steps,
- _target_='protomotions.agents.ppo.agent.PPO',
- model=<factory>,
- num_steps=32,
- gradient_clip_val=0.0,
- fail_on_bad_grads=False,
- check_grad_mag=True,
- gamma=0.99,
- bounds_loss_coef=0.0,
- task_reward_w=1.0,
- num_mini_epochs=1,
- training_early_termination=None,
- save_epoch_checkpoint_every=1000,
- save_last_checkpoint_every=10,
- max_episode_length_manager=None,
- evaluator=<factory>,
- normalize_rewards=True,
- normalized_reward_clamp_value=5.0,
- tau=0.95,
- e_clip=0.2,
- clip_critic_loss=True,
- actor_clip_frac_threshold=0.6,
- entropy_coef=0.005,
- l2c2=<factory>,
- adaptive_lr=<factory>,
- advantage_normalization=<factory>,
Bases:
BaseAgentConfigMain configuration class for PPO Agent.
- Attributes:
batch_size: Training batch size. training_max_steps: Maximum training steps. model: Model configuration. num_steps: Environment steps per update. gradient_clip_val: Max gradient norm. 0=disabled. fail_on_bad_grads: Fail on NaN/Inf gradients. check_grad_mag: Log gradient magnitude. gamma: Discount factor. bounds_loss_coef: Action bounds loss. 0 for tanh outputs. task_reward_w: Task reward weight. num_mini_epochs: Mini-epochs per update. training_early_termination: Stop early at this step. None=disabled. save_epoch_checkpoint_every: Save epoch_xxx.ckpt every N epochs. save_last_checkpoint_every: Save last.ckpt every K epochs. max_episode_length_manager: Episode length curriculum. evaluator: Evaluation config. normalize_rewards: Normalize rewards. normalized_reward_clamp_value: Clamp normalized rewards to [-val, val]. tau: GAE lambda for advantage estimation. e_clip: PPO clipping parameter epsilon. clip_critic_loss: Clip critic loss similar to actor. actor_clip_frac_threshold: Skip actor update if clip_frac > threshold (e.g., 0.5). entropy_coef: Entropy bonus coefficient for learnable std exploration. l2c2: L2C2 settings. adaptive_lr: Adaptive learning rate settings. advantage_normalization: Advantage normalization settings.
- model: PPOModelConfig#
- l2c2: L2C2Config#
- adaptive_lr: AdaptiveLRConfig#
- advantage_normalization: AdvantageNormalizationConfig#
- __init__(
- batch_size,
- training_max_steps,
- _target_='protomotions.agents.ppo.agent.PPO',
- model=<factory>,
- num_steps=32,
- gradient_clip_val=0.0,
- fail_on_bad_grads=False,
- check_grad_mag=True,
- gamma=0.99,
- bounds_loss_coef=0.0,
- task_reward_w=1.0,
- num_mini_epochs=1,
- training_early_termination=None,
- save_epoch_checkpoint_every=1000,
- save_last_checkpoint_every=10,
- max_episode_length_manager=None,
- evaluator=<factory>,
- normalize_rewards=True,
- normalized_reward_clamp_value=5.0,
- tau=0.95,
- e_clip=0.2,
- clip_critic_loss=True,
- actor_clip_frac_threshold=0.6,
- entropy_coef=0.005,
- l2c2=<factory>,
- adaptive_lr=<factory>,
- advantage_normalization=<factory>,