SeparateActorCriticLearner
SeparateActorCriticLearner uses two distinct gradient boosted tree learners to represent an actor and a critic (Value function). Useful when the actor and critic need to be trained independently with different tree configurations or update rules. It is a wrapper around MultiGBTLearner and supports training, prediction, saving/loading, and SHAP value computation per ensemble.
- class gbrl.learners.actor_critic_learner.SeparateActorCriticLearner(input_dim: int, output_dim: int, tree_struct: Dict, policy_optimizer: Dict, value_optimizer: Dict, params: Dict = {}, verbose: int = 0, device: str = 'cpu')[source]
Bases:
MultiGBTLearner
Implements a separate actor-critic learner using two independent gradient boosted trees.
This class extends MultiGBTLearner by maintaining two separate models: - One for policy learning (Actor). - One for value estimation (Critic).
It provides separate step_actor and step_critic methods for updating the respective models.
- distil(obs: ndarray | Tensor, policy_targets: ndarray, value_targets: ndarray, params: Dict, verbose: int) Tuple[List[float], List[Dict]] [source]
Distills the trained model into a student model.
- Parameters:
obs (NumericalData) – Input observations.
policy_targets (np.ndarray) – Target values for the policy (actor).
value_targets (np.ndarray) – Target values for the value function (critic).
params (Dict) – Distillation parameters.
verbose (int) – Verbosity level.
- Returns:
The final loss values and updated parameters for distillation.
- Return type:
Tuple[List[float], List[Dict]]
- predict(obs: ndarray | Tensor, requires_grad: bool = True, start_idx: int = 0, stop_idx: int = None, tensor: bool = True) Tuple[ndarray, ndarray] [source]
Predicts both the policy and value outputs for the given observations.
- Parameters:
obs (NumericalData) – Input observations.
requires_grad (bool, optional) – Whether to compute gradients. Defaults to True.
start_idx (int, optional) – Start index for prediction. Defaults to 0.
stop_idx (int, optional) – Stop index for prediction. Defaults to None.
tensor (bool, optional) – Whether to return a tensor. Defaults to True.
- Returns:
Predicted policy outputs and value function outputs.
- Return type:
Tuple[np.ndarray, np.ndarray]
- predict_critic(obs: ndarray | Tensor, requires_grad: bool = True, start_idx: int = 0, stop_idx: int = None, tensor: bool = True) ndarray | Tensor [source]
Predicts the value function (critic) output for the given observations.
- Parameters:
obs (NumericalData) – Input observations.
requires_grad (bool, optional) – Whether to compute gradients. Defaults to True.
start_idx (int, optional) – Start index for prediction. Defaults to 0.
stop_idx (int, optional) – Stop index for prediction. Defaults to None.
tensor (bool, optional) – Whether to return a tensor. Defaults to True.
- Returns:
Predicted value function outputs.
- Return type:
NumericalData
- predict_policy(obs: ndarray | Tensor, requires_grad: bool = True, start_idx: int = 0, stop_idx: int = None, tensor: bool = True) ndarray | Tensor [source]
Predicts the policy (actor) output for the given observations.
- Parameters:
obs (NumericalData) – Input observations.
requires_grad (bool, optional) – Whether to compute gradients. Defaults to True.
start_idx (int, optional) – Start index for prediction. Defaults to 0.
stop_idx (int, optional) – Stop index for prediction. Defaults to None.
tensor (bool, optional) – Whether to return a tensor. Defaults to True.
- Returns:
Predicted policy outputs.
- Return type:
NumericalData
- step(obs: ndarray | Tensor, theta_grad: ndarray | Tensor, value_grad: ndarray | Tensor, model_idx: int | None = None) None [source]
Performs a single gradient update step on both the policy and value models.
- Parameters:
obs (NumericalData) – Input observations.
theta_grad (NumericalData) – Gradient update for the policy (actor).
value_grad (NumericalData) – Gradient update for the value function (critic).
model_idx (Optional[int], optional) – Index of the model to update.
None (If)
models. (updates both)