`viberl.agents.ppo`

PPO: Proximal Policy Optimization for stable policy gradient updates.

Algorithm Overview:

PPO is a policy gradient method that prevents large policy updates through a clipped surrogate objective, making training more stable and reliable while maintaining sample efficiency.

Key Concepts:

Clipped Surrogate Objective: Prevents destructive policy updates
Generalized Advantage Estimation (GAE): Computes stable advantage estimates
Multiple PPO Epochs: Reuses collected data efficiently
Policy Network: \(\pi_\theta(a|s)\) for action selection
Value Network: \(V_\phi(s)\) for baseline estimation

Mathematical Foundation:

Optimization Objective:

\[L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]

Advantage Function:

\[A_t = \delta_t + (\gamma\lambda) \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \dots\]

Reference: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347 (2017). PDF

Classes:

Name	Description
`PPOAgent`	PPO agent implementation with clipped surrogate objective and GAE.

PPOAgent

PPOAgent(
    state_size: int,
    action_size: int,
    learning_rate: float = 0.0003,
    gamma: float = 0.99,
    lam: float = 0.95,
    clip_epsilon: float = 0.2,
    value_loss_coef: float = 0.5,
    entropy_coef: float = 0.01,
    max_grad_norm: float = 0.5,
    ppo_epochs: int = 4,
    batch_size: int = 64,
    hidden_size: int = 128,
    num_hidden_layers: int = 2,
    device: str = 'auto',
)

Bases: Agent

PPO agent implementation with clipped surrogate objective and GAE.

This agent implements Proximal Policy Optimization using a clipped surrogate objective to prevent large policy updates, along with Generalized Advantage Estimation for stable advantage computation.

Parameters:

Name	Type	Description	Default
`state_size`	`int`	Dimension of the state space. Must be positive.	required
`action_size`	`int`	Number of possible actions. Must be positive.	required
`learning_rate`	`float`	Learning rate for the Adam optimizer. Must be positive.	`0.0003`
`gamma`	`float`	Discount factor for future rewards. Should be in (0, 1].	`0.99`
`lam`	`float`	GAE lambda parameter for advantage computation. Should be in [0, 1].	`0.95`
`clip_epsilon`	`float`	PPO clipping parameter. Should be positive.	`0.2`
`value_loss_coef`	`float`	Coefficient for value loss. Should be positive.	`0.5`
`entropy_coef`	`float`	Coefficient for entropy bonus. Should be positive.	`0.01`
`max_grad_norm`	`float`	Maximum gradient norm for clipping. Should be positive.	`0.5`
`ppo_epochs`	`int`	Number of PPO epochs per update. Must be positive.	`4`
`batch_size`	`int`	Batch size for training. Must be positive.	`64`
`hidden_size`	`int`	Number of neurons in each hidden layer. Must be positive.	`128`
`num_hidden_layers`	`int`	Number of hidden layers. Must be non-negative.	`2`
`device`	`str`	Device for computation ('auto', 'cpu', or 'cuda').	`'auto'`

Raises:

Type	Description
`ValueError`	If any parameter is invalid.

Methods:

Name	Description
`act`	Select action using policy \(\pi(a\|s;\theta)\).
`learn`	Update policy and value networks using PPO clipped objective.
`save`	Save the agent's neural network parameters to a file.
`load`	Load the agent's neural network parameters from a file.

Attributes:

Name	Type	Description
`gamma`
`lam`
`clip_epsilon`
`value_loss_coef`
`entropy_coef`
`max_grad_norm`
`ppo_epochs`
`batch_size`
`device`
`policy_network`
`value_network`
`optimizer`

Source code in viberl/agents/ppo.py

def __init__(
    self,
    state_size: int,
    action_size: int,
    learning_rate: float = 3e-4,
    gamma: float = 0.99,
    lam: float = 0.95,
    clip_epsilon: float = 0.2,
    value_loss_coef: float = 0.5,
    entropy_coef: float = 0.01,
    max_grad_norm: float = 0.5,
    ppo_epochs: int = 4,
    batch_size: int = 64,
    hidden_size: int = 128,
    num_hidden_layers: int = 2,
    device: str = 'auto',
):
    super().__init__(state_size, action_size)
    self.gamma = gamma
    self.lam = lam
    self.clip_epsilon = clip_epsilon
    self.value_loss_coef = value_loss_coef
    self.entropy_coef = entropy_coef
    self.max_grad_norm = max_grad_norm
    self.ppo_epochs = ppo_epochs
    self.batch_size = batch_size

    # Set device
    if device == 'auto':
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    else:
        self.device = torch.device(device)

    # Initialize networks
    self.policy_network = PolicyNetwork(
        state_size=state_size,
        action_size=action_size,
        hidden_size=hidden_size,
        num_hidden_layers=num_hidden_layers,
    ).to(self.device)

    self.value_network = VNetwork(
        state_size=state_size,
        hidden_size=hidden_size,
        num_hidden_layers=num_hidden_layers,
    ).to(self.device)

    # Initialize optimizer
    self.optimizer = torch.optim.Adam(
        list(self.policy_network.parameters()) + list(self.value_network.parameters()),
        lr=learning_rate,
    )

gamma `instance-attribute`

gamma = gamma

lam `instance-attribute`

lam = lam

clip_epsilon `instance-attribute`

clip_epsilon = clip_epsilon

value_loss_coef `instance-attribute`

value_loss_coef = value_loss_coef

entropy_coef `instance-attribute`

entropy_coef = entropy_coef

max_grad_norm `instance-attribute`

max_grad_norm = max_grad_norm

ppo_epochs `instance-attribute`

ppo_epochs = ppo_epochs

batch_size `instance-attribute`

batch_size = batch_size

device `instance-attribute`

device = device('cuda' if is_available() else 'cpu')

policy_network `instance-attribute`

policy_network = to(device)

value_network `instance-attribute`

value_network = to(device)

optimizer `instance-attribute`

optimizer = Adam(list(parameters()) + list(parameters()), lr=learning_rate)

act

act(state: ndarray, training: bool = True) -> Action

Select action using policy \(\pi(a|s;\theta)\).

Parameters:

Name	Type	Description	Default
`state`	`ndarray`	Current state observation.	required
`training`	`bool`	Whether in training mode (affects exploration).	`True`

Returns:

Type	Description
`Action`	Action containing the selected action.

Source code in viberl/agents/ppo.py

def act(self, state: np.ndarray, training: bool = True) -> Action:
    r"""Select action using policy $\pi(a|s;\theta)$.

    Args:
        state: Current state observation.
        training: Whether in training mode (affects exploration).

    Returns:
        Action containing the selected action.
    """
    state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)

    with torch.no_grad():
        action_probs = self.policy_network(state_tensor)
        dist = Categorical(action_probs)

        if training:
            # Training mode: sample from policy distribution
            action = dist.sample()
            log_prob = dist.log_prob(action)
            return Action(action=action.item(), logprobs=log_prob)
        else:
            # Evaluation mode: select most likely action (greedy)
            action = action_probs.argmax().item()
            return Action(action=action)

learn

learn(trajectories: list[Trajectory]) -> dict[str, float]

Update policy and value networks using PPO clipped objective.

Parameters:

Name	Type	Description	Default
`trajectories`	`list[Trajectory]`	List of trajectories to learn from	required

Returns:

Type	Description
`dict[str, float]`	Dictionary containing policy loss, value loss, and total loss.

Source code in viberl/agents/ppo.py

def learn(self, trajectories: list[Trajectory]) -> dict[str, float]:
    """Update policy and value networks using PPO clipped objective.

    Args:
        trajectories: List of trajectories to learn from

    Returns:
        Dictionary containing policy loss, value loss, and total loss.
    """
    if not trajectories:
        return {}

    # Collect all data from all trajectories
    all_states = []
    all_actions = []
    all_rewards = []
    all_log_probs = []
    all_dones = []
    all_values = []

    for trajectory in trajectories:
        if not trajectory.transitions:
            continue

        # Extract data from trajectory
        states = [t.state for t in trajectory.transitions]
        actions = [t.action.action for t in trajectory.transitions]
        rewards = [t.reward for t in trajectory.transitions]
        log_probs = [
            t.action.logprobs.item() if t.action.logprobs is not None else 0.0
            for t in trajectory.transitions
        ]
        dones = [t.done for t in trajectory.transitions]

        # Compute values for each state
        values = []
        for state in states:
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            with torch.no_grad():
                value = self.value_network(state_tensor).squeeze(-1).item()
                values.append(value)

        all_states.extend(states)
        all_actions.extend(actions)
        all_rewards.extend(rewards)
        all_log_probs.extend(log_probs)
        all_dones.extend(dones)
        all_values.extend(values)

    if not all_states:
        return {}

    # Convert to tensors
    states_tensor = torch.FloatTensor(np.array(all_states)).to(self.device)
    actions_tensor = torch.LongTensor(all_actions).to(self.device)
    old_log_probs_tensor = torch.FloatTensor(all_log_probs).to(self.device)

    # Compute advantages and returns
    advantages, returns = self._compute_gae(all_rewards, all_values, all_dones)
    advantages_tensor = torch.FloatTensor(advantages).to(self.device)
    returns_tensor = torch.FloatTensor(returns).to(self.device)

    # Normalize advantages (handle small sample sizes)
    if len(advantages_tensor) > 1:
        advantages_tensor = (advantages_tensor - advantages_tensor.mean()) / (
            advantages_tensor.std() + 1e-8
        )
    else:
        advantages_tensor = advantages_tensor - advantages_tensor.mean()

    # Create dataset
    dataset_size = len(all_states)
    indices = np.arange(dataset_size)

    metrics = {
        'ppo/policy_loss': 0.0,
        'ppo/value_loss': 0.0,
        'ppo/entropy_loss': 0.0,
        'ppo/total_loss': 0.0,
        'ppo/batch_size': len(trajectories),
    }

    # PPO epochs
    for _epoch in range(self.ppo_epochs):
        np.random.shuffle(indices)

        for start in range(0, dataset_size, self.batch_size):
            end = start + self.batch_size
            batch_indices = indices[start:end]

            batch_states = states_tensor[batch_indices]
            batch_actions = actions_tensor[batch_indices]
            batch_old_log_probs = old_log_probs_tensor[batch_indices]
            batch_advantages = advantages_tensor[batch_indices]
            batch_returns = returns_tensor[batch_indices]

            # Forward pass
            action_probs = self.policy_network(batch_states)
            # Ensure action_probs are valid probabilities
            action_probs = torch.clamp(action_probs, 1e-8, 1 - 1e-8)
            action_probs = action_probs / action_probs.sum(dim=1, keepdim=True)

            values = self.value_network(batch_states).squeeze(-1)

            dist = Categorical(action_probs)
            new_log_probs = dist.log_prob(batch_actions)
            entropy = dist.entropy()

            # Compute ratio for PPO
            ratio = torch.exp(new_log_probs - batch_old_log_probs)

            # Clipped surrogate objective
            surr1 = ratio * batch_advantages
            surr2 = (
                torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon)
                * batch_advantages
            )
            policy_loss = -torch.min(surr1, surr2).mean()

            # Value loss
            value_loss = nn.MSELoss()(values.squeeze(), batch_returns.squeeze())

            # Entropy loss
            entropy_loss = -entropy.mean()

            # Total loss
            total_loss = (
                policy_loss
                + self.value_loss_coef * value_loss
                + self.entropy_coef * entropy_loss
            )

            # Update networks
            self.optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(
                list(self.policy_network.parameters()) + list(self.value_network.parameters()),
                self.max_grad_norm,
            )
            self.optimizer.step()

            # Accumulate metrics
            metrics['ppo/policy_loss'] += policy_loss.item()
            metrics['ppo/value_loss'] += value_loss.item()
            metrics['ppo/entropy_loss'] += entropy_loss.item()
            metrics['ppo/total_loss'] += total_loss.item()

    # Average metrics over all batches and epochs
    num_batches = (dataset_size + self.batch_size - 1) // self.batch_size
    for key in ['ppo/policy_loss', 'ppo/value_loss', 'ppo/entropy_loss', 'ppo/total_loss']:
        metrics[key] /= num_batches * self.ppo_epochs

    return metrics

save

save(filepath: str) -> None

Save the agent's neural network parameters to a file.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	Path where to save the model	required

Source code in viberl/agents/ppo.py

def save(self, filepath: str) -> None:
    """Save the agent's neural network parameters to a file.

    Args:
        filepath: Path where to save the model
    """
    torch.save(
        {
            'policy_network': self.policy_network.state_dict(),
            'value_network': self.value_network.state_dict(),
        },
        filepath,
    )

load

load(filepath: str) -> None

Load the agent's neural network parameters from a file.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	Path from which to load the model	required

Source code in viberl/agents/ppo.py

def load(self, filepath: str) -> None:
    """Load the agent's neural network parameters from a file.

    Args:
        filepath: Path from which to load the model
    """
    checkpoint = torch.load(filepath, map_location='cpu')
    self.policy_network.load_state_dict(checkpoint['policy_network'])
    self.value_network.load_state_dict(checkpoint['value_network'])

viberl.agents.ppo

PPOAgent

gamma instance-attribute

lam instance-attribute

clip_epsilon instance-attribute

value_loss_coef instance-attribute

entropy_coef instance-attribute

max_grad_norm instance-attribute

ppo_epochs instance-attribute

batch_size instance-attribute

device instance-attribute

policy_network instance-attribute

value_network instance-attribute

optimizer instance-attribute

act

learn

save

load

`viberl.agents.ppo`

gamma `instance-attribute`

lam `instance-attribute`

clip_epsilon `instance-attribute`

value_loss_coef `instance-attribute`

entropy_coef `instance-attribute`

max_grad_norm `instance-attribute`

ppo_epochs `instance-attribute`

batch_size `instance-attribute`

device `instance-attribute`

policy_network `instance-attribute`

value_network `instance-attribute`

optimizer `instance-attribute`