Skip to content

viberl.agents.ppo

PPO: Proximal Policy Optimization for stable policy gradient updates.

Algorithm Overview:

PPO is a policy gradient method that prevents large policy updates through a clipped surrogate objective, making training more stable and reliable while maintaining sample efficiency.

Key Concepts:

  • Clipped Surrogate Objective: Prevents destructive policy updates
  • Generalized Advantage Estimation (GAE): Computes stable advantage estimates
  • Multiple PPO Epochs: Reuses collected data efficiently
  • Policy Network: \(\pi_\theta(a|s)\) for action selection
  • Value Network: \(V_\phi(s)\) for baseline estimation

Mathematical Foundation:

Optimization Objective:

\[L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]

Advantage Function:

\[A_t = \delta_t + (\gamma\lambda) \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \dots\]

Reference: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347 (2017). PDF

Classes:

Name Description
PPOAgent

PPO agent implementation with clipped surrogate objective and GAE.

PPOAgent

PPOAgent(
    state_size: int,
    action_size: int,
    learning_rate: float = 0.0003,
    gamma: float = 0.99,
    lam: float = 0.95,
    clip_epsilon: float = 0.2,
    value_loss_coef: float = 0.5,
    entropy_coef: float = 0.01,
    max_grad_norm: float = 0.5,
    ppo_epochs: int = 4,
    batch_size: int = 64,
    hidden_size: int = 128,
    num_hidden_layers: int = 2,
    device: str = 'auto',
)

Bases: Agent

PPO agent implementation with clipped surrogate objective and GAE.

This agent implements Proximal Policy Optimization using a clipped surrogate objective to prevent large policy updates, along with Generalized Advantage Estimation for stable advantage computation.

Parameters:

Name Type Description Default
state_size int

Dimension of the state space. Must be positive.

required
action_size int

Number of possible actions. Must be positive.

required
learning_rate float

Learning rate for the Adam optimizer. Must be positive.

0.0003
gamma float

Discount factor for future rewards. Should be in (0, 1].

0.99
lam float

GAE lambda parameter for advantage computation. Should be in [0, 1].

0.95
clip_epsilon float

PPO clipping parameter. Should be positive.

0.2
value_loss_coef float

Coefficient for value loss. Should be positive.

0.5
entropy_coef float

Coefficient for entropy bonus. Should be positive.

0.01
max_grad_norm float

Maximum gradient norm for clipping. Should be positive.

0.5
ppo_epochs int

Number of PPO epochs per update. Must be positive.

4
batch_size int

Batch size for training. Must be positive.

64
hidden_size int

Number of neurons in each hidden layer. Must be positive.

128
num_hidden_layers int

Number of hidden layers. Must be non-negative.

2
device str

Device for computation ('auto', 'cpu', or 'cuda').

'auto'

Raises:

Type Description
ValueError

If any parameter is invalid.

Methods:

Name Description
act

Select action using policy \(\pi(a|s;\theta)\).

learn

Update policy and value networks using PPO clipped objective.

save

Save the agent's neural network parameters to a file.

load

Load the agent's neural network parameters from a file.

Attributes:

Name Type Description
gamma
lam
clip_epsilon
value_loss_coef
entropy_coef
max_grad_norm
ppo_epochs
batch_size
device
policy_network
value_network
optimizer
Source code in viberl/agents/ppo.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def __init__(
    self,
    state_size: int,
    action_size: int,
    learning_rate: float = 3e-4,
    gamma: float = 0.99,
    lam: float = 0.95,
    clip_epsilon: float = 0.2,
    value_loss_coef: float = 0.5,
    entropy_coef: float = 0.01,
    max_grad_norm: float = 0.5,
    ppo_epochs: int = 4,
    batch_size: int = 64,
    hidden_size: int = 128,
    num_hidden_layers: int = 2,
    device: str = 'auto',
):
    super().__init__(state_size, action_size)
    self.gamma = gamma
    self.lam = lam
    self.clip_epsilon = clip_epsilon
    self.value_loss_coef = value_loss_coef
    self.entropy_coef = entropy_coef
    self.max_grad_norm = max_grad_norm
    self.ppo_epochs = ppo_epochs
    self.batch_size = batch_size

    # Set device
    if device == 'auto':
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    else:
        self.device = torch.device(device)

    # Initialize networks
    self.policy_network = PolicyNetwork(
        state_size=state_size,
        action_size=action_size,
        hidden_size=hidden_size,
        num_hidden_layers=num_hidden_layers,
    ).to(self.device)

    self.value_network = VNetwork(
        state_size=state_size,
        hidden_size=hidden_size,
        num_hidden_layers=num_hidden_layers,
    ).to(self.device)

    # Initialize optimizer
    self.optimizer = torch.optim.Adam(
        list(self.policy_network.parameters()) + list(self.value_network.parameters()),
        lr=learning_rate,
    )

gamma instance-attribute

gamma = gamma

lam instance-attribute

lam = lam

clip_epsilon instance-attribute

clip_epsilon = clip_epsilon

value_loss_coef instance-attribute

value_loss_coef = value_loss_coef

entropy_coef instance-attribute

entropy_coef = entropy_coef

max_grad_norm instance-attribute

max_grad_norm = max_grad_norm

ppo_epochs instance-attribute

ppo_epochs = ppo_epochs

batch_size instance-attribute

batch_size = batch_size

device instance-attribute

device = device('cuda' if is_available() else 'cpu')

policy_network instance-attribute

policy_network = to(device)

value_network instance-attribute

value_network = to(device)

optimizer instance-attribute

optimizer = Adam(list(parameters()) + list(parameters()), lr=learning_rate)

act

act(state: ndarray, training: bool = True) -> Action

Select action using policy \(\pi(a|s;\theta)\).

Parameters:

Name Type Description Default
state ndarray

Current state observation.

required
training bool

Whether in training mode (affects exploration).

True

Returns:

Type Description
Action

Action containing the selected action.

Source code in viberl/agents/ppo.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
def act(self, state: np.ndarray, training: bool = True) -> Action:
    r"""Select action using policy $\pi(a|s;\theta)$.

    Args:
        state: Current state observation.
        training: Whether in training mode (affects exploration).

    Returns:
        Action containing the selected action.
    """
    state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)

    with torch.no_grad():
        action_probs = self.policy_network(state_tensor)
        dist = Categorical(action_probs)

        if training:
            # Training mode: sample from policy distribution
            action = dist.sample()
            log_prob = dist.log_prob(action)
            return Action(action=action.item(), logprobs=log_prob)
        else:
            # Evaluation mode: select most likely action (greedy)
            action = action_probs.argmax().item()
            return Action(action=action)

learn

learn(trajectories: list[Trajectory]) -> dict[str, float]

Update policy and value networks using PPO clipped objective.

Parameters:

Name Type Description Default
trajectories list[Trajectory]

List of trajectories to learn from

required

Returns:

Type Description
dict[str, float]

Dictionary containing policy loss, value loss, and total loss.

Source code in viberl/agents/ppo.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
def learn(self, trajectories: list[Trajectory]) -> dict[str, float]:
    """Update policy and value networks using PPO clipped objective.

    Args:
        trajectories: List of trajectories to learn from

    Returns:
        Dictionary containing policy loss, value loss, and total loss.
    """
    if not trajectories:
        return {}

    # Collect all data from all trajectories
    all_states = []
    all_actions = []
    all_rewards = []
    all_log_probs = []
    all_dones = []
    all_values = []

    for trajectory in trajectories:
        if not trajectory.transitions:
            continue

        # Extract data from trajectory
        states = [t.state for t in trajectory.transitions]
        actions = [t.action.action for t in trajectory.transitions]
        rewards = [t.reward for t in trajectory.transitions]
        log_probs = [
            t.action.logprobs.item() if t.action.logprobs is not None else 0.0
            for t in trajectory.transitions
        ]
        dones = [t.done for t in trajectory.transitions]

        # Compute values for each state
        values = []
        for state in states:
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            with torch.no_grad():
                value = self.value_network(state_tensor).squeeze(-1).item()
                values.append(value)

        all_states.extend(states)
        all_actions.extend(actions)
        all_rewards.extend(rewards)
        all_log_probs.extend(log_probs)
        all_dones.extend(dones)
        all_values.extend(values)

    if not all_states:
        return {}

    # Convert to tensors
    states_tensor = torch.FloatTensor(np.array(all_states)).to(self.device)
    actions_tensor = torch.LongTensor(all_actions).to(self.device)
    old_log_probs_tensor = torch.FloatTensor(all_log_probs).to(self.device)

    # Compute advantages and returns
    advantages, returns = self._compute_gae(all_rewards, all_values, all_dones)
    advantages_tensor = torch.FloatTensor(advantages).to(self.device)
    returns_tensor = torch.FloatTensor(returns).to(self.device)

    # Normalize advantages (handle small sample sizes)
    if len(advantages_tensor) > 1:
        advantages_tensor = (advantages_tensor - advantages_tensor.mean()) / (
            advantages_tensor.std() + 1e-8
        )
    else:
        advantages_tensor = advantages_tensor - advantages_tensor.mean()

    # Create dataset
    dataset_size = len(all_states)
    indices = np.arange(dataset_size)

    metrics = {
        'ppo/policy_loss': 0.0,
        'ppo/value_loss': 0.0,
        'ppo/entropy_loss': 0.0,
        'ppo/total_loss': 0.0,
        'ppo/batch_size': len(trajectories),
    }

    # PPO epochs
    for _epoch in range(self.ppo_epochs):
        np.random.shuffle(indices)

        for start in range(0, dataset_size, self.batch_size):
            end = start + self.batch_size
            batch_indices = indices[start:end]

            batch_states = states_tensor[batch_indices]
            batch_actions = actions_tensor[batch_indices]
            batch_old_log_probs = old_log_probs_tensor[batch_indices]
            batch_advantages = advantages_tensor[batch_indices]
            batch_returns = returns_tensor[batch_indices]

            # Forward pass
            action_probs = self.policy_network(batch_states)
            # Ensure action_probs are valid probabilities
            action_probs = torch.clamp(action_probs, 1e-8, 1 - 1e-8)
            action_probs = action_probs / action_probs.sum(dim=1, keepdim=True)

            values = self.value_network(batch_states).squeeze(-1)

            dist = Categorical(action_probs)
            new_log_probs = dist.log_prob(batch_actions)
            entropy = dist.entropy()

            # Compute ratio for PPO
            ratio = torch.exp(new_log_probs - batch_old_log_probs)

            # Clipped surrogate objective
            surr1 = ratio * batch_advantages
            surr2 = (
                torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon)
                * batch_advantages
            )
            policy_loss = -torch.min(surr1, surr2).mean()

            # Value loss
            value_loss = nn.MSELoss()(values.squeeze(), batch_returns.squeeze())

            # Entropy loss
            entropy_loss = -entropy.mean()

            # Total loss
            total_loss = (
                policy_loss
                + self.value_loss_coef * value_loss
                + self.entropy_coef * entropy_loss
            )

            # Update networks
            self.optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(
                list(self.policy_network.parameters()) + list(self.value_network.parameters()),
                self.max_grad_norm,
            )
            self.optimizer.step()

            # Accumulate metrics
            metrics['ppo/policy_loss'] += policy_loss.item()
            metrics['ppo/value_loss'] += value_loss.item()
            metrics['ppo/entropy_loss'] += entropy_loss.item()
            metrics['ppo/total_loss'] += total_loss.item()

    # Average metrics over all batches and epochs
    num_batches = (dataset_size + self.batch_size - 1) // self.batch_size
    for key in ['ppo/policy_loss', 'ppo/value_loss', 'ppo/entropy_loss', 'ppo/total_loss']:
        metrics[key] /= num_batches * self.ppo_epochs

    return metrics

save

save(filepath: str) -> None

Save the agent's neural network parameters to a file.

Parameters:

Name Type Description Default
filepath str

Path where to save the model

required
Source code in viberl/agents/ppo.py
331
332
333
334
335
336
337
338
339
340
341
342
343
def save(self, filepath: str) -> None:
    """Save the agent's neural network parameters to a file.

    Args:
        filepath: Path where to save the model
    """
    torch.save(
        {
            'policy_network': self.policy_network.state_dict(),
            'value_network': self.value_network.state_dict(),
        },
        filepath,
    )

load

load(filepath: str) -> None

Load the agent's neural network parameters from a file.

Parameters:

Name Type Description Default
filepath str

Path from which to load the model

required
Source code in viberl/agents/ppo.py
345
346
347
348
349
350
351
352
353
def load(self, filepath: str) -> None:
    """Load the agent's neural network parameters from a file.

    Args:
        filepath: Path from which to load the model
    """
    checkpoint = torch.load(filepath, map_location='cpu')
    self.policy_network.load_state_dict(checkpoint['policy_network'])
    self.value_network.load_state_dict(checkpoint['value_network'])