Skip to content

viberl.agents.dqn

DQN: Deep Q-Network combining Q-learning with deep neural networks for human-level control.

Algorithm Overview:

DQN combines traditional Q-learning with deep neural networks to learn optimal action-value functions in environments with high-dimensional state spaces. It addresses key challenges of applying deep learning to reinforcement learning through experience replay and target networks.

Key Concepts:

  • Deep Q-Learning: Uses neural networks to approximate Q-values \(Q(s,a;\theta)\)
  • Experience Replay: Stores and samples experiences to break correlation between samples
  • Target Network: Separate frozen network provides stable target Q-values
  • Epsilon-Greedy: Balances exploration and exploitation during training
  • Temporal Difference: Uses TD error for Q-value updates

Mathematical Foundation:

Optimization Objective:

\[L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D}\left[\left(r + \gamma \max_{a'} Q_{\text{target}}(s',a') - Q_\theta(s,a)\right)^2\right]\]

Bellman Optimality Equation:

\[Q^*(s,a) = \mathbb{E}\left[r + \gamma \max_{a'} Q^*(s',a')\right]\]

Reference: Mnih, V., Kavukcuoglu, K., Silver, D., et al. Human-level control through deep reinforcement learning. Nature 518, 529-533 (2015). PDF

Classes:

Name Description
DQNAgent

DQN agent implementation with deep Q-learning and experience replay.

DQNAgent

DQNAgent(
    state_size: int,
    action_size: int,
    learning_rate: float = 0.001,
    gamma: float = 0.99,
    epsilon_start: float = 1.0,
    epsilon_end: float = 0.01,
    epsilon_decay: float = 0.995,
    memory_size: int = 10000,
    batch_size: int = 64,
    target_update: int = 10,
    hidden_size: int = 128,
    num_hidden_layers: int = 2,
)

Bases: Agent

DQN agent implementation with deep Q-learning and experience replay.

This agent implements the Deep Q-Network algorithm using neural networks to approximate Q-values, with experience replay and target networks for stability.

Parameters:

Name Type Description Default
state_size int

Dimension of the state space. Must be positive.

required
action_size int

Number of possible actions. Must be positive.

required
learning_rate float

Learning rate for the Adam optimizer. Must be positive.

0.001
gamma float

Discount factor for future rewards. Should be in (0, 1].

0.99
epsilon_start float

Initial exploration rate. Should be in [0, 1].

1.0
epsilon_end float

Final exploration rate. Should be in [0, 1].

0.01
epsilon_decay float

Decay rate for exploration. Should be in (0, 1].

0.995
memory_size int

Size of the experience replay buffer. Must be positive.

10000
batch_size int

Batch size for training. Must be positive.

64
target_update int

Frequency of target network updates. Must be positive.

10
hidden_size int

Number of neurons in each hidden layer. Must be positive.

128
num_hidden_layers int

Number of hidden layers in the Q-network. Must be non-negative.

2

Raises:

Type Description
ValueError

If any parameter is invalid.

Methods:

Name Description
act

Select action using epsilon-greedy policy.

learn

Update Q-network using Q-learning with experience replay.

save

Save the agent's neural network parameters to a file.

load

Load the agent's neural network parameters from a file.

Attributes:

Name Type Description
learning_rate
gamma
epsilon_start
epsilon_end
epsilon_decay
memory_size
batch_size
target_update
epsilon
q_network
target_network
optimizer
memory
Source code in viberl/agents/dqn.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def __init__(
    self,
    state_size: int,
    action_size: int,
    learning_rate: float = 1e-3,
    gamma: float = 0.99,
    epsilon_start: float = 1.0,
    epsilon_end: float = 0.01,
    epsilon_decay: float = 0.995,
    memory_size: int = 10000,
    batch_size: int = 64,
    target_update: int = 10,
    hidden_size: int = 128,
    num_hidden_layers: int = 2,
):
    super().__init__(state_size, action_size)
    self.learning_rate = learning_rate
    self.gamma = gamma
    self.epsilon_start = epsilon_start
    self.epsilon_end = epsilon_end
    self.epsilon_decay = epsilon_decay
    self.memory_size = memory_size
    self.batch_size = batch_size
    self.target_update = target_update
    self.epsilon = epsilon_start

    # Neural networks
    self.q_network = QNetwork(state_size, action_size, hidden_size, num_hidden_layers)
    self.target_network = QNetwork(state_size, action_size, hidden_size, num_hidden_layers)
    self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)

    # Copy weights to target network
    self._update_target_network()

    # Experience replay buffer
    self.memory = deque(maxlen=memory_size)

learning_rate instance-attribute

learning_rate = learning_rate

gamma instance-attribute

gamma = gamma

epsilon_start instance-attribute

epsilon_start = epsilon_start

epsilon_end instance-attribute

epsilon_end = epsilon_end

epsilon_decay instance-attribute

epsilon_decay = epsilon_decay

memory_size instance-attribute

memory_size = memory_size

batch_size instance-attribute

batch_size = batch_size

target_update instance-attribute

target_update = target_update

epsilon instance-attribute

epsilon = epsilon_start

q_network instance-attribute

q_network = QNetwork(state_size, action_size, hidden_size, num_hidden_layers)

target_network instance-attribute

target_network = QNetwork(state_size, action_size, hidden_size, num_hidden_layers)

optimizer instance-attribute

optimizer = Adam(parameters(), lr=learning_rate)

memory instance-attribute

memory = deque(maxlen=memory_size)

act

act(state: ndarray, training: bool = True) -> Action

Select action using epsilon-greedy policy.

Parameters:

Name Type Description Default
state ndarray

Current state observation.

required
training bool

Whether in training mode (affects exploration).

True

Returns:

Type Description
Action

Action containing the selected action.

Source code in viberl/agents/dqn.py
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def act(self, state: np.ndarray, training: bool = True) -> Action:
    """Select action using epsilon-greedy policy.

    Args:
        state: Current state observation.
        training: Whether in training mode (affects exploration).

    Returns:
        Action containing the selected action.
    """
    if training and random.random() < self.epsilon:
        action = random.randint(0, self.action_size - 1)
    else:
        with torch.no_grad():
            q_values = self.q_network.get_q_values(state)
            action = q_values.argmax().item()

    return Action(action=action)

learn

learn(trajectories: list[Trajectory]) -> dict[str, float]

Update Q-network using Q-learning with experience replay.

Parameters:

Name Type Description Default
trajectories list[Trajectory]

List of trajectories to learn from

required

Returns:

Type Description
dict[str, float]

Dictionary containing loss, epsilon, and memory size.

Source code in viberl/agents/dqn.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
def learn(self, trajectories: list[Trajectory]) -> dict[str, float]:
    """Update Q-network using Q-learning with experience replay.

    Args:
        trajectories: List of trajectories to learn from

    Returns:
        Dictionary containing loss, epsilon, and memory size.
    """
    if not trajectories:
        return {}

    # Store all transitions from all trajectories in memory
    transitions_added = 0
    for trajectory in trajectories:
        for transition in trajectory.transitions:
            self.memory.append(transition)
            transitions_added += 1

    if len(self.memory) < self.batch_size:
        return {
            'dqn/memory_size': len(self.memory),
            'dqn/transitions_added': transitions_added,
            'dqn/batch_size': len(trajectories),
        }

    # Sample batch from memory
    batch = random.sample(self.memory, self.batch_size)

    # Extract batch data
    states = torch.FloatTensor([t.state for t in batch])
    actions = torch.LongTensor([t.action.action for t in batch])
    rewards = torch.FloatTensor([t.reward for t in batch])
    next_states = torch.FloatTensor([t.next_state for t in batch])
    dones = torch.BoolTensor([t.done for t in batch])

    # Current Q values
    current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))

    # Next Q values from target network
    with torch.no_grad():
        next_q_values = self.target_network(next_states).max(1)[0]
        target_q_values = rewards + (self.gamma * next_q_values * (~dones))

    # Compute loss
    loss = nn.MSELoss()(current_q_values.squeeze(), target_q_values)

    # Optimize
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

    # Update target network
    self._update_target_network()

    # Decay epsilon
    self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

    return {
        'dqn/loss': loss.item(),
        'dqn/epsilon': self.epsilon,
        'dqn/memory_size': len(self.memory),
        'dqn/transitions_added': transitions_added,
        'dqn/batch_size': len(trajectories),
    }

save

save(filepath: str) -> None

Save the agent's neural network parameters to a file.

Parameters:

Name Type Description Default
filepath str

Path where to save the model

required
Source code in viberl/agents/dqn.py
195
196
197
198
199
200
201
202
203
204
205
206
207
def save(self, filepath: str) -> None:
    """Save the agent's neural network parameters to a file.

    Args:
        filepath: Path where to save the model
    """
    torch.save(
        {
            'q_network': self.q_network.state_dict(),
            'target_network': self.target_network.state_dict(),
        },
        filepath,
    )

load

load(filepath: str) -> None

Load the agent's neural network parameters from a file.

Parameters:

Name Type Description Default
filepath str

Path from which to load the model

required
Source code in viberl/agents/dqn.py
209
210
211
212
213
214
215
216
217
def load(self, filepath: str) -> None:
    """Load the agent's neural network parameters from a file.

    Args:
        filepath: Path from which to load the model
    """
    checkpoint = torch.load(filepath, map_location='cpu')
    self.q_network.load_state_dict(checkpoint['q_network'])
    self.target_network.load_state_dict(checkpoint['target_network'])