Constrained Optimizer
How to use a ConstrainedOptimizer
The ConstrainedOptimizer
class is the cornerstone of Cooper. A
ConstrainedOptimizer
performs parameter updates to solve a
ConstrainedMinimizationProblem
given a chosen
Formulation
.
A ConstrainedOptimizer
wraps a torch.optim.Optimizer
used for updating the “primal” parameters associated directly with the
optimization problem. These might be, for example, the parameters of the model
you are training.
Additionally, a ConstrainedOptimizer
includes a second
torch.optim.Optimizer
, which performs updates on the “dual”
parameters (e.g. the multipliers used in a
LagrangianFormulation
).
Construction
The main ingredients to build a ConstrainedOptimizer
are a
Formulation
(associated with a
ConstrainedMinimizationProblem
) and a
torch.optim.Optimizer
corresponding to a primal_optimizer
.
Note
Cooper supports the use of multiple primal_optimizers
, each
corresponding to different groups of primal variables. The
primal_optimizers
argument accepts a single optimizer, or a list
of optimizers. See Multiple primal optimizers.
If the ConstrainedMinimizationProblem
you are dealing with is in fact
constrained, depending on your formulation, you might also need to provide a
dual_optimizer
. Check out the section on Partial optimizer instantiation
for more details on defining dual_optimizer
s.
Note
Cooper includes extra-gradient implementations of SGD and Adam which can be used as primal or dual optimizers. See Extra-gradient optimizers.
Examples
The highlighted lines below show the small changes required to go from an
unconstrained to a constrained problem. Note that these changes should also be
accompanied with edits to the custom problem class which inherits from
ConstrainedMinimizationProblem
. More details on
the definition of a CMP can be found under the entry for Constrained Minimization Problem.
Unconstrained problem
1model = ModelClass(...) 2cmp = cooper.ConstrainedMinimizationProblem() 3formulation = cooper.formulation.Formulation(...) 4 5primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) 6 7constrained_optimizer = cooper.UnconstrainedOptimizer( 8 formulation=formulation, 9 primal_optimizers=primal_optimizer, 10)
Constrained problem
1model = ModelClass(...) 2cmp = cooper.ConstrainedMinimizationProblem() 3formulation = cooper.formulation.Formulation(...) 4 5primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) 6# Note that dual_optimizer is "partly instantiated", *without* parameters 7dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3, momentum=0.9) 8 9constrained_optimizer = cooper.SimultaneousConstrainedOptimizer( 10 formulation=formulation, 11 primal_optimizers=primal_optimizer, 12 dual_optimizer=dual_optimizer, 13)
The training loop
We have gathered all the ingredients we need for tackling our CMP: the
custom ConstrainedMinimizationProblem
class, along
with your ConstrainedOptimizer
of choice and a
ConstrainedOptimizer
for updating the parameters. Now it is time to
put them to good use.
The typical training loop for solving a CMP in a machine learning set up using Cooper (with a Lagrangian Formulation) will involve the following steps:
Overview of main steps in a training loop
(Optional) Iterate over your dataset and sample of mini-batch.
Call
constrained_optimizer.zero_grad()
to reset the parameters’ gradientsCompute the current
CMPState
(or estimate it with the minibatch) and calculate the Lagrangian usingformulation.compute_lagrangian(cmp.closure, ...)
.Populate the primal and dual gradients with
formulation.backward(lagrangian)
Perform updates on the parameters using the primal and dual optimizers based on the recently computed gradients, via a call to
constrained_optimizer.step()
.
Example
1model = ModelClass(...) 2cmp = cooper.ConstrainedMinimizationProblem(...) 3formulation = cooper.LagrangianFormulation(...) 4 5primal_optimizer = torch.optim.SGD(model.parameters(), lr=primal_lr) 6# Note that dual_optimizer is "partly instantiated", *without* parameters 7dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=primal_lr) 8 9constrained_optimizer = cooper.ConstrainedOptimizer( 10 formulation=formulation, 11 primal_optimizers=primal_optimizer, 12 dual_optimizer=dual_optimizer, 13) 14 15for inputs, targets in dataset: 16 # Clear gradient buffers 17 constrained_optimizer.zero_grad() 18 19 # The closure is required to compute the Lagrangian 20 # The closure might in turn require the model, inputs, targets, etc. 21 lagrangian = formulation.compute_lagrangian(cmp.closure, ...) 22 23 # Populate the primal and dual gradients 24 formulation.backward(lagrangian) 25 26 # Perform primal and dual parameter updates 27 constrained_optimizer.step()
Parameter updates
By default, parameter updates are performed using simultaneous gradient descent-ascent updates (according to the choice of primal and dual optimizers). Formally,
Note
We explicitly include a negative sign in front of the gradient for
\(\lambda\) in order to highlight the fact that \(\lambda\) solves
maximization problem. Cooper handles the sign flipping internally, so
you should provide your definition for a dual_optimizer
using a non-negative
learning rate, as usual!
Multiplier projection
Lagrange multipliers associated with inequality constraints should remain
non-negative. Cooper executes the standard projection to
\(\mathbb{R}^{+}\) by default for
DenseMultiplier
s. For more details
on using custom projection operations, see the section on Multipliers.
Other update strategies supported by ConstrainedOptimizer
include:
Alternating updates for (projected) gradient descent-ascent
The Augmented Lagrangian method (ALM)
Performing Dual restarts on the Lagrange multipliers for inequality constraints
Using Extra-gradient
Extra-gradient-based optimizers require an extra call to the
cmp.closure()
. See the section on Extra-gradient optimizers for usage details.
The ConstrainedOptimizer
implements a ConstrainedOptimizer.step()
method, that updates the primal and dual parameters (if Formulation
has any).
The nature of the update depends on the attributes provided during the
initialization of the ConstrainedOptimizer
. By default, updates are via
gradient descent on the primal parameters and (projected) ascent
on the dual parameters, with simultaneous updates.
Note
When applied to an unconstrained problem, ConstrainedOptimizer.step()
will be equivalent to performing optimizer.step()
on all of the
primal_optimizers
based on the gradient of the loss with respect to the
primal parameters.
Additional features
In this section we provide details on using “advanced features” such as
alternating updates, the Augmented Lagrangian method or dual restarts, in
conjunction with a
ConstrainedOptimizer
.
Alternating updates
It is possible to perform alternating updates between the primal and dual
parameters by setting the flag alternating=True
in the construction of the
ConstrainedOptimizer
. In this case, the gradient computed by calling
backward()
is used to update the
primal parameters. Then, the gradient with respect to the dual variables (given
the new value of the primal parameters!) is computed and used to update the dual
variables. This two-stage process is handled by Cooper inside the
ConstrainedOptimizer.step()
method.
Important
Selecting alternating=True
does not necessarily double the number of
backward passes through a the primal parameters!
When using a LagrangianFormulation
, to obtain the gradients with respect
to the Lagrange multipliers, it suffices to evaluate the constraint
defects (through a call to
closure()
). This
operation does not require a having to back-propagate through the Lagrangian
with respect to the primal parameters.
Providing a defect_fn
in the call to ConstrainedOptimizer.step()
allows for updating the Lagrange multiplier without having to re-evaluate
the loss function, but rather only the constraints.
Warning
Combining alternating updates with dual restarts is untested. Use at your own risk.
Augmented Lagrangian method
Exactly solving the intermediate optimization problems considered by the Augmented Lagrangian formulation can be challenging. This is particularly true when \(L_{c_t}(x, \lambda)\) is non-convex in \(x\).
Rather than aiming to solve the intermediate optimization problem to high precision, Cooper implements a version of the Augmented Lagrangian method where the primal variables are updated using a gradient-based step.
The new primal variables are then used to perform an alternating update on the Lagrange multipliers.
Augmented Lagrangian coefficients
Recall that the update for the dual variables in an Augmented Lagrangian formulation is:
This corresponds exactly to a (projected) gradient ascent update on the dual variables with “step size” \(c_t\) on the function:
Therefore, the sequence of Augmented Lagrangian coefficients can be identified with a scheduler on the dual learning rate.
As in the default parameter updates, we explicitly include a negative sign in front of the dual gradient to denote the maximization performed in the dual variables, since Pytorch optimizers use a minimization convention. Cooper handles the gradient sign flip internally.
Example
1import torch
2import math
3import cooper
4
5class MyCustomCMP(cooper.ConstrainedMinimizationProblem):
6 # ConstrainedMinimizationProblem with 3 inequality constraints and 4 equality constraints
7 ...
8
9cmp = MyCustomCMP()
10
11params = torch.nn.Parameter(...)
12primal_optimizer = torch.optim.SGD([params], lr=1e-2)
13
14# Set the dual_optimizer base learning rate to 1.0 since the penalty coefficient
15# for the Augmented Lagrangian is controlled via the dual scheduler
16dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1.0)
17
18# Increasing Augmented Lagrangian coefficient schedule
19dual_scheduler = cooper.optim.partial_scheduler(
20 torch.optim.lr_scheduler.LambdaLR, lr_lambda=lambda step: math.sqrt(step / 100)
21)
22
23formulation = cooper.formulation.AugmentedLagrangianFormulation(cmp)
24
25constrained_optimizer = cooper.ConstrainedOptimizer(
26 formulation=formulation,
27 primal_optimizers=primal_optimizer,
28 dual_optimizer=dual_optimizer,
29 dual_scheduler=dual_scheduler,
30 dual_restarts=False,
31 alternating=True, # Remember that ALM performs alternating updates
32)
33
34# We need to manually trigger the creation of the Lagrange multipliers (formulation state)
35# and the link them to the dual optimizer.
36# This is so that the dual scheduler is fully initialized before we start training, since
37# we need it to compute the Augmented Lagrangian coefficient.
38formulation.create_state_from_metadata(
39 dtype=params.dtype, device=device, ineq_size=torch.Size([3]), eq_size=torch.Size([4])
40)
41coop.instantiate_dual_optimizer_and_scheduler()
42
43for step_id in range(1000):
44 coop.zero_grad()
45
46 lagrangian = formulation.compute_lagrangian(
47 aug_lag_coeff_scheduler=coop.dual_scheduler,
48 closure=cmp.closure,
49 params=params,
50 )
51 formulation.backward(lagrangian)
52
53 # We need to pass the closure or defect_fn to perform the alternating updates
54 # required by the Augmented Lagrangian method.
55 coop.step(defect_fn=cmp.defect_fn, params=params) # Or: closure=cmp.closure
56
57 # Remember that you need to call the dual_scheduler manually!
58 coop.dual_scheduler.step()
Dual restarts
dual_restarts=True
can be used to set the value of the dual variables
associated with inequality constraints to zero whenever the respective
constraints are being satisfied. We do not perform dual_restarts
on
multipliers for equality constraints.
For simplicity, consider a CMP with two inequality constraints \(g_1(x) \le 0\) and \(g_2(x) \le 0\) and loss \(f(x)\). Suppose that the constraint on \(g_1\) is strictly satisfied, while that on \(g_2\) is violated at \(x\). Consider the Lagrangian at \(x\) with (non-negative) multipliers \(\lambda_1\) and \(\lambda_2\).
The best response for \(\lambda_1\) and \(\lambda_2\) at the current value of the primal parameters \(x\) is \((\lambda_1, \lambda_2) = (0, +\infty)\). This is due to the non-negativity constraints on the Lagrange multipliers, the sign of the constraint violations, and the fact that the \(\lambda\)-player wants to maximize \(\mathcal{L}(x,\lambda_1, \lambda_2)\).
“Playing a best response” for the Lagrange multiplier \(\lambda_2\) of the
violated constraint clearly leads to an impractical algorithm (due to numerical
overflow). However, for the currently feasible constraint, performing a best
response update on \(\lambda_1\) is indeed implementable, and trivially so.
This is exactly the effect of setting dual_restarts=True
!
In practice, this prevents the optimization from over-focusing on a constraint (at the expense on improving on the loss function!) when said constraint is already being satisfied. This over-penalization arises from the accumulation of the (positive) defects in the value of the Lagrange multiplier while the constraint was being violated in the past.
Note
We recommend setting dual_restarts=False
when dealing with constraints
whose violations are estimated stochastically, for example Monte Carlo
estimates constraints described by expectations. This is to avoid
restarting multipliers when a constraint is “being satisfied” for a single
estimate, since this can be a result of the stochasticity of the estimator.
Warning
The behavior of dual restarts has only been tested using
DenseMultiplier
objects.
Multiple primal optimizers
When constructing a ConstrainedOptimizer
,
one or multiple primal optimizers (grouped in a list) can be provided. Allowing
for multiple primal optimizers is useful when setting separate groups of primal
variables to have different optimizer classes and hyperparameters.
When a list of optimizers is provided for the primal_optimizers
argument, they are
treated “as if they were a single optimizer”. In particular, all primal optimizers
operations such as optimizer.step()
are
executed simultaneously (without intermediate calls to
cmp.closure()
or
formulation.backward(lagrangian)
).
ConstrainedOptimizer
Class
- class cooper.optim.constrained_optimizer.ConstrainedOptimizer
Optimizes a
ConstrainedMinimizationProblem
given a providedFormulation
.A
ConstrainedOptimizer
includes one or moretorch.optim.Optimizer
s for the primal variables. It also includes atorch.optim.Optimizer
for the dual variables associated with the providedFormulation
.For handling unconstrained problems, we provide an
UnconstrainedOptimizer
. Please refer to the documentation of theConstrainedMinimizationProblem
andFormulation
classes for further details on handling unconstrained problems.- Parameters
formulation –
Formulation
of theConstrainedMinimizationProblem
to be optimized.primal_optimizers – Fully instantiated
torch.optim.Optimizer
s used to optimize the primal parameters (e.g. model parameters). The primal parameters can be partitioned into multiple optimizers, in this caseprimal_optimizers
accepts a list oftorch.optim.Optimizer
s.dual_optimizer – Partially instantiated
torch.optim.Optimizer
used to optimize the dual variables (e.g. Lagrange multipliers). If dealing with an unconstrained problem, please use aUnconstrainedOptimizer
instead.dual_scheduler – Partially instantiated
torch.optim.lr_scheduler._LRScheduler
used to schedule the learning rate of the dual variables. Defaults to None.extrapolation – Whether to perform extragradient updates. Defaults to False.
alternating – Whether to alternate parameter updates between primal and dual parameters. Otherwise, do simultaneous parameter updates. Defaults to False.
dual_restarts – If True, perform “restarts” on the Lagrange multipliers associated with inequality constraints: whenever the constraint is satisfied, directly set the multiplier to zero. Defaults to False.
- base_sanity_checks()
Perform sanity checks on the initialization of
ConstrainedOptimizer
.
- instantiate_dual_optimizer_and_scheduler()
Instantiates the dual optimizer and scheduler.
- restart_dual_variables()
Execute restart for multipliers associated with inequality constraints
- state_dict()
Returns the state of the ConstrainedOptimizer. See
CooperOptimizerState
.- Return type
CooperOptimizerState