Constrained Optimizer

How to use a ConstrainedOptimizer

The ConstrainedOptimizer class is the cornerstone of Cooper. A ConstrainedOptimizer performs parameter updates to solve a ConstrainedMinimizationProblem given a chosen Formulation.

A ConstrainedOptimizer wraps a torch.optim.Optimizer used for updating the “primal” parameters associated directly with the optimization problem. These might be, for example, the parameters of the model you are training.

Additionally, a ConstrainedOptimizer includes a second torch.optim.Optimizer, which performs updates on the “dual” parameters (e.g. the multipliers used in a LagrangianFormulation).

Construction

The main ingredients to build a ConstrainedOptimizer are a Formulation (associated with a ConstrainedMinimizationProblem) and a torch.optim.Optimizer corresponding to a primal_optimizer.

Note

Cooper supports the use of multiple primal_optimizers, each corresponding to different groups of primal variables. The primal_optimizers argument accepts a single optimizer, or a list of optimizers. See Multiple primal optimizers.

If the ConstrainedMinimizationProblem you are dealing with is in fact constrained, depending on your formulation, you might also need to provide a dual_optimizer. Check out the section on Partial optimizer instantiation for more details on defining dual_optimizers.

Note

Cooper includes extra-gradient implementations of SGD and Adam which can be used as primal or dual optimizers. See Extra-gradient optimizers.

Examples

The highlighted lines below show the small changes required to go from an unconstrained to a constrained problem. Note that these changes should also be accompanied with edits to the custom problem class which inherits from ConstrainedMinimizationProblem. More details on the definition of a CMP can be found under the entry for Constrained Minimization Problem.

  • Unconstrained problem

     1model =  ModelClass(...)
     2cmp = cooper.ConstrainedMinimizationProblem()
     3formulation = cooper.formulation.Formulation(...)
     4
     5primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
     6
     7constrained_optimizer = cooper.UnconstrainedOptimizer(
     8    formulation=formulation,
     9    primal_optimizers=primal_optimizer,
    10)
    
  • Constrained problem

     1model =  ModelClass(...)
     2cmp = cooper.ConstrainedMinimizationProblem()
     3formulation = cooper.formulation.Formulation(...)
     4
     5primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
     6# Note that dual_optimizer is "partly instantiated", *without* parameters
     7dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3, momentum=0.9)
     8
     9constrained_optimizer = cooper.SimultaneousConstrainedOptimizer(
    10    formulation=formulation,
    11    primal_optimizers=primal_optimizer,
    12    dual_optimizer=dual_optimizer,
    13)
    

The training loop

We have gathered all the ingredients we need for tackling our CMP: the custom ConstrainedMinimizationProblem class, along with your ConstrainedOptimizer of choice and a ConstrainedOptimizer for updating the parameters. Now it is time to put them to good use.

The typical training loop for solving a CMP in a machine learning set up using Cooper (with a Lagrangian Formulation) will involve the following steps:

Overview of main steps in a training loop

  1. (Optional) Iterate over your dataset and sample of mini-batch.

  2. Call constrained_optimizer.zero_grad() to reset the parameters’ gradients

  3. Compute the current CMPState (or estimate it with the minibatch) and calculate the Lagrangian using formulation.compute_lagrangian(cmp.closure, ...).

  4. Populate the primal and dual gradients with formulation.backward(lagrangian)

  5. Perform updates on the parameters using the primal and dual optimizers based on the recently computed gradients, via a call to constrained_optimizer.step().

Example

 1model = ModelClass(...)
 2cmp = cooper.ConstrainedMinimizationProblem(...)
 3formulation = cooper.LagrangianFormulation(...)
 4
 5primal_optimizer = torch.optim.SGD(model.parameters(), lr=primal_lr)
 6# Note that dual_optimizer is "partly instantiated", *without* parameters
 7dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=primal_lr)
 8
 9constrained_optimizer = cooper.ConstrainedOptimizer(
10    formulation=formulation,
11    primal_optimizers=primal_optimizer,
12    dual_optimizer=dual_optimizer,
13)
14
15for inputs, targets in dataset:
16    # Clear gradient buffers
17    constrained_optimizer.zero_grad()
18
19    # The closure is required to compute the Lagrangian
20    # The closure might in turn require the model, inputs, targets, etc.
21    lagrangian = formulation.compute_lagrangian(cmp.closure, ...)
22
23    # Populate the primal and dual gradients
24    formulation.backward(lagrangian)
25
26    # Perform primal and dual parameter updates
27    constrained_optimizer.step()

Parameter updates

By default, parameter updates are performed using simultaneous gradient descent-ascent updates (according to the choice of primal and dual optimizers). Formally,

\[\begin{split}x_{t+1} &= \texttt{primal_optimizer_update} \left( x_{t}, \nabla_{x} \mathcal{L}(x, \lambda_t)|_{x=x_t} \right)\\ \lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({x_{t}}, \lambda)|_{\lambda=\lambda_t} \right)\end{split}\]

Note

We explicitly include a negative sign in front of the gradient for \(\lambda\) in order to highlight the fact that \(\lambda\) solves maximization problem. Cooper handles the sign flipping internally, so you should provide your definition for a dual_optimizer using a non-negative learning rate, as usual!

Multiplier projection

Lagrange multipliers associated with inequality constraints should remain non-negative. Cooper executes the standard projection to \(\mathbb{R}^{+}\) by default for DenseMultipliers. For more details on using custom projection operations, see the section on Multipliers.

Other update strategies supported by ConstrainedOptimizer include:

The ConstrainedOptimizer implements a ConstrainedOptimizer.step() method, that updates the primal and dual parameters (if Formulation has any). The nature of the update depends on the attributes provided during the initialization of the ConstrainedOptimizer. By default, updates are via gradient descent on the primal parameters and (projected) ascent on the dual parameters, with simultaneous updates.

Note

When applied to an unconstrained problem, ConstrainedOptimizer.step() will be equivalent to performing optimizer.step() on all of the primal_optimizers based on the gradient of the loss with respect to the primal parameters.

Additional features

In this section we provide details on using “advanced features” such as alternating updates, the Augmented Lagrangian method or dual restarts, in conjunction with a ConstrainedOptimizer.


Alternating updates

It is possible to perform alternating updates between the primal and dual parameters by setting the flag alternating=True in the construction of the ConstrainedOptimizer. In this case, the gradient computed by calling backward() is used to update the primal parameters. Then, the gradient with respect to the dual variables (given the new value of the primal parameters!) is computed and used to update the dual variables. This two-stage process is handled by Cooper inside the ConstrainedOptimizer.step() method.

\[\begin{split}x_{t+1} &= \texttt{primal_optimizers_update} \left( x_{t}, \nabla_{x} \mathcal{L}_{c_t}(x, \lambda_t)|_{x=x_t} \right)\\ \lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({\color{red} x_{t+1}}, \lambda)|_{\lambda=\lambda_t} \right)\end{split}\]

Important

Selecting alternating=True does not necessarily double the number of backward passes through a the primal parameters!

When using a LagrangianFormulation, to obtain the gradients with respect to the Lagrange multipliers, it suffices to evaluate the constraint defects (through a call to closure()). This operation does not require a having to back-propagate through the Lagrangian with respect to the primal parameters.

Providing a defect_fn in the call to ConstrainedOptimizer.step() allows for updating the Lagrange multiplier without having to re-evaluate the loss function, but rather only the constraints.

Warning

Combining alternating updates with dual restarts is untested. Use at your own risk.


Augmented Lagrangian method

Exactly solving the intermediate optimization problems considered by the Augmented Lagrangian formulation can be challenging. This is particularly true when \(L_{c_t}(x, \lambda)\) is non-convex in \(x\).

Rather than aiming to solve the intermediate optimization problem to high precision, Cooper implements a version of the Augmented Lagrangian method where the primal variables are updated using a gradient-based step.

\[x_{t+1} = \texttt{primal_optimizers_update} \left( x_{t}, \nabla_{x} \mathcal{L}_{c_t}(x, \lambda_t)|_{x=x_t} \right)\]

The new primal variables are then used to perform an alternating update on the Lagrange multipliers.

Augmented Lagrangian coefficients

Recall that the update for the dual variables in an Augmented Lagrangian formulation is:

\[\begin{split}\lambda_{g, t+1} &= \left[\lambda_{g, t} + c_t g(x_{t+1}) \right]^+ \\ \lambda_{h, t+1} &= \lambda_{h, t} + c_t h(x_{t+1})\end{split}\]

This corresponds exactly to a (projected) gradient ascent update on the dual variables with “step size” \(c_t\) on the function:

\[\begin{split}\mathcal{L}_{c_t}(x_{t+1}, \lambda) \triangleq & \, \, {\color{gray} \overbrace{ f(x_{t+1}) +\frac{c_t}{2} ||g(x_{t+1}) \odot \mathbf{1}_{g(x_{t+1}) \ge 0 \vee \lambda_{g} > 0}||^2 + \frac{c_t}{2} ||h(x_{t+1})||^2}^{\text{do not contribute to gradient } \nabla_{\lambda} \mathcal{L}(x_{t+1}, \lambda)|_{\lambda = \lambda_t}}} \\ & \, \, + \lambda_{g}^{\top} \, g(x_{t+1}) + \lambda_{h}^{\top} \, h(x_{t+1})\end{split}\]

Therefore, the sequence of Augmented Lagrangian coefficients can be identified with a scheduler on the dual learning rate.

\[\lambda_{t+1} = \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({\color{red} x_{t+1}}, \lambda_t) , \texttt{lr} = c_t\right)\]

As in the default parameter updates, we explicitly include a negative sign in front of the dual gradient to denote the maximization performed in the dual variables, since Pytorch optimizers use a minimization convention. Cooper handles the gradient sign flip internally.

Example

 1import torch
 2import math
 3import cooper
 4
 5class MyCustomCMP(cooper.ConstrainedMinimizationProblem):
 6    # ConstrainedMinimizationProblem with 3 inequality constraints and 4 equality constraints
 7    ...
 8
 9cmp = MyCustomCMP()
10
11params = torch.nn.Parameter(...)
12primal_optimizer = torch.optim.SGD([params], lr=1e-2)
13
14# Set the dual_optimizer base learning rate to 1.0 since the penalty coefficient
15# for the Augmented Lagrangian is controlled via the dual scheduler
16dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1.0)
17
18# Increasing Augmented Lagrangian coefficient schedule
19dual_scheduler = cooper.optim.partial_scheduler(
20    torch.optim.lr_scheduler.LambdaLR, lr_lambda=lambda step: math.sqrt(step / 100)
21)
22
23formulation = cooper.formulation.AugmentedLagrangianFormulation(cmp)
24
25constrained_optimizer = cooper.ConstrainedOptimizer(
26    formulation=formulation,
27    primal_optimizers=primal_optimizer,
28    dual_optimizer=dual_optimizer,
29    dual_scheduler=dual_scheduler,
30    dual_restarts=False,
31    alternating=True, # Remember that ALM performs alternating updates
32)
33
34# We need to manually trigger the creation of the Lagrange multipliers (formulation state)
35# and the link them to the dual optimizer.
36# This is so that the dual scheduler is fully initialized before we start training, since
37# we need it to compute the Augmented Lagrangian coefficient.
38formulation.create_state_from_metadata(
39    dtype=params.dtype, device=device, ineq_size=torch.Size([3]), eq_size=torch.Size([4])
40)
41coop.instantiate_dual_optimizer_and_scheduler()
42
43for step_id in range(1000):
44    coop.zero_grad()
45
46    lagrangian = formulation.compute_lagrangian(
47        aug_lag_coeff_scheduler=coop.dual_scheduler,
48        closure=cmp.closure,
49        params=params,
50    )
51    formulation.backward(lagrangian)
52
53    # We need to pass the closure or defect_fn to perform the alternating updates
54    # required by the Augmented Lagrangian method.
55    coop.step(defect_fn=cmp.defect_fn, params=params) # Or: closure=cmp.closure
56
57    # Remember that you need to call the dual_scheduler manually!
58    coop.dual_scheduler.step()

Dual restarts

dual_restarts=True can be used to set the value of the dual variables associated with inequality constraints to zero whenever the respective constraints are being satisfied. We do not perform dual_restarts on multipliers for equality constraints.

For simplicity, consider a CMP with two inequality constraints \(g_1(x) \le 0\) and \(g_2(x) \le 0\) and loss \(f(x)\). Suppose that the constraint on \(g_1\) is strictly satisfied, while that on \(g_2\) is violated at \(x\). Consider the Lagrangian at \(x\) with (non-negative) multipliers \(\lambda_1\) and \(\lambda_2\).

\[\mathcal{L}(x,\lambda_1, \lambda_2) = f(x) + \lambda_1 \, g_1(x) + \lambda_2 \, g_2(x)\]

The best response for \(\lambda_1\) and \(\lambda_2\) at the current value of the primal parameters \(x\) is \((\lambda_1, \lambda_2) = (0, +\infty)\). This is due to the non-negativity constraints on the Lagrange multipliers, the sign of the constraint violations, and the fact that the \(\lambda\)-player wants to maximize \(\mathcal{L}(x,\lambda_1, \lambda_2)\).

“Playing a best response” for the Lagrange multiplier \(\lambda_2\) of the violated constraint clearly leads to an impractical algorithm (due to numerical overflow). However, for the currently feasible constraint, performing a best response update on \(\lambda_1\) is indeed implementable, and trivially so. This is exactly the effect of setting dual_restarts=True!

In practice, this prevents the optimization from over-focusing on a constraint (at the expense on improving on the loss function!) when said constraint is already being satisfied. This over-penalization arises from the accumulation of the (positive) defects in the value of the Lagrange multiplier while the constraint was being violated in the past.

Note

We recommend setting dual_restarts=False when dealing with constraints whose violations are estimated stochastically, for example Monte Carlo estimates constraints described by expectations. This is to avoid restarting multipliers when a constraint is “being satisfied” for a single estimate, since this can be a result of the stochasticity of the estimator.

Warning

The behavior of dual restarts has only been tested using DenseMultiplier objects.

Multiple primal optimizers

When constructing a ConstrainedOptimizer, one or multiple primal optimizers (grouped in a list) can be provided. Allowing for multiple primal optimizers is useful when setting separate groups of primal variables to have different optimizer classes and hyperparameters.

When a list of optimizers is provided for the primal_optimizers argument, they are treated “as if they were a single optimizer”. In particular, all primal optimizers operations such as optimizer.step() are executed simultaneously (without intermediate calls to cmp.closure() or formulation.backward(lagrangian)).

ConstrainedOptimizer Class

class cooper.optim.constrained_optimizer.ConstrainedOptimizer

Optimizes a ConstrainedMinimizationProblem given a provided Formulation.

A ConstrainedOptimizer includes one or more torch.optim.Optimizers for the primal variables. It also includes a torch.optim.Optimizer for the dual variables associated with the provided Formulation.

For handling unconstrained problems, we provide an UnconstrainedOptimizer. Please refer to the documentation of the ConstrainedMinimizationProblem and Formulation classes for further details on handling unconstrained problems.

Parameters
  • formulationFormulation of the ConstrainedMinimizationProblem to be optimized.

  • primal_optimizers – Fully instantiated torch.optim.Optimizers used to optimize the primal parameters (e.g. model parameters). The primal parameters can be partitioned into multiple optimizers, in this case primal_optimizers accepts a list of torch.optim.Optimizers.

  • dual_optimizer – Partially instantiated torch.optim.Optimizer used to optimize the dual variables (e.g. Lagrange multipliers). If dealing with an unconstrained problem, please use a UnconstrainedOptimizer instead.

  • dual_scheduler – Partially instantiated torch.optim.lr_scheduler._LRScheduler used to schedule the learning rate of the dual variables. Defaults to None.

  • extrapolation – Whether to perform extragradient updates. Defaults to False.

  • alternating – Whether to alternate parameter updates between primal and dual parameters. Otherwise, do simultaneous parameter updates. Defaults to False.

  • dual_restarts – If True, perform “restarts” on the Lagrange multipliers associated with inequality constraints: whenever the constraint is satisfied, directly set the multiplier to zero. Defaults to False.

base_sanity_checks()

Perform sanity checks on the initialization of ConstrainedOptimizer.

instantiate_dual_optimizer_and_scheduler()

Instantiates the dual optimizer and scheduler.

restart_dual_variables()

Execute restart for multipliers associated with inequality constraints

state_dict()

Returns the state of the ConstrainedOptimizer. See CooperOptimizerState.

Return type

CooperOptimizerState

zero_grad(ignore_primal=False, ignore_dual=False)

Sets the gradients of all optimized Parameters to zero. This includes both the primal and dual variables.

Parameters
  • ignore_primal (bool) – If True, the gradients of the primal variables will not be zeroed. Defaults to False.

  • ignore_dual (bool) – If True, the gradients of the dual variables will not be zeroed. Defaults to False.