Optim

Partial optimizer instantiation

When constructing a ConstrainedOptimizer, the dual_optimizer parameter is expected to be a torch.optim.Optimizer for which the params argument has not yet been passed. The rest of the instantiation of the dual_optimizer is handled internally by Cooper.

The cooper.optim.partial_optimizer() method below allows you to provide a configuration for your dual_optimizer's hyperparameters (e.g. learning rate, momentum, etc.)

optim.partial_optimizer(**optim_kwargs)

Partially instantiates an optimizer class. This approach is preferred over functools.partial() since the returned value is an optimizer class whose attributes can be inspected and which can be further instantiated.

Parameters
  • optim_cls – Pytorch optimizer class to be partially instantiated.

  • **optim_kwargs – Keyword arguments for optimizer hyperparemeters.

Extra-gradient optimizers

The extra-gradient method [Korpelevich, 1976] is a standard approach for solving min-max games as those appearing in the LagrangianFormulation.

Given a Lagrangian \(\mathcal{L}(x,\lambda)\), define the joint variable \(\omega = (x,\lambda)\) and the “gradient” operator:

\[F(\omega) = [\nabla_x \mathcal{L}(x,\lambda), -\nabla_{\lambda} \mathcal{L}(x,\lambda)]^{\top}\]

The extra-gradient update can be summarized as:

\[\begin{split}\omega_{t+1/2} &= P_{\Omega}[\omega_{t+} - \eta F(\omega_{t})] \\ \omega_{t+1} &= P_{\Omega}[\omega_{t} - \eta F(\omega_{t+1/2})]\end{split}\]

Note

In the unconstrained case, the extra-gradient update is “intrinsically different” from that of Nesterov momentum [Gidel et al., 2019]. The current version of Cooper raises a RuntimeError when trying to use an ExtragradientOptimizer. This restriction might be lifted in future releases.

The implementations of ExtraSGD and ExtraAdam included in Cooper are minor edits from those originally written by Hugo Berard. Gidel et al. [2019] provides a concise presentation of the extra-gradient in the context of solving Variational Inequality Problems.

Warning

If you decide to use extra-gradient optimizers for defining a ConstrainedOptimizer, the primal and dual optimizers must both be instances of classes inheriting from ExtragradientOptimizer.

When provided with extrapolation-capable optimizers, Cooper will automatically trigger the calls to the extrapolation function.

Due to the calculation of gradients at the “look-ahead” point \(\omega_{t+1/2}\), the call to cooper.constrained_optimizer.ConstrainedOptimizer.step() requires passing the parameters needed for the computation of the cooper.problem.ConstrainedMinimizationProblem.closure().

Example:

 1model = ...
 2
 3cmp = cooper.ConstrainedMinimizationProblem(is_constrained=True)
 4formulation = cooper.problem.Formulation(...)
 5
 6# Non-extra-gradient optimizers
 7primal_optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
 8dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)
 9
10# Extra-gradient optimizers
11primal_optimizer = cooper.optim.ExtraSGD(model.parameters(), lr=1e-2)
12dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=1e-3)
13
14const_optim = cooper.ConstrainedOptimizer(
15    formulation=formulation,
16    primal_optimizer=primal_optimizer,
17    dual_optimizer=dual_optimizer,
18)
19
20for step in range(num_steps):
21    const_optim.zero_grad()
22    lagrangian = formulation.composite_objective(cmp.closure, model, inputs)
23    formulation.custom_backward(lagrangian)
24
25    # Non-extra-gradient optimizers
26    # Passing (cmp.closure, model, inputs) to step will simply be ignored
27    const_optim.step()
28
29    # Extra-gradient optimizers
30    # Must pass (cmp.closure, model, inputs) to step
31    const_optim.step(cmp.closure, model, inputs)
class cooper.optim.ExtragradientOptimizer(params, defaults)[source]

Base class for optimizers with extrapolation step.

Parameters
  • params (Iterable) – an iterable of torch.Tensors or dicts. Specifies what Tensors should be optimized.

  • defaults (dict) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).

extrapolation()[source]

Performs the extrapolation step and saves a copy of the current parameters for the update step.

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

class cooper.optim.ExtraSGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]

Implements stochastic gradient descent with extrapolation step (optionally with momentum).

Nesterov momentum is based on the formula from Sutskever et al. [2013].

Parameters
  • params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float) – Learning rate.

  • momentum (float) – Momentum factor.

  • weight_decay (float) – Weight decay (L2 penalty).

  • dampening (float) – Dampening for momentum.

  • nesterov (bool) – If True, enables Nesterov momentum.

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et al. [2013]. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho \cdot v + g \\ p = p - lr \cdot v\end{split}\]

where \(p\), \(v\), \(g\) and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et al. [2013] and other frameworks which employ an update of the form

\[\begin{split}v &= \rho \cdot v + lr \cdot g \\ p &= p - v\end{split}\]

The Nesterov version is analogously modified.

step(closure=None)

Performs a single optimization step.

Parameters

closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

class cooper.optim.ExtraAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]

Implements the Adam algorithm with an extrapolation step.

Parameters
  • params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float) – Learning rate.

  • betas (Tuple[float, float]) – Coefficients used for computing running averages of gradient and its square.

  • eps (float) – Term added to the denominator to improve numerical stability.

  • weight_decay (float) – Weight decay (L2 penalty).

  • amsgrad (bool) – Flag to use the AMSGrad variant of this algorithm from Reddi et al. [2018].

Learning rate schedulers

Cooper supports learning rate schedulers for the primal and dual optimizers. Recall that Cooper handles the primal and dual optimizers in slightly different ways: the primal optimizer is “fully” instantiated by the user, while we expect a “partially” instantiated dual optimizer. We follow a similar pattern for the learning rate schedulers.

Example:

 1from torch.optim.lr_scheduler import StepLR, ExponentialLR
 2
 3...
 4primal_optimizer = torch.optim.SGD(...)
 5dual_optimizer = cooper.optim.partial_optimizer(...)
 6
 7primal_scheduler = StepLR(primal_optimizer, step_size=1, gamma=0.1)
 8dual_scheduler = cooper.optim.partial_scheduler(ExponentialLR, **scheduler_kwargs)
 9
10const_optim = cooper.ConstrainedOptimizer(..., primal_optimizer, dual_optimizer, dual_scheduler)
11
12for step in range(num_steps):
13    ...
14    const_optim.step() # Cooper calls dual_scheduler.step() internally
15    primal_scheduler.step()  # You must call this explicitly

Primal learning rate scheduler

You must instantiate the scheduler for the learning rate used by the primal_optimizer and call the scheduler’s step method explicitly, as is usual in Pytorch. See torch.optim.lr_scheduler for details.

Dual learning rate scheduler

When constructing a ConstrainedOptimizer, the dual_scheduler parameter is expected to be a partially instantiated learning rate scheduler from Pytorch, for which the optimizer argument has not yet been passed. The cooper.optim.partial_scheduler() method allows you to provide a configuration for your dual_scheduler's hyperparameters. The rest of the instantiation of the dual_scheduler is managed internally by Cooper.

The calls to the step method of the dual_scheduler are made by Cooper during the execution of cooper.constrained_optimizer.ConstrainedOptimizer.step().

optim.partial_scheduler(**scheduler_kwargs)

Partially instantiates a learning rate scheduler class. This approach is preferred over functools.partial() since the returned value is a scheduler class whose attributes can be inspected and which can be further instantiated.

Parameters
  • scheduler_cls – Pytorch scheduler class to be partially instantiated.

  • **scheduler_kwargs – Keyword arguments for scheduler hyperparemeters.