Optim Module

Partial optimizer instantiation

When constructing a ConstrainedOptimizer, the dual_optimizer parameter is expected to be a torch.optim.Optimizer for which the params argument has not yet been passed. The rest of the instantiation of the dual_optimizer is handled internally by Cooper.

The cooper.optim.partial_optimizer() method below allows you to provide a configuration for your dual_optimizer's hyperparameters (e.g. learning rate, momentum, etc.)

optim.partial_optimizer(**optim_kwargs)

Partially instantiates an optimizer class. This approach is preferred over functools.partial() since the returned value is an optimizer class whose attributes can be inspected and which can be further instantiated.

Parameters

optim_cls – Pytorch optimizer class to be partially instantiated.
**optim_kwargs – Keyword arguments for optimizer hyperparemeters.

Learning rate schedulers

Cooper supports learning rate schedulers for the primal and dual optimizers. Recall that Cooper handles the primal and dual optimizers in slightly different ways: the primal optimizer is “fully” instantiated by the user, while we expect a “partially” instantiated dual optimizer. We follow a similar pattern for the learning rate schedulers.

Example:

from torch.optim.lr_scheduler import StepLR, ExponentialLR

...
primal_optimizer = torch.optim.SGD(...)
dual_optimizer = cooper.optim.partial_optimizer(...)

primal_scheduler = StepLR(primal_optimizer, step_size=1, gamma=0.1)
dual_scheduler = cooper.optim.partial_scheduler(ExponentialLR, **scheduler_kwargs)

const_optim = cooper.ConstrainedOptimizer(..., primal_optimizer, dual_optimizer, dual_scheduler)

for step in range(num_steps):
    ...
    const_optim.step() # Cooper calls dual_scheduler.step() internally
    primal_scheduler.step()  # You must call this explicitly

Primal learning rate scheduler

You must instantiate the scheduler for the learning rate used by each primal_optimizer and call the scheduler’s step method explicitly, as is usual in Pytorch. See torch.optim.lr_scheduler for details.

Dual learning rate scheduler

When constructing a ConstrainedOptimizer, the dual_scheduler parameter is expected to be a partially instantiated learning rate scheduler from Pytorch, for which the optimizer argument has not yet been passed. The cooper.optim.partial_scheduler() method allows you to provide a configuration for your dual_scheduler's hyperparameters. The rest of the instantiation of the dual_scheduler is managed internally by Cooper.

Note

The call to the step() method of the dual optimizer is handled internally by Cooper. However, you must perform the call to the dual scheduler’s step method manually. This will usually come after several calls to cooper.optim.constrained_optimizer.ConstrainedOptimizer.step().

The reasoning behind this design is to provide you, the user, with greater visibility and control on the dual learning rate scheduler. For example, you might want to synchronize the changes in the dual learning rate scheduler depending on the number of training epochs ellapsed so far.

This flexibility is also desirable when using an Augmented Lagrangian Formulation, since the penalty coefficient for the augmented Lagrangian can be controlled directly via the dual learning rate scheduler.

`PartialScheduler` Class

optim.partial_scheduler(**scheduler_kwargs)

Partially instantiates a learning rate scheduler class. This approach is preferred over functools.partial() since the returned value is a scheduler class whose attributes can be inspected and which can be further instantiated.

Parameters

scheduler_cls – Pytorch scheduler class to be partially instantiated.
**scheduler_kwargs – Keyword arguments for scheduler hyperparemeters.

Extra-gradient optimizers

The extra-gradient method [Korpelevich, 1976] is a standard approach for solving min-max games as those appearing in the LagrangianFormulation.

Given a Lagrangian \(\mathcal{L}(x,\lambda)\), define the joint variable \(\omega = (x,\lambda)\) and the “gradient” operator:

\[F(\omega) = [\nabla_x \mathcal{L}(x,\lambda), -\nabla_{\lambda} \mathcal{L}(x,\lambda)]^{\top}\]

The extra-gradient update can be summarized as:

\[\begin{split}\omega_{t+1/2} &= P_{\Omega}[\omega_{t+} - \eta F(\omega_{t})] \\ \omega_{t+1} &= P_{\Omega}[\omega_{t} - \eta F(\omega_{t+1/2})]\end{split}\]

Note

In the unconstrained case, the extra-gradient update is “intrinsically different” from that of Nesterov momentum [Gidel et al., 2019]. The current version of Cooper raises a RuntimeError when trying to use an ExtragradientOptimizer. This restriction might be lifted in future releases.

The implementations of ExtraSGD and ExtraAdam included in Cooper are minor edits from those originally written by Hugo Berard. Gidel et al. [2019] provides a concise presentation of the extra-gradient in the context of solving Variational Inequality Problems.

Warning

If you decide to use extra-gradient optimizers for defining a ConstrainedOptimizer, the primal and dual optimizers must both be instances of classes inheriting from ExtragradientOptimizer.

When provided with extrapolation-capable optimizers, Cooper will automatically trigger the calls to the extrapolation function.

Due to the calculation of gradients at the “look-ahead” point \(\omega_{t+1/2}\), the call to cooper.optim.constrained_optimizer.ConstrainedOptimizer.step() requires passing the parameters needed for the computation of the cooper.problem.ConstrainedMinimizationProblem.closure().

Example:

model = ...

cmp = cooper.ConstrainedMinimizationProblem()
formulation = cooper.Formulation(...)

# Non-extra-gradient optimizers
primal_optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)

# Extra-gradient optimizers
primal_optimizer = cooper.optim.ExtraSGD(model.parameters(), lr=1e-2)
dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=1e-3)

const_optim = cooper.ConstrainedOptimizer(
    formulation=formulation,
    primal_optimizers=primal_optimizer,
    dual_optimizer=dual_optimizer,
)

for step in range(num_steps):
    const_optim.zero_grad()
    lagrangian = formulation.compute_lagrangian(cmp.closure, model, inputs)
    formulation.backward(lagrangian)

    # Non-extra-gradient optimizers
    # Passing (cmp.closure, model, inputs) to step will simply be ignored
    const_optim.step()

    # Extra-gradient optimizers
    # Must pass (cmp.closure, model, inputs) to step
    const_optim.step(cmp.closure, model, inputs)

class cooper.optim.ExtragradientOptimizer(params, defaults)[source]

Base class for optimizers with extrapolation step.

Parameters

params (Iterable) – an iterable of torch.Tensors or dicts. Specifies what Tensors should be optimized.
defaults (dict) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).

extrapolation()[source]: Performs the extrapolation step and saves a copy of the current parameters for the update step.

step(closure=None)[source]

Performs a single optimization step.

Parameters: closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

class cooper.optim.ExtraSGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]

Implements stochastic gradient descent with extrapolation step (optionally with momentum).

Nesterov momentum is based on the formula from Sutskever et al. [2013].

Parameters

params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – Learning rate.
momentum (float) – Momentum factor.
weight_decay (float) – Weight decay (L2 penalty).
dampening (float) – Dampening for momentum.
nesterov (bool) – If True, enables Nesterov momentum.

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et al. [2013]. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho \cdot v + g \\ p = p - lr \cdot v\end{split}\]

where \(p\), \(v\), \(g\) and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et al. [2013] and other frameworks which employ an update of the form

\[\begin{split}v &= \rho \cdot v + lr \cdot g \\ p &= p - v\end{split}\]

The Nesterov version is analogously modified.

step(closure=None)

Performs a single optimization step.

Parameters: closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

class cooper.optim.ExtraAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]

Implements the Adam algorithm with an extrapolation step.

Parameters

params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – Learning rate.
betas (Tuple[float, float]) – Coefficients used for computing running averages of gradient and its square.
eps (float) – Term added to the denominator to improve numerical stability.
weight_decay (float) – Weight decay (L2 penalty).
amsgrad (bool) – Flag to use the AMSGrad variant of this algorithm from Reddi et al. [2018].

Optim Module

Partial optimizer instantiation

Learning rate schedulers

Primal learning rate scheduler

Dual learning rate scheduler

PartialScheduler Class

Extra-gradient optimizers

`PartialScheduler` Class