Optim Module
Partial optimizer instantiation
When constructing a ConstrainedOptimizer
, the
dual_optimizer
parameter is expected to be a
torch.optim.Optimizer
for which the params
argument has not
yet been passed. The rest of the instantiation of the dual_optimizer
is
handled internally by Cooper.
The cooper.optim.partial_optimizer()
method below allows you to provide a
configuration for your dual_optimizer
's hyperparameters (e.g. learning
rate, momentum, etc.)
- optim.partial_optimizer(**optim_kwargs)
Partially instantiates an optimizer class. This approach is preferred over
functools.partial()
since the returned value is an optimizer class whose attributes can be inspected and which can be further instantiated.- Parameters
optim_cls – Pytorch optimizer class to be partially instantiated.
**optim_kwargs – Keyword arguments for optimizer hyperparemeters.
Learning rate schedulers
Cooper supports learning rate schedulers for the primal and dual optimizers. Recall that Cooper handles the primal and dual optimizers in slightly different ways: the primal optimizer is “fully” instantiated by the user, while we expect a “partially” instantiated dual optimizer. We follow a similar pattern for the learning rate schedulers.
Example:
1from torch.optim.lr_scheduler import StepLR, ExponentialLR 2 3... 4primal_optimizer = torch.optim.SGD(...) 5dual_optimizer = cooper.optim.partial_optimizer(...) 6 7primal_scheduler = StepLR(primal_optimizer, step_size=1, gamma=0.1) 8dual_scheduler = cooper.optim.partial_scheduler(ExponentialLR, **scheduler_kwargs) 9 10const_optim = cooper.ConstrainedOptimizer(..., primal_optimizer, dual_optimizer, dual_scheduler) 11 12for step in range(num_steps): 13 ... 14 const_optim.step() # Cooper calls dual_scheduler.step() internally 15 primal_scheduler.step() # You must call this explicitly
Primal learning rate scheduler
You must instantiate the scheduler for the learning rate used by each
primal_optimizer
and call the scheduler’s step
method explicitly, as is
usual in Pytorch. See torch.optim.lr_scheduler
for details.
Dual learning rate scheduler
When constructing a
ConstrainedOptimizer
,
the dual_scheduler
parameter is expected to be a partially instantiated
learning rate scheduler from Pytorch, for which the optimizer
argument has
not yet been passed. The cooper.optim.partial_scheduler()
method
allows you to provide a configuration for your dual_scheduler
's
hyperparameters. The rest of the instantiation of the dual_scheduler
is
managed internally by Cooper.
Note
The call to the step()
method of the dual optimizer is handled
internally by Cooper. However, you must perform the call to the dual
scheduler’s step
method manually. This will usually come after several
calls to cooper.optim.constrained_optimizer.ConstrainedOptimizer.step()
.
The reasoning behind this design is to provide you, the user, with greater visibility and control on the dual learning rate scheduler. For example, you might want to synchronize the changes in the dual learning rate scheduler depending on the number of training epochs ellapsed so far.
This flexibility is also desirable when using an Augmented Lagrangian Formulation, since the penalty coefficient for the augmented Lagrangian can be controlled directly via the dual learning rate scheduler.
PartialScheduler
Class
- optim.partial_scheduler(**scheduler_kwargs)
Partially instantiates a learning rate scheduler class. This approach is preferred over
functools.partial()
since the returned value is a scheduler class whose attributes can be inspected and which can be further instantiated.- Parameters
scheduler_cls – Pytorch scheduler class to be partially instantiated.
**scheduler_kwargs – Keyword arguments for scheduler hyperparemeters.
Extra-gradient optimizers
The extra-gradient method [Korpelevich, 1976] is a standard
approach for solving min-max games as those appearing in the
LagrangianFormulation
.
Given a Lagrangian \(\mathcal{L}(x,\lambda)\), define the joint variable \(\omega = (x,\lambda)\) and the “gradient” operator:
The extra-gradient update can be summarized as:
Note
In the unconstrained case, the extra-gradient update is “intrinsically
different” from that of Nesterov momentum [Gidel et al., 2019].
The current version of Cooper raises a RuntimeError
when
trying to use an ExtragradientOptimizer
. This
restriction might be lifted in future releases.
The implementations of ExtraSGD
and
ExtraAdam
included in Cooper are minor edits from
those originally written by Hugo Berard.
Gidel et al. [2019] provides a concise presentation of the
extra-gradient in the context of solving Variational Inequality Problems.
Warning
If you decide to use extra-gradient optimizers for defining a
ConstrainedOptimizer
, the primal
and dual optimizers must both be instances of classes inheriting from
ExtragradientOptimizer
.
When provided with extrapolation-capable optimizers, Cooper will automatically trigger the calls to the extrapolation function.
Due to the calculation of gradients at the “look-ahead” point
\(\omega_{t+1/2}\), the call to
cooper.optim.constrained_optimizer.ConstrainedOptimizer.step()
requires
passing the parameters needed for the computation of the
cooper.problem.ConstrainedMinimizationProblem.closure()
.
Example:
1model = ...
2
3cmp = cooper.ConstrainedMinimizationProblem()
4formulation = cooper.Formulation(...)
5
6# Non-extra-gradient optimizers
7primal_optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
8dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)
9
10# Extra-gradient optimizers
11primal_optimizer = cooper.optim.ExtraSGD(model.parameters(), lr=1e-2)
12dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=1e-3)
13
14const_optim = cooper.ConstrainedOptimizer(
15 formulation=formulation,
16 primal_optimizers=primal_optimizer,
17 dual_optimizer=dual_optimizer,
18)
19
20for step in range(num_steps):
21 const_optim.zero_grad()
22 lagrangian = formulation.compute_lagrangian(cmp.closure, model, inputs)
23 formulation.backward(lagrangian)
24
25 # Non-extra-gradient optimizers
26 # Passing (cmp.closure, model, inputs) to step will simply be ignored
27 const_optim.step()
28
29 # Extra-gradient optimizers
30 # Must pass (cmp.closure, model, inputs) to step
31 const_optim.step(cmp.closure, model, inputs)
- class cooper.optim.ExtragradientOptimizer(params, defaults)[source]
Base class for optimizers with extrapolation step.
- Parameters
params (
Iterable
) – an iterable oftorch.Tensor
s ordict
s. Specifies what Tensors should be optimized.defaults (
dict
) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).
- class cooper.optim.ExtraSGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]
Implements stochastic gradient descent with extrapolation step (optionally with momentum).
Nesterov momentum is based on the formula from Sutskever et al. [2013].
- Parameters
params (
Iterable
) – Iterable of parameters to optimize or dicts defining parameter groups.lr (
float
) – Learning rate.momentum (
float
) – Momentum factor.weight_decay (
float
) – Weight decay (L2 penalty).dampening (
float
) – Dampening for momentum.nesterov (
bool
) – IfTrue
, enables Nesterov momentum.
Note
The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et al. [2013]. and implementations in some other frameworks.
Considering the specific case of Momentum, the update can be written as
\[\begin{split}v = \rho \cdot v + g \\ p = p - lr \cdot v\end{split}\]where \(p\), \(v\), \(g\) and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.
This is in contrast to Sutskever et al. [2013] and other frameworks which employ an update of the form
\[\begin{split}v &= \rho \cdot v + lr \cdot g \\ p &= p - v\end{split}\]The Nesterov version is analogously modified.
- class cooper.optim.ExtraAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]
Implements the Adam algorithm with an extrapolation step.
- Parameters
params (
Iterable
) – Iterable of parameters to optimize or dicts defining parameter groups.lr (
float
) – Learning rate.betas (
Tuple
[float
,float
]) – Coefficients used for computing running averages of gradient and its square.eps (
float
) – Term added to the denominator to improve numerical stability.weight_decay (
float
) – Weight decay (L2 penalty).amsgrad (
bool
) – Flag to use the AMSGrad variant of this algorithm from Reddi et al. [2018].