Optim
Partial optimizer instantiation
When constructing a ConstrainedOptimizer
, the
dual_optimizer
parameter is expected to be a
torch.optim.Optimizer
for which the params
argument has not
yet been passed. The rest of the instantiation of the dual_optimizer
is
handled internally by Cooper.
The cooper.optim.partial_optimizer()
method below allows you to provide a
configuration for your dual_optimizer
's hyperparameters (e.g. learning
rate, momentum, etc.)
- optim.partial_optimizer(**optim_kwargs)
Partially instantiates an optimizer class. This approach is preferred over
functools.partial()
since the returned value is an optimizer class whose attributes can be inspected and which can be further instantiated.- Parameters
optim_cls – Pytorch optimizer class to be partially instantiated.
**optim_kwargs – Keyword arguments for optimizer hyperparemeters.
Extra-gradient optimizers
The extra-gradient method [Korpelevich, 1976] is a standard
approach for solving min-max games as those appearing in the
LagrangianFormulation
.
Given a Lagrangian \(\mathcal{L}(x,\lambda)\), define the joint variable \(\omega = (x,\lambda)\) and the “gradient” operator:
The extra-gradient update can be summarized as:
Note
In the unconstrained case, the extra-gradient update is “intrinsically
different” from that of Nesterov momentum [Gidel et al., 2019].
The current version of Cooper raises a RuntimeError
when
trying to use an ExtragradientOptimizer
. This
restriction might be lifted in future releases.
The implementations of ExtraSGD
and
ExtraAdam
included in Cooper are minor edits from
those originally written by Hugo Berard.
Gidel et al. [2019] provides a concise presentation of the
extra-gradient in the context of solving Variational Inequality Problems.
Warning
If you decide to use extra-gradient optimizers for defining a
ConstrainedOptimizer
, the primal
and dual optimizers must both be instances of classes inheriting from
ExtragradientOptimizer
.
When provided with extrapolation-capable optimizers, Cooper will automatically trigger the calls to the extrapolation function.
Due to the calculation of gradients at the “look-ahead” point
\(\omega_{t+1/2}\), the call to
cooper.constrained_optimizer.ConstrainedOptimizer.step()
requires
passing the parameters needed for the computation of the
cooper.problem.ConstrainedMinimizationProblem.closure()
.
Example:
1model = ...
2
3cmp = cooper.ConstrainedMinimizationProblem(is_constrained=True)
4formulation = cooper.problem.Formulation(...)
5
6# Non-extra-gradient optimizers
7primal_optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
8dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)
9
10# Extra-gradient optimizers
11primal_optimizer = cooper.optim.ExtraSGD(model.parameters(), lr=1e-2)
12dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=1e-3)
13
14const_optim = cooper.ConstrainedOptimizer(
15 formulation=formulation,
16 primal_optimizer=primal_optimizer,
17 dual_optimizer=dual_optimizer,
18)
19
20for step in range(num_steps):
21 const_optim.zero_grad()
22 lagrangian = formulation.composite_objective(cmp.closure, model, inputs)
23 formulation.custom_backward(lagrangian)
24
25 # Non-extra-gradient optimizers
26 # Passing (cmp.closure, model, inputs) to step will simply be ignored
27 const_optim.step()
28
29 # Extra-gradient optimizers
30 # Must pass (cmp.closure, model, inputs) to step
31 const_optim.step(cmp.closure, model, inputs)
- class cooper.optim.ExtragradientOptimizer(params, defaults)[source]
Base class for optimizers with extrapolation step.
- Parameters
params (
Iterable
) – an iterable oftorch.Tensor
s ordict
s. Specifies what Tensors should be optimized.defaults (
dict
) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).
- class cooper.optim.ExtraSGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]
Implements stochastic gradient descent with extrapolation step (optionally with momentum).
Nesterov momentum is based on the formula from Sutskever et al. [2013].
- Parameters
params (
Iterable
) – Iterable of parameters to optimize or dicts defining parameter groups.lr (
float
) – Learning rate.momentum (
float
) – Momentum factor.weight_decay (
float
) – Weight decay (L2 penalty).dampening (
float
) – Dampening for momentum.nesterov (
bool
) – IfTrue
, enables Nesterov momentum.
Note
The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et al. [2013]. and implementations in some other frameworks.
Considering the specific case of Momentum, the update can be written as
\[\begin{split}v = \rho \cdot v + g \\ p = p - lr \cdot v\end{split}\]where \(p\), \(v\), \(g\) and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.
This is in contrast to Sutskever et al. [2013] and other frameworks which employ an update of the form
\[\begin{split}v &= \rho \cdot v + lr \cdot g \\ p &= p - v\end{split}\]The Nesterov version is analogously modified.
- class cooper.optim.ExtraAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]
Implements the Adam algorithm with an extrapolation step.
- Parameters
params (
Iterable
) – Iterable of parameters to optimize or dicts defining parameter groups.lr (
float
) – Learning rate.betas (
Tuple
[float
,float
]) – Coefficients used for computing running averages of gradient and its square.eps (
float
) – Term added to the denominator to improve numerical stability.weight_decay (
float
) – Weight decay (L2 penalty).amsgrad (
bool
) – Flag to use the AMSGrad variant of this algorithm from Reddi et al. [2018].
Learning rate schedulers
Cooper supports learning rate schedulers for the primal and dual optimizers. Recall that Cooper handles the primal and dual optimizers in slightly different ways: the primal optimizer is “fully” instantiated by the user, while we expect a “partially” instantiated dual optimizer. We follow a similar pattern for the learning rate schedulers.
Example:
1from torch.optim.lr_scheduler import StepLR, ExponentialLR 2 3... 4primal_optimizer = torch.optim.SGD(...) 5dual_optimizer = cooper.optim.partial_optimizer(...) 6 7primal_scheduler = StepLR(primal_optimizer, step_size=1, gamma=0.1) 8dual_scheduler = cooper.optim.partial_scheduler(ExponentialLR, **scheduler_kwargs) 9 10const_optim = cooper.ConstrainedOptimizer(..., primal_optimizer, dual_optimizer, dual_scheduler) 11 12for step in range(num_steps): 13 ... 14 const_optim.step() # Cooper calls dual_scheduler.step() internally 15 primal_scheduler.step() # You must call this explicitly
Primal learning rate scheduler
You must instantiate the scheduler for the learning rate used by the
primal_optimizer
and call the scheduler’s step
method explicitly, as is
usual in Pytorch. See torch.optim.lr_scheduler
for details.
Dual learning rate scheduler
When constructing a
ConstrainedOptimizer
,
the dual_scheduler
parameter is expected to be a partially instantiated
learning rate scheduler from Pytorch, for which the optimizer
argument has
not yet been passed. The cooper.optim.partial_scheduler()
method
allows you to provide a configuration for your dual_scheduler
's
hyperparameters. The rest of the instantiation of the dual_scheduler
is
managed internally by Cooper.
The calls to the step
method of the dual_scheduler
are made by
Cooper during the execution of
cooper.constrained_optimizer.ConstrainedOptimizer.step()
.
- optim.partial_scheduler(**scheduler_kwargs)
Partially instantiates a learning rate scheduler class. This approach is preferred over
functools.partial()
since the returned value is a scheduler class whose attributes can be inspected and which can be further instantiated.- Parameters
scheduler_cls – Pytorch scheduler class to be partially instantiated.
**scheduler_kwargs – Keyword arguments for scheduler hyperparemeters.