Torch Optimizers

PyTorch provides implementations of many popular optimizers for solving unconstrained minimization problems. Cooper extends PyTorch’s functionality by offering optimizers tailored for min-max optimization problems involving Lagrange multipliers, such as the Lagrangian and AugmentedLagrangian formulations.

The following optimizers are implemented in Cooper:

\(\nu\)PI

The \(\nu\)PI optimizer is a first-order optimization algorithm introduced by Sohrabi et al. [SRZ+24]. It generalizes several popular first-order optimization techniques, including gradient descent, gradient descent with Polyak momentum [Pol64], Nesterov accelerated gradient [Nes83], the optimistic gradient method [Pop80], and Proportional-Integral (PI) controllers [AH95].

The \(\nu\)PI optimizer has been shown to reduce oscillations and overshoot in the value of the Lagrange multipliers, leading to more stable convergence to feasible solutions. For a detailed discussion on the \(\nu\)PI algorithm, see the ICML 2024 paper: On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization.

enum cooper.optim.nuPIInitType(value)[source]

nuPI initialization types. This is used to determine how to initialize the error and derivative terms of the nuPI controller. The initialization scheme SGD ensures that the first step of nuPI(KP, KI) is equivalent to SGD with learning rate \(\eta \times K_I\). The ZEROS scheme yields a first step which corresponds to SGD with a learning rate of \(\eta \times (K_P + K_I)\).

Valid values are as follows:

ZEROS = <nuPIInitType.ZEROS: 0>
SGD = <nuPIInitType.SGD: 1>
class cooper.optim.nuPI(params, lr, weight_decay=0.0, Kp=0.0, Ki=1.0, ema_nu=0.0, init_type=nuPIInitType.SGD, maximize=False)[source]
__init__(params, lr, weight_decay=0.0, Kp=0.0, Ki=1.0, ema_nu=0.0, init_type=nuPIInitType.SGD, maximize=False)[source]

Implements the nuPI controller as a PyTorch optimizer.

Controllers are designed to guide a system toward a desired state by adjusting a control variable. This is achieved by measuring the error, which is the difference between the desired and current states, and using this error to modify the control variable, thereby influencing the system.

For this controller, the error signal is derived from the gradient of a loss function \(L\) being optimized with respect to a parameter \(\vtheta\). Here, \(\vtheta\) acts as the control variable, while the gradient of \(L\) serves as the error signal, defined as \(\ve_t = \nabla L_t(\vtheta_t)\). The control objective of setting \(\nabla L_t(\vtheta_t) = 0\) corresponds to finding a stationary point of the loss function, thereby minimizing (or maximizing) it.

Note

When applied to the Lagrange multipliers of a constrained minimization problem, the control state \(\nabla L_t(\vtheta_t)\) corresponds to the gradient of the Lagrangian function with respect to the multipliers (e.g., \(\nabla_{\vlambda} \Lag(\vx, \vlambda) = \vg(\vx)\) for inequality-constrained problems). Setting this gradient to (less than or equal to) zero corresponds to finding a point that satisfies the constraints.

The nuPI controller updates parameters as follows:

\[\begin{split}\vxi_t &= \nu \vxi_{t-1} + (1 - \nu) \ve_t, \\ \vtheta_1 &= \vtheta_0 - \eta (K_P \vxi_0 + K_I \ve_0), \\ \vtheta_{t+1} &= \vtheta_t - \eta (K_I \ve_t + K_P (\vxi_t - \vxi_{t-1}))\end{split}\]

Here, \(\vxi_t\) is a smoothed version of the error signal (\(\ve_t\)), using an exponential moving average (EMA) with coefficient \(\nu\). \(K_P\) and \(K_I\) are the proportional and integral gains, respectively, while the learning rate \(\eta\) is kept separate to allow comparison with other optimizers.

Weight decay is applied based only on the error signal \(\ve_t\), following a similar approach to PyTorch’s AdamW optimizer.

When maximize=False, the parameter update is multiplied by \(-1\) before being applied.

Initialization Schemes: The initialization of the nuPI controller requires specifying the initial smoothed error signal, \(\vxi_{-1}\), which impacts the first parameter update. Two initialization schemes are available:

  • nuPIInitType.ZEROS: Initializes \(\vxi_{-1} = \vzero\). The first update rule becomes:

    \[\vtheta_1 = \vtheta_0 - \eta (K_P \ve_0 + K_I \ve_0) = \vtheta_0 - \eta (K_P + K_I) \ve_0.\]
  • nuPIInitType.SGD: Initializes \(\vxi_{-1} = \ve_0\), producing a first step identical to SGD:

    \[\begin{split}\vxi_0 &= \ve_0, \\ \vtheta_1 &= \vtheta_0 - \eta (K_P \ve_0 + K_I \ve_0) = \vtheta_0 - \eta K_I \ve_0.\end{split}\]

Note

nuPI(\(\eta\), \(K_P=0\), \(K_I=1\), \(\nu=0\)) corresponds to SGD with learning rate \(\eta\).

nuPI(\(\eta\), \(K_P=1\), \(K_I=1\), \(\nu=0\)) corresponds to the optimistic gradient method [Pop80].

Parameters:
  • params (Iterable[Tensor]) – iterable of parameters to optimize, or dicts defining parameter groups.

  • lr (float) – learning rate.

  • weight_decay (Optional[float]) – weight decay (L2 penalty). Defaults to 0.

  • Kp (Optional[Tensor]) – proportional gain. Defaults to 0.

  • Ki (Optional[Tensor]) – integral gain. Defaults to 1.

  • ema_nu (float) – EMA coefficient for the smoothed error signal. Defaults to 0, meaning no smoothing is applied.

  • init_type (nuPIInitType) – initialization scheme for \(\vxi_{-1}\). Defaults to nuPIInitType.SGD, which matches the first step of SGD.

  • maximize (bool) – whether to maximize the objective with respect to the parameters instead of minimizing. Defaults to False.

Raises:
  • ValueError – If the learning rate, or weight decay is negative.

  • ValueError – If the EMA coefficient is not in the range \((-1, 1)\).

  • ValueError – If the initialization type is invalid.

  • NotImplementedError – If multiple parameter groups are used with non-scalar proportional and integral gains.

Warning

If a negative proportional or integral gain is used. If both proportional and integral gains are zero. If the EMA coefficient is negative.

step(closure=None)[source]

Performs a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

Return type:

Optional[float]

Extragradient Optimizers

Extragradient optimizers are PyTorch optimizers equipped with an extrapolation method, allowing them to be used alongside the ExtrapolationConstrainedOptimizer.

In Cooper, we implement two extragradient optimizers: ExtraSGD and ExtraAdam. We also provide a base class, ExtragradientOptimizer, that can be used to create custom extragradient optimizers.

The implementations of ExtraSGD and ExtraAdam in Cooper are based on minor modifications to the original implementations by Hugo Berard. For a concise overview of the extra-gradient algorithm and its application to solving Variational Inequality Problems, refer to [GBV+19].

class cooper.optim.ExtragradientOptimizer(params, defaults)[source]

Base class for torch.optim.Optimizers with an extrapolation step.

Parameters:
  • params (Iterable) – an iterable of torch.Tensors or dicts. Specifies what Tensors should be optimized.

  • defaults (dict) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).

extrapolation()[source]

Performs the extrapolation step and saves a copy of the current parameters for the update step.

Return type:

None

step(closure=None)[source]

Performs a single optimization step.

Parameters:

closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

Return type:

Optional[Tensor]

class cooper.optim.ExtraSGD(params, lr=0.001, momentum=0, dampening=0, weight_decay=0, nesterov=False, maximize=False)[source]

Extrapolation-compatible implementation of SGD with momentum.

Note

The implementation of SGD with Momentum/Nesterov subtly differs from [SMDH13] and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as:

\[\begin{split}\vv_{t+1} = \rho \cdot \vv_t + \nabla_{\vtheta} L(\vtheta_t) \\ \vtheta_{t+1} = \vtheta_t - \eta \cdot \vv_{t+1},\end{split}\]

where \(\vtheta\), \(\vv\), \(\nabla_{\vtheta} L\) and \(\rho\) denote the parameters, velocity, gradient and momentum respectively.

This is in contrast to [SMDH13] and other frameworks which employ an update of the form:

\[\begin{split}\vv_{t+1} &= \rho \cdot \vv_t + \eta \cdot \nabla_{\vtheta} L(\vtheta_t) \\ \vtheta_{t+1} &= \vtheta_t - \vv_{t+1}.\end{split}\]

The Nesterov version is modified analogously.

Parameters:
  • params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float) – Learning rate.

  • momentum (float) – Momentum factor.

  • weight_decay (float) – Weight decay (L2 penalty).

  • dampening (float) – Dampening for momentum.

  • nesterov (bool) – If True, enables Nesterov momentum.

Raises:
  • ValueError – If the learning rate, momentum, or weight decay are negative.

  • ValueError – If Nesterov momentum is enabled while momentum is set to zero or dampening is not zero.

class cooper.optim.ExtraAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, maximize=False)[source]

Implements the Adam algorithm with an extrapolation step.

Parameters:
  • params (Iterable) – Iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float) – Learning rate.

  • betas (tuple[float, float]) – Coefficients used for computing running averages of gradient and its square.

  • eps (float) – Term added to the denominator to improve numerical stability.

  • weight_decay (float) – Weight decay (L2 penalty).

  • amsgrad (bool) – Flag to use the AMSGrad variant of this algorithm from [RKK18].

Raises:
  • ValueError – If the learning rate or epsilon value is negative.

  • ValueError – If the beta parameters are not in the range [0, 1).