Parameter scaling in nonlinear optimization

$ \def\R{\mathbb R} \def\F{\mathbf F} $

You know the scenario: you have a function $f:\R^n \rightarrow \R$ and a starting guess $x_0 \in \R^n$. You want to find a better $x$, ideally the one for which $f(x)$ is smallest, which you might call $x^\ast$.

(I'll assume $f(x) = \|\F(x)\|^2 = \sum_{i=1}^m F_i(x)^2$ for $\F:\R^n\rightarrow\R^m$ so I can talk about the Jacobian $J = \frac{\partial\F}{\partial x}$, but if your problem doesn't look like that, just let $m=1$, or read Jacobian as gradient.)

Many nonlinear optimization packages aim to be scaling-covariant, that is they promise to give you an equivalent answer even if you scale x with a diagonal matrix D. That is, if

xstar = minimize(f, x0)

and you define (Julia syntax)

g(x) = f(D*x)
ystar = minimize(g, inv(D)*x0)

then they promise that

ystar = inv(D)*xstar

This is a good thing, right?

Well, sure it's a good thing. But it's generally implemented badly. In fact, it's basically not really possible to do it right. Most attempts fail if your starting point is at the origin, which may well be a very reasonable starting point, or if one column of the Jacobian is zero at the starting point, again this may well be reasonable. 1)

So instead I far prefer to place some onus on you, the user of the routine. Your job is this:

Scale x so its components are “around 1”

This is actually pretty easy in practice. Just think about the parameters in your model: maybe they are spring constants, initial temperatures, robot joint angles, phase of the moon. Think about the output of your model. Maybe it's predicted energy use in watts. Think about the real-world application. Is it energy use of a city? Then use megawatts or gigawatts to keep the objective function value around 1. Actually, this is the easy bit – almost all implementations deal well with scaling the objective, but I want to get you into the regime of thinking about scaling.

Then, to scale the parameters, ensure that when they change by about 1e-5, the objective also changes by a reasonable amount. So if one of the parameters is phase of the moon, it might feel natural to measure it in days (0 < phoon < 29.5) with some handling for wraparound. Then your labmate points out that the SI unit for time is the second so you dutifully reparameterize it in seconds, in the range (0, 2500000) But if you add 1e-5 seconds, it will probably change the objective by very little, maybe by 1e-12 gigawatts, and the optimizer will be in trouble.

Your first idea (do it in days) was actually a better choice (about 86400 times better), giving a change of 1e-7 GW when you add 1e-5 days (about a second). Can we do even better?

Well, look at how the parameter appears in the objective funciton. For phase of the moon, that's normally in some term like:

$$ \cos(\eta * \mathtt{phoon})^\beta $$

so a sensible range might be 0 to $2\pi/\eta$. The point is just to think about how a change in phoon affects the objective.

But that's the easy case, what about dimensionless parameters?

Yeah, not an issue. Imagine that beta above is also part of the optimization. 2) Again, we perform a little thought experiment to see what a reasonable range might be. Below 1 it will flatten out the curve, that could be useful to model some DC effect, but might also cause phoon to be conflated with some other parameter, and ultimately will make the function quite insensitive to phoon…. Hmmm… Recheck thoughts above.. OK, let's keep beta bigger than 0.5, and smaller than say 10. Quick mental check if adding 1e-5 at either end of that range will do anything very weird. Seems OK.

$\def\e#1#2{{\mathbf e}^{#1}_{#2}}$

Another reason for good scaling: finite difference Jacobian calculation

In many situations, you aren't yet sure enough of your model function $\F(x)$ to have written code to compute the Jacobian. Or, you have written that code, but you need to test it. (Believe me, you really do.) Either way, you need another way to get the Jacobian $J$. It's well known that you can get a decent approximation to $J$ using finite differences as follows.

Let's look at one entry: $J_{ij} = \frac{\partial F_i}{\partial x_j}$, and approximate it using the limit definition. The vector $\e nj$ is the unit vector in $\R^n$ with a one in the $j^\text{th}$ element and zeros elsewhere, and $F_i(x)$ is the $i^\text{th}$ component of $\F(x)$.

$$ \frac{\partial F_i}{\partial x_j}(x) = \lim_{\delta \rightarrow 0} \frac{F_i(x + \delta \e n j) - F_i(x)}{\delta} $$

In a computer, we can't evaluate the limit, so we just choose some “small” value for $\delta$ and implement the formula in code, building a column at a time:

  for j = 1:n
    J[:,j] = (F(x + delta * e(m,j)) - F(x))/delta

This can be made more efficient, and more accurate (central differences, Ridders's method etc), but the point I mainly want to explore here is the choice of delta, and argue that really, you should set things up so that 1e-5 works…

Imagine doing something else: you might set delta = 1e-5 * abs(x[j]) for each j. But then if the user evaluates the Jacobian at the origin $x=0$, it will divide by zero. So such code ends up doing something like delta = max(1e-5 * abs(x[j]), 1e-7). No longer scale invariant, and much harder to reason about.

So. In conclusion. Think about the parameters, and try to keep them in sensible units.

1)
Obviously if *all* columns of the Jacobian are zero, you're already at the optimum.
2)
It's quite likely we would need beta, as one reason phase of the moon might be contributing is something to do with werewolf activity, so it will be sharply peaked around the full moon, but we don't know how sharp the peak is. Fitting beta will let us determine that.
Enter your comment. Wiki syntax is allowed: