This is an old revision of the document!


Parameter scaling in nonlinear optimization

$ \def\R{\mathbb R} \def\F{\mathbf F} $

You know the scenario: you have a function $f:\R^n -> \R$ and a starting guess $x_0 \in \R^n$. You want to find a better $x$, ideally the one for which $f(x)$ is smallest, which you might call $x^\ast$.

(I'll assume $f(x) = \|\F(x)\|$ for $\F:\R^n->\R^m$ so I can talk about the Jacobian $J = \frac{\partial\F}{\partial x}$, but if your problem doesn't look like that, just let $m=1$.)

Many nonlinear optimization packages are scaling-covariant, that is they promise to give you the same answer even if you scale x with a diagonal matrix D. That is, if

xstar = minimize(f, x0)

and you define

g(x) = f(D*x)
ystar = minimize(g, inv(D)*x0)

then they promise that

ystar = inv(D)*xstar

This is a good thing, right?

Well, sure it's a good thing. But it's generally implemented badly. In fact, it's basically not really possible to do it right. Most attempts fail if your starting point is at the origin, which may well be a very reasonable starting point, or if one column of the Jacobian is zero at the starting point, again this may well be reasonable. 1)

So instead I far prefer to place some onus on you, the user of the routine. Your job is this:

*Scale x so its values are “around 1”*

This is actually pretty easy in practice. Just think about the parameters in your model: maybe they are spring constants, initial temperatures, robot joint angles, phase of the moon. Think about the output of your model. Maybe it's predicted energy use in watts. Think about the real-world application. Is it energy use of a city? Then use megawatts or gigawatts to keep the objective function value around 1. Actually, this is the easy bit – almost all implementations deal well with scaling the objective, but I want to get you into the regime of thinking about scaling.

Then, to scale the parameters, ensure that when they change by about 1e-5, the objective also changes by a reasonable amount. So if one of the parameters is phase of the moon, it might feel natural to measure it in days (0, 29.5) with some handling for wraparound. Then your labmate points out that the SI unit for time is the second so you dutifully reparameterize it in seconds, in the range (0, 2500000) But if you add 1e-5 seconds, it will probably change the objective by very little, maybe by 1e-12 gigawatts, and the optimizer will be in trouble.

Your days were a better choice (about 86400 times better), giving a change of 1e-7 GW when you add 1e-5 days (about a second).

Ultimately you're probably anyway looking at terms in the objective which include the phase of the moon as

cos(eta * phoon)^beta

so perhaps a range of 0..1/eta is sensible. The point is just to think about how a change in phoon affects the objective.

But that's the easy case, what about dimensionless parameters?

Yeah, not an issue. Imagine that beta above is also part of the optimization. 2) Again, we perform a little thought experiment to see what a reasonable range might be. Below 1 it will flatten out the curve, that could be useful to model some DC effect, but might also cause phoon to be conflated with some other parameter, and ultimately will make the function quite insensitive to phoon…. Hmmm… Recheck thoughts above.. OK, let's keep beta bigger than 0.5, and smaller than say 10. Quick mental check if adding 1e-5 at either end of that range will do anything very weird. Seems OK.

Another reason for good scaling: finite difference Jacobian calculation

In many situations, you aren't yet sure enough of your model function $\F(x)$ to have written code to compute the Jacobian. Or, you have written that code, but you need to test it. (Believe me, you really do.) Either way, you need another way to get the Jacobian $J$. It's well known that you can get a decent approximation to an entry $J$ using finite differences as follows.

$\def\e#1#2{{\mathbf e}^{#1}_{#2}}$ Let's look at one entry: $J_{ij} = \frac{\partial F_i}{\partial x_j}$, and approximate it using the limit definition. The vector $\e mk$ is the unit vector in $\R^m$ with a one in the $k^\text{th}$ element and zeros elsewhere, and $F_i(x)$ is the $i^\text{th}$ component of $\F(x)$.

$$ \frac{\partial F_i}{\partial x_j}(x) = \lim_{\delta \rightarrow 0} \frac{F_i(x + \delta \e m j) - F_i(x)}{\delta} $$

In a computer, we can't evaluate the limit, so we just choose some “small” value for $\delta$ and implement the formula in code, building a column at a time:

  for j = 1:m
    J[:,j] = (F(x + delta * e(m,j)) - F(x))/delta

This can be more efficient, and more accurate (central differences, Ridders's method etc), but the point I mainly want to explore here is the choice of delta, and argue that really, you should set things up so that 1e-5 works…

1)
Obviously if *all* columns of the Jacobian are zero, you're already at the optimum.
2)
It's quite likely we would need beta, as one reason phase of the moon might be contributing is something to do with werewolf activity, so it will be sharply peaked around the full moon, but we don't know how sharp the peak is. Fitting beta will let us determine that.
optimization-parameter-scaling.1510406700.txt.gz · Last modified: 2017/11/11 13:25 by awf
CC Attribution 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0