Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
optimization-parameter-scaling [2017/11/11 10:04] – awf | optimization-parameter-scaling [2021/09/03 13:34] (current) – awf | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Parameter scaling in nonlinear optimization ====== | ====== Parameter scaling in nonlinear optimization ====== | ||
- | You have a function f:R^n -> R and a starting guess x0 \in R^n | + | $ |
+ | \def\R{\mathbb | ||
+ | \def\F{\mathbf F} | ||
+ | $ | ||
- | You want to find a better x, ideally the one for which f(x) is smallest. | + | You know the scenario: you have a function $f:\R^n -> \R$ and a starting guess $x_0 \in \R^n$. |
+ | You want to find a better | ||
- | (Assume | + | (I'll assume $f(x) = \|\F(x)\|^2 = \sum_{i=1}^m F_i(x)^2$ for $\F:\R^n->\R^m$ so I can talk about the Jacobian |
- | Many nonlinear optimization packages | + | Many nonlinear optimization packages |
That is, if | That is, if | ||
xstar = minimize(f, x0) | xstar = minimize(f, x0) | ||
- | and you define | + | and you define |
g(x) = f(D*x) | g(x) = f(D*x) | ||
Line 19: | Line 23: | ||
then they promise that | then they promise that | ||
- | ystar = D*xstar | + | ystar = inv(D)*xstar |
| | ||
This is a good thing, right? | This is a good thing, right? | ||
Line 32: | Line 36: | ||
So instead I far prefer to place some onus on you, the user of the routine. | So instead I far prefer to place some onus on you, the user of the routine. | ||
- | > *Scale x so its values | + | > **Scale x so its components |
This is actually pretty easy in practice. | This is actually pretty easy in practice. | ||
Line 39: | Line 43: | ||
Think about the output of your model. | Think about the output of your model. | ||
Think about the real-world application. | Think about the real-world application. | ||
- | Then use megawatts or gigawatts to keep the objective function value around 1. | + | Then use megawatts or gigawatts to keep the objective function value around 1. |
+ | Actually, this is the easy bit -- almost all implementations deal well with scaling the objective, but I want to get you into the regime of thinking about scaling. | ||
Then, to scale the parameters, ensure that when they change by about 1e-5, | Then, to scale the parameters, ensure that when they change by about 1e-5, | ||
the objective also changes by a reasonable amount. | the objective also changes by a reasonable amount. | ||
So if one of the parameters is phase of the moon, it might feel natural to | So if one of the parameters is phase of the moon, it might feel natural to | ||
- | measure it in days (0, 28) with some handling for wraparound. | + | measure it in days (0 < phoon < 29.5) with some handling for wraparound. |
- | Then your labmate points out that the SI unit for time is the second so you dutifully reparameterize it in seconds, in the range (0, 2419200) | + | Then your labmate points out that the SI unit for time is the second so you dutifully reparameterize it in seconds, in the range (0, 2500000) |
But if you add 1e-5 seconds, it will probably change the objective by very little, maybe by 1e-12 gigawatts, and the optimizer will be in trouble. | But if you add 1e-5 seconds, it will probably change the objective by very little, maybe by 1e-12 gigawatts, and the optimizer will be in trouble. | ||
- | Your days were a better choice (about 86400 times better), giving a change of 1e-7 GW when you add 1e-5 days (about a second). | + | Your first idea (do it in days) was actually |
- | Ultimately you're probably anyway looking | + | Well, look at how the parameter appears |
- | | + | $$ \cos(\eta * \mathtt{phoon})^\beta $$ |
- | so perhaps | + | so a sensible |
- | The point is just to think about how a change in phoon affects the objective. | + | The point is just to think about how a change in '' |
- | === But that's the easy case, what about dimensionless parameters? | + | > But that's the easy case, what about dimensionless parameters? |
Yeah, not an issue. | Yeah, not an issue. | ||
Line 70: | Line 75: | ||
Seems OK. | Seems OK. | ||
- | === The proof in the pudding: finite difference Jacobian calculation === | + | $\def\e# |
- | So you've carefully scaled all your parameters, let' | + | > Another reason for good scaling: finite difference Jacobian calculation |
+ | |||
+ | In many situations, | ||
+ | written code to compute the Jacobian. | ||
+ | (Believe me, you really do.) Either way, you need another way to get the Jacobian $J$. | ||
+ | It' | ||
+ | |||
+ | Let's look at one entry: $J_{ij} = \frac{\partial F_i}{\partial x_j}$, and approximate | ||
+ | The vector $\e nj$ is the unit vector in $\R^n$ with a one in the $j^\text{th}$ element and zeros elsewhere, | ||
+ | and $F_i(x)$ is the $i^\text{th}$ component of $\F(x)$. | ||
+ | |||
+ | $$ | ||
+ | \frac{\partial F_i}{\partial x_j}(x) = \lim_{\delta \rightarrow 0} \frac{F_i(x + \delta \e n j) - F_i(x)}{\delta} | ||
+ | $$ | ||
+ | |||
+ | In a computer, we can't evaluate the limit, so we just choose some " | ||
+ | for $\delta$ and implement the formula in code, building a column at a time: | ||
+ | |||
+ | for j = 1:n | ||
+ | J[:,j] = (F(x + delta * e(m,j)) - F(x))/ | ||
+ | |||
+ | This can be made more efficient, and more accurate (central differences, | ||
+ | but the point I mainly want to explore here is the choice of delta, | ||
+ | and argue that really, you should set things up so that 1e-5 works... | ||
+ | |||
+ | Imagine doing something else: you might set '' | ||
+ | But then if the user evaluates the Jacobian at the origin $x=0$, it will divide by zero. | ||
+ | So such code ends up doing something like '' | ||
+ | No longer scale invariant, and much harder to reason about. | ||
+ | |||
+ | So. In conclusion. | ||