Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
optimization-parameter-scaling [2017/11/11 11:44]
awf
optimization-parameter-scaling [2017/11/13 10:43] (current)
awf
Line 1: Line 1:
 ====== Parameter scaling in nonlinear optimization ====== ====== Parameter scaling in nonlinear optimization ======
  
 +$
 +\def\R{\mathbb R}
 +\def\F{\mathbf F}
 +$
  
-You know the scenario: you have a function f:R^n -> R and a starting guess x0 \in R^n.   +You know the scenario: you have a function ​$f:\R^n -> \Rand a starting guess $x_0 \in \R^n$.   
-You want to find a better x, ideally the one for which f(x) is smallest.+You want to find a better ​$x$, ideally the one for which $f(x)is smallest, which you might call $x^\ast$.
  
-(Assume ​f(x) = norm(F(x)) for F:​R^n->​R^m so I can talk about the Jacobian, but it's not important.)+(I'll assume $f(x) = \|\F(x)\|^2 = \sum_{i=1}^m F_i(x)^2$ for $\F:\R^n->\R^mso I can talk about the Jacobian ​$J = \frac{\partial\F}{\partial x}$, but if your problem doesn't look like that, just let $m=1$, or read Jacobian as gradient.)
  
-Many nonlinear optimization packages ​are scaling-covariant,​ that is they promise to give you the same answer even if you scale x with a diagonal matrix D.+Many nonlinear optimization packages ​aim to be scaling-covariant,​ that is they promise to give you an equivalent ​answer even if you scale x with a diagonal matrix D.
 That is, if  That is, if 
  
   xstar = minimize(f, x0)   xstar = minimize(f, x0)
  
-and you define+and you define ​(Julia syntax)
  
   g(x) = f(D*x)   g(x) = f(D*x)
Line 19: Line 23:
 then they promise that  then they promise that 
  
-  ystar = D*xstar+  ystar = inv(D)*xstar
   ​   ​
 This is a good thing, right? This is a good thing, right?
Line 32: Line 36:
 So instead I far prefer to place some onus on you, the user of the routine. ​ Your job is this: So instead I far prefer to place some onus on you, the user of the routine. ​ Your job is this:
  
-> *Scale x so its values ​are "​around 1"*+**Scale x so its components ​are "​around 1"**
  
 This is actually pretty easy in practice.  ​ This is actually pretty easy in practice.  ​
Line 39: Line 43:
 Think about the output of your model. ​ Maybe it's predicted energy use in watts. ​   Think about the output of your model. ​ Maybe it's predicted energy use in watts. ​  
 Think about the real-world application. ​ Is it energy use of a city?  ​ Think about the real-world application. ​ Is it energy use of a city?  ​
-Then use megawatts or gigawatts to keep the objective function value around 1.+Then use megawatts or gigawatts to keep the objective function value around 1.   
 +Actually, this is the easy bit -- almost all implementations deal well with scaling the objective, but I want to get you into the regime of thinking about scaling.
  
 Then, to scale the parameters, ensure that when they change by about 1e-5,  Then, to scale the parameters, ensure that when they change by about 1e-5, 
 the objective also changes by a reasonable amount. ​   the objective also changes by a reasonable amount. ​  
 So if one of the parameters is phase of the moon, it might feel natural to  So if one of the parameters is phase of the moon, it might feel natural to 
-measure it in days (0, 28) with some handling for wraparound. ​  +measure it in days (0 < phoon < 29.5) with some handling for wraparound. ​  
-Then your labmate points out that the SI unit for time is the second so you dutifully reparameterize it in seconds, in the range (0, 2419200)+Then your labmate points out that the SI unit for time is the second so you dutifully reparameterize it in seconds, in the range (0, 2500000)
 But if you add 1e-5 seconds, it will probably change the objective by very little, maybe by 1e-12 gigawatts, and the optimizer will be in trouble. But if you add 1e-5 seconds, it will probably change the objective by very little, maybe by 1e-12 gigawatts, and the optimizer will be in trouble.
  
-Your days were a better choice (about 86400 times better), giving a change of 1e-7 GW when you add 1e-5 days (about a second).+Your first idea (do it in days) was actually ​a better choice (about 86400 times better), giving a change of 1e-7 GW when you add 1e-5 days (about a second). ​ Can we do even better?
  
-Ultimately you're probably anyway looking ​at terms in the objective ​which include the phase of the moon as+Well, look at how the parameter appears ​in the objective ​funciton. ​ For phase of the moon, that's normally in some term like:
  
-  ​cos(eta * phoon)^beta+$$  \cos(\eta * \mathtt{phoon})^\beta  $$
  
-so perhaps ​a range of 0..1/eta is sensible.   +so a sensible ​range might be to $2\pi/\eta$.   
-The point is just to think about how a change in phoon affects the objective.+The point is just to think about how a change in ''​phoon'' ​affects the objective.
  
 === But that's the easy case, what about dimensionless parameters? === === But that's the easy case, what about dimensionless parameters? ===
Line 70: Line 75:
 Seems OK. Seems OK.
  
-=== The proof in the pudding: finite difference Jacobian calculation ===+$\def\e#​1#​2{{\mathbf e}^{#​1}_{#​2}}$
  
-So you've carefully scaled all your parameters, let's see how it goes...+=== Another reason for good scaling: finite difference Jacobian calculation ===
  
 +In many situations, you aren't yet sure enough of your model function $\F(x)$ to have 
 +written code to compute the Jacobian. ​ Or, you have written that code, but you need to test it.  ​
 +(Believe me, you really do.)  Either way, you need another way to get the Jacobian $J$.
 +It's well known that you can get a decent approximation to $J$ using finite differences as follows.
 +
 +Let's look at one entry: $J_{ij} = \frac{\partial F_i}{\partial x_j}$, and approximate it using the limit definition.  ​
 +The vector $\e nj$ is the unit vector in $\R^n$ with a one in the $j^\text{th}$ element and zeros elsewhere, ​
 +and $F_i(x)$ is the $i^\text{th}$ component of $\F(x)$.
 +
 +$$
 +\frac{\partial F_i}{\partial x_j}(x) = \lim_{\delta \rightarrow 0} \frac{F_i(x + \delta \e n j) - F_i(x)}{\delta}
 +$$
 +
 +In a computer, we can't evaluate the limit, so we just choose some "​small"​ value 
 +for $\delta$ and implement the formula in code, building a column at a time:
 +
 +    for j = 1:n
 +      J[:,j] = (F(x + delta * e(m,j)) - F(x))/delta
 +
 +This can be made more efficient, and more accurate (central differences,​ Ridders'​s method etc), 
 +but the point I mainly want to explore here is the choice of delta, ​
 +and argue that really, you should set things up so that 1e-5 works...
 +
 +Imagine doing something else: you might set ''​delta = 1e-5 * abs(x[j])''​ for each ''​j''​.  ​
 +But then if the user evaluates the Jacobian at the origin $x=0$, it will divide by zero.  ​
 +So such code ends up doing something like ''​delta = max(1e-5 * abs(x[j]), 1e-7)''​.  ​
 +No longer scale invariant, and much harder to reason about.
 +
 +So.  In conclusion. ​ Think about the parameters, and try to keep them in sensible units.
 +
 +~~DISCUSSION~~