In this post, we will try to connect Energy-Based Model with classical optimal control frameworks like Model-Predictive Control from the perspective of Lagrangian optimization.
This is one part of the series about energy-based learning and optimal control. A recommended reading order is:
- Notes on "The Energy-Based Learning Model" by Yann LeCun, 2021
- Learning Data Distribution Via Gradient Estimation
- [From MPC to Energy-Based Policy]
- How Would Diffusion Model Help Robot Imitation
- Causality hidden in EBM
Review of EKF and MPC
Consider a state-space model with process noise and measurement noise as:
To perform optimal control on such system, we first predict and update our observation with Extended Kalman Filter:
Prediction Step
Priori Estimation:
Jacobian of the State Transition Function:
Error Covariance Prediction:
Update Step
Jacobian of the Measurement Function:
Kalman Gain:
Posterior Estimation:
Error Covariance Update:
With the estimated state we can perform Model-Based Control with the following condition:
$$ \begin{aligned} {} J &= {i=0}^{N-1} (x[k+i], u[k+i]) + f(x[k+N]) \ (x, u) &= (x - x{})^{} Q (x - x_{}) + u^{} R u\
& x[k+i+1] = f(x[k+i], u[k+i]) i = 0, , N-1 \ & x[k] = [k] \ & x_{} x[k+i] x_{} i \ & u_{} u[k+i] u_{} i \end{aligned} $$
: Control input sequence. : Prediction horizon. : Stage cost function. : Terminal cost function. : State constraints. : Control constraints.
Introducing Energy Based Model
We can replace the state transition function
Supervised Training
Since
Here we have the score function as:
Dataset: A set of transitions
with "independently and identically distributed (i.i.d.)" assumption. The training objective is to minimize the difference between data landscape
and model landscape , and the objective function is defined as follows, where loss is commonly defined as MSELoss:
HOWEVER, we cannot get access to full data distribution. According to (Hyvärinen, et.al , 2005)1 we may use the following procedures. (More in Appendix A of original paper)
$$ \begin{aligned} J()&={p{data}}[| s_(x) |^2 - 2 s_(x)^s_{}(x) + | s_{}(x) |^2] \
&=\frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \left\| s_\theta(\mathbf x) \right\|^2 \right] - \mathbb{E}_{p_{\text{data}}} \left[ s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) \right] + \overbrace{\frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \left\| s_{\text{data}}(\mathbf x) \right\|^2 \right]}^{\text{constant}} \\
J'()&= {p{}} - {p{}} \
\end{aligned} $$
By integrating by parts, we move the derivative from
{x} s(x) = {x} ( -{x} E_(x) ) = -{x} {x} E_(x) = -{x} E(x)\
{p{}} = -{p{}}
J_k() &= ( {x[k]}^2 E(x[k]) ) + | {x[k]} E(x[k]) |^2 \end{aligned} $$
If you haven't seen such formulation in diffusion models and feel strange:
- The training objective is to learn the distribution of
, which is a known Gaussian distribution since the noise level is provided. This is also why q-sampling requires .
Optimization Based Inferencing
Langevin Dynamics can produce samples from a probability density
Hyvärinen, A., & Dayan, P. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4). https://jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf↩