From MPC to Diffusion Policy

  • ~10.22K 字
  1. 1. Review of EKF and MPC
  2. 2. Introducing Energy Based Model
    1. 2.1. Supervised Training
    2. 2.2. Optimization Based Inferencing

In this post, we will try to connect Energy-Based Model with classical optimal control frameworks like Model-Predictive Control from the perspective of Lagrangian optimization.

This is one part of the series about energy-based learning and optimal control. A recommended reading order is:

  1. Notes on “The Energy-Based Learning Model” by Yann LeCun, 2021
  2. Learning Data Distribution Via Gradient Estimation
  3. [From MPC to Energy-Based Policy]
  4. How Would Diffusion Model Help Robot Imitation
  5. Causality hidden in EBM

Review of EKF and MPC

Consider a state-space model with process noise and measurement noise as:

{x[k]=f(x[k1],u[k1],w[k1]),wN(0,Qw)z[k]=h(x[k],v[k]),vN(0,Qv)\left\{ \begin{aligned} x[k] &= f(x[k-1], u[k-1], w[k-1]),\quad &w \sim \mathcal{N}(0, Q_w) \\\\ z[k] &= h(x[k], v[k]),\quad &v \sim \mathcal{N}(0, Q_v) \end{aligned} \right.

With a state xx and observation zz.

To perform optimal control on such system, we first predict and update our observation with Extended Kalman Filter:

  • Prediction Step

    • Priori Estimation:

      x^[k]=f(x^[k1],u[k1],0)\hat{x}^{-}[k] = f(\hat{x}[k-1], u[k-1], 0)
    • Jacobian of the State Transition Function:

      A[k]=fx(x^[k1],u[k1],0)A[k] = \frac{\partial f}{\partial x} (\hat{x}[k-1], u[k-1], 0) W[k]=fw(x^[k1],u[k1],0)W[k]=\frac{\partial f}{\partial w} (\hat{x}[k-1], u[k-1], 0)
    • Error Covariance Prediction:

      P[k]=A[k]P[k1]AT[k1]+W[k]QwWT[k]P^{-}[k] = A[k]P[k-1]A^T[k-1]+W[k]Q_wW^T[k]
  • Update Step

    • Jacobian of the Measurement Function:

      H[k]=hx(x^[k],0)H[k] = \frac{\partial h}{\partial x} (\hat{x}^{-}[k], 0) V[k]=hv(x^[k],0)V[k] = \frac{\partial h}{\partial v} (\hat{x}^{-}[k], 0)
    • Kalman Gain:

      K[k]=P[k]H[k]H[k]P[k]H[k]+V[k]QvVT[k]K[k] = \frac{P^{-}[k] H^{\top}[k]} { H[k] P^{-}[k] H^{\top}[k] + V[k]Q_vV^T[k] }
    • Posterior Estimation:

      x^[k]=x^[k]+K[k](z[k]h(x^[k],0))\hat{x}[k] = \hat{x}^{-}[k] + K[k] \left( z[k] - h(\hat{x}^{-}[k],0) \right)
    • Error Covariance Update:

      P[k]=(IK[k]H[k])P[k]P[k] = \left( I - K[k] H[k] \right) P^{-}[k]

With the estimated state we can perform Model-Based Control with the following condition:

minuJ=i=0N1(x[k+i],u[k+i])+f(x[k+N])(x,u)=(xxref)Q(xxref)+uRus.t.x[k+i+1]=f(x[k+i],u[k+i])i=0,,N1x[k]=x^[k]xminx[k+i]xmaxiuminu[k+i]umaxi\begin{aligned} \min_{\mathbf{u}} \quad J &= \sum_{i=0}^{N-1} \ell(x[k+i], u[k+i]) + \ell_f(x[k+N]) \\\\ \ell(x, u) &= (x - x_{\text{ref}})^{\top} Q (x - x_{\text{ref}}) + u^{\top} R u \\\\ \text{s.t.} \quad & x[k+i+1] = f(x[k+i], u[k+i]) \quad \forall i = 0, \dots, N-1 \\\\ & x[k] = \hat{x}[k] \\\\ & x_{\text{min}} \leq x[k+i] \leq x_{\text{max}} \quad \forall i \\\\ & u_{\text{min}} \leq u[k+i] \leq u_{\text{max}} \quad \forall i \end{aligned}
  • u={u[k],u[k+1],,u[k+N1]} \mathbf{u} = \{ u[k], u[k+1], \dots, u[k+N-1] \} : Control input sequence.
  • N N : Prediction horizon.
  • (x,u) \ell(x, u) : Stage cost function.
  • f(x) \ell_f(x) : Terminal cost function.
  • xmin,xmax x_{\text{min}}, x_{\text{max}} : State constraints.
  • umin,umax u_{\text{min}}, u_{\text{max}} : Control constraints.

Introducing Energy Based Model

We can replace the state transition function ff in the formulated state space model with EBM as:

p(x[k]x[k1],u[k1])=eE(x[k],x[k1],u[k1])Z(x[k1],u[k1]) p(x[k]∣x[k−1],u[k−1])=\frac{e^{−E(x[k],x[k−1],u[k−1])}}{Z(x[k−1],u[k−1])}

Supervised Training

Since ZZ is often intractable, we will use score matching to learn the EBM in the following content.

Here we have the score function as:

sθ(x[k],x[k1],u[k1])=sθ(x)=xlogpθ(x)=xEθ(x) s_\theta (x[k],x[k−1],u[k−1])=s_\theta(\mathbf x) = \nabla_{\mathbf x} \log p_\theta(\mathbf x)=-\nabla_{\mathbf x} E_\theta(\mathbf x)
  • Dataset: A set of transitions {(xi[k1],ui[k1],xi[k])}i=1N\{(x_i[k-1], u_i[k-1], x_i[k])\}^N_{i=1} with “independently and identically distributed (i.i.d.)” assumption.

  • The training objective is to minimize the difference between data landscape sdata(x)s_{data}(\mathbf x) and model landscape sθ(x)s_\theta(\mathbf x), and the objective function is defined as follows, where loss L\mathcal L is commonly defined as MSELoss:

    J(θ)=12Epdata[L(sdata(x),sθ(x))] =12Epdata[sdata(x)sθ(x)2] J(\theta)=\frac{1}{2}\mathbb{E}_{p_{data}}[\mathcal L(s_{data}(\mathbf x), s_{\theta}(\mathbf x))] \ =\frac{1}{2}\mathbb{E}_{p_{data}}[||s_{data}(\mathbf x)- s_{\theta}(\mathbf x)||^2]

HOWEVER, we cannot get access to full data distribution. According to (Hyvärinen, et.al , 2005)1 we may use the following procedures. (More in Appendix A of original paper)

J(θ)=12Epdata[sθ(x)22sθ(x)sdata(x)+sdata(x)2]=12Epdata[sθ(x)2]Epdata[sθ(x)sdata(x)]+12Epdata[sdata(x)2]constantJ(θ)=12Epdata[sθ(x)2]Epdata[sθ(x)sdata(x)]\begin{aligned} J(\theta)&=\frac{1}{2}\mathbb{E}_{p_{data}}[\left\| s_\theta(\mathbf x) \right\|^2 - 2 s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) + \left\| s_{\text{data}}(\mathbf x) \right\|^2] \\ &=\frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \left\| s_\theta(\mathbf x) \right\|^2 \right] - \mathbb{E}_{p_{\text{data}}} \left[ s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) \right] + \overbrace{\frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \left\| s_{\text{data}}(\mathbf x) \right\|^2 \right]}^{\text{constant}} \\ J'(\theta)&=\frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \left\| s_\theta(\mathbf x) \right\|^2 \right] - \mathbb{E}_{p_{\text{data}}} \left[ s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) \right] \end{aligned} Epdata[sθ(x)sdata(x)]=pdata(x)sθ(x)sdata(x)dx =sθ(x)xpdata(x)dx \mathbb{E}_{p_{\text{data}}} \left[ s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) \right] = \int p_{\text{data}}(\mathbf x) s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) d\mathbf x \ = \int s_\theta(\mathbf x)^\top \nabla_{\mathbf x} p_{\text{data}}(\mathbf x) d\mathbf x

By integrating by parts, we move the derivative from pdata(x) p_{\text{data}}(x) to sθ(x) s_\theta(x) :

sθ(x)xpdata(x)dx=pdata(x)divxsθ(x)dxdivxsθ(x)=divx(xEθ(x))=divxxEθ(x)=ΔxEθ(x)Epdata[sθ(x)sdata(x)]=Epdata[divxsθ(x)] \int s_\theta(\mathbf x)^\top \nabla_{\mathbf x} p_{\text{data}}(\mathbf x) d\mathbf x = -\int p_{\text{data}}(\mathbf x) \text{div}_{\mathbf x} s_\theta(\mathbf x) d\mathbf x \\ \text{div}_{\mathbf x} s_\theta(\mathbf x) = \text{div}_{\mathbf x} \left( -\nabla_{\mathbf x} E_\theta(\mathbf x) \right) = -\text{div}_{\mathbf x} \nabla_{\mathbf x} E_\theta(\mathbf x) = -\Delta_{\mathbf x} E_\theta(\mathbf x)\\ \mathbb{E}_{p_{\text{data}}} \left[ s_\theta(\mathbf x)^\top s_{\text{data}}(\mathbf x) \right] = -\mathbb{E}_{p_{\text{data}}} \left[ \text{div}_{\mathbf x} s_\theta(\mathbf x) \right]

Eventually we obtain the updated objective function

J(θ)=12Epdata[xEθ(x)2]+Epdata[ΔxEθ(x)]Jk(θ)=Tr(x[k]2E(x[k]))+12x[k]E(x[k])2 \begin{aligned} J(\theta) &= \frac{1}{2} \mathbb{E}_{p_{\text{data}}} \left[ \left\| \nabla_{\mathbf x} E_\theta(\mathbf x) \right\|^2 \right] + \mathbb{E}_{p_{\text{data}}} \left[ \Delta_{\mathbf x} E_\theta(\mathbf x) \right] \\ J_k(\theta) &= \text{Tr} \left( \nabla_{\mathbf x[k]}^2 E(\mathbf x[k]) \right) + \frac{1}{2} \left\| \nabla_{\mathbf x[k]} E(\mathbf x[k]) \right\|^2 \end{aligned} Tr(x[k]2E(x[k]))\text{Tr} \left( \nabla_{\mathbf x[k]}^2 E(\mathbf x[k]) \right) denotes the trace of Hessian matrix (or Jacobian) of score function w.r.t. x[k]\mathbf x[k] .

If you haven’t seen such formulation in diffusion models and feel strange:

  • The training objective is to learn the distribution of q(xt1xt,x0)q(x_{t-1}|x_t, x_0), which is a known Gaussian distribution since the noise level is provided. This is also why q-sampling requires x0x_0 .

Optimization Based Inferencing

Langevin Dynamics can produce samples from a probability density p(x)p(\mathbf x) using only the score function xlogp(x)\nabla_{\mathbf x}\log p(\mathbf x). Given a fixed step size ϵ>0\epsilon >0, and an initial value x~π(x)\tilde{\mathbf x} \sim \pi(\mathbf x) with π\pi being a prior distribution, the Langevin method recursively computes the following

x~t=x~t1+ϵ2xlogp(x~t1)+ϵzt \tilde{\mathbf x}_t=\tilde{\mathbf x}_{t-1}+\frac{\epsilon}{2}\nabla_{\mathbf x}\log p(\tilde{\mathbf x}_{t-1})+\sqrt{\epsilon}\mathbf z_t

where ztN(0,I)\mathbf z_t \sim \mathcal N(0,I) . The distribution of x~t\tilde{\mathbf x}_t equals p(x)p(\mathbf x) when ϵ0\epsilon \rightarrow 0 and TT \rightarrow \infty , or else a Metropolis-Hastings update is needed to correct the error.


  1. 1.Hyvärinen, A., & Dayan, P. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4). https://jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf
分享这一刻
让朋友们也来瞅瞅!