Notes on Denoising Diffusion Implicit Models

Jianfeng Wang, July 9, 2023

Forward pass

In DDPM, the forward pass is a Markov chain, but in the model of DDIM, it is not.

Given the observation $\mathbf{x}_0$ following an unknown distribution $q(\mathbf{x}_0)$ and a timestep $T$ for a Gaussian distribution

\[q(\mathbf{x}_{T} | \mathbf{x}_0) =\mathcal{N}(\sqrt{\bar{\alpha}_T}\mathbf{x}_0; (1 - \bar{\alpha}_T) \mathbf{I}),\]

we define the following

\[q(\mathbf{x}_{1:T} | \mathbf{x}_0) = q(\mathbf{x}_T | \mathbf{x}_0) \prod_{t = 2}^{T} q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0).\]

Note that, the paper of DDIM uses $\alpha_{t}$, which is actually closely related to $\bar{\alpha}_{t}$ in the DDPM paper. Thus, we use $\bar{\alpha}_t$ here. The process says that 1) we have $\mathbf{x}_0$, 2) draw $\mathbf{x}_T$ from $q(\mathbf{x}_T | \mathbf{x}_0)$, 3) draw $\mathbf{x}_{T - 1}$ based on $q(\mathbf{x}_{T - 1} | \mathbf{x}_{T}, \mathbf{x}_0)$, 4) draw $\mathbf{x}_{T - 2}$ based on $q(\mathbf{x}_{T - 2} | \mathbf{x}_{T - 1}, \mathbf{x}_0)$, until we draw $\mathbf{x}_1$ based on $q(\mathbf{x}_1 | \mathbf{x}_2, \mathbf{x}_{0})$. The probability of $q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0)$ is defined as

\[q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N} ( \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 } { \sqrt{1 - \bar{\alpha}_{t}} }, \sigma_{t}^{2} \mathbf{I} ).\]

This definition leads to the following

\[q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N} ( \sqrt{\bar{\alpha}_t} \mathbf{x}_0; (1 - \bar{\alpha}_t) \mathbf{I} ).\]

Let’s verify it through induction. The equation holds for $t = T$ because of the definition. Let’s say it also holds for some $t$ starting from $T$. Then, we only need to calculate $q(\mathbf{x}_{t - 1} | \mathbf{x}_0)$. Considering the equation holds for $q(\mathbf{x}_t | \mathbf{x}_0)$, we have

\[\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_t\]

where $\epsilon \sim \mathcal{N}(0, \mathbf{I})$. Given $\mathbf{x}_0$ and $\mathbf{x}_t$, we can draw $\mathbf{x}_{t - 1}$ through $q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \mathbf{x}_0)$:

\[\mathbf{x}_{t - 1} = \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 } { \sqrt{1 - \bar{\alpha}_{t}} } + \sigma_{t} \epsilon_{t - 1}.\]

By substituting $\mathbf{x}_t$, we have

\[\mathbf{x}_{t - 1} = \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \epsilon_{t} + \sigma_{t} \epsilon_{t - 1}.\]

Thus, we can easily conclude $q(\mathbf{x}_{t - 1} | \mathbf{x}_{0})$ also follows the Gaussian distribution. The mean is $\sqrt{\bar{\alpha}_{t - 1}}$, and the variance is $1 - \bar{\alpha}_{t - 1} - \sigma_{t}^2 + \sigma_{t}^{2} = 1 - \bar{\alpha}_{t - 1}$.

An interesting property is that the probability $q(\mathbf{x}_{t}|\mathbf{x}_{0})$ is un-related with any $\sigma$.

Backward pass for a generative process

Recall the form of $q(\mathbf{x}_T | \mathbf{x}_t)$, if we design $\bar{\alpha}_{T}$ close to 0, it will be close to a normal Gaussian distribution. Thus, we can draw a sample from normal distribution as an approximation of $\mathbf{x}_T$. For any $t$, we can draw $\mathbf{x}_t$ based on a random sample $\epsilon_{t}$. A generative network model is designed to learn such noise by $\epsilon{x}_{\theta}(\mathbf{x}_t)$. Note that, the network should also depend on the timestep $t$, and we ignore the notation for simplicity.

With the predicted noise, we can estimate $\mathbf{x}_0$ as

\[\hat{\mathbf{x}}_0(\mathbf{x}_t) = \frac { \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_{\theta}(\mathbf{x}_t) } { \sqrt{\alpha_{t}} }.\]

Then, we can draw another sample of $\mathbf{x}_{t - 1}$ by the probability of $q(\mathbf{x}_{t - 1} | \mathbf{x}_t, \hat{\mathbf{x}}_0)$. This process repeats until we reach $\mathbf{x}_{1}$. Then, the original $\mathbf{x}_0$ is estimated by $\hat{\mathbf{x}}_0(\mathbf{x}_1)$.

The key is to train $\epsilon_{\theta}(\mathbf{x}_t)$, which can follow exactly the same process of DDPM. Thus, the training process between DDIM and DDPM is the same, and the difference is only in inference.

$\sigma_{t} = 0$ for DDIM

When $\sigma_{t} = 0$, the process is called DDIM. That is, in the forward pass, we draw $\mathbf{x}_T$ based on $\mathbf{x}_0$, with some random noise $\epsilon_{T}$. Then, we will draw all other $\mathbf{x}_t$ in a deterministic way, rather than in a random way, as $\sigma_t$ is always 0.

In the backward generative process, we first draw a random sample from normal distribution as an approximation of $\mathbf{x}_{T}$. Then, all other samples will be drawn also in the deterministic way.

Appropriate $\sigma_{t}$ for DDPM

When $\sigma_{t}$ is larger than 0, we will instroduce some noise in both the forward pass and the backward pass. One special case is when

\[\sigma_{t} = \sqrt{ \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} } (1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t - 1}}) }.\]

Recalling DDPM’s inference process, we know both DDPM and DDIM drive $\mathbf{x}_{t - 1}$ based on $\hat{\mathbf{x}}_0$ and $\mathbf{x}_t$. In DDIM, the variance is

\[\sigma_t^2 = \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} } (1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t - 1}})= \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} }\beta_{t}\]

Recall that in DDPM, $\bar{\alpha}_{t} = \prod_{i = 1}^{t} \alpha_{i} = \prod_{i = 1}^{t} (1 - \beta_{i})$. This variance is exactly the same as the Eqn (7) of the DDPM paper (reference [1]).

To derive the mean, let’s first calculate

\[\begin{aligned} 1 - \bar{\alpha}_{t - 1} - \sigma_{t}^2 & = 1 - \bar{\alpha}_{t - 1} - \frac { 1 - \bar{\alpha}_{t - 1} } { 1 - \bar{\alpha}_{t} } \frac { \bar{\alpha}_{t - 1} - \bar{\alpha}_{t} } { \bar{\alpha}_{t - 1} } \\ & = \frac { (1 - \bar{\alpha}_t - \bar{\alpha}_{t - 1} + \bar{\alpha}_t\bar{\alpha}_{t - 1}) \bar{\alpha}_{t - 1} - (1 - \bar{\alpha}_{t - 1}) (\bar{\alpha}_{t - 1} - \bar{\alpha}_{t}) } { (1 - \bar{\alpha}_t) \bar{\alpha}_{t - 1} } \\ & = \frac { - 2 \bar{\alpha}_{t} \bar{\alpha}_{t - 1} + \bar{\alpha}_{t} \bar{\alpha}_{t - 1}^{2} + \bar{\alpha}_t } { (1 - \bar{\alpha}_t) \bar{\alpha}_{t - 1} } \\ & = \frac { \bar{\alpha}_t (1 - \bar{\alpha}_{t - 1})^2 } { (1 - \bar{\alpha}_t) \bar{\alpha}_{t - 1} } \end{aligned}\]

Then, the mean is

\[\begin{aligned} & \sqrt{\bar{\alpha}_{t - 1}} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 } { \sqrt{1 - \bar{\alpha}_{t}} } \\ = & ( \sqrt{\bar{\alpha}_{t - 1}} - \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} \frac { \sqrt{\bar{\alpha}_t} } { \sqrt{1 - \bar{\alpha}_t} } ) \mathbf{x}_0 + \frac { \sqrt{1 - \bar{\alpha}_{t - 1} - \sigma_t^2} } { \sqrt{1 - \bar{\alpha}_{t}} }\mathbf{x}_t \\ = & ( \sqrt{\bar{\alpha}_{t - 1}} - \frac { \sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_{t - 1}) } { \sqrt{1 - \bar{\alpha}_t} \sqrt{\bar{\alpha}_{t - 1}} } \frac { \sqrt{\bar{\alpha}_t} } { \sqrt{1 - \bar{\alpha}_t} } ) \mathbf{x}_0 + \frac { \sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_{t - 1}) } { \sqrt{1 - \bar{\alpha}_{t}} \sqrt{1 - \bar{\alpha}_t} \sqrt{\bar{\alpha}_{t - 1}} }\mathbf{x}_t \\ = & ( \sqrt{\bar{\alpha}_{t - 1}} - \frac { \bar{\alpha}_t (1 - \bar{\alpha}_{t - 1}) } { (1 - \bar{\alpha}_t) \sqrt{\bar{\alpha}_{t - 1}} } ) \mathbf{x}_0 + \frac { \sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_{t - 1}) } { (1 - \bar{\alpha}_{t}) \sqrt{\bar{\alpha}_{t - 1}} }\mathbf{x}_t \\ = & \frac { \bar{\alpha}_{t - 1} - \bar{\alpha}_{t} } { (1 - \bar{\alpha}_t) \sqrt{\bar{\alpha}_{t - 1}} } \mathbf{x}_0 + \frac { \sqrt{\alpha_t} (1 - \bar{\alpha}_{t - 1}) } { 1 - \bar{\alpha}_{t} }\mathbf{x}_t \\ = & \frac { \sqrt{\bar{\alpha}_{t - 1}} \beta_{t} } { 1 - \bar{\alpha}_t } \mathbf{x}_0 + \frac { \sqrt{\alpha_t} (1 - \bar{\alpha}_{t - 1}) } { 1 - \bar{\alpha}_{t} }\mathbf{x}_t \end{aligned}\]

This is exactly the same as the mean of Eqn (7) of DDPM paper. Thus, the inference process is exactly the same as DDPM.

References

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[2] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” arXiv preprint arXiv:2010.02502 (2020).

Notes on Denoising Diffusion Implicit Models

Forward pass

Backward pass for a generative process

\(\sigma_{t} = 0\) for DDIM

Appropriate \(\sigma_{t}\) for DDPM

References