### Categories


You have almost certainly heard of OpenAI’s DALL·E 2. It may well be why you are reading this post. As you most likely know, this is a large scale text to image model that demonstrates impressive image-generating capabilities. The results come about through the use of a category of models known as denoising diffusion models. In this blog we will learn about the basic diffusion-based generative model introduced in Denoising Diffusion Probabilistic Models (DDPM). This post will cover the basics at a relatively high level but along the way there will be optional exercises that will help you to delve deeper into the mathematical details.

Animation of CIFAR-10 samples generated from noise by a diffusion model

## Introduction

Diffusion is not unlike the process of creating a painting. The artist starts with a blank canvas and a collection of ideas, thoughts, experiences and observations. We can think of these as latent variables. Then with each brush stroke an image is gradually created. The AI algorithm begins instead with random noise and over a sequence of steps turns it into an image. Arguably any generative model could be characterised in this way but an important aspect of a diffusion model is the iterative nature of the process.

## Reverse process

Consider a dataset of images, denoted by $\xz$, which come from the data distribution $\qr{\xz}$. We would like to be able to generate new images, starting with random noise $\mathbf{x}_T$.

From noise to image

The image generation model works by successively sampling latent variables $\xtt{T} \ldots \xtt{1}$ which are all of the same dimension as $\xz$ and then finally sampling the image $\xz$. It is modelled as a Markov chain starting from $\xtt{T}$. The elements of the sequence have a joint distribution $p_\theta\left(\xzT\right)$ called the reverse process. Each element is independent of every other element given the next element

$p(\xt \vert \xz, \ldots, \xtmone, \xtt{t+1}\ldots, \xtt{T}) = p_\theta(\xt \vert \xtt{t+1}) \\ p_\theta\left(\xzT\right) = p\left(\xtt{T}\right)\prod_{i=1}^{T}p_\theta\left(\xtmone \vert \xt\right)$

This gives rise to the following marginal distribution over $x_0$

$p_\theta(x_0) = \int p_\theta(x_{0:T}) dx_{1:T}$

Notice that all but $p\left(\xtt{T}\right)$ are parameterised by $\theta$ indicating that they are learned distributions. Both the $p\left(\xtt{T}\right)$ and learned transitions for $t=2, \ldots, T$ are assumed to be Gaussian which greatly simplifies the calculation of quantities like KL-divergence

$p\left(\xtt{T}\right) = \norm{\mathbf{0},\II}$ $\prtacond{\theta}{\xtmone}{\xt} = \norm{\thetamuterm, \sigta}, \text{ }\text{ } t = 2, \ldots, T$

Since the final transition distribution $\prtacond{\theta}{\xz}{\xt}$ needs to yield plausible data samples, it has a different form. This is explained in detail below but in practice we don’t usually need to worry about it.

## Forward process

But how do you learn the reverse process? After all we don’t know what is $\mathbf{x}_T$ or the intermediate $\mathbf{x}_t$ that gave rise to a given $\mathbf{x}_0$. To solve that diffusion models introduce an forward process or diffusion process which starts with $x_0$ and gradually adds Gaussian noise over $T$ timesteps parameterised by a schedule of variances $\beta_1, \ldots, \beta_T$. The forward process is also a Markov chain but now each latent variable $\xt$ is independent of every other element given the previous element $\xtmone$.

$q(\xt \vert \xz, \ldots, \xtmone, \xtt{t - 1}, \ldots, \xtt{T}) = \qrcond{\xt}{\xtt{t-1}}\\ \qrcond{\xoneT}{\xtt{t-1}} = \prod_{i=1}^T\qrcond{\xt}{\xtt{t-1}}$ $\qrcond{\xt}{\xtmone} = \norm{\sqrtonembt{t}\xtmone, \bt\II}$

Given $\xz$ is it possible to sample $\xt$ at an arbitrary timestep.

#### Exercise

Show that

$\qrcond{\xt}{\xz} = \norm{\xt; \sqrt{\abt}\xz, (1 - \abt)\II} \\\\\\ \at := 1 - \bt \\\\\\ \abt := \prod_{s=1}^{t} \att{s}$

• Since $\xz$ is given, you could think about how to sample $x_1$ by first sampling $\mathbf{z} \sim \norm{\mathbf{0}, \II}$ and transforming it
• Then consider how you can repeat the process for successive timesteps
• You should see a pattern emerging where any $x_t$ can be expressed in terms of $\xz$, $\mathbf{z}_1 \ldots \mathbf{z}_t \sim \norm{\mathbf{0}, \II}$ and the variances $\beta_t$
• You just need to find the mean and variance of this expression
• Also recall that all the variances involved are diagonal so you can just use the formula for the 1d case for transforming random variables applied elementwise
$x = \sigma z + \mu$
• The same goes for finding mean and variance

$\Ef{}{ax + b} = a\Ef{}{x}$ $\text{Var}(ax + b) = a^2\text{Var}(x)$
• You can use the equality $\sum_{z=1}^Za_z\prod_{i=z+1}^{Z}(1 - a_i) = 1 - \prod_{z=1}^Z (1 - a_z)$ (you can prove it by induction)

One way to approach this is to think about how we can sample from $\qrcond{\xt}{\xtmone}$ given $\xtmone$. Since it is a normal distribution we can sample from it by first sampling a standard normal distribution, then transforming it. Let us start with $t=1$, where $\xtmone$ is the data $\xz$ $$\mathbf{z} \sim \norm{\mathbf{0}, \II} \\\\\\ \xtt{1} = \sqrt{\btt{1}} \mathbf{z} + \sqrt{1-\btt{1}} \xz$$ Armed with $\xtt{1}$ we can sample $\xtt{2}$ and from now for clarity we will use the notation $\mathbf{z}_t$ to indicate the sample from $\norm{\mathbf{0}, \II}$ taken at step $t$ $$\mathbf{z}_2 \sim \norm{\mathbf{0}, \II} \\\\\\ \xtt{2} = \sqrt{\btt{2}} \mathbf{z}_2 + \sqrt{1-\btt{2}} \xtt{1}$$ $$=\sqrt{\btt{2}} \mathbf{z}_2 + \sqrt{1-\btt{2}} \left(\sqrt{\btt{1}} \mathbf{z}_1 + \sqrt{1-\btt{1}} \xz\right)$$ $$=\sqrt{\btt{2}} \mathbf{z}_2 + \sqrt{\btt{1}}\sqrt{1-\btt{2}} \mathbf{z}_1 + \sqrt{1-\btt{2}} \sqrt{1-\btt{1}} \xz$$ We can see a pattern emerging $$\xt = \sum_{s=1}^t\sqrt{\btt{s}}\mathbf{z}_s\prod_{i=s+1}^{t}\sqrt{1 - \btt{i}} + \xz\prod_{s=1}^{t}\sqrt{1 - \btt{s}}$$ Let us now find the mean and the variance of $\xt$.

Noting that the mean of $\mathbf{z}$ is zero $$\Ef{\qrcond{\xt}{\xz}}{\xt} = \xz\prod_{s=1}^{t}\sqrt{1 - \btt{s}} = \xz\sqrt{\prod_{s=1}^{t}{1 - \btt{s}}} = \xz\sqrt{\prod_{s=1}^{t}{\as}}= \sqrt{\abt}\xz$$ as required.

The covariance matrix is just the sum of the covariance matrices of the $\mathbf{z}_i$ terms. Applying the 1d formula elementwise we have the sum of the squares of the $\mathbf{z}_i$ coefficients multiplied by $\II$ $$\text{Var}\left(\xt\right) = \II\sum_{s=1}^t{\btt{s}}\prod_{i=s+1}^{t}\left(1 - \btt{i}\right)$$ Define $$f_t := \sum_{s=1}^t{\btt{s}}\prod_{i=s+1}^{t}\left(1 - \btt{i}\right)$$ We can show that $\text{ }f_t = 1 - \prod_{s=1}^{t}(1 - \btt{s})$ as follows
1. Holds trivially for the base case $t=1$ since $$\sum_{s=1}^1{\btt{s}}\prod_{i=2}^{1}\left(1 - \btt{i}\right) = \btt{1} = 1 - \prod_{s=1}^{1}(1 - \btt{s})$$ where we used the convention that the product over an empty set i.e. $\prod_{i=2}^{1}$ is 1.
2. Now assuming this holds for $t-1$ i.e. $\sum_{s=1}^{t-1}{\btt{s}}\prod_{i=s}^{t-1}\left(1 - \btt{i}\right) = 1 - \prod_{s=1}^{t-1}(1 - \btt{s})$, consider the case for $t$ $$\sum_{s=1}^t{\btt{s}}\prod_{i=s+1}^{t}\left(1 - \btt{i}\right)\\\\\\ =\sum_{s=1}^{t-1}{\btt{s}}\prod_{i=s+1}^t\left(1 - \btt{i}\right) + {\btt{t}}\prod_{i=t+1}^t\left(1 - \btt{i}\right)\\\\\\ =\left(1 - \btt{t}\right) \underbrace{\sum_{s=1}^{t-1}{\btt{s}}\prod_{i=s+1}^{t-1}\left(1 - \btt{i}\right)}_{=f_{t-1}} + {\btt{t}} \\\\\\ = \left(1 - \btt{t}\right)\underbrace{\left(1 - \prod_{s=1}^{t-1}(1 - \btt{s})\right)}_{\text{by the induction hypothesis}} + {\btt{t}}$$ $$= 1 - \bt - \prod_{s=1}^{t}(1 - \btt{s}) + \bt = 1 - \prod_{s=1}^{t}(1 - \btt{s})$$
Therefore $$\text{Var}\left(\xt\right) = \left(1 - \prod_{s=1}^{t}(1 - \btt{s})\right)\II = \left(1 - \abt\right)\II$$ as required

## Training

The goal is to learn the reverse process so that the model can generate images from arbitrary latents $\xtt{T}$. To do so we parameterise the reverse process by $\theta$ and then seek to learn $\theta$ to minimise the negative log likelihood $\log\left(\prta{\theta}{\xz}\right)$. The integral $\int p(\xzT) d\xoneT$ is typically intractable so we minimise an upper bound.

#### Exercise

Show that

$E_{q(\xz)}\left[-\log(p(\xz))\right] \leq E_{q(\xzT)}\left[-\log\frac{p(\xzT)}{q(\xoneT \vert \xz)}\right]$

• Use the definition for $p(\xz)$
• Multiply the integrad by something that results in an expectation over $\qrcond{\xoneT}{\xz}$
• Use Jensen’s inequality by which

$\log\left(E_{p(x)}\left[f(x)\right]\right) \geq E_{p(x)}\left[\log\left(f(x)\right)\right]$

We'll drop the negative sign for now to use Jensen's inequality more conveniently then bring it back later. We also assume a continuous distribution here but it will work equally well if you replace the integral with a sum. $$E_{q(x_0)}\left[\log(p(x_0))\right] = \int \log(p(x_0)) q(x_0) dx_0$$ $$=\int \log\left(\int p(x_{0:T}) dx_{1:T}\right) q(x_0) dx_0$$ $$=\int \log\left(\int \frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}{q(x_{1:T}\vert x_0)} dx_{1:T}\right) q(x_0) dx_0$$ Note that now the inner integral is an expectation over $q(x_{1:T}\vert x_0)$ $$\int \frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}{q(x_{1:T}\vert x_0)} dx_{1:T} =E_{q(x_{1:T}\vert x_0)}\left[\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right]$$ By Jensen's inequality: $$\log\left(E_{q(x_{1:T}\vert x_0)}\left[\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right]\right) \geq E_{q(x_{1:T})}\left[\log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right)\right]$$ Since this inequality holds for every value of $x_0$ $$\int \log\left(E_{q(x_{1:T}\vert x_0)}\left[\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right]\right) q(x_0) dx_0 \geq \int E_{q(x_{1:T}\vert x_0)}\left[\log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right)\right]q(x_0)dx_0$$ Expressing the expectation on the right hand side as an integral $$=\int \left(\int \log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right) q(x_{1:T}\vert x_0)dx_{1:T}\right)q(x_0) dx_0$$ Grouping the integrals over $x_0$ and $x_{1:T}$ $$=\int \int \log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right) q(x_{1:T}\vert x_0)q(x_0) dx_{1:T}dx_0 \\\\\\ =\int \log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right) q(x_{0:T})dx_{0:T}$$ Expressing this as an expectation over $q(x_{0:T})$ $$=E_{q(x_{0:T})}\left[\log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right)\right]$$ Bringing back the negative sign and thus reversing the inequality $$E_{q(x_0)}\left[-\log(p(x_0))\right] \leq E_{q(x_{0:T})}\left[\log\left(\frac{p(x_{0:T})}{q(x_{1:T}\vert x_0)}\right)\right]$$

How to learn the reverse process is a design choice. The obvious approach might be to train a neural network to predict $\thetamuterm$. However in the DDPM paper the model is trained to predict a noise value $\boldsymbol{\epsilon}_\theta(\mathbf{x}_0, t)$.

Recall that you can sample normally distributed $\xtt{} \sim \norm{\boldsymbol{\mu}, \Sigma}$ by transforming a sample from the standard normal distribution

• Sample $\ftt{z}{} \sim \norm{\mathbf{0}, \II}$
• Apply a linear transform $\xtt{} = \ftt{L}{}\ftt{z}{} + \boldsymbol{\mu}$ where $\ftt{L}{}\ftt{L}{}^T = \Sigma$
• In case of a diagonal covariance $\Sigma = \sigma^2\II$, you have $\ftt{L}{} = \sigma\II \implies \xtt{} = \sigma\ftt{z}{} + \boldsymbol{\mu}$

We can sample in this way from $\qrcond{\xt}{\xz}$ and train the model to predict the noise value

Sample $\xz \sim \qr{\xz}$

def train_step(self, x0):
cfg = self.cfg

if cfg.randflip:
x0 = tf.image.random_flip_left_right(x0)
batch_size = tf.shape(x0)[0]
t = tf.random.uniform([batch_size], 0, cfg.T, dtype=tf.int32)


Sample $\boldeps \sim \norm{\mathbf{0},\II}$

        eps = tf.random.normal(shape=tf.shape(x0))


Find $\xt = \sqrt{\abt}\xz + \sqrt{1 - \abt}\boldeps$ which is a sample from $\qrcond{\xt}{\xz}$

        xt = self.get_xt(x0, eps, t)


Predict $\etta\left(\xt, t\right)$ using a model that takes as input the latent $\xt$ and a timestep $t$.

        eps_theta = self.model(self.get_input(xt, t), training=True)


Train using the loss $L_\text{simple}\left(\theta\right) := \Ef{t, \xz, \boldeps}{\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1-\abt}\boldeps\right)\right\Vert^2}$ which is an approximation to $L$. See the Appendix for mathematical details if you are interested.

        losses = tf.reduce_mean(
(eps - eps_theta) ** 2,
axis=tf.range(1, tf.rank(eps))
)
loss = tf.reduce_mean(losses)

self.ema.apply(self.model.trainable_variables)
if tf.equal(self.optimizer.iterations, 1):
self.ema_decay.assign(cfg.ema_decay)

return {'loss': loss}


The paper uses a U-Net model to predict $\etta\left(\xt, t\right)$. The timestep is input as an integer in the range $0, \ldots, T-1$ (representing $1, \ldots, T$) which is transformed to a harmonic embedding similar to the position embedding in a Transformer model.

## Sampling

When the model has been trained, we can sample new data points by first sampling random noise $\xtt{T}$, then successively sampling latents $\xt$

Animation of $\xz$ across timesteps

Assume we have $\xt$. To start with we sample $\xtt{T} \sim \pr{\xtt{T}} = \norm{\mathbf{0}, \II}$

def p_sample_step(self, t, xt):
batch_shape = tf.shape(xt)


Predict $\etta\left(\xt, t\right)$

    eps_theta = self.model(
self.get_input(xt, t), training=False
)


Transform this to $\xz = \frac{1}{\abt}\left(\xt - \sqrt{1 - \abt}\etta\left(\xt, t\right)\right)$, clipping it to the range $[-1, 1]$. Initially this won’t be high quality sample but is used to get $\xt$

    x0 = (xt - self.select_timestep(self.sqrt_1_m_alpha_bar, t, xt) * eps_theta) / tf.math.sqrt(
self.select_timestep(self.alpha_bar, t, xt)
)
x0 = tf.clip_by_value(x0, -1, 1)


For $t > 1$, sample $\boldeps \sim \norm{\mathbf{0},\II}$ and return $\xtmone = \sqrt{\abtmone}\xz + \sqrt{1 - \abtmone}\boldeps$, which is a sample from $\qrcond{\xtmone}{\xz}$. At the final step, $t=1$, just return $\xz$.

    z = tf.cond(
tf.greater(t, 1),
lambda: tf.random.normal(shape=batch_shape),
lambda: tf.zeros(batch_shape)
)
xtm1 = model_mean + z * tf.math.sqrt(self.select_timestep(self.sigma_square, t, xt))
return x0, xtm1


### Performance

Unconditional CIFAR-10 samples from the basic DDPM model

Qualitatively the images generated by diffusion look realistic. Diffusion gets state of the art Frechet Inception Distance (FID) for CIFAR-10, which can be seen as an indicator of image quality. However it is inferior to other models with regard to the Inception Score and log-likelihood. It also gets qualitatively impressive results for other datasets like LSUN but the metrics are not the best.

### Time

A significant disadvantage with diffusion is that you sample by going through each step in the making this a slow process. In this respect it is not unlike auto-regressive models where images must be generated a pixel at a time. However the image below shows that the model is starting to generate samples of decent quality well before the last step.

$\xz$ improving across timesteps

Later works show that competitive results can be obtained with fewer samples.

## Extensions

A lot of work has gone into improving the limitations of diffusion models and extending its capabilities, many of which are leveraged by DALL·E 2:

• Improved Denoising Diffusion Probabilistic Models(Improved Diffusion) introduces simple modifications that improve the log likelihood.
• Diffusion Models Beat GANs on Image Synthesis manages to get better ImageNet scores compared to the best GAN models.
• Denoising Diffusion Implicit Models paper they come up with a sampling method that is able to skip some of the steps but still get good quality results.
• Whilst the model considered here, later papers suchs as Improved Diffusion also explore conditional models. It is possible to use different kinds of superivison such as labels of text inputs to generate conditional samples, a notable example is of which of course is DALL·E 2. There are different approaches to guiding diffusion models. See the GLIDE post for a discussion and references.
• Besides image generation, diffusion can be used to learn other tasks like upsampling and inpainting.

## What’s next

• The appendix contains exercises that will guide you through the derivation of the loss.
• Check out this post about GLIDE to read about the conditional diffusion architecture on which DALL·E 2 is based.

## Appendix: Loss derivation

In this section we will go through the mathematical details to understand how the approach outlined above comes about.

### Terms of the loss function

As a first step we will split $L$ into three losses.

#### Exercise

Show that the loss $L$ may be written as


• Use Bayes rule and the Markovian property to write $\qterm$ as a product of conditional distributions $q\left(x_{t-1} \lvert x_t, x_0\right)$
• Then it is simply a question of rearranging everything and applying a little reasoning to make it match the expression
• Remember that $\DKL{q}{p} = \Eq{\log\frac{q}{p}}$
• Also recall if $a$ does not depend on $x$ then $\Ef{q(x)}{a} = a$

Let us consider just the term inside the expectation
$\def\pxztprod{\prod_{t=1}^T\prtacond{\theta}{\xtmone}{\xt}}$ $$-\log{\frac{\pterm}{\qterm}} = -\log{\frac{\pr{\xT}\pxztprod}{\qterm}}$$
$\def\qprod{\prod_{t=1}^T\qrcond{\xt}{\xtmone}}$ $$= -\log{\pr{\xT}) - \log\frac{\pxztprod}{\qprod}}$$
$$\qtt{t}{t-1} = \qtttwo{t}{t-1}{0}$$
and applying Bayes rule we get
$$= \frac{\qtttwo{t-1}{t}{0}\qtt{t}{0}}{\qtt{t-1}{0}}$$
Now consider the product in the denominator of the second term $$\qprod = \frac{\qtttwo{T-1}{T}{0}\qtt{T}{0}}{\qtt{T-1}{0}} \frac{\qtttwo{T-2}{T-1}{0}\qtt{T-1}{0}}{\qtt{T-2}{0}} \times \\\\\\ \ldots\times\frac{\qtttwo{1}{2}{0}\qtt{2}{0}}{\qtt{1}{0}} \qtt{1}{0}$$ $$=\qtt{T}{0}\prod_{t=1}^T\qtttwo{t-1}{t}{0} \\\\\\ \implies \qr{\xzT} = \qr{\xz}\qtt{T}{0}\prod_{t=1}^T\qtttwo{t-1}{t}{0}$$ since the denominators and numerators of the successive terms cancel except for $\qtt{T}{0}$. Now with some rearranging we can write the term inside the expectation as $\def\LTfrac{\frac{\qtt{T}{0}}{\prta{\theta}{\xz}}}$ $\def\Ltfrac{\frac{\qtttwo{t-1}{t}{0}}{\prtacond{\theta}{\xtmone}{\xt}}}$ $$\log{\LTfrac} + \sum_{t>1}^T \log{\Ltfrac} -\log{\prta{\theta}{\xz}}$$ Putting this back into $\Eq{\cdot}$, we trivially find $\Eq{L_0}$. As for the KL-divergence terms, my reasoning for how to get them in the form$\Eq{\text{D}_\text{KL}\left(\ldots\right)}$ is as follows. Recall that the subscript $q$ in the expectation refers to $\qr{\xzT}$. Using the expression derived above $\qr{\xzT}$, the expectation over $\log\Ltfrac$ for some $\tau > 1$ becomes $\newcommand\Ltfract[1]{\frac{\qtttwo{#1 - 1}{#1}{0}}{\prtacond{\theta}{\xtt{#1 - 1}}{\xtt{#1}}}}$ $$\Eq{\log\Ltfract{\tau}} = \Ef{\qr{\xz}\qtt{T}{0}\prod_{t>1, t\neq\tau}^T\qtttwo{t-1}{t}{0}}{\Ef{\qtttwo{\tau-1}{\tau}{0}}{\log\Ltfract{\tau}}}$$ $$=\Ef{\qr{\xz}\qtt{T}{0}\prod_{t>1, t\neq\tau}^T\qtttwo{t-1}{t}{0}}{\DKL{\qtttwo{\tau-1}{\tau}{0}}{\prtacond{\theta}{\xtt{\tau - 1} }{ \xtt{\tau} }}}$$ $$=\Ef{\qr{\xz}\qtt{T}{0}\prod_{t>1}^T\qtttwo{t-1}{t}{0}}{\DKL{\qtttwo{\tau-1}{\tau}{0}}{\prtacond{\theta}{\xtt{\tau - 1} }{ \xtt{\tau} }}} \\\\\\ =\Eq{\DKL{\qtttwo{\tau-1}{\tau}{0}}{\prtacond{\theta}{\xtt{\tau - 1} }{ \xtt{\tau} }}}$$ since the KL term involves taking expectation over $\xtt{\tau-1}$ it results in a term that is constant with respect to $\xtt{\tau-1}$ so if you take the expectation over it again with respect to a distribution over $\xtt{\tau-1}$ it is left unchanged. We can repeat the same reasoning to arrive at the term for $\Eq{L_T}$.

As both $\pr{\xT}$ and $\qrcond{\xT}{\xz}$ are fixed, $L_T = \DKL{\qrcond{\xT}{\xz}}{\pr{\xT}}$ does not depend on $\theta$ so we can neglect it from now on.

### Evaluating $L_{t-1}$

To simplify $L_{t-1}$ let us consider the forward process posterior $\qrcond{\xtmone}{\xt,\xz}$. The KL-divergence is straightforward to evaluate for normal distributions. We know that the transitions in both the forward and reverse case as well as $\pr{\xtt{T}}$ are Gaussian. It turns out that the forward process posterior is also Gaussian.

#### Exercise

Show that $\qrcond{\xtmone}{\xt,\xz}$ is a normal distribution with mean

$\tilde{\mu}\left(\xt, \xz \right) := \frac{\sqrt{\abtmone}\bt}{1 - \abt}\xz + \frac{\sqrt{\at}(1 - \abtmone)}{1 - \abt}\xt$

and diagonal covariance $\tbt\II$

• Remember that both $\qrcond{\xtmone}{\xz}$ and $\qrcond{\xt}{\xtmone}$ are Gaussian
• Use Bayes rule and the fact that the reverse process is Markovian to derive an expression for $\qrcond{\xtmone}{\xt, \xz}$
• Again since covariances are diagonal we need just consider the distributions for a single element
• You should be able to find an expression proportional to $\exp(-(x_{t-1} - \mu)^2/\sigma^2))$ where $x_{t-1}$ is a single element of $\xtmone$ and you can show that $\sigma^2$ and $\mu$ are equivalent to the expressions above

Using Bayes rule $$\qrcond{\xtmone}{\xt, \xz} = \frac{\qrcond{\xt}{\xtmone, \xz}\qrcond{\xtmone}{\xz}}{\qrcond{\xt}{\xz}} \\\\\ = \frac{\qrcond{\xt}{\xtmone}\qrcond{\xtmone}{\xz}}{\qrcond{\xt}{\xz}}$$ since the reverse process is Markovian $$\propto \qrcond{\xt}{\xtmone}\qrcond{\xtmone}{\xz}$$ considering only terms that depend on $\xtmone$ since this is a distribution over $\xtmone$ Since all covariances involved are diagonal it suffices to consider the distributions for a single element. $$\qrcond{x_{t-1}}{x_t, x_0} \propto \exp\left({-\frac{\left(x_t - \sqrt{1 - \bt}x_{t-1}\right)^2}{2\bt}}\right) \exp\left({-\frac{\left(x_{t-1} - \sqrt{\abtmone}x_0\right)^2}{2\left(1 - \abtmone\right)}}\right)$$ $$\propto \exp\left(-\left(\frac{1}{2}\left(\frac{1 - \bt}{\bt} + \frac{1}{\left(1 - \abtmone\right)}\right)x_{t-1}^2 - \left(\frac{\sqrt{\abtmone}}{\left(1 - \abtmone\right)}x_0 + \frac{\sqrt{1 - \bt}}{\bt}x_t\right)x_{t-1}\right)\right)$$ $$\propto \exp\left(-\frac{1}{2\sigma^2}\left(x_{t-1} - \mu\right)^2\right) \\\\\\ \frac{1}{\sigma^2} = {\left(\frac{\at}{\bt} + \frac{1}{\left(1 - \abtmone\right)}\right)}\\\\\\ \mu = \sigma^2\left(\frac{\sqrt{\abtmone}}{\left(1 - \abtmone\right)}x_0 + \frac{\sqrt{\at}}{\bt}x_t\right)$$ Notice that this has the form of a Gaussian. Now we can show $\sigma^2$ and $\mu$ are equivalent to the desired expressions $$\frac{1}{\sigma^2} = \frac{\at(1 - \abtmone) + \bt}{\bt(1 - \abtmone)} = \frac{(1 - \bt) - \abt + \bt}{\bt(1 - \abtmone)} \implies \sigma^2 = \frac{1 - \abtmone}{1 - \abt}\bt = \tbt$$ Using the expression for $\sigma^2$ $$\mu = \frac{1 - \abtmone}{1 - \abt}\bt\left( \frac{\sqrt{\abtmone}}{1 - \abtmone}x_0 + \frac{\sqrt{\at}}{\bt}x_t\right) = \frac{\sqrt{\abtmone}\bt}{1 - \abt}x_0 + \frac{\sqrt{\at}(1 - \abtmone)}{1 - \abt}x_t$$ and $\tilde{\mu}\left(\xt, \xz \right)$ is simply the expression above applied to each element.

We have said that the reverse process transition distributions $\prtacond{\theta}{\xtmone}{\xt}$ are also Gaussian but we have not yet mentioned what form they take. Here we consider the approach from the DDPM paper where only the mean is learned whilst the covariance is set as $\sigta = \sigma_t \II$ where $\sigma_t$ is a hyperparameter. Using a diagonal covariance considerably simplifies the KL-divergence term in $L_{t-1}$.

#### Exercise

Show that $\def\tilmuterm{\mu\left(\xt, \xz\right)}$ $\def\thetamuterm{\mu_\theta\left(\xt, t\right)}$ $L_{t-1} = \DKL{\qtttwo{t-1}{t}{0}}{\prtacond{\theta}{\xtmone}{\xt}}= \text{const.} + \frac{\left\Vert\tilmuterm- \thetamuterm\right\Vert^2}{2\sigma_t^2}$

• You can use the approach in other exercises of simply considering one coordinate and then extending the answer to the multivariate case
• Neglect any constant that does not depend on
• Recall that for a Gaussian distribution $\Ef{}{x^2} = \sigma^2 + \mu^2$

$$\DKL{\qtttwo{t-1}{t}{0}}{\prtacond{\theta}{\xtmone}{\xt}} \\\\\\ = {-\log{\Ltfrac}} = \Eqt{\Ltfrac} \\\\\\ = \text{const.} + \Eqt{-\mutilterm + \muthetaterm} \\\\\\ = \text{const.} + \Eqt{\muthetaterm}$$ where $\text{const.}$ stands for any constants that don't depend on the parameters or the data.

### Noise prediction

As mentioned above the model is trained to predict a noise value $\boldsymbol{\epsilon}_\theta(\mathbf{x}_0, t)$, from which the mean can be derived. To start with let us introduce sample $\qrcond{\xt}{\xz}$ by first sampling $\boldeps \sim \norm{\mathbf{0}, \II}$ and transforming it.

#### Exercise

Show that by reparameterising $\xt$ as $\xt\left(\xz, \boldeps\right) = \sqrt{\abt}\xz + \sqrt{1 - \abt}\boldeps$, $L_{t-1}$ may be written as


$L_{t-1} - \text{const.} =\Ef{\xz, \boldeps}{\frac{1}{2\sigma_t^2}\left\Vert\frac{1}{\sqrt{\abt}}\left(\xtxzep - \frac{\bt}{\sqrt{1 - \abt}}\boldeps\right) -\mutxt\right\Vert^2}$

• You just need to find an expression for $\tilmuterm$ and plug it into $L_{t-1}$

We just need to find $\tilmuterm$ in terms of $\xt\left(\xz, \boldeps\right)$ and $\boldeps$. For simplicity we will only write $\xt$ in the derivation. $$\xz = \frac{1}{\sqrt{\abt}}\left(\xt - \sqrt{1 - \abt}\boldeps\right)$$ $$\tilmuterm = \frac{1}{\sqrt{\abt}(1 - \abt)}\left(\left(\bt + \at(1 - \abtmone)\right)\xt - \sqrt{1 - \abt}\bt\boldeps \right) \\\\\\ = \frac{1}{\sqrt{\abt}(1 - \abt)}\left((1 - \at) + \at(1 - \abtmone)\right)\xt - \frac{\sqrt{1 - \abt}\bt}{\sqrt{\abt}{(1 - \abt)}}\boldeps \\\\\\ = \xtxzep - \frac{\bt}{\sqrt{1 - \abt}}\boldeps$$ Since $\xt$ is fully determined by $\xz$ and $\boldeps$ we can write the loss as an expectation over only $\xz$ and $\boldeps$.

To predict $\mutxt$ directly, the steps would be

• Sample $\xz \sim \qr{\xz}$, $\xt \sim \qtt{t}{0}$
• Predict $\mutxt$

However the reparameterised version of $L_{t-1}$ suggests another possibility. We see that the optimal value of $\mutxt$ is $\tilde{\mu}\left(\xt\left(\xz, \boldeps\right), \xz\right)= \muoptim$. It might be a good idea to use this form for the mean. Now the steps become

• Sample $\xz \sim \qr{\xz}$, $\boldeps \sim \norm{\mathbf{0},\II}$
• Find $\xt = \sqrt{\abt}\xz + \sqrt{1 - \abt}\boldeps$
• Predict $\etta\left(\xt, t\right)$
• Estimate $\hat{\mu} = \frac{1}{\sqrt{\abt}}\left(\xt - \frac{\bt}{\sqrt{1 - \abt}}\etta\right)$

#### Exercise

As it stands $L_{t-1}$ is proportional to the square difference of means $\hat{\mu}$ and $\mu_\theta$. Plugging in the expression for $\hat{\mu}$ in terms of $\etta$ and transform $L_{t-1}$ into an expression that is proportional to the squared difference of $\boldeps$ and $\etta$ up to a constant.

• Just plug in the expression for $\hat{\mu}$ and simplify

$\def\xtxzetta{\xt\left(\xz, \boldeps_\theta\right)}$ $$L_{t-1} - \text{const.} =\Ef{\xz, \boldeps}{\frac{1}{2\sigma_t^2}\left\Vert\tilde{\mu} - \hat{\mu}\right\Vert^2}$$ $$=\Ef{\xz, \boldeps}{\frac{1}{2\sigma_t^2}\left\Vert\frac{1}{\sqrt{\abt}}\left(\xt - \frac{\bt}{\sqrt{1 - \abt}}\boldeps\right) - \frac{1}{\sqrt{\abt}}\left(\xt - \frac{\bt}{\sqrt{1 - \abt}}\etta\right)\right\Vert^2}$$ $$=\Ef{\xz, \boldeps}{\frac{\bt^2}{2\sigma_t^2\abt(1-\abt)}\left\Vert\boldeps - \etta\right\Vert^2}$$ $$=\Ef{\xz, \boldeps}{\frac{\bt^2}{2\sigma_t^2\abt(1-\abt)}\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1 - \abt}\boldeps, t\right)\right\Vert^2}$$

### $\prtacond{\theta}{\xz}{\xtt{1}}$ as an independent discrete decoder

$L_0$ is easy to evaluate. We just need to specify what form $\prtacond{\theta}{\xz}{\xtt{1}}$ takes.

The data $\xz$ is assumed to consists of integers ${0, 1, \ldots, 255}$ linearly scaled to lie in the interval $[-1, 1]$. The form of the distribution for the last term of the reverse process is given by the products of the elementwise distributions


$\prtacond{\theta}{\xz}{\xtt{1}} = \prod_{i=1}^D \int_{\deltaxzi{-}}^{\deltaxzi{+}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2} dx$ $\deltax{+}{x}= \left\{ \begin{array}{ll} \infty & x = 1 \\ x + \frac{1}{255} & x < 1 \\ \end{array} \right.$ $\deltax{-}{x}= \left\{ \begin{array}{ll} -\infty & x = -1 \\ x - \frac{1}{255} & x > -1 \\ \end{array} \right.$

This is more easily visualised

For $-1 < x < 1$, it is the interval $x-\frac{1}{255}, x+\frac{1}{255}$. For $x=1$ it is region to right of $x - \frac{1}{255}$ and for $x =-1$ it is the region to the left of $x + \frac{1}{255}$. For each element the distribution can look different as it depends on $\mu^i_\theta\left(\xtt{1}, 1\right)$. For example if the dataset contains lots of outdoor images where the sky is seen at the top of the image, blue channel elements in this region would have a high probability for $x=1$.

#### Exercise

Show that the expression for $\prtacond{\theta}{\xz}{\xtt{1}}$ is a valid probability distribution.

• Express the integrals for each case of $\xzi$ in terms of cumulative distribution functions $F_X(x)$
• Note that the sum over all $\xz$ may be written as product of sums over each $\xzi$
• Then you should be able to show that terms in the sum over each $\xzi$ cancel so that the result is 1

For a given $\xzi$ $$\int_{\deltaxzi{-}}^{\deltaxzi{+}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2}= \left\{ \begin{array}{ll} \int_{1 - \frac{1}{255}}^{\infty} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2}dx = P\left(X \geq 1 - \frac{1}{255}\right) & \xzi = -1 \\ \int_{\xzi - \frac{1}{255}}^{\xzi + \frac{1}{255}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2}dx = P\left(\xzi - \frac{1}{255} \leq X \leq \xzi + \frac{1}{255}\right) & -1 < \xzi < 1 \\ \int_{-\infty}^{-1 + \frac{1}{255}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2}dx = P\left(X \leq -1 + \frac{1}{255}\right) & \xzi = -1 \\ \end{array} \right.$$ Summing over $\xz$ and noting that the we can write this as a product of sums over each $\xzi$ $$\sum_{\xz} \prtacond{\theta}{\xz}{\xtt{1}} = \sum_{x_0^0=-1}^{1}\ldots\sum_{x_0^D=-1}^{1}\prod_{i=1}^D \int_{\deltaxzi{-}}^{\deltaxzi{+}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2} dx \\= \prod_{i=1}^D \sum_{\xzi=-1}^{1}\int_{\deltaxzi{-}}^{\deltaxzi{+}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2} dx$$ The transformation to go from $[0, 255]$ to $[-1, 1]$ is $\frac{2x}{255} - 1$ so the intervals between consecutive scaled values will be $\frac{2}{255}$. Expanding the sum inside the product and plugging in the values for the integrals $$\prod_{i=1}^D \sum_{\xzi=-1}^{1}\int_{\deltaxzi{-}}^{\deltaxzi{+}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2} \\= P\left(X \geq 1 - \frac{1}{255}\right) + \sum_{-1+\frac{2}{255}}^{1-\frac{2}{255}}P\left(\xzi - \frac{1}{255} \leq X \leq \xzi + \frac{1}{255}\right) + P\left(X\leq -1 + \frac{1}{255}\right)$$ $$= 1 - F_X\left(1 - \frac{1}{255}\right) + \sum_{-1+\frac{2}{255}}^{1-\frac{2}{255}}F_X\left(\xzi + \frac{1}{255}\right) - F_X\left(\xzi - \frac{1}{255}\right) + F\left(-1 + \frac{1}{255}\right)$$ where $F_X$ is the cumulative distribution function. $$= 1 - F_X\left(1 - \frac{1}{255}\right) + \sum_{-1+\frac{2}{255}}^{1-\frac{2}{255}}F_X\left(\xzi + \frac{1}{255}\right) - \sum_{-1}^{1-\frac{4}{255}}F_X\left(\xzi + \frac{1}{255}\right) + F\left(-1 + \frac{1}{255}\right)$$ where we shifted $x$ by $-\frac{2}{255}$ for the the second term in the summation $$= 1 - F_X\left(1 - \frac{1}{255}\right) + F_X\left(1 - \frac{1}{255}\right) - F_X\left(-1 + \frac{1}{255}\right) + F\left(-1 + \frac{1}{255}\right) \\=1 \implies \sum_{\xz} \prtacond{\theta}{\xz}{\xtt{1}} = 1$$

### The final training objective

Whilst it is possible to use the loss directly in its simplified form, in the paper they find that the mean squared distance between the sampled noise $\boldeps$ and the predicted noise $\boldeps_\theta$ yields better sample quality

$L_\text{simple}\left(\theta\right) := \Ef{t, \xz, \boldeps}{\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1-\abt}\boldeps\right)\right\Vert^2}$

As with a lot of things in deep learning what works in practice does not necessarily accord with theory. Nevertheless it is possible to argue that that $L_\text{simple}\left(\theta\right)$ is an approximation to the original $L$.

#### Exercise

Show that

• For $t>1$ that $L_\text{simple}\left(\theta\right)$ corresponds to an unweighted version of $L_{t-1}$
• For $t=1$ it is $L_0$ with the integral in $\prtacond{\theta}{\xz}{\xtt{1}}$ approximated by the probability density function of the Gaussian distribution multiplied by the bin width, neglecting edge effects and leaving out the timestep-dependent weight.

(As noted earlier $L_T$ does not depend on $\theta$ so we don’t consider it here)

• For $t>1$ you just need to identify the weights that are left out
• For $t=1$, do the approximation of the integral, then use reparameterisation of $\xtt{1}$ and $\mu_\theta$ to introduce the noise terms and eliminate the $\xtt{}$ terms and leave out the weights as for $t>1$

Write the loss as $$L =\Ef{t}{L_{t-1}}$$ where $t = 1, \ldots, T-1$ It is easy to show this for $t > 1$ since $$L_{t-1} -\text{const}. = \Ef{\xz, \boldeps}{\frac{\bt^2}{2\sigma^2\at(1 - \abt)}\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1-\abt}\boldeps\right)\right\Vert^2} = \frac{\bt^2}{2\sigma^2\at(1 - \abt)}\Ef{\xz, \boldeps}{\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1-\abt}\boldeps\right)\right\Vert^2}$$ where $\frac{\bt^2}{2\sigma^2\at(1 - \abt)}$ can be regarded as a timestep-dependent weight. Neglecting that you get $$L_{t-1,\text{simple}} - \text{const}. = \Ef{\xz, \boldeps}{\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1-\abt}\boldeps\right)\right\Vert^2}$$ The $t=1$ case involves a bit more work. The loss is given as $$L_0 = -\log{\prtacond{\theta}{\xz}{\xtt{1}}} = \text{const}. - \sum_{i=1}^D\log\int_{\deltaxzi{-}}^{\deltaxzi{+}} \norm{x;\mu^i_\theta\left(\xtt{1}, 1\right), \sigma_1^2}dx$$ Neglecting the case of $\xzi=\pm 1$ i.e. the edge values, the limits of the integral are $\xzi - \frac{1}{255}$ and $\xzi + \frac{1}{255}$. Approximating the integral by its value at centre of the interval $\xzi$ multiplied by the width of the interval $\frac{2}{255}$ $\def\normexpterm{\frac{\left(\xzi - \mu_\theta^i\right)^2}{2\sigma_1^2}}$ $\def\normterm{\frac{1}{\sqrt{2\pi}\sigma_1}\exp\left({-\normexpterm}\right)}$ $$\approx \text{const}. - \sum_{i=1}^D\log \left(\frac{2}{255}\normterm\right)$$ The RHS with only terms that depend on $\theta$ $$\text{const}. + RHS = \sum_{i=1}^D\normexpterm = \frac{1}{2\sigma_1^2}\left\Vert\xz - \mu_\theta\left(\xtt{1}, 1\right)\right\Vert^2$$ $\def\abo{\bar{\alpha}_1}$ Substituting $\xtt{1} = \sqrt{\bar{\alpha}_1}\xz + \boldeps\sqrt{1 - \bar{\alpha}_1}$ and using the parameterisation for $\mu_\theta$ $$\mu_\theta\left(\xtt{1}, 1\right) = \frac{1}{\sqrt{\abo}}\left(\xtt{1} - \frac{\beta_1}{\sqrt{1 - \abo}}\boldeps_\theta\left(\xtt{1}, 1\right)\right)$$ $$=\frac{1}{\sqrt{\alpha_1}}\left(\sqrt{\alpha_1}\xz + \boldeps\sqrt{\beta_1} - \sqrt{\beta_1}\boldeps_\theta\left(\xtt{1}, 1\right)\right)$$ Plugging in this expression for $\mu_\theta$ the RHS becomes $$\frac{1}{2\sigma_1^2}\frac{\beta_1}{\alpha_1}\left\Vert\boldeps - \etta\left(\xtt{1}, 1\right)\right\Vert^2$$ Recognising that $\frac{1}{2\sigma_1^2}\frac{\beta_1}{\alpha_1} = \left.\frac{\bt^2}{2\sigma^2\at(1 - \abt)}\right\vert_{t=1}$ and leaving it out as we did for $t > 1$ we get $$L_{0, \text{simple}} - \text{const}. = \Ef{\xz, \boldeps}{\left\Vert\boldeps - \etta\left(\sqrt{\abo}\xtt{1} + \sqrt{1-\abo}\boldeps\right)\right\Vert^2}$$ Since $L_t$ has the same form for each $t$ (neglecting constant terms) $$L_{\text{simple}}\left(\theta\right) =\Ef{t}{L_{t-1, \text{simple}}} = \Ef{t, \xz, \boldeps}{\left\Vert\boldeps - \etta\left(\sqrt{\abt}\xz + \sqrt{1-\abt}\boldeps\right)\right\Vert^2}$$