3 min read


When batching inputs for sequence models you often have sequences of variable sizes and you need to pad some of the inputs so that you can input them as a single tensor. For example here is a pair of lines in a dialogue from Twelfth Night Act 2, Scene 4 which are of variable length as represented here

simple example of two lines of different lengths

However you don’t want the pad locations to influence the weight updates. In this post we will learn how PyTorch and TensorFlow approach this via their respective embedding layers.

import torch
import tensorflow as tf
import numpy as np


    num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False,      
    sparse=False, _weight=None


    input_dim, output_dim, embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs

padding_idx in PyTorch

From the PyTorch documentation

padding_idx (int, optional) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training, i.e. it remains as a fixed “pad”. For a newly constructed Embedding, the embedding vector at padding_idx will default to all zeros, but can be updated to another value to be used as the padding vector.

embed_size = 7
embed_pyt = torch.nn.Embedding(embed_size, 1, padding_idx=0, scale_grad_by_freq=False)
lin_pyt = torch.nn.Linear(1, 1)
arr = np.stack([[1, 1, 2, 6, 0],
                [1, 5, 5, 0, 0]])
inp_pyt = torch.from_numpy(arr)

Run forward pass weighting each location randomly so to force the gradients for each embedding to be different:

z = lin_pyt(embed_pyt(inp_pyt))
weight = torch.rand_like(z)
z2 = torch.sum(z*weight)

Note that the weight for padding_idx i.e. 0 here is zero

Parameter containing:
tensor([[ 0.0000],
        [ 0.6048],
        [ 1.1213],
        [ 0.3374],
        [-0.1433]], requires_grad=True)

The grad values are typically larger for the tokens that occur more often (1 and 5 v 2 and 6) and zero for those that don’t appear (3, 4) and 0 for padding_idx=0.


mask_zero in TensorFlow

This parameter serves a similar purpose to padding_idx above

mask_zero: Boolean, whether or not the input value 0 is a special “padding” value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

However it works differently. It won’t affect the output of the embedding layer. Instead it will return a tensor that can be used for blocking the pad positions in subsequent layers.

First let us construct a similar model in TensorFlow reusing the input from before

embed_tf = tf.keras.layers.Embedding(embed_size, 1, mask_zero=True)
inp_tf = tf.convert_to_tensor(arr)

Now we will create simpled masked wrapper for Dense the equivalent of Linear in PyTorch.

class MaskedDense(tf.keras.layers.Dense):
    def call(self, inputs, mask=None):
        if mask is not None:
            return super(MaskedDense, self).call(inputs * tf.cast(mask[..., None], tf.float32))
        return super(MaskedDense, self).call(inputs)

lin_tf = MaskedDense(1, use_bias=False, kernel_initializer=

Now get the outputs and gradients

with tf.GradientTape() as tape:
    z = lin_tf(embed_tf(inp_tf))
    z2 = tf.reduce_sum(z * weight.numpy())

grad = tape.gradient(z2, embed_tf.trainable_variables[0])

Note that the weights here are not zero for the padding index of 0.

<tf.Variable 'embedding/embeddings:0' shape=(7, 1) dtype=float32, numpy=
array([[ 0.01041734],
       [ 0.04714436],
       [ 0.04238981],
       [ 0.01394412],
       [ 0.00850029],
       [-0.0347899 ],
       [ 0.00357641]], dtype=float32)>

What about the gradients?

<tensorflow.python.framework.indexed_slices.IndexedSlices at 0x7fd851f25dc0>

The gradient for Embedding is an instance of IndexedSlices so we need to reconstruct tensor from it.

tf.scatter_nd(grad.indices[:, None], grad.values, tf.cast(grad.dense_shape, tf.int64))
<tf.Tensor: shape=(7, 1), dtype=float32, numpy=
array([[0.        ],
       [0.        ],
       [0.        ],
       [0.03467343]], dtype=float32)>

As expected the gradient for 0 does have a value of zero.