Chainer character embeddings

(Continues from Numpy character embeddings.) The numpy embedding model turned out to be extremely slow because it wasn’t vectorised. Chainer is a python deep learning package that enables us to implement the model easily with automatic differentiation and the resulting vectorised operations are fast—and can be run on a GPU if you want. In this post I’ll explore how the different optimisers perform out of the box.


There are a few deep learning packages available with Python interfaces at the moment (and more are being added). From the venerable Theano and its offspring like Lasagne and Keras, to Tensorflow (which is also supported by Keras).

Chainer caught my attention when I first looked at because it made recurrent neural networks easy to do. Theano, which was the state-of-the-art package at that stage, struggles with RNNs—and RNNs are awesome.

Chainer’s interface is a joy to work with and I think the define-by-run scheme is very clever. It might not be the fastest library out there but it is extremely flexible and the fact that you can install it with pip install chainer clinched it for me. Many other packages are difficult to install and don’t really want you to use it without the GPU—sometimes you just need a medium-sized deep model that is easy to deploy.

Embedding model

Back to our simple embedding model. The main trick in this implementation is to pack the input sequences into a 2 dimensional ndarray. Each row is a training point and each column represents a token. The number of columns is the size of the window you’re using to predict the next token.

Chainer then applies this input matrix to the embedding matrix . The EmbedID operation broadcasts the embedding lookups over the last dimension of the input array, resulting in a three dimensional ndarray. If we now reshape this ndarray our embeddings are neatly packed next to each other just like we wanted.

The last step is a simple linear layer.

class EmbeddingModel(Chain):
    def __init__(self, vocab_size, embedding_size, ngram_size):
        super(EmbeddingModel, self).__init__(
            l1 = L.EmbedID(vocab_size, embedding_size),
            l2 = L.Linear(embedding_size * ngram_size, vocab_size)
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.ngram_size = ngram_size
    def __call__(self, x):
        h = F.reshape(self.l1(x), (-1, self.embedding_size * self.ngram_size, 1))
        y = self.l2(h)
        return y

Training setup

Training is accomplished by defining a loss function—in our case the softmax cross entropy loss—and calling the backward() method on the loss to run automatic differentiation via back propagation on the model. The optimizer object uses the parameter gradients to update the parameters.

The following standard training loop was taken from the chainer documentation:

# Setup the model with a vocabulary of V,
# hidden dimensionality of H and window
# size of M.
model = L.Classifier(EmbeddingModel(V, H, M))
optimizer = optimizers.NesterovAG()

# Training loop
for epoch in range(n_epochs):
    indexes = np.random.permutation(datasize) 
    for i in range(0, datasize, batchsize):
        x = Variable(train_x[indexes[i : i + batchsize]])
        t = Variable(train_y[indexes[i : i + batchsize]])
        loss = model(x, t)


Chainer comes with a few of the beloved deep learning optimisers, like adam and NesterovAG as part of the package.

Different batch sizes

Because deep models are usually trained with minibatch stochastic gradient methods, we are stuck with a bunch of optimiser hyperparameters. Ain’t nobody got time to tune all of them for all the available optimisers, so let’s pick NesterovAG to see what a good minibatch size is.

We use a small collection of Shakespeare text as the training data; set the character window size to 10; and the embedding dimension also to 10.

(We plot only the training accuracy because we’re interested in the optimiser performance and not yet in how well the model generalises.)


Different optimisers

Now we can compare the different optimizers’ out-of-the-box performance with a minibatch size of 256.


Looks like the three front runners are Adam, RMSPropGraves, and NesterovAG.

Embedding dimensions

Now we can experiment with a few different hidden layer sizes and window sizes.


As the window size increases the next character is more accurately predicted. A hidden layer size of 40, however, gives better accuracy after 300 seconds than a larger network because the larger network is also slower to train.


Finally, as always, let’s generate some text!

Not, but I way?

Hers att, tor my the text Soald hemereef Pracceit the th lood,
wimen's oflly hop selingeasp wimy biontog nofour prayse? 
Goout, bown loth.

Now, KI pet char.
Written on January 28, 2016