Chainer character embeddings
Continues from [Numpy character embeddings]
numpy embedding model turned out to be extremely slow because
it wasn’t vectorised.
Chainer is a python deep learning
package that enables us to
implement the model easily with automatic differentiation and the
resulting vectorised operations are fast - and can be run on a GPU if you want.
In this post I’ll explore how the different optimisers perform out of the box.
There are a few deep learning packages available with Python interfaces at the moment (and more are being added). From the venerable Theano and its offspring like Lasagne and Keras, to Tensorflow (which is also supported by Keras).
Chainer caught my attention when I first looked at because it made
recurrent neural networks easy to do.
Theano, which was the state-of-the-art
package at that stage, struggles with RNNs - and RNNs are awesome.
Chainer’s interface is a joy to work with and I think the define-by-run
scheme is very clever.
It might not be the fastest library out there but it is extremely
flexible and the fact that you can install it with
pip install chainer
clinched it for me. Many other packages are difficult to install and don’t
really want you to use it without the GPU - sometimes you just need a
medium-sized deep model that is easy to deploy.
Back to our simple embedding model.
The main trick in this implementation is to pack the input
sequences into a 2 dimensional
Each row is a training point and each column represents a token.
The number of columns is the size of the window you’re using to
predict the next token.
Chainer then applies this input matrix to the embedding matrix .
EmbedID operation broadcasts the
embedding lookups over the last dimension of the input array, resulting
in a three dimensional
ndarray. If we now reshape this
ndarray our embeddings are neatly packed next to each other
just like we wanted.
The last step is a simple linear layer.
class EmbeddingModel(Chain): def __init__(self, vocab_size, embedding_size, ngram_size): super(EmbeddingModel, self).__init__( l1 = L.EmbedID(vocab_size, embedding_size), l2 = L.Linear(embedding_size * ngram_size, vocab_size) ) self.vocab_size = vocab_size self.embedding_size = embedding_size self.ngram_size = ngram_size def __call__(self, x): h = F.reshape(self.l1(x), (-1, self.embedding_size * self.ngram_size, 1)) y = self.l2(h) return y
Training is accomplished by defining a loss function - in our case
the softmax cross entropy loss - and calling the
backward() method on the
loss to run automatic differentiation via back propagation on the model.
uses the parameter gradients to update the parameters.
The following standard training loop was taken from the
# Setup the model with a vocabulary of V, # hidden dimensionality of H and window # size of M. model = L.Classifier(EmbeddingModel(V, H, M)) optimizer = optimizers.NesterovAG() optimizer.setup(model) # Training loop for epoch in range(n_epochs): indexes = np.random.permutation(datasize) for i in range(0, datasize, batchsize): x = Variable(train_x[indexes[i : i + batchsize]]) t = Variable(train_y[indexes[i : i + batchsize]]) model.zerograds() loss = model(x, t) loss.backward() optimizer.update()
Chainer comes with a few of the beloved deep learning optimisers, like
NesterovAG as part
of the package.
Different batch sizes
Because deep models are usually trained with minibatch stochastic
gradient methods, we are stuck with a bunch of optimiser hyperparameters.
Ain’t nobody got time to tune all of them for all the available optimisers, so
NesterovAG to see what a good minibatch size is.
We use a small collection of Shakespeare text as the training data; set the character window size to 10; and the embedding dimension also to 10.
(We plot only the training accuracy because we’re interested in the optimiser performance and not yet in how well the model generalises.)
Now we can compare the different optimizers’ out-of-the-box performance with a minibatch size of 256.
Looks like the three front runners are
Now we can experiment with a few different hidden layer sizes and window sizes.
As the window size increases the next character is more accurately predicted. A hidden layer size of 40, however, gives better accuracy after 300 seconds than a larger network because the larger network is also slower to train.
Finally, as always, let’s generate some text!
SEROLED: Not, but I way? LARTERMEE:' OXFRONUK: Hers att, tor my the text Soald hemereef Pracceit the th lood, wimen's oflly hop selingeasp wimy biontog nofour prayse? Goout, bown loth. FRIALAS: Now, KI pet char.