Chainer character embeddings
(Continues from Numpy character embeddings.)
The numpy
embedding model turned out to be extremely slow because
it wasn’t vectorised. Chainer
is a python deep learning
package that enables us to
implement the model easily with automatic differentiation and the
resulting vectorised operations are fast—and can be run on a GPU if you want.
In this post I’ll explore how the different optimisers perform out of the box.
Chainer
There are a few deep learning packages available with Python interfaces at the moment (and more are being added). From the venerable Theano and its offspring like Lasagne and Keras, to Tensorflow (which is also supported by Keras).
Chainer
caught my attention when I first looked at because it made
recurrent neural networks easy to do. Theano
, which was the state-of-the-art
package at that stage, struggles with RNNs—and RNNs are awesome.
Chainer
’s interface is a joy to work with and I think the define-by-run
scheme is very clever.
It might not be the fastest library out there but it is extremely
flexible and the fact that you can install it with pip install chainer
clinched it for me. Many other packages are difficult to install and don’t
really want you to use it without the GPU—sometimes you just need a
medium-sized deep model that is easy to deploy.
Embedding model
Back to our simple embedding model.
The main trick in this implementation is to pack the input
sequences into a 2 dimensional ndarray
.
Each row is a training point and each column represents a token.
The number of columns is the size of the window you’re using to
predict the next token.
Chainer
then applies this input matrix to the embedding matrix .
The EmbedID
operation broadcasts the
embedding lookups over the last dimension of the input array, resulting
in a three dimensional ndarray
. If we now reshape this
ndarray
our embeddings are neatly packed next to each other
just like we wanted.
The last step is a simple linear layer.
class EmbeddingModel(Chain):
def __init__(self, vocab_size, embedding_size, ngram_size):
super(EmbeddingModel, self).__init__(
l1 = L.EmbedID(vocab_size, embedding_size),
l2 = L.Linear(embedding_size * ngram_size, vocab_size)
)
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.ngram_size = ngram_size
def __call__(self, x):
h = F.reshape(self.l1(x), (-1, self.embedding_size * self.ngram_size, 1))
y = self.l2(h)
return y
Training setup
Training is accomplished by defining a loss function—in our case
the softmax cross entropy loss—and calling the backward()
method on the
loss to run automatic differentiation via back propagation on the model.
The optimizer
object
uses the parameter gradients to update the parameters.
The following standard training loop was taken from the chainer
documentation:
# Setup the model with a vocabulary of V,
# hidden dimensionality of H and window
# size of M.
model = L.Classifier(EmbeddingModel(V, H, M))
optimizer = optimizers.NesterovAG()
optimizer.setup(model)
# Training loop
for epoch in range(n_epochs):
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x = Variable(train_x[indexes[i : i + batchsize]])
t = Variable(train_y[indexes[i : i + batchsize]])
model.zerograds()
loss = model(x, t)
loss.backward()
optimizer.update()
Optimisers
Chainer
comes with a few of the beloved deep learning optimisers, like
adam
and NesterovAG
as part
of the package.
Different batch sizes
Because deep models are usually trained with minibatch stochastic
gradient methods, we are stuck with a bunch of optimiser hyperparameters.
Ain’t nobody got time to tune all of them for all the available optimisers, so
let’s pick NesterovAG
to see what a good minibatch size is.
We use a small collection of Shakespeare text as the training data; set the character window size to 10; and the embedding dimension also to 10.
(We plot only the training accuracy because we’re interested in the optimiser performance and not yet in how well the model generalises.)
Different optimisers
Now we can compare the different optimizers’ out-of-the-box performance with a minibatch size of 256.
Looks like the three front runners are Adam
, RMSPropGraves
, and NesterovAG
.
Embedding dimensions
Now we can experiment with a few different hidden layer sizes and window sizes.
As the window size increases the next character is more accurately predicted. A hidden layer size of 40, however, gives better accuracy after 300 seconds than a larger network because the larger network is also slower to train.
Generate
Finally, as always, let’s generate some text!
SEROLED:
Not, but I way?
LARTERMEE:'
OXFRONUK:
Hers att, tor my the text Soald hemereef Pracceit the th lood,
wimen's oflly hop selingeasp wimy biontog nofour prayse?
Goout, bown loth.
FRIALAS:
Now, KI pet char.