# Numpy character embeddings

*Continues from Embedding derivative derivation.*

`numpy`

, train it on some
characters, generate some text, and plot two of the components over time.
## Vec transpose

To implement the derivative we need the vec-transpose operator. Numpy doesn't have an implementation but we can implement our own by doing some reshaping and transposition:

```
def vec_transpose(X, p):
m, n = X.shape
return X.reshape(m / p, p, n).T.reshape((n * p), m / p)
```

## Commutation matrix

There is an implementation of the construction of the commutation matrix in statsmodels.tsa.tsatools that we can use.

```
def commutation_matrix(p, q):
K = np.eye(p * q)
indices = np.arange(p * q).reshape((p, q), order='F')
return K.take(indices.ravel(), axis=0)
```

## Objective function

A straightforward implementation of the log likelihood and log likelihood
derivatives (using the previous notation)
for a single data point in `python`

yields:

```
def log_likelihood(x, sample_weight, W, R, C):
logZ = -np.inf
p = 0.0
for c in range(C):
en = np.dot(np.dot(x, R), W[c])
logZ = np.logaddexp(logZ, en)
p += en * y[c]
ll += (p - logZ) * sample_weight
def dlog_likelihood(x, sample_weight, W, R, C):
dlldW = [np.zeros(Wc.shape) for Wc in W]
dlldE = np.zeros(E.shape)
for c in range(C):
en = np.dot(np.dot(x, R), W[c])
dlldW[c] += (np.dot(np.dot(y[c], x), R) - np.exp(-logZ + en) \
* np.dot(x, R)).reshape(dlldW[c].shape) \
* sample_weight
inner = np.dot(np.dot(W[c], x.reshape(1, -1)), K_mv).T
term = np.dot(K_hm_m_T, vec_transpose(inner, M)).T
dlldE += ((y[c] - np.exp(-logZ + en)) * term) * sample_weight
```

Now the total log likelihood is calculated as:

```
def objective(params, (M, H, X, Y, sample_weights)):
V = X.shape[1] / M
N, C = Y.shape
E = params[0: V * H].reshape(V, H)
W = [params[V * H + c * M * H: V * H + (c + 1) * M * H].reshape(M * H, 1)
for c in range(C)]
R = np.kron(np.eye(M), E)
K_hm_m_T = vec_transpose(commutation_matrix(H, M), M).T
K_mv = commutation_matrix(M, V)
ll = 0.0
dlldW = [np.zeros(Wc.shape) for Wc in W]
dlldE = np.zeros(E.shape)
for x, y, sample_weight in zip(X, Y, sample_weights):
ll += log_likelihood(x, sample_weight, W, R, C)
dlldW_delta, dlldE_delta = dlog_likelihood(x,
sample_weight,
W, R, C,
K_hm_m_T, K_mv)
for c in range(C):
dlldW[c] += dlldW_delta[c]
dlldE += dlldE_delta
dparams = np.concatenate([d.flatten() for d in [dlldE] + dlldW])
return -ll, -dparams.T
```

## Character N-grams

To construct the training data the input string is first split into N-grams (in our case M-grams), and then counted. This reduces the number of training points, and hence the training time.

```
text = open('input.txt', 'r').read()
vocab = list(set(text))
V = len(vocab)
vocab_to_index = {v: i for i, v in enumerate(vocab)}
ngrams = {}
M = 3
for i in xrange(len(text) - (M + 1)):
ngram = text[i: i + (M + 1)]
ngrams[ngram] = ngrams.get(ngram, 0) + 1
N = len(ngrams)
```

We then sort the points M-grams from most common to least common (because curriculum learning, or something), and construct the input feature vectors.

```
scounts = sorted(ngrams.items(), key=lambda x: -x[1])
scounts[:10]
train_x = np.zeros((N, V * M))
train_y = np.zeros((N, V))
for n, (ngram, _) in enumerate(scounts):
init = ngram[:-1]
y = ngram[-1]
for i, token in enumerate(init):
train_x[n, i*V + vocab_to_index[token]] = 1
train_y[n, vocab_to_index[y]] = 1
train_sw = [count for _, count in scounts]
```

(I'm not bothering with train/test splits because this implementation is too slow to take seriously)

# Training

Now we decide on our embedding size - in our case 5 - and get an initial random parameter vector.

```
H = 5
x0 = random_params(M, H, V, V, std=1.0 / M)
```

I've wanted to check out Jascha Sohl-Dickstein's minibatch version of the BFGS optimizer for a while and it was fairly easy to plug the objective into that.

```
from sfo import SFO
minibatch_size = 400
subs = [(M, H,
train_x[n * minibatch_size: (n + 1) * minibatch_size].copy(),
train_y[n * minibatch_size: (n + 1) * minibatch_size].copy(),
train_sw[n * minibatch_size: (n + 1) * minibatch_size],
) for n in range(len(train_sw) / minibatch_size)]
optimizer = SFO(objective,
x0.copy(),
subs,
max_history_terms=6,
hessian_algorithm='bfgs')
```

## Embedding evolution

To make a `gif`

of how the character embeddings evolve during training
we'll need to save the intermediate parameter vectors.

Also the first optimizer step must be two steps for some reason.

```
param_list = []
param_list.append(optimizer.optimize(num_steps=2).copy())
for step in range(500):
param_list.append(optimizer.optimize(num_steps=1).copy())
```

## Make a gif

Using this recipe, the first two components (out of the possible 5) is plotted over time.

```
from matplotlib import animation
frame_factor = 2
def animate(nframe):
plt.cla()
E = param_list[nframe * frame_factor][0: V * H].reshape(V, H)
for i, char in enumerate(vocab):
plt.text(E[i, 0], E[i, 1], char)
plt.scatter(E[i, 0], E[i, 1], marker='.')
plt.ylim(-1.1, 1.5)
plt.xlim(-2.5, 1.1)
plt.title('Iteration {}'.format(nframe))
fig = plt.figure(figsize=(12, 12))
anim = animation.FuncAnimation(fig, animate,
frames=len(param_list) / frame_factor)
anim.save('embeddings.gif', writer='imagemagick', fps=32);
```

(*The first two componenents of each character embedding as training processes:*)

## Generate

This model is expected to work about as well as character N-grams, so let's generate some text to get a feel for what it does:

```
def predict_proba(params, M, H, X):
V = X.shape[1] / M
N = X.shape[0]
C = (len(params) - V*H) / (M * H)
E = params[0: V * H].reshape(V, H)
W = [params[V * H + c * M * H: V * H + (c + 1) * M * H].reshape(M * H, 1) for c in range(C)]
R = np.kron(np.eye(M), E)
p = np.zeros((N, C))
for n, x in enumerate(X):
logZ = -np.inf
for c in range(C):
en = np.dot(np.dot(x, R), W[c])
logZ = np.logaddexp(logZ, en)
p[n, c] = en
p[n, :] -= logZ
return np.exp(p)
def generate(params, M, H, n, start):
V = len(vocab_to_index)
for i in range(n):
x = np.zeros((1, M*V))
for i, c in enumerate(start[-M:]):
x[0, i*V + vocab_to_index[c]] = 1
nextchar = vocab[np.random.multinomial(1, predict_proba(params, M, H, x).flatten()).argmax()]
sys.stdout.write(nextchar)
start += nextchar
generate(optimizer.theta, M, H, 300, 'The')
```

The model (which is trained on a small piece of Shakespeare) gives:

```
e! shye thole,
nsretH torpthe lor.
Towd-pe; yore'n Gprthr toas uount ool I
Anh y and- yaw ancacame.
CUTHURMA:
INROMd?
His:
CNwond.
FLRXDZM&zrele hed varadd bheons;
I law sade piot io her fitponeve py nolr, it ans torl tist,
ar thar hat anasd yoind, feat ye sraps thon-rasg.
S Rosg- folbel, sheron
```

Unfortunately the implementation is very slow to train. This means that we are restricted to small N-grams and small embedding dimensions - which means that it's not going to work very well.

I can try to optimise the implementation to run faster but next
I'll rather try to implement the same model with `chainer`

to show how
much simpler and better automatic differentiation makes things.