Spelling correction with pyhacrf

In this post I’ll describe how to generate suggestions for incorrectly spelled words using the pyhacrf package. pyhacrf implements the Hidden Alignment Conditional Random Field (HACRF) model in python with a sklearn-like interface.

Learnable edit distance

One approach to spelling correction is to find the ‘closest’ dictionary word to our incorrectly spelled word. A common way to define distance between strings is the Levenshtein distance.

A drawback of the Levenshtein distance is that all edit operations have the same weight. For example, say we have the incorrect token: kat and three candidates corrections: hat, mat, and cat. All three corrections have the Levenshtein distance of 1 from kat and we cannot further choose between the candidates.

A solution is to generalise the Levenshtein distance. Two possible generalisations are:

Weighted finite state transducers (WFSTs) The Levenshtein distance is a special case of a WFST where every transition has a weight of 1. Different transitions can have different weights, however. That allows the difference between kat and cat to be 0.3 while kat to hat is 1.0 for example. These weights can be learned from examples.
Hidden alignment conditional random field (HACRF) (or see my thesis on the use of the model for noisy text normalization) This model can classify pairs as matching or non-matching. For example it can classify the pair (kat, cat) as being a match and (kat, hat) as a mismatch directly. It can also give probabilities for these predictions. We will use the probability of a match as an ‘edit distance’.

Training data

Let’s use Fei Liu’s wonderful collection of Tweet misspelling examples.

Here are the first few examples of matching word pairs:

[('0kkay', 'okay'),
 ('0n', 'on'),
 ('0neee', 'one'),
 ('0r', 'or'),
 ('1s', 'once')]

Because our model also needs examples of pairs that does not match, we generate non-matching pairs by randomly combining incorrectly spelled words and correctly spelled words:

[('sausge', 'uses'),
 ('finnally', 'mummy'),
 ('kall', 'hit'),
 ('backkk', 'transfered'),
 ('bihg', 'morning')]

In the end we thus have 3974 positive examples and 3974 negative examples.

Feature extraction

Now that we have training examples we have to transform the string pairs into feature vectors. The HACRF model assigns probability to different alignments of the two strings by assigning energy to positions in the lattice that summarise the different alignments.

For the example pair ('1s', 'once') we can visualize all alignments as:

s . . . .        s (1, 0) (1, 1) (1, 2) (1, 3)
1 . . . .   or   1 (0, 0) (0, 1) (0, 2) (0, 3)
  o n c e            o      n      c      e

The model associates energy to each lattice position and lattice transition by taking the inner product between a feature vector and parameters:

\[ \Phi _ {i,j}(\mathbf{x} _ {i,j},\mathbf{\lambda}) = \exp \{ \mathbf{x} _ {i,j} \mathbf{\lambda}^T \} \]

There is thus a feature vector \(\mathbf{x} _ {i,j} \) for each lattice position \( {i,j} \). We can stack these vectors together into a \( (I \times J \times K ) \) tensor where \( I \) is the length of the first string, \( J \) is the length of the second string, and \( K \) is the length of each feature vector. This tensor is nicely stored in a numpy ndarray of shape (I, J, K).

So feature extraction starts with a list of string pairs and outputs a list of ndarrays:

fe = StringPairFeatureExtractor(match=True, numeric=True)
x = fe.fit_transform(x_raw)

where x now contains arrays of shapes:

>>> [x_n.shape for x_n in x]
[(4, 4, 3),
 (8, 5, 3),
 (11, 10, 3),
    ...
 (7, 7, 3),
 (5, 10, 3)]

Each example is now an ndarray that is filled with ones and zeros:

>>> x[0]
array([[[ 1.,  0.,  0.],
        [ 1.,  1.,  0.],
        [ 1.,  0.,  0.],
        [ 1.,  1.,  0.]],

       [[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 1.,  1.,  0.],
        [ 1.,  0.,  0.]],

       [[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 1.,  1.,  0.]],

       [[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 1.,  1.,  0.]]])

The feature vector for each lattice position has three dimensions (if we were to unstack them).

The first dimension is always 1 - this is the bias or offset feature.
The second is 1 when the characters of the two input strings in that lattice position is equal and 0 otherwise.
The third element is 1 when the characters are equal and both are also numerical characters and zero otherwise.

Learning

The training labels are just a list of strings:

>>> y
['match', 'mismatch', ... 'mismatch']

We can now learn a model:

>>> from sklearn.cross_validation import train_test_split
>>> x_train, x_test, y_train, y_test = \
      train_test_split(x, y, test_size=0.2, random_state=42)
>>> model = Hacrf()
>>> model.fit(x_train, y_train, verbosity=5)
Iteration  Log-likelihood |gradient|
         0     -138.6      528.7
         5     -44.44      70.03
        10     -37.29      150.9
                ...
       410     -19.51   0.007586
       415     -19.51   0.002551

Evaluation

To evaluate how well it has generalized, let’s compute the confusion matrices on the training and testing sets.

>>> from sklearn.metrics import confusion_matrix
>>> from sklearn.metrics import accuracy_score
>>> pr = m.predict(x_train)
>>> accuracy_score(y_train, pr)
0.96
>>> confusion_matrix(y_train, pr)
[[96  5]
 [ 3 96]]

And the final performance on the test set is:

>>> pr = m.predict(x_test)
>>> accuracy_score(y_test, pr)
0.951260504202
>>> confusion_matrix(y_test, pr)
[[571  39]
 [ 19 561]]

It looks like the model has generalized well (the test and training scores are similar) and can differentiate between random word pairs and matching word pairs. Can it, however, give reasonable scores that can be used as a distance measure?

To see how it ranks candidate corrections, let’s use it to score candidate correct tokens from a thousand random words.

# Construct list of candidate pairs
incorrect = 'ses'
candidates = set(candidate for _, candidate in x_raw[:1000])
test_pairs = [(incorrect, candidate) for candidate in candidates]

# Extract features
x_test = fe.transform(test_pairs)

# Score
pr = m.predict_proba(x_test)

# Display
candidate_scores = zip(pr, test_pairs)
candidate_scores = sorted(candidate_scores, key=lambda x: -x[0][0])
print candidate_scores[:10]

which produces

[(array([ 0.99492903,  0.00507097]), ('ses', 'she')),
 (array([ 0.99460332,  0.00539668]), ('ses', 'seems')),
 (array([ 0.98942973,  0.01057027]), ('ses', 'sheesh')), 
 (array([ 0.98908823,  0.01091177]), ('ses', 'sake')), 
 (array([ 0.98908823,  0.01091177]), ('ses', 'some')), 
 (array([ 0.98788131,  0.01211869]), ('ses', 'singers')), 
 (array([ 0.97803790,  0.02196210]), ('ses', 'send')), 
 (array([ 0.97687055,  0.02312945]), ('ses', 'shine')), 
 (array([ 0.97687055,  0.02312945]), ('ses', 'space')), 
 (array([ 0.97633783,  0.02366217]), ('ses', 'says'))]

The correct token is ‘uses’, which it doesn’t get in the top ten candidates. Here are, for a few other words, the top ten candidates and the probability of them being a match:

mummmyyy		m0rning		mannz		baak		bekause
mummy	0.98	morning	1.00	man	0.99	back	0.99	because	1.00
mommy	0.65	might	0.89	mean	0.99	break	0.98	breakfast	0.93
may	0.11	missing	0.89	meant	0.94	balkan	0.97	been	0.91
much	0.10	mine	0.89	may	0.92	bad	0.96	bike	0.88
me	0.05	marketing	0.86	mad	0.92	bike	0.94	bias	0.87
mouth	0.05	turning	0.83	mais	0.74	baby	0.89	be	0.82
sum	0.05	burning	0.83	make	0.74	bias	0.89	babes	0.79
maybe	0.01	fronting	0.81	made	0.74	be	0.88	break	0.76
mad	0.01	printing	0.80	me	0.71	bastards	0.86	bread	0.74
man	0.01	training	0.78	my	0.71	bread	0.78	please	0.69

Conclusion

The model gives pretty good spelling recommendations, although the examples we looked at are not particularly difficult. In a next post I’ll discuss how to

add character features (the current model still cannot differenciate between hat and cat),
regularize,
use a different optimizer.

Written on May 19, 2015