Spelling correction with pyhacrf
In this post I’ll describe how to generate suggestions for
incorrectly spelled words using the pyhacrf
package.
pyhacrf
implements the Hidden Alignment Conditional
Random Field (HACRF) model in python with a sklearn
-like
interface.
Learnable edit distance
One approach to spelling correction is to find the ‘closest’ dictionary word to our incorrectly spelled word. A common way to define distance between strings is the Levenshtein distance.
A drawback of the Levenshtein distance is that all edit operations
have the same weight. For example, say we have the incorrect token:
kat
and three candidates corrections: hat
, mat
, and cat
.
All three corrections have the Levenshtein distance of 1 from kat
and we cannot further choose between the candidates.
A solution is to generalise the Levenshtein distance. Two possible generalisations are:
-
Weighted finite state transducers (WFSTs) The Levenshtein distance is a special case of a WFST where every transition has a weight of 1. Different transitions can have different weights, however. That allows the difference between
kat
andcat
to be 0.3 whilekat
tohat
is 1.0 for example. These weights can be learned from examples. -
Hidden alignment conditional random field (HACRF) (or see my thesis on the use of the model for noisy text normalization) This model can classify pairs as matching or non-matching. For example it can classify the pair
(kat, cat)
as being a match and(kat, hat)
as a mismatch directly. It can also give probabilities for these predictions. We will use the probability of a match as an ‘edit distance’.
Training data
Let’s use Fei Liu’s wonderful collection of Tweet misspelling examples.
Here are the first few examples of matching word pairs:
[('0kkay', 'okay'),
('0n', 'on'),
('0neee', 'one'),
('0r', 'or'),
('1s', 'once')]
Because our model also needs examples of pairs that does not match, we generate non-matching pairs by randomly combining incorrectly spelled words and correctly spelled words:
[('sausge', 'uses'),
('finnally', 'mummy'),
('kall', 'hit'),
('backkk', 'transfered'),
('bihg', 'morning')]
In the end we thus have 3974 positive examples and 3974 negative examples.
Feature extraction
Now that we have training examples we have to transform the string pairs into feature vectors. The HACRF model assigns probability to different alignments of the two strings by assigning energy to positions in the lattice that summarise the different alignments.
For the example pair ('1s', 'once')
we can visualize all
alignments as:
s . . . . s (1, 0) (1, 1) (1, 2) (1, 3)
1 . . . . or 1 (0, 0) (0, 1) (0, 2) (0, 3)
o n c e o n c e
The model associates energy to each lattice position and lattice transition by taking the inner product between a feature vector and parameters:
\[ \Phi _ {i,j}(\mathbf{x} _ {i,j},\mathbf{\lambda}) = \exp \{ \mathbf{x} _ {i,j} \mathbf{\lambda}^T \} \]
There is thus a feature vector \(\mathbf{x} _ {i,j} \) for
each lattice position \( {i,j} \). We can stack these vectors
together into a \( (I \times J \times K ) \) tensor where
\( I \) is the length of the first string, \( J \) is the
length of the second string, and \( K \) is the length of
each feature vector. This tensor is nicely stored in a numpy
ndarray
of shape (I, J, K)
.
So feature extraction starts with a list of string pairs
and outputs a list of ndarray
s:
fe = StringPairFeatureExtractor(match=True, numeric=True)
x = fe.fit_transform(x_raw)
where x
now contains arrays of shapes:
>>> [x_n.shape for x_n in x]
[(4, 4, 3),
(8, 5, 3),
(11, 10, 3),
...
(7, 7, 3),
(5, 10, 3)]
Each example is now an ndarray
that is filled
with ones and zeros:
>>> x[0]
array([[[ 1., 0., 0.],
[ 1., 1., 0.],
[ 1., 0., 0.],
[ 1., 1., 0.]],
[[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 1., 0.],
[ 1., 0., 0.]],
[[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 1., 0.]],
[[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 1., 0.]]])
The feature vector for each lattice position has three dimensions (if we were to unstack them).
- The first dimension is always 1 - this is the bias or offset feature.
- The second is 1 when the characters of the two input strings in that lattice position is equal and 0 otherwise.
- The third element is 1 when the characters are equal and both are also numerical characters and zero otherwise.
Learning
The training labels are just a list of strings:
>>> y
['match', 'mismatch', ... 'mismatch']
We can now learn a model:
>>> from sklearn.cross_validation import train_test_split
>>> x_train, x_test, y_train, y_test = \
train_test_split(x, y, test_size=0.2, random_state=42)
>>> model = Hacrf()
>>> model.fit(x_train, y_train, verbosity=5)
Iteration Log-likelihood |gradient|
0 -138.6 528.7
5 -44.44 70.03
10 -37.29 150.9
...
410 -19.51 0.007586
415 -19.51 0.002551
Evaluation
To evaluate how well it has generalized, let’s compute the confusion matrices on the training and testing sets.
>>> from sklearn.metrics import confusion_matrix
>>> from sklearn.metrics import accuracy_score
>>> pr = m.predict(x_train)
>>> accuracy_score(y_train, pr)
0.96
>>> confusion_matrix(y_train, pr)
[[96 5]
[ 3 96]]
And the final performance on the test set is:
>>> pr = m.predict(x_test)
>>> accuracy_score(y_test, pr)
0.951260504202
>>> confusion_matrix(y_test, pr)
[[571 39]
[ 19 561]]
It looks like the model has generalized well (the test and training scores are similar) and can differentiate between random word pairs and matching word pairs. Can it, however, give reasonable scores that can be used as a distance measure?
To see how it ranks candidate corrections, let’s use it to score candidate correct tokens from a thousand random words.
# Construct list of candidate pairs
incorrect = 'ses'
candidates = set(candidate for _, candidate in x_raw[:1000])
test_pairs = [(incorrect, candidate) for candidate in candidates]
# Extract features
x_test = fe.transform(test_pairs)
# Score
pr = m.predict_proba(x_test)
# Display
candidate_scores = zip(pr, test_pairs)
candidate_scores = sorted(candidate_scores, key=lambda x: -x[0][0])
print candidate_scores[:10]
which produces
[(array([ 0.99492903, 0.00507097]), ('ses', 'she')),
(array([ 0.99460332, 0.00539668]), ('ses', 'seems')),
(array([ 0.98942973, 0.01057027]), ('ses', 'sheesh')),
(array([ 0.98908823, 0.01091177]), ('ses', 'sake')),
(array([ 0.98908823, 0.01091177]), ('ses', 'some')),
(array([ 0.98788131, 0.01211869]), ('ses', 'singers')),
(array([ 0.97803790, 0.02196210]), ('ses', 'send')),
(array([ 0.97687055, 0.02312945]), ('ses', 'shine')),
(array([ 0.97687055, 0.02312945]), ('ses', 'space')),
(array([ 0.97633783, 0.02366217]), ('ses', 'says'))]
The correct token is ‘uses’, which it doesn’t get in the top ten candidates. Here are, for a few other words, the top ten candidates and the probability of them being a match:
mummmyyy | m0rning | mannz | baak | bekause | |||||
mummy | 0.98 | morning | 1.00 | man | 0.99 | back | 0.99 | because | 1.00 |
mommy | 0.65 | might | 0.89 | mean | 0.99 | break | 0.98 | breakfast | 0.93 |
may | 0.11 | missing | 0.89 | meant | 0.94 | balkan | 0.97 | been | 0.91 |
much | 0.10 | mine | 0.89 | may | 0.92 | bad | 0.96 | bike | 0.88 |
me | 0.05 | marketing | 0.86 | mad | 0.92 | bike | 0.94 | bias | 0.87 |
mouth | 0.05 | turning | 0.83 | mais | 0.74 | baby | 0.89 | be | 0.82 |
sum | 0.05 | burning | 0.83 | make | 0.74 | bias | 0.89 | babes | 0.79 |
maybe | 0.01 | fronting | 0.81 | made | 0.74 | be | 0.88 | break | 0.76 |
mad | 0.01 | printing | 0.80 | me | 0.71 | bastards | 0.86 | bread | 0.74 |
man | 0.01 | training | 0.78 | my | 0.71 | bread | 0.78 | please | 0.69 |
Conclusion
The model gives pretty good spelling recommendations, although the examples we looked at are not particularly difficult. In a next post I’ll discuss how to
- add character features (the current model still cannot
differenciate between
hat
andcat
), - regularize,
- use a different optimizer.