This is an implementation of a spell checker in Python. The spell checker reads in a corpus file (in this case, ./english.txt
) and computes the probability of each word in the corpus. It then uses this information to suggest corrections for a given misspelled word.
The code contains the following functions:
read_corpus(filename)
: reads in a corpus file and returns a list of all words in the file.split(word)
: returns a list of all possible ways to split a word into two parts.delete(word)
: returns a list of all possible words that can be generated by deleting one character from the input word.swap(word)
: returns a list of all possible words that can be generated by swapping adjacent characters in the input word.replace(word)
: returns a list of all possible words that can be generated by replacing one character in the input word with a letter from the alphabet.insert(word)
: returns a list of all possible words that can be generated by inserting one character from the alphabet into the input word.edit1(word)
: returns a set of all possible words that can be generated by applying one edit operation (i.e. delete, swap, replace, or insert) to the input word.edit2(word)
: returns a set of all possible words that can be generated by applying two edit operations to the input word.correct_spelling(word, vocabulary, word_probabilities)
: takes a misspelled word and returns a list of suggested corrections, along with their probabilities. The suggested corrections are generated by applying edit operations to the input word and selecting the correction with the highest probability of being the intended word.
The SpellChecker
class reads in a corpus file and stores the vocabulary, word counts, and word probabilities. It also provides a method check(word)
that takes a misspelled word and returns a list of suggested corrections, sorted by probability.
To use the spell checker, create an instance of the SpellChecker
class with the path to the corpus file as an argument.
Google colab file can be found here