自动更正:
将拼写错误的单词修改正确。
工作原理:
- 识别拼写错误的单词
- 查找n编辑距离的字符串
- 筛选候选项
- 计算单词概率
构建模型
- Identify a misspelled word和字典匹配
If word not in vocabulary then its misspedlled - Find strings n edit distance away
找到n编辑距离的字符串 - Edit: an operation performed on a string to change it每一步只对一个字母进行操作
- Insert (add a letter) 插入一个字母
Add a letter to a string at any position: to ==> top,two,… - Delete (remove a letter) 删除一个字母
Remove a letter from a string : hat ==> ha, at, ht - Switch (swap 2 adjacent letters) 调换字母位置
Exmaple: eta=> eat,tea - Replace (change 1 letter to another) 替换一个字母
Example: jaw ==> jar,paw,saw,…
- Insert (add a letter) 插入一个字母
最小编辑距离
评估两个字符串之间的相似性,即将一个字符串转换为另一个字符串所需的最小编辑次数,该算法试图使编辑成本最小化。
最小编辑距离算法
列上是源单词,行上是目标单词,(0,0)处是每个单词开头的空字符串
D[i,j]是源单词开头到索引i和目标单词开头到索引j之间的最小编辑距离
作业
资料:github链接
最后的backtrace algorithm没有完成
Part 3-3: suggest spelling suggestions
知识一:Short circuit
In Python, logical operations such as and and or have two useful properties. They can operate on lists and they have ‘short-circuit’ behavior.
知识二:sort a list of tuples by second Item
https://www.geeksforgeeks.org/python-program-to-sort-a-list-of-tuples-by-second-item/
知识三:set intersection
# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# UNIT TEST COMMENT: Candidate for Table Driven Tests
# GRADED FUNCTION: get_corrections
def get_corrections(word, probs, vocab, n=2, verbose = False):
'''
Input:
word: a user entered string to check for suggestions
probs: a dictionary that maps each word to its probability in the corpus
vocab: a set containing all the vocabulary
n: number of possible word corrections you want returned in the dictionary
Output:
n_best: a list of tuples with the most probable n corrected words and their probabilities.
'''
suggestions = []
n_best = []
### START CODE HERE ###
suggestions = list((word in vocab and word) or edit_one_letter(word).intersection(vocab) or edit_two_letter(word).intersection(vocab))
all_best = [(suggestion, probs[suggestion]) for suggestion in suggestions]
n_best = sorted(all_best, key = lambda x: x[1], reverse = True)
### END CODE HERE ###
if verbose: print("entered word = ", word, "\nsuggestions = ", suggestions)
return n_best