C2W1.LAB.Vocabulary Creation+Candidates from String Edits

理论课:C2W1.Auto-correct


理论课: C2W1.Auto-correct

Vocabulary Creation

从一个小语料库中创建词表

Imports and Data

导入包

# imports
import re # regular expression library; for tokenization of words
from collections import Counter # collections library; counter: dict subclass for counting hashable objects
import matplotlib.pyplot as plt # for data visualization

语料库也就一句话

# the tiny corpus of text ! 
text = 'red pink pink blue blue yellow ORANGE BLUE BLUE PINK' 
print(text)
print('string length : ',len(text))

结果:
red pink pink blue blue yellow ORANGE BLUE BLUE PINK
string length : 52

Preprocessing

由于没有包含特殊字符,可以简单进行数据预处理:

# convert all letters to lower case
text_lowercase = text.lower()
print(text_lowercase)
print('string length : ',len(text_lowercase))

结果:
red pink pink blue blue yellow orange blue blue pink
string length : 52

# some regex to tokenize the string to words and return them in a list
words = re.findall(r'\w+', text_lowercase)
print(words)
print('count : ',len(words))

结果:
[‘red’, ‘pink’, ‘pink’, ‘blue’, ‘blue’, ‘yellow’, ‘orange’, ‘blue’, ‘blue’, ‘pink’]
count : 10

Create Vocabulary

法1.集合法

# create vocab
vocab = set(words)
print(vocab)
print('count : ',len(vocab))

结果:
{‘red’, ‘pink’, ‘orange’, ‘blue’, ‘yellow’}
count : 5

法2.词典加词频法

利用get

# create vocab including word count
counts_a = dict()
for w in words:
    counts_a[w] = counts_a.get(w,0)+1
print(counts_a)
print('count : ',len(counts_a))

结果:
{‘red’: 1, ‘pink’: 3, ‘blue’: 4, ‘yellow’: 1, ‘orange’: 1}
count : 5
利用Counter

# create vocab including word count using collections.Counter
counts_b = dict()
counts_b = Counter(words)
print(counts_b)
print('count : ',len(counts_b))

结果同上

Visualization

# barchart of sorted word counts
d = {'blue': counts_b['blue'], 'pink': counts_b['pink'], 'red': counts_b['red'], 'yellow': counts_b['yellow'], 'orange': counts_b['orange']}
plt.bar(range(len(d)), list(d.values()), align='center', color=d.keys())
_ = plt.xticks(range(len(d)), list(d.keys()))

结果:
在这里插入图片描述

Ungraded Exercise

上面由 collections.Counter 返回的 counts_b 是按字频排序的
修改小语料库的text,使counts_b中的pinkred之间出现新的颜色

需要重新运行所有单元格,还是只运行特定单元格?

# 修改 text 变量
text = 'red pink green pink green blue blue yellow ORANGE BLUE BLUE PINK'

# 重新运行以下代码来更新 counts_b 的值
text_lowercase = text.lower()
words = re.findall(r'\w+', text_lowercase)
counts_b = Counter(words)
print(counts_b)
print('count : ', len(counts_b))

Candidates from String Edits

Imports and Data

不需要导入什么包,数据也就一个词:

# data
word = 'dearz' # 🦌

Splits

找出将一个单词分成两个部分的所有方法!

# splits with a loop
splits_a = []
for i in range(len(word)+1):
    splits_a.append([word[:i],word[i:]])

for i in splits_a:
    print(i)

结果:
[‘’, ‘dearz’]
[‘d’, ‘earz’]
[‘de’, ‘arz’]
[‘dea’, ‘rz’]
[‘dear’, ‘z’]
[‘dearz’, ‘’]

也可以用list来完成:

# same splits, done using a list comprehension
splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)]

for i in splits_b:
    print(i)

结果同上。

Delete Edit

从拆分列表splits中的后半部分的每个字符串中删除一个字母。
这样做的目的是有效删除被编辑的原始单词中每个可能的字母。

# deletes with a loop
splits = splits_a
deletes = []

print('word : ', word)
# 遍历分割的结果,检查后半部分是否不为空
for L,R in splits:
    if R: # 如果后半部分不为空,则打印删除第一个字符后的结果
        print(L + R[1:], ' <-- delete ', R[0])

结果:
word : dearz
earz <-- delete d
darz <-- delete e
derz <-- delete a
deaz <-- delete r
dear <-- delete z
下面给出了删除的原理示意:

# breaking it down
print('word : ', word)
one_split = splits[0]
print('first item from the splits list : ', one_split)
L = one_split[0]
R = one_split[1]
print('L : ', L)
print('R : ', R)
print('*** now implicit delete by excluding the leading letter ***')
print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0])

结果:
word : dearz
first item from the splits list : [‘’, ‘dearz’]
L :
R : dearz
*** now implicit delete by excluding the leading letter ***
L + R[1:] : earz <-- delete d

当然也可以用list更加简洁

# deletes with a list comprehension
splits = splits_a
deletes = [L + R[1:] for L, R in splits if R]

print(deletes)
print('*** which is the same as ***')
for i in deletes:
    print(i)

结果:
[‘earz’, ‘darz’, ‘derz’, ‘deaz’, ‘dear’]
*** which is the same as ***
earz
darz
derz
deaz
dear

Ungraded Exercise

经过上面的操作,得到了执行删除编辑后创建的候选字符串列表deletes
下一步是过滤该列表,以查找词汇表中的候选词。
在下面的示例词汇表中,你能想到创建候选词列表的方法吗?
[‘dean’,‘deer’,‘dear’,‘fries’,‘and’,‘coke’]

vocab = ['dean','deer','dear','fries','and','coke']
edits = list(deletes)

print('vocab : ', vocab)
print('edits : ', edits)

candidates=[]

### START CODE HERE ###
#candidates = ??  # hint: 'set.intersection'
#candidates = list(set(edits) & set(vocab))
candidates = list(set(edits).intersection(set(vocab)))
### END CODE HERE ###

print('candidate words : ', candidates)

注意:除了splits和deletes操作,还有其他的编辑类型,例如:insert, replace, switch等,这里没有一一实现,留待各位补全。

  • 26
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

oldmao_2000

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值