deep-learning-with-pytorch p1ch4 课后作业第二题_relatively large file containing python source cod-CSDN博客

本文链接：https://blog.csdn.net/weixin_43972097/article/details/107890190

2.Select a relatively large file containing Python source code 选择一个比较大的python源码文件

a. Build an index of all the words in the source file (feel free to make your tokenization as simple or as complex as you like; we suggest starting with replacing r"[^a-zA-Z0-9_]+" with spaces). 为源码中的单词进行索引。

这里我下载了flask源码，首先要清洗语料，利用re这个第三方库。

import re
with open(r'F:\Users\asus\JupyterProjects\data\flask.py') as f:
    text = f.read()
text = re.sub(r'[^\w\s]','',text)

# 与前面章节所讲的相同，比较简单
def clean_words(input_str):
    punctuation = ',;:"!?”“_-' 
    word_list = input_str.lower().replace('[^a-zA-Z0-9_]+','').split()
    word_list = [word.strip(punctuation) for word in word_list]
    return word_list

text = clean_words(text)
text_list = sorted(set(text))
text_listdict = {word:i for (i,word) in enumerate(text_list)}

len(text_listdict)
Out[114]
1103

text_listdict['1px']
Out[115]
70

b.Compare your index with the one we made for Pride and Prejudice. Which is larger? 与傲慢与偏见相比，谁的索引更多一些。

傲慢与偏见更多些，这里我的text_lisdict的长度为1103，傲慢的长度为7000+。

c.Create the one-hot encoding for the source code file. 为源文件设置独热编码

word_t = torch.zeros(len(text),len(text_listdict))

for i,word in enumerate(text):
    word_index = text_listdict[word]
    word_t[i][word_index] = 1

d.What information is lost with this encoding? How does that information compare to what’s lost in the Pride and Prejudice encoding? 通过独热编码这种方式，源文件损失了什么信息，与傲慢与偏见损失的信息相比呢？

1.首先损失了单词之间的联系，比如某一个单词后面大概率为另一个单词，但是独热方式并不能表示出来。
2.与傲慢与偏见相比，傲慢要损失的多一点。因为flask源文件的单词数较少，且代码中各种符号较多。