1.PSP表格
PSP2 | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
---|---|---|---|
Planning | 计划 | 30 | 10 |
· Estimate | · 估计这个任务需要多少时间 | 30 | 10 |
Development | 开发 | 550 | 980 |
· Analysis | · 需求分析 (包括学习新技术) | 60 | 180 |
· Design Spec | · 生成设计文档 | 60 | 30 |
· Design Review | · 设计复审 (和同事审核设计文档) | 30 | 30 |
· Coding Standard | · 代码规范 (为目前的开发制定合适的规范) | 20 | 10 |
· Design | · 具体设计 | 60 | 40 |
· Coding | · 具体编码 | 180 | 540 |
· Code Review | · 代码复审 | 20 | 30 |
· Test | · 测试(自我测试,修改代码,提交修改) | 120 | 120 |
Reporting | 报告 | 90 | 180 |
· Test Report | · 测试报告 | 60 | 120 |
· Size Measurement | · 计算工作量 | 10 | 30 |
· Postmortem & Process Improvement Plan | · 事后总结, 并提出过程改进计划 | 20 | 30 |
合计 | 730 | 1170 |
2.解题思路描述
2.1需求分析
2.1.1 输出某英文文本中出现频率最高的10个字母,按字典序排序。格式如下:
wf.exe -c <file name>
2.1.2 输出某英文文本中出现频率最高的N个单词,格式如下:
wf.exe -f <file> //输出文件中所有不重复的单词,按照出现次数由多到少排列,出现次数同样多的,以字典序排列。
wf.exe -d <directory> //指定文件目录,对目录下每一个文件执行wf.exe -f <file> 的操作。
wf.exe -d -s <directory> //同上, 但是会递归遍历目录下的所有子目录。
支持-n参数,输出出现次数最多的前n个单词。
2.1.3支持 stop words。格式如下:
wf.exe -x <stopwordfile> -f <file>
2.1.4输出短语,并按照频率排序。格式如下:
-p <number>//参数 <number> 说明要输出多少个词的短语。
2.1.5把动词形态都统一之后再计数。格式如下
wf.exe -v <verb file> //<verb file> 是纪录动词形态的文本文件。
由以上可以看出,本次任务采用命令行的形式。
2.2思路过程
按照老师给的步骤,逐步完成每个功能。可以对之前的代码进行优化。打算整个项目都用Python来完成,也是对自己的一种挑战。首先得完成对每个功能函数的定义,python中的sys库和argparse包处理运行程序,使得我们可以用命令行的形式调用每个函数。python中的re库可以处理stop words的需求,使用os库实现对文件的操作
2.3查找资料
由于之前没有做过类似的项目,所有的库对我来说都是一个比较新的东西,所以本次花了很长时间去学习这些知识。
另外,在后续本地文件上传至码云、部署博客等方面也参考了一些资料,一并放在这里。
hexo的一些安装设置在本地已经完成,但是由于当时访问博客网页一直出现404的情况,花很长时间解决了404的问题,又有新的问题出现。所以没有在hexo完成这篇博客。
3.设计实现过程
3.1代码组织
调用各种库函数
import os
import getopt
import string
import operator
import io, sys
获取文件操作
def eachFile(filepath, txt_file_list):
pathDir = os.listdir(filepath) // 获取当前路径下的文件名,返回List
for s in pathDir:
newDir = os.path.join(filepath, s) //将文件命加入到当前文件路径后面
if os.path.isfile(newDir): //如果是文件
if os.path.splitext(newDir)[1] == ".txt": //判断是否是txt
txt_file_list.append(newDir) //读文件
else:
pass
else:
eachFile(newDir, txt_file_list) //如果不是文件,递归这个文件夹的路径
判断字符类型
def is_lower_letter(chatr):
return 97 <= ord(chatr) <= 122
def is_upper_letter(chatr):
return 65 <= ord(chatr) <= 90
def is_digit(chatr):
return 48 <= ord(chatr) <= 57
def is_space(chatr):
return ord(chatr) in (9, 10, 13, 32)
这里有必要说明一下,我们不论原始字符的大小写是什么,把它统一转化成小写。
计算字母频率
def calculate_character_freq(filename, flag_xx=False, stop_words=None):
chatr_dict = dict.fromkeys(string.ascii_letters, 0)
with open(filename, 'r', encoding='utf-8') as in_file:
all_chatrs = in_file.read()
if flag_xx:
for stop_word in stop_words:
all_chatrs = all_chatrs.replace(' ' + stop_word + ' ', ' ').replace(' ' + stop_word + '\n',
'\n').replace(
'\n' + stop_word + ' ', '\n')
for chatr in all_chatrs:
try:
chatr_dict[chatr] += 1
except:
pass
return chatr_dict
计算单词频率
def calculate_word_freq(filename):
word_dict = {}
with open(filename, 'r', encoding='utf-8') as in_file:
all_chatrs = in_file.read()
started = False
word = ""
for chatr in all_chatrs:
if started:
if (chatr in all_lower_letters) or (chatr in all_digits):
word += chatr
elif chatr in all_upper_letters:
word += chatr.lower()
else:
started = False
try:
word_dict[word] += 1
except:
word_dict[word] = 1
word = ""
else:
if chatr in all_lower_letters:
started = True
word += chatr
elif chatr in all_upper_letters:
started = True
word += chatr.lower()
else:
pass
return word_dict
计算短语频率
def calculate_phrase_freq(filename, phrase_length):
phrase_dict = {}
with open(filename, 'r', encoding='utf-8') as in_file:
all_contents = in_file.read() # 是否考虑文件大于内存的情况
started = False
previous_words = []
previous_words_num = 0
current_word = ""
for chatr in all_contents:
# sentence = sentence
# for chatr in sentence:
if started:
if (chatr in all_lower_letters) or (chatr in all_digits):
current_word += chatr
elif (chatr in all_spaces) and current_word:
# previous_words.append(current_word) keep method as less as possible
previous_words += [current_word]
previous_words_num += 1
current_word = ""
if previous_words_num == phrase_length:
phrase = ' '.join(previous_words)
if phrase in phrase_dict.keys():
phrase_dict[phrase] += 1
else:
phrase_dict[phrase] = 1
previous_words = previous_words[1:]
# previous_words.pop(0)
previous_words_num -= 1
elif chatr in all_upper_letters:
current_word += chatr.lower()
else:
if current_word:
previous_words += [current_word]
previous_words_num += 1
if previous_words_num == phrase_length:
phrase = ' '.join(previous_words)
if phrase in phrase_dict.keys():
phrase_dict[phrase] += 1
else:
phrase_dict[phrase] = 1
started = False
previous_words = []
previous_words_num = 0
current_word = ""
else:
if chatr in all_lower_letters:
started = True
current_word += chatr
elif chatr in all_upper_letters:
started = True
current_word += chatr.lower()
else:
pass
return phrase_dict
引入词汇表,实现v操作
def get_verb_format_dict(verb_file):
verb_dict = {}
with open(verb_file, 'r', encoding='utf-8') as infile:
all_verb_lines = infile.readlines()
for verb_line in all_verb_lines:
verb_info = verb_line.strip().split(' -> ')
deformed_verbs = verb_info[1].split(',')
origin_verb = verb_info[0]
for deformed_verb in deformed_verbs:
verb_dict[deformed_verb] = origin_verb
''' ignore condition that one deformed_verb has 2 origin_verb
if deformed_verb in verb_dict.keys():
pass
# raise ValueError("deformed verb" + deformed_verb + " has more than one origin verb!")
else:
verb_dict[deformed_verb] = origin_verb
'''
return verb_dict
引入stop words file,实现x操作
def drop_stop_words(filename, stop_file):
with open(stop_file, 'r', encoding='utf-8') as infile:
stop_words = []
for stop_word in infile.readlines():
stopword = stop_word.strip('\n')
stop_words.append(stopword)
with open(filename, 'r', encoding='utf-8') as in_file:
all_lines = in_file.read()
for stop_word in stop_words:
new_all_lines = (all_lines.lower()).replace(' ' + stop_word + ' ', ' ').replace(' ' + stop_word + '\n',
'\n').replace(
'\n' + stop_word + ' ', '\n')
'''
with open(filename+'.replace', 'w', encoding='utf-8') as outfile:
outfile.write(new_all_lines)# ssss
#print(new_all_lines[:500])
'''
new_all_lines = new_all_lines.split('\n')
return new_all_lines
打印输出
def print_word_dict(input_dict, filename, stop_words=None, flag_pp=False, flag_ff=False, reverse=True):
if flag_ff == True:
for stop_word in stop_words:
input_dict.pop(stop_word, '404')
total = sum(input_dict.values())
word_list = [(key, input_dict[key]) for key in input_dict.keys()]
word_list.sort(key=operator.itemgetter(0), reverse=False)
word_list.sort(key=operator.itemgetter(1), reverse=True)
if flag_pp == False:
for key, value in word_list:
print("%40s\t%d" % (key, value))
else:
stop_words_string = ""
stop_words_string.join(stop_words)
for key, value in word_list:
flag = False
for stop_word in stop_words:
if stop_word in key:
flag = True
break
if flag == True:
pass
else:
print("%40s\t%d" % (key, value))
3.2函数调用
主函数部分如下
def main(argv):
try:
opts, args = getopt.getopt(argv, "hcfdsn:x:p:v:") # h, c, f不需要带参数
except getopt.GetoptError:
logging.error(
'usage : wf.py -c <count> -f <frequency> -d <directory> -n <number> -x <stopword file> -p <number2> -v <verb file>')
sys.exit(1)
flag_c = False
flag_f = False
flag_d = False
flag_n = False
flag_x = False
flag_p = False
flag_v = False
flag_s = False
for opt, arg in opts:
if opt == '-h':
logging.info(
'usage : wf.py -c <count> -f <frequency> -d <directory> -n <number> -x <stopword file> -p <number2> -v <verb file>')
sys.exit(0)
elif opt in ("-c"):
flag_c = True
elif opt in ("-f"):
flag_f = True
elif opt in ("-d"):
flag_d = True
elif opt in ("-s"):
flag_s = True
elif opt in ("-n"):
top_number = arg
flag_n = True
elif opt in ("-x"):
stop_file = arg
flag_x = True
elif opt in ("-p"):
phrase_length = arg
flag_p = True
elif opt in ("-v"):
verb_file = arg
flag_v = True
else:
logging.error(
'usage : wf.py -c <count> -f <frequency> -d <directory> -n <number> -x <stopword file> -p <number2> -v <verb file>')
sys.exit(1)
txt_file_list = [] # txt_file_list stores the txt files' names
if flag_d == True:
directory_name = args[0]
file_list = os.listdir(str(directory_name))
for i, elem in enumerate(file_list):
if os.path.splitext(elem)[1] == '.txt':
txt_file_list.append(os.path.join(directory_name, elem))
elif flag_s == True:
directory_name = args[0]
eachFile(str(directory_name), txt_file_list)
else:
filename = args
current_folder = './'
txt_file_list.append(os.path.join(current_folder, filename[0]))
for file_index in range(len(txt_file_list)):
if flag_x:
with open(stop_file, 'r', encoding='utf-8') as infile:
stop_words = []
for stop_word in infile.readlines():
stopword = stop_word.strip('\n')
stop_words.append(stopword)
if flag_v:
if flag_p:
results = calculate_phrase_freq_with_v(txt_file_list[file_index], verb_file, int(phrase_length))
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number, stop_words, flag_pp=True)
else:
print_word_dict(results, txt_file_list[file_index], stop_words, flag_pp=True)
return 0
elif flag_c:
results = calculate_character_freq_with_v(txt_file_list[file_index], verb_file, flag_xx=True,
stop_words=stop_words)
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number)
else:
print_word_dict(results, txt_file_list[file_index])
return 0
elif flag_f:
results = calculate_word_freq_with_v(txt_file_list[file_index], verb_file)
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number, stop_words, flag_ff=True)
else:
print_word_dict(results, txt_file_list[file_index], stop_words, flag_ff=True)
return 0
else:
raise ValueError("You must specify one of the <-f -c -p>")
else:
if flag_p:
results = calculate_phrase_freq(txt_file_list[file_index], int(phrase_length))
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number, stop_words, flag_pp=True)
else:
print_word_dict(results, txt_file_list[file_index], stop_words, flag_pp=True)
return 0
elif flag_f:
results = calculate_word_freq(txt_file_list[file_index])
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number, stop_words, flag_ff=True)
else:
print_word_dict(results, txt_file_list[file_index], stop_words, flag_ff=True)
return 0
elif flag_c:
results = calculate_character_freq(txt_file_list[file_index], flag_xx=True, stop_words=stop_words)
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number)
else:
print_word_dict(results, txt_file_list[file_index])
return 0
else:
raise ValueError("You must specify one of the <-f -c -p>")
else: # flag_x==False:
if flag_v:
if flag_p:
results = calculate_phrase_freq_with_v(txt_file_list[file_index], verb_file, int(phrase_length))
elif flag_c:
results = calculate_character_freq_with_v(txt_file_list[file_index], verb_file)
elif flag_f:
results = calculate_word_freq_with_v(txt_file_list[file_index], verb_file)
else:
raise ValueError("You must specify one of the <-f -c -p>")
else:
if flag_p:
results = calculate_phrase_freq(txt_file_list[file_index], int(phrase_length))
elif flag_f:
results = calculate_word_freq(txt_file_list[file_index])
elif flag_c:
results = calculate_character_freq(txt_file_list[file_index])
else:
raise ValueError("You must specify one of the <-f -c -p>")
if flag_n:
print_word_dict_top_n(results, txt_file_list[file_index], top_number)
else:
print_word_dict(results, txt_file_list[file_index])
3.3单元测试
单元测试是对程序部分分支功能的测试,在编程阶段就可以着手进行,对没有达到预期的结果进行分析,找出错误原因以及改进方式。以下为在命令行中执行的语句,对应着输出。测试语句和结果如下:
输入:wf.exe -c pride-and-prejudice.txt -n 10
输出:e 12.69%
t 8.67%
a 7.83%
o 7.21%
n 6.86%
h 6.65%
s 5.92%
r 5.82%
i 5.63%
d 4.72%
输入:wf.exe -f pride-and-prejudice.txt -n 20
输出:the 19199
and 15902
to 9997
of 8611
she 8435
her 8314
a 7685
in 6007
was 5975
i 5406
you 5239
he 4921
that 4598
had 4493
it 4486
on 3766
with 3318
for 3309
s 3193
his 3140
输入:wf.exe -x stopwords.txt -f pride-and-prejudice.txt -n 10
输出:a 7685
in 6007
was 5975
i 5406
you 5239
he 4921
that 4598
had 4493
it 4486
s 3766
输入:wf.exe -v verbs.txt -f pride-and-prejudice.txt -n 10
输出:the 19193
and 15903
be 14121
to 9997
of 8611
she 8435
her 8314
a 7685
have 7556
in 6007
输入:wf.exe -p 3 pride-and-prejudice.txt -n 10
输出:she does not 315
there is a 306
she could not 273
for a moment 210
going to 201
it was a 192
there was no 177
it would be 156
ve got to 138
she would have 132
输入:wf.exe -v verbs.txt -p 2 pride-and-prejudice.txt -n 10
输出:of the 1824
in the 1548
and the 1420
she have 1339
it be 1192
there be 1092
she be 1026
have be 1010
he be 829
and she 822
输入:wf.exe -f -d pride-and-prejudice.txt -s -n 10
输出:the 19193
and 15903
to 9997
of 8611
she 8435
her 8314
a 7685
in 6007
was 5975
i 5406
输入:wf.exe -p 3 -d pride-and-prejudice.txt -n 10
输出:she does not 315
there was a 306
she could not 273
for a moment 210
m going to 201
it was a 192
there was no 177
it would be 156
ve got to 138
she would have 132
输入:wf.exe -x stopwords.txt -f -d pride-and-prejudice.txt -n 10
输出:a 7685
in 6007
was 5975
i 5406
you 5239
he 4921
that 4598
had 4493
it 4486
s 3766
输入:wf.exe -v verbs.txt -p 2 -d testFile -n 10
输出:
输入:wf.exe -v verbs.txt -p 2 -d testFile -n 10
输出:of the 1726
in the 1680
and the 1388
she have 1342
it be 1268
there be 1028
she be 1026
have be 1002
he be 952
and she 822
输入:wf.exe -v verbs.txt -f -d testFile -n 10
输出:the 19193
and 15903
be 14121
to 9997
of 8611
she 8435
her 8314
a 7685
have 7556
in 6007
4.程序性能测试
4.1 测试工具
测试工具使用Pycharm专业版中的Profile。
4.2 测试文件
测试文件为pride-and-prejudice.txt经过多次重复而来,大小超过了500MB。
4.3 测试结果
测试结果如下图
由测试结果可知,判断一个字符是不是一个小写字母这个函数被调用的最多,很消耗时间。
5.总结与改进
本次项目的完成让我进一步加深了对软件工程这门课所学知识的理解。正所谓书上所说“软件工程不等于编程”,在这次项目中我深刻地体会到了这一点。编程虽是重要的一部分,但却在整个软件工程中占极小的比重,完整的软件工程还包括需求分析、设计、测试、改进、维护等等。并且在这次任务中,我掌握了Python的更多库的使用,对这门语言的理解更加深刻了,并且由于好多东西得现学,所以花了大量时间用来查阅资料解决问题。网络上教程虽多但不宜多看,要选择一篇适合自己的看到底,中途有什么附加问题再请教别处。
正如软件测试结果那样,判断输入字符是最耗时的工作,后续尝试了放弃无论什么都转化成小写字母的思路,判断只有出现大写字母再转化,明显快了许多,而且对于组合逻辑,如-v,-p,我们可以只扫描一遍txt文件,而不是扫描两遍,先处理-v,再处理-p。后续还有诸如此类的小细节等待优化来提升速度。