Module 5: Advanced text processing 1
前言
我尽可能做详细,每个步骤讲清楚。答案不止一种,有大神可以留言。其他课程章节在我的主页可以查看。
第五章主要内容是处理文本行,该模块的模式如下: 过滤和转换文本行,提取单词,逐行摘要,基于固定单词位置的处理。
文中空行和#注释不算讲解的代码行数,查代码行数的时候可以直接跳过。
Question1: End in end
Your task is to write a program that loops through the full Jane Eyre novel and counts those lines which end in the word “end” and prints out the result. We stored the novel in the jane_eyre.txt file.
Let your program print out the following message:
Where 4 is the number of lines that your program should count. As a reference, these are the lines that you should find:
Make sure that you only count lines which end in the word “end”, not just the characters. To do this, you could for example put a white space infront of the word, to make sure you don’t get longer words whose last three letters happen to be “end”.
要求: 打开文本文件,过滤出‘end’结尾的句子,然后查找有多少行是end结尾的句子,按照要求打印出来。
count = 0
for line in open("jane_eyre.txt"):
line_strip = line.rstrip()
if line_strip.endswith(" end"):
count += 1
print("There are "+str(count)+" lines that end in 'end'.")
line1: 获得一个变量count,它的值为0,用来每次的循环计数。
line2: 遍历循环文件,将数据文件中的每一行分配给循环变量 line 。
line3: 用rstrip()函数删除字符串末尾的指定字符(默认为空格)。然后赋值给line_strip。
line4: 条件判断函数endswith()判断结尾是否以end结尾。
可以在单词的前面加上一个空格,以确保您不会得到后三个字母碰巧是“ end”的更长的单词。
line5: 如果每次有end结尾的单词,count计数循环加一。
line6: 打印输出理想结果。输出里必须是字符串格式,原本的count为整数格式,所以要加一个str(count)。
Question2:Rare words
Your task is to write a program that loops through the words in the provided jane_eyre.txt file and counts those words which fulfill both of the following conditions:
- The word has more than 10 characters
- The word does not contain the letter “e”
Once you’ve finished the aggregation, let your program print out the following message:
Again, as a reference here are the words that your program should find (and count):
If you’re unsure how to start this task, go back one slide and take another look at the read method, and then a few slides more where we covered looping over lists.
**要求:**找出文本里字母数量超过10个的和带有e的单词,过滤出来有多少个单词。还有一种打开文件的方式是read()。
方法1: 主要讲解
count = 0
f = open("jane_eyre.txt").read()
words = f.split()
for word in words:
if len(word) > 10 and "e" not in word:
count += 1
print("There are "+str(count)+" long words without an 'e'.")
line1: 设置一个变量0,为了后面参加循环。
line2: 用open().read()函数来打开读取文件。
line3: 用split()函数把文本遇到空格分开列表形式。
line4: 用for in循环,遍历words在word里。
line5-line6: 条件判断单词的字数是否大于10 用and并列判断e是否在word里,如果两个条件都满足,计数循环加1.
line7: 输出理想结果。
方法2: 可参考
count = 0
for line in open("jane_eyre.txt"):
line = line.rstrip()
for word in line.split():
if len(word) > 10 and "e" not in word:
count += 1
print("There are " + str(count) + " long words without an 'e'.")
Question3:Accumulating characters
To practise nested loops and in-loop aggregations, your task is to write a program that, for each line, sums up the number of characters in whichever words in the line do not contain the letter “e”.
For example, in the following line of words,
we have four words (“coming”, “was”, “my”, “day”) that don’t contain the letter “e”. Summing up the number of characters in each of these four words, we get 6+3+2+3, which is 14.
Let your program print out the number of characters for each line. Using the provided jane_eyre.txt, the output of your program should look like this:
Hint
You’ll need a nested for loop here - go back a few slides if you’re unsure how to solve this problem.
要求: 找出每行没有带e的单词,然后查出这些单词的有多少个字母,打印出来。
for line in open("jane_eyre.txt"):
line_strip = line.strip()
count = 0
for word in line_strip.split():
if "e" not in word:
count += len(word)
print(count)
line1: for in 循环打开文件。
line2: strip()函数去掉文本的首位空余部分。
line3: 设置一个计数循环0,在小循环的外面,大循环的里面。
line4: 第二次循环,用split()函数把line2 的文本变成列表格式。
line5: 条件判断e是否在文本里。
line6: 如果条件判断为真,line3的循环每次加符合条件的单词并通过len()函数计算单词长度。
line8: 输出理想结果。
Question4: It’s all about perspective
The novel Jane Eyre is written in the First Person perspective and, interestingly enough, about 20% of all sentences in this novel start with the word “I”. It also turns out that the most common last word is “me”.
Which words follow and precede “I” and “me”? Your task is to write a program that loops over the provided jane_eyre_sentences.txt file and prints out the second and second-last words of each sentence that starts with the word “I” and ends with the word “me”. Each line in the file is one sentence with the ending punctuation mark removed.
The file contains the first 500 sentences of our novel. The output of your program for those should look like this:
For example, the first sentence that starts with “I” and ends with “me” is:
I resisted all the way a new thing for me and a circumstance which greatly strengthened the bad opinion Bessie and Miss Abbot were disposed to entertain of me
In this sentence, the second word is resisted and the second-last word is of.
When you check the last word of each sentence, don’t forget about the newline and/or carriage return character. It’s probably best to strip each line first before looking at the individual words.
要求: 打开文本文件,找出句子是以I开头和me结尾的句子。然后过滤出这个句子的第二项和倒数第二项。有两种方法差距不大,前者老师的,后者我自己的。
方法1:
for line in open("jane_eyre_sentences.txt"):
line_strip = line.strip()
words = line_strip.split()
if words[0] == "I" and words[-1] == "me":#连等号==可以用关键字in来替代
print(words[1], words[-2])
line1: for in循环开文件
line2: 用strip()函数去掉首位的空格
line3: 用split()函数遇到空格就把line_strip变量分成列表形式。
line4: 条件判断,索引的[0] 和[-1]是开头和结尾,用==或者关键字in来判断是不是’I’和’me’。
line5: 因为已经通过split()函数变成列表,可以用[1],[-2]来索引出第二个字和倒数第二个字,最后输出理想结果。
方法2:(参考)
for word in open('jane_eyre_sentences.txt'):
words=word.strip()
if words.startswith('I '):
if words.endswith(' me'):
word=word.split()
print(word[1],word[-2])
注释:只是把split()函数放在最后来把变量存储的值此变成列表。
总结:
这单元学的不难,了解索引分清split()和strip()函数就没有什么大问题。
我会持续更新,初心不变,记录学到的知识。上述所讲的内容如果有什么问题请留言或者私信我,小编看见会在第一时间更新。