Module 6: Advanced text processing 2
The most common word in the English language is the definite article “the”, according to an analysis of the Oxford English text corpus. In our novel, the most common word is in fact “the”, with a word frequency of about 4.2%.
In which part of a sentence is the word “the” most commonly used? Write a program that iterates over the provided jane_eyre_sentences.txt file and prints out the position of the first occurrence of the word “the” in each sentence.
For sentences that don’t contain “the”, print out the word “missing”. Your program should be case-insensitive, such that upper case as well as lower case words get counted.
The provided file contains the first 20 sentences of the novel. Here’s what the output of your program for these sentences should look like:
The first sentence doesn’t contain “the”, so your program should print missing here. Then, the next sentence,
We had been wandering indeed in the leafless shrubbery
has “the” appearing as the sixth word (starting to count from zero), so your program should print out the index 6. Make sure your program finds the “The”'s at the beginning of sentences as well - those will have the index 0.
要求： 打开文本文件，打印出每个句子中第一次出现单词“ the”的位置。对于不包含“ the”的句子，请打印出“ missing”一词。您的程序应不区分大小写，这样就可以同时计算大小写单词。
for line in open("jane_eyre_sentences.txt"): line_strip = line.strip() line_lower = line_strip.lower() words = line_lower.split() if "the" in words: print(words.index("the")) else: print("missing")
line1: for in 循环打开文件。
line7-line8: 条件判断其余项，说明这些是没有the 的句子，打印输出missing。
With the letter “e” being the most common letter in the English language, let’s have a look how many sentences consist mostly of words which don’t contain this letter.
Write a program that iterates over the provided jane_eyre_sentences.txt file and counts the number of words without an “e”, including both upper and lower case. For each sentence in which the relative amount of words without “e” is over 70%, print out how many words in that sentence contain no “e”, and how many words there are in total. Also, let your program print out the corresponding line number (starting to count from zero).
The provided file contains the first 50 sentences of the novel. Here’s what the output of your program for these sentences should look like:
For the first line, 9 out of 10 words contain no “e”, which makes 90% of the words in this sentence “e”-less.
要求： 打开文本文件进行迭代，计算出不带’e’的数量，每行不带‘e’的单词数量超过70%计算出来，输出结果不包含’e’单词数量，一共多少单词和在第几行（从0行开始计算）。分成两个部分讲make it easy.
#第一部分 line_count=0 #line1 for line in open('jane_eyre_sentences.txt'): #line2 words=line.split() #line3 count=0 #line4 for word in words: #line5 if 'e' not in word: #line6 count+=1 #line7 ans=count/len(words)*100 #line8 #第二部分 if ans>70: #line9 print(str(line_count)+': '+str(count)+' out of '+str(len(words))+" words contain no 'e'.") #line10 line_count+=1 #line11
line2： 用for in循环打开文本文件。
You might have noticed that the sentences in our novel, and classic English literature in general, can get quite long. In order to extract the relevant information out of literature, data scientists write clever abrigdement algorithms which reduce the number of words without compromising the information stored in the text too much.
A first step to developing such algorithms is to be able to rearrange words in an arbitrary way to form new sentences. Using the supplied jane_eyre_sentences.txt file which contains one of the two ballads in that novel, your task is to transform each sentence such that we only keep the words between the third and the third-last word (inclusive) and skip every second word on the way.
For example, using the following sentence,
Your program should transform this sentence to the following, new sentence:
Here the sentence starts from the third word (remember that indexing starts at 0) and then contains every second word of the original sentence up to the third last.
Here’s what the first lines of your output should look like:
for line in open("jane_eyre_sentences.txt"): words = line.split() print(" ".join(words[2:-2:2]))
line3： 用join（）函数将word[2:-2:2]的单词获得给" "空白格。
The semicolon is a popular stylistic element to connect two closely related ideas with equal position or rank. Charlotte Brontë also makes ample use of semicolons; you can find about 3460 semicolons in the 7460 sentences of the Jane Eyre novel.
Your task is to find sentences which contain a semicolon, and find the number of words before and after the semicolon. Use the jane_eyre_sentences.txt, which contains an excerpt from the first chapter with complete punctuation.
Your program should print out the line number, and then the number of words before and after the semicolon, separated by a semicolon. For example, for the following sentence, which is the third sentence from the file,
Be seated somewhere; and until you can speak pleasantly, remain silent."
your program should print:
There are three words before the semicolon, and then another eight afterwards.
For the entire file, the output of your program should look like this:
You can use the split method to split up each sentence at the semicolon. This should give you a list with two strings - when you call split again on both elements of the list, you get two individual lists of words before and after the semicolon.
There are no sentences in the file with more than one semicolon.
count = 0 for line in open("jane_eyre_sentences.txt"): if ";" in line: line_split = line.split(";") words_before = line_split.split() words_after = line_split.split() print("Line " + str(count) + ": "+str(len(words_before)) + \ ";" + str(len(words_after))) count += 1