USYD悉尼大学DATA1002 详细作业解析Module6

最新推荐文章于 2021-11-17 18:00:00 发布

不二程序猿

最新推荐文章于 2021-11-17 18:00:00 发布

阅读量3k

点赞数 16

分类专栏：悉尼大学 DATA1002 文章标签： python 程序人生经验分享

本文链接：https://blog.csdn.net/weixin_43773228/article/details/108961216

版权

悉尼大学 DATA1002 专栏收录该内容

7 篇文章

订阅专栏

Module 6: Advanced text processing 2

前言
question1：The first "the"
question2：The letter "e"
Question3：Ouput abrigdements
Question4：The use of semicolons
总结

前言

我尽可能做详细，每个步骤讲清楚。答案不止一种，有大神可以留言。其他章节在我的主页可以查看。文中空行和#注释不算讲解的代码行数，查代码行数的时候可以直接跳过。

question1：The first “the”

The most common word in the English language is the definite article “the”, according to an analysis of the Oxford English text corpus. In our novel, the most common word is in fact “the”, with a word frequency of about 4.2%.

In which part of a sentence is the word “the” most commonly used? Write a program that iterates over the provided jane_eyre_sentences.txt file and prints out the position of the first occurrence of the word “the” in each sentence.

For sentences that don’t contain “the”, print out the word “missing”. Your program should be case-insensitive, such that upper case as well as lower case words get counted.

The provided file contains the first 20 sentences of the novel. Here’s what the output of your program for these sentences should look like:

The first sentence doesn’t contain “the”, so your program should print missing here. Then, the next sentence,

We had been wandering indeed in the leafless shrubbery

has “the” appearing as the sixth word (starting to count from zero), so your program should print out the index 6. Make sure your program finds the “The”'s at the beginning of sentences as well - those will have the index 0.

要求： 打开文本文件，打印出每个句子中第一次出现单词“ the”的位置。对于不包含“ the”的句子，请打印出“ missing”一词。您的程序应不区分大小写，这样就可以同时计算大小写单词。

for line in open("jane_eyre_sentences.txt"):
  line_strip = line.strip()
  line_lower = line_strip.lower()
  words = line_lower.split()
  if "the" in words:
    print(words.index("the"))
  else:
    print("missing")

line1: for in 循环打开文件。
line2: 去掉文本的首位空格部分，用strip（）函数。
line3: 把文本变成小写，用lower（）函数。
line4: 把文本变成列表形式，方便后面索引找出单词位置。用split（）函数。
line5-line6: 条件判断the在不在变量words里，如果在输出the的位置。用index（）获取列表索引的位置。
line7-line8: 条件判断其余项，说明这些是没有the 的句子，打印输出missing。

question2：The letter “e”

With the letter “e” being the most common letter in the English language, let’s have a look how many sentences consist mostly of words which don’t contain this letter.

Write a program that iterates over the provided jane_eyre_sentences.txt file and counts the number of words without an “e”, including both upper and lower case. For each sentence in which the relative amount of words without “e” is over 70%, print out how many words in that sentence contain no “e”, and how many words there are in total. Also, let your program print out the corresponding line number (starting to count from zero).

The provided file contains the first 50 sentences of the novel. Here’s what the output of your program for these sentences should look like:

For the first line, 9 out of 10 words contain no “e”, which makes 90% of the words in this sentence “e”-less.

要求： 打开文本文件进行迭代，计算出不带’e’的数量，每行不带‘e’的单词数量超过70%计算出来，输出结果不包含’e’单词数量，一共多少单词和在第几行（从0行开始计算）。分成两个部分讲make it easy.

#第一部分
line_count=0	#line1
for line in open('jane_eyre_sentences.txt'):	#line2
  words=line.split()	#line3
  count=0	#line4
  for word in words:	#line5
    if 'e' not in word:	#line6
      count+=1	#line7
  ans=count/len(words)*100	#line8
  #第二部分
  if ans>70:	#line9
    print(str(line_count)+': '+str(count)+' out of '+str(len(words))+" words contain no 'e'.")		#line10
  line_count+=1 #line11

第一部分：
line1： 在打开文件前设置一个计数循环的变量，用来计算出在第几行。
line2： 用for in循环打开文本文件。
line3： 将单词变成列表，储存给新的变量words。
line4： 设置一个计数循环来计算没有’e’的单词有多少个。
line5： 新的循环来遍历列表。
line6-line7： 条件判断，如果‘e’不在列表单词里，计数寻循环加一。
line8： 设置新的变量ans来计算单词的比例，count是没有‘e’单词的数量，len（words）是单行单词的数量。
第二部分：
line9： 条件判断，如果单词比例超过70就输出下面语句块。
line10： 输出最终结果，str（line_count）为计算在第几行，str(count)计算没有’e’单词的数量，str（len（words））计算一行有多少个单词。
line11： 计数循环为了求在第几行，每遍历一次循环计数加一。注意计数是从0开始和代码块缩进是和第第三行对其。

Question3：Ouput abrigdements

You might have noticed that the sentences in our novel, and classic English literature in general, can get quite long. In order to extract the relevant information out of literature, data scientists write clever abrigdement algorithms which reduce the number of words without compromising the information stored in the text too much.

A first step to developing such algorithms is to be able to rearrange words in an arbitrary way to form new sentences. Using the supplied jane_eyre_sentences.txt file which contains one of the two ballads in that novel, your task is to transform each sentence such that we only keep the words between the third and the third-last word (inclusive) and skip every second word on the way.

For example, using the following sentence,

Your program should transform this sentence to the following, new sentence:

Here the sentence starts from the third word (remember that indexing starts at 0) and then contains every second word of the original sentence up to the third last.

Here’s what the first lines of your output should look like:

要求： 打开文本文件，每一行的前三个和后两个单词不输出，中间的单词每隔一个输出一个

for line in open("jane_eyre_sentences.txt"):
  words = line.split()
  print(" ".join(words[2:-2:2]))

line1： 打开文件
line2： 转换成列表的格式
line3： 用join（）函数将word[2:-2:2]的单词获得给" "空白格。
链接: 列表索引切片用法.

Question4：The use of semicolons

The semicolon is a popular stylistic element to connect two closely related ideas with equal position or rank. Charlotte Brontë also makes ample use of semicolons; you can find about 3460 semicolons in the 7460 sentences of the Jane Eyre novel.

Your task is to find sentences which contain a semicolon, and find the number of words before and after the semicolon. Use the jane_eyre_sentences.txt, which contains an excerpt from the first chapter with complete punctuation.

Your program should print out the line number, and then the number of words before and after the semicolon, separated by a semicolon. For example, for the following sentence, which is the third sentence from the file,

Be seated somewhere; and until you can speak pleasantly, remain silent."

your program should print:

There are three words before the semicolon, and then another eight afterwards.

For the entire file, the output of your program should look like this:

Hint
You can use the split method to split up each sentence at the semicolon. This should give you a list with two strings - when you call split again on both elements of the list, you get two individual lists of words before and after the semicolon.
There are no sentences in the file with more than one semicolon.

要求： 查找带有分号的句子，并查找分号之前和之后的单词数。

count = 0
for line in open("jane_eyre_sentences.txt"):
  if ";" in line:
    line_split = line.split(";")
    words_before = line_split[0].split()
    words_after = line_split[1].split()
    print("Line " + str(count) + ": "+str(len(words_before)) + \
            ";" + str(len(words_after)))
  count += 1

line1： 设置一个计数循环，来计算在第几行。
line2： 打开文本。
line3： 条件判断’；‘是否在文本里。
line4： 用split（’;’)函数将分号前后分离，并变成列表形式。
line5： line_split[0]求分号前的单词，.split（）函数将他们分离成列表格式，方便输出的时候查询单词数量。
line6： line_split[1]求分号后的单词，.split（）函数将他们分离成列表格式，方便输出的时候查询单词数量。（在此文本里，每行最多出现一个分号）
line7： 输出理想答案，str（count）用来计算行数，str（len（var))来求出番号前后的单词数量。
line8： 没循环一次，不管什么结果循环计数加一来求出在第几行。