【系列笔记一】-USYD悉尼大学Data1002 Grok Module 3 课件 作业 assignment讲解

Module 3 Text processing and data cleaning

这里有GROK-Module3 的全部内容,篇幅有点长,请有耐心看完。每一个大目录的最后一个小目录是程序小测验,算final成绩,会重点讲解。
Module3 一共有6大章节:1、Introduction 2、Transforming data 3、Filtering data 4、Filtering and transforming 5、Advanced filtering and transforming 6、Alternative transforms



前言

创作不易,拒绝抄袭,可以引用,标明出处。
小编会尽力去完善每一个知识点,如果有错误,漏掉的内容欢迎留言私信补充。


Introduction

In this module we will learn how to process text-based data. We start by looking at how to write programs that open and read from text files.
这一模块我们学习如何处理文本数据,从如何编写打开和读取文本文件开始。


From there, we will concentrate on two important concepts in the field of text processing: transforming and filtering. These two tasks are routinely applied in data cleaning and data mining。
有两个非常重要的概念:转换和过滤,经常用在数据清理和数据挖掘。

The patterns for this module are:这篇模块的内容
1.Transforming data 转换数据
2.Filtering data 过滤数据
3.Filtering and transforming 过滤和转换
4.Advanced filtering and transforming 高级过滤和转换



Pattern 1: Transforming data

Transforming data

Below we have a text file that contains the beginning of our novel.(only head)
这是一个文本文件,包括的是小说开头。(只选取前六行

pride_and_prejudice.txt

It
Is
A
Truth
Universally
Acknowledged

We want to transform each word such that it only contains lower case characters.
我们将所有的字母转换成小写

for word in open("pride_and_prejudice.txt"):
    word_new = word.lower()
    print(word_new)

When you run this code, you should see the following output:

it
is
a
truth
universally
acknowledged


Breaking it down

The first line in our example program initiates a so-called loop that runs through each line of the file. This is a standard syntax when working with files.
程序的第一行是循环,贯穿文章的每一行,这是处理文件的标准语法。

for word in open("pride_and_prejudice.txt"):

The loop variable plays a special role in the for statement: to this variable we assign each line from the file in turn.
循环变量在for语句中起着特殊的作用:我们依次将文件中的每一行赋给这个变量。


Transform each word

Inside the indented block of the for loop we can do anything we want with ourword variable.
在for循环的缩进块里,可以用1word变量做很多事情。

for word in open("pride_and_prejudice.txt"):
    word_new = word.lower()
    print(word_new)

Here we created a new variable called word_new which contains a transformed version of the original word variable. By calling.lower() at the end of a string, it gets converted to lower case characters.
创建一个新的变量来存储新的值,用.lower()函数转换为小写。

Here’s another example of how we can transform words in a file. For the following small file,We want to print out the length of each word. We can do this by using thelen function:
(另一个例子,我们想知道单词的长度,用len() 函数。)

example.txt

one
two
three

for word in open("example.txt"):
  length = len(word)
  print(length)

When you run this code, you should get the following output:

4
4
5


String methods recap

We call lower() from the example a string method. It converts all characters in a string to lower case.
我们从示例中将lower()称为字符串方法。 它将字符串中的所有字符转换为小写。
"string".method(args)

A very useful string method in the context of file processing is the rstrip method. It allows us to strip characters from a string. This can for example be used to remove the carriage return (\n) character:
rstrip()是文件处理很好的一个方法

s1 = "line 1\n"
s2 = "line 2"
s1_stripped = s1.rstrip("\n")
print(s1_stripped)
print(s2)

Stripping whitespace

If we call rstrip and pass in a particular character as the argument, then all instances of that character are stripped from the right side (end) of the string.( 如果我们调用rstrip并传入一个特定字符作为参数,那么该字符的所有实例都将从字符串的右侧(结尾)剥离

Python has two other string methods that work similarly to rstrip:

  • lstrip — same as rstrip, but strips from the left side (beginning) of the string; and
  • strip — same as lstrip, but strips from both sides of the string.

其余两种方法与rstrip相似,lstrip 分离左边的部分,strip 分离两边的部分。


Strip it!(算分)

跳转链接:USYD悉尼大学DATA1002 详细作业解析Module3



Pattern 2: Filtering data

Using the following text file,(只选取txt的head)

pride_and_prejudice.txt

I
certainly
have
had
my
of

we want to keep those words with more than six characters, thereby filtering out all the short words(我们想筛选出超过6个字母的单词

for word in open("pride_and_prejudice.txt"):
  if len(word.rstrip("\n")) > 6:
    print(word)

When you run this code, you should see the following output:(输出

certainly
pretend
extraordinary


Breaking it down

Here we check whether the word contains more than 6 characters using the len function, which returns the number of characters in a string. We strip off the carriage return character before we calculate the length using rstrip("\n"), like we did before. You can think of Python evaluating from the innermost instruction to the outermost. Here, rstrip get’s executed first, and then len gets executed with the result of rstrip.
使用len()函数检查单词长度,在这里,首先执行rstrip get,然后使用rstrip的结果执行len

Why we need to strip?

Remember that each line in a file (except for the last one) ends in a carriage return character. This character gets counted as well when you call the len function on the string that contains the data from that line. Compare:
文件中的每一行(最后一行除外)都以回车符结尾。 当您在包含该行数据的字符串上调用len函数时,也会计算此字符

word1 = "Carriage\n"
word2 = "Carriage"
print(len(word1))
print(len(word2))
#output:
9
8

For this reason (and many others) we should always use rstrip("\n") in filtering statements.
因此(以及许多其他原因),我们应始终在过滤语句中使用rstrip(“ \ n”)


In or Out

Boundary cases

A programmer needs to be very careful and precise in stating the condition in any if statement. It is good practice to check your code on examples that fall exactly on the boundary between the cases that are printed and those that are filtered, and also check it on cases that are just either side of the boundary.
程序员在任何if语句中说明条件时都需要非常小心和精确。优良作法是在完全落在已打印案例与已过滤案例之间的边界上的示例上检查代码,并在边界两侧的案例上进行检查。

例如:

if len(word.rstrip("\n")) <= 9:

if len(word.rstrip("\n")) < 9:

Code that “filters out the words with more than 9 characters”, and code that “filters out the words with 9 or more characters”, will do different things when they are given a word with exactly 9 characters.
当代码被赋予恰好9个字符的单词时,“过滤出9个以上字符的单词”和“过滤出9个或更多字符的单词”的代码将执行不同的操作。


Character in string

This filtering technique checks whether a certain character (or set of characters) is present within a string. For this, we use the in keyword like this:(此过滤技术检查字符串中是否存在某个字符(或一组字符)。 为此,我们使用in关键字,)

letter in word

This syntax checks whether the value of a variable called letter is among the characters that are present in the value of a variable called word, and it returns either True orFalse.
(此语法检查称为letter的变量的值是否在称为word的变量的值中存在的字符中,并返回TrueFalse)

letter = "p"
word = "accept"
print(not letter in word)
print(letter in word and len(word)==6)
#output:
False
True

Filter(算分)

跳转链接:USYD悉尼大学DATA1002 详细作业解析Module3



Pattern 3: Filtering and transforming

We want to be able to apply both filtering and transforming at the same time. Using the following text file as an example. 我们希望应用同时过滤和转换,下面是文本例子(文本太长,只截取头部):

pride_and_prejudice.txt
However
little
known
the
feelings
or

问题: suppose we want to find all words that end in the character “e” and then find out how long these words are. 我们想知道多少个单词最后一个字母包含‘e’, 然后这个单词的长度是多少?

思路: we first have to filter out the words which don’t end with an “e” and then transform each word which wasn’t filtered, into a number which represents its length. 首先我们过滤出结尾不是 ’e’ 的单词,然后将没被过滤的单词转换为长度。

for word in open('pride_and_prejudice.txt'):
    if word.rstrip('\n').endwith('e'):
        length = len(word.rstrip('\n'))
        print(length)

# out

总结

以上就是今天要讲的内容,讲述了Module3 课件的内容。每章节的最后小节算分,希望大家认真阅读,不懂就问,取得高分,成功上岸。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值