Module 3 Text processing and data cleaning
这里有GROK-Module3 的全部内容,篇幅有点长,请有耐心看完。每一个大目录的最后一个小目录是程序小测验,算final成绩,会重点讲解。
Module3 一共有6大章节:1、Introduction 2、Transforming data 3、Filtering data 4、Filtering and transforming 5、Advanced filtering and transforming 6、Alternative transforms
文章目录
前言
创作不易,拒绝抄袭,可以引用,标明出处。
小编会尽力去完善每一个知识点,如果有错误,漏掉的内容欢迎留言私信补充。
Introduction
In this module we will learn how to process text-based data. We start by looking at how to write programs that open and read from text files.
这一模块我们学习如何处理文本数据,从如何编写打开和读取文本文件开始。
From there, we will concentrate on two important concepts in the field of text processing: transforming and filtering. These two tasks are routinely applied in data cleaning and data mining。
有两个非常重要的概念:转换和过滤,经常用在数据清理和数据挖掘。
The patterns for this module are:这篇模块的内容
1.Transforming data 转换数据
2.Filtering data 过滤数据
3.Filtering and transforming 过滤和转换
4.Advanced filtering and transforming 高级过滤和转换
Pattern 1: Transforming data
Transforming data
Below we have a text file that contains the beginning of our novel.(only head)
(这是一个文本文件,包括的是小说开头。(只选取前六行))
pride_and_prejudice.txt
It
Is
A
Truth
Universally
Acknowledged
We want to transform each word such that it only contains lower case characters.
(我们将所有的字母转换成小写)
for word in open("pride_and_prejudice.txt"):
word_new = word.lower()
print(word_new)
When you run this code, you should see the following output:
it
is
a
truth
universally
acknowledged
Breaking it down
The first line in our example program initiates a so-called loop that runs through each line of the file. This is a standard syntax when working with files.
(程序的第一行是循环,贯穿文章的每一行,这是处理文件的标准语法。)
for word in open("pride_and_prejudice.txt"):
The loop variable plays a special role in the for statement: to this variable we assign each line from the file in turn.
(循环变量在for语句中起着特殊的作用:我们依次将文件中的每一行赋给这个变量。)