DATA2001 期末知识点概括Week 2

本文链接：https://blog.csdn.net/weixin_43773228/article/details/118032625

DATA2001 期末知识点概括

文章目录

DATA2001 期末知识点概括
前言
Week 2 Data Cleaning and Exploration with Python
- 2.1 Level of Measurements and Type of Data: Categorical Data (Nominal,Dichotomous,Ordinal) , Quantitative(Interval,Ratio)
- 2.2 variance and standard 方差和标准差
- 2.3 Data Cleaning--pandas 数据清理
- 2.4 correlation statistics 相关统计
Week 3 Accessing Data in Relational Databases; Introduction to SQL
- 3.1 What is a Database?什么是数据库
- 3.2 Advantages of Database 数据库好处
- 3.3 Key Database Concepts 数据库概念
- - 3.3.1 Primary Key and Foreign Key 主键和外键
- 3.4 Kinds of Relationships (One-One , One-Many , Many-Many Relationship) 表和表的关系
Week 4 Declarative Data Analysis with SQL
- 4.1 SQL – The Structured Query Language 结构化查询语言(DDL, DML)
- 4.2 Table Constraints and Relational Keys(Primary key,Foreign keys) 表约束和关系键
- 4.3 SQL Domain Constraints 域约束
- 4.4 SQL查询语言（暂无）
Week 5 Scalable Data Analytics The Role of Indexes and Data Partitioning
- 5.1 Where is Data Stored? 数据存储在哪里
- 5.2 How to Access to Data on Secondary Storage FAST?如何快速访问二级存储上的数据
- 5.3 Alternative File Organizations 替代文件组织（Heap Files，Sorted Files）
- 5.4 Index 索引
- 5.5 Distributed Data Management 分布式数据管理(Partitioning)
- - 5.5.1 Partitioning 分区
Week 6 Scraping Web Data
- 6.1 (儿童节快乐）Web Scraping – General Approach 网页抓取 - 一般方法
- 6.2 Robots Exclusion Standard 机器人排除标准(robots协议）
- 6.3 Is it Legal? 法外狂徒？
- 6.4 Web Page Retrieval: URLs 网页检索 URL
- 6.5 HTML – Hypertext Markup Language 超文本标记语言
- 6.6 General Structure of a Web Page 网页的一般结构
- 6.7How to Select Content in a Webpage?如何选择网页中的内容？
- 6.8 HTML Document Model (DOM): Element-Tree 文档模型 (DOM)：元素树
Week 7 Semistructured Data; NoSQL
- 7.1 Getting Data via Service-APIs (Web Services) 通过服务 API（Web 服务）获取数据
- 7.2 Semistructured Data 半结构化数据
- 7.3 HTML vs. XML
- 7.4 Logical Document Structure 逻辑文件结构
- 7.5 How to query or filter XML?如何查询或过滤 XML？
- - 7.5.1 XPath Document Model Tree XPath 文档模型树
- 7.6 Semi-Structured data versus Structured Data 半结构化数据与结构化数据
- 7.7 NoSQL
- - 7.7.1NOSQL
- 7.8 MongoDB Data Model 数据模型
- - 7.8.1 MongoDB vs. RDBMS
Week 8 Text Data Processing: Feature Extraction & Analysis
- 8.1 Text data
- 8.2 Machine Learning tasks 机器学习任务
- 8.3 Tokenisation
- 8.4 Normalisation
- 8.5 Indicator Features
- 8.6 Term Frequency Weighting
- 8.7 TF-IDF Weighting 加权
- 8.8 Vector Space Model 向量空间模型
- 8.9 Document Vectors 文档向量
Week 9 (Geo-)Spatial Data
- 9.1 Spatial Data 空间数据
- 9.2 SDBMS vs GIS
- 9.3 Object Model 对象模型
- 9.4 PostGIS: Geometry vs. Geography Type
- 更多week 9 知识点请翻阅 week_9_lecture
Week 10 Time Series Data
- 10.1 Temporal Data 时间数据
- 10.2 Temporal Support in Databases 数据库中的时间支持
- 10.3 Concepts in Temporal Databases
Week 11 Image Data Processing
- 11.1 Image Data 图像数据
- 11.2 Types of Images种类
- 11.3 Aspects of Image Processing 图像处理方面
- 11.4 Morphological Image Processing 形态学图像处理
Week 12 Big Data
- 12.1 Big Data: Volume 量
- 12.2 Big Data: Velocity 速度
- 12.3 Big Data: Variety 多样性
- 12.4 Scale-Up 单个升级
- 12.5 The Alternative: Scale-Out 增加数量
- 12.6 MapReduce Overview
- 12.7 MapReduce Discussion
总结

前言

每周知识点来自于学校lecture，内容是每周的重点，可能会漏掉几个小的part知识点，对应在lecture上找就好。中英文为机翻，有的对照着大概理解一下就好。

Week 2 Data Cleaning and Exploration with Python

2.1 Level of Measurements and Type of Data: Categorical Data (Nominal,Dichotomous,Ordinal) , Quantitative(Interval,Ratio)

Categorical Data

A categorical variable is also known as a discrete or qualitative variable and can have two or more categories.

分类变量也称为离散变量或定性变量，可以有两个或多个类别。
It is further divided into two variants, nominal and ordinal.
它进一步分为两种变体，名义型和有序型。

These variables are sometimes coded as numerical values, or as strings.
这些变量有时被编码为数值或字符串。

Nominal Data

This is an unordered category data. This type of variable may be “label-coded” in numeric form but these numerical values have no mathematical interpretation and are just labeling to denote categories. For example, colours: black, red and white can be coded as 1, 2 and 3.
这是一个无序的类别数据。这种类型的变量可能以数字形式进行**“标签编码”**，但这些数值没有数学解释，只是用来表示类别的标签。例如，颜色：黑色、红色和白色可以编码为 1、2 和 3。

- Dichotomous Data

A dichotomous is a type of nominal data that can only have two possible values, e.g. true or false, or presence or absence. These are also sometimes referred as binary or Boolean variables.二分法是一种名义数据，它只能有两个可能的值，例如 真或假，或存在或不存在。这些有时也称为二进制或布尔变量。

e.g.: ture(1) or false(0)

Ordinal Data

This is ordered categorical data in which there is strict order for comparing the values, so a labelling as numbers is not completely arbitrary. For example, human height (small, medium and high) can be coded into numbers small = 1, medium = 2, high = 3.
这是有序的分类数据，其中比较值有严格的顺序，因此标记为数字并不是完全任意的。例如，人的身高（小、中、高）可以编码为数字小 = 1、中 = 2、高 = 3。

– Values are ordered 值是有序的
– No distance is implied 没有暗示距离
– Eg rank, agreement 例如等级、协议

Quantitative Data

Interval Data

It is a variable in which the interval between values has meaning and there is no true zero value.
它是一个变量，其中值之间的间隔有意义并且没有真正的零值。

Ratio Data

It is variable that might have a true value of zero and represents the total absence of the variable being measured. For example, it makes sense to say a Kelvin temperature of 100 is twice as hot as a Kelvin temperature of 50 because it represents twice as much the thermal energy (unlike Fahrenheit temperatures of 100 and 50).
它是真实值可能为零的变量，表示被测变量完全不存在。例如，可以说 100 的开尔文温度是 50 的开尔文温度的两倍，因为它代表了两倍的热能（与华氏温度 100 和 50 不同）。

2.2 variance and standard 方差和标准差

在这里插入图片描述

Samples from two populations with the same mean but different variances. The red population has mean 100 and variance 100 (SD=10) while the blue population has mean 100 and variance 2500 (SD=50).

分布较远variance较大，分布较近variance较小。

2.3 Data Cleaning–pandas 数据清理

Missing Data Handling

Pandas provides various functions for handling missing/wrong data
Part of this already included in the input functions (cf. csv_read() ) where missing values are automatically replaced with NA/NaN.
Pandas 提供了各种处理丢失/错误数据的函数
– 这部分内容已包含在输入函数中（参见 csv_read() ），其中缺失值会自动替换为 NA/NaN

data2 = data[’numGen’].dropna() 

data[‘numGen’].fillna(0, inplace=True)

data[‘numGen’].replace(to_replace=‘<Null>’, value=0, inplace=True)

Fix missing values during import

Some datasets contain placeholders for missing values
such as ‘n/a’, ‘–’ or ‘null’
Best to replace during import to avoid later problem
一些数据集包含缺失值的占位符
例如‘n/a’、‘–’或‘null’
最好在导入时更换以避免以后出现问题

import pandas as pd

missing_values = [“--”,”<Null>”]
data = pd.read_csv(‘MajorPowerStations.csv’, na_values = missing_values)
data.head()

2.4 correlation statistics 相关统计

Scipy 包括各种相关统计，得到-1到1之间的数。
-1到0表示反比，1到0表示正比，0表示毫不相干。

stats.spearmanr
stats.pearsonr

Week 3 Accessing Data in Relational Databases; Introduction to SQL

3.1 What is a Database?什么是数据库

A database is a shared collection of logically related data and its description.
数据库是逻辑相关数据及其描述的共享集合。

The database represents the entities (real-world things), the attributes (their relevant properties), and the logical relationships between the entities.
数据库表示实体（现实世界的事物）、属性（它们的相关属性）以及实体之间的逻辑关系。

3.2 Advantages of Database 数据库好处

– Data is managed, so quality can be enforced by the DBMS
管理数据，因此 DBMS 可以强制执行质量

– Improved Data Sharing 改进的数据共享
• Different users get different views of the data 不同的用户对数据有不同的看法
• Efficient concurrent access 高效的并发访问

– Enforcement of Standards 执行标准
• All data access is done in the same way 所有数据访问均以相同方式完成

– Integrity constraints, data validation rules 完整性约束、数据验证规则

– Better Data Accessibility/ Responsiveness 更好的数据可访问性/响应能力
• Use of standard data query language (SQL) 使用标准数据查询语言 (SQL)

– Security, Backup/Recovery, Concurrency 安全、备份/恢复、并发
• Disaster recovery is easier 灾难恢复更容易

Program-Data Independence 程序数据独立性
– Metadata stored in DBMS, so applications don’t need to worry about data formats 元数据存储在 DBMS 中，因此应用程序无需担心数据格式

– Data queries/updates managed by DBMS so programs don’t need to process data access routines
数据查询/更新由 DBMS 管理，因此程序不需要处理数据访问例程

– Results in:
• Reduced application development time 缩短应用程序开发时间
• Increased maintenance productivity 提高维护效率
• Efficient access 高效访问

3.3 Key Database Concepts 数据库概念

– Table – an arrangement of related information stored in columns and rows.
表格 – 存储在列和行中的相关信息的排列。

– Field / Attribute – column in a table, contains homogenous set of data.
字段/属性 – 表中的列，包含同类数据集。

– Field data types - kind of data that can be stored in a field. For example, a field whose data type is Text can store data consisting of either text or number characters, but a Number field can store only numerical data.
字段数据类型 - 可以存储在字段中的数据类型。例如，数据类型为文本的字段可以存储由文本或数字字符组成的数据，但数字字段只能存储数字数据。

– Primary Key (PK) – a field in a table whose value is uniquely identifies each record in the table. A PK cannot be null (it must be given).
主键 (PK) – 表中的字段，其值唯一标识表中的每条记录。 PK 不能为空（必须给出）。

– Record – A row in table.
记录 – 表中的一行

3.3.1 Primary Key and Foreign Key 主键和外键

Primary Key

– A primary key is a unique attribute which the database uses to identify a row in a table.
主键是数据库用来标识表中行的唯一属性。

– It is a unique, auto-incrementing ID which is filled in by the database - in other words it is NEVER NULL-– ( NULL has the special meaning in databases of “unknown” or “not given” )
它是一个唯一的、自动递增的 ID，由数据库填充 - 换句话说，它永远不会为空–（NULL 在“未知”或“未给出”的数据库中具有特殊含义）

– A primary ID number will only ever be issued once
一个主要的 ID 号码只会发出一次

Foreign Key

– When we need to refer to a record in a separate table we reference its ID as a foreign key.
当我们需要引用单独表中的记录时，我们将其 ID 作为外键引用。

– A foreign key is defined in a second table, but it refers to the primary key or a unique key in the first table.
外键定义在第二个表中，但它指的是第一个表中的主键或唯一键

3.4 Kinds of Relationships (One-One , One-Many , Many-Many Relationship) 表和表的关系

One-One Relationship (1-1 Relationship):
One-to-One (1-1) relationship is defined as the relationship between two tables where both the tables should be associated with each other based on only one matching row.
一对一 (1-1) 关系定义为两个表之间的关系，其中两个表应仅基于一个匹配行相互关联。

One-Many Relationship (1-M Relationship):
The One-toMany relationship is defined as a relationship between two tables where a row from one table can have multiple matching rows in another table.
一对多关系定义为两个表之间的关系，其中一个表中的一行可以在另一个表中具有多个匹配行。

Many-to-Many Relationship (M-N Relationship) 多对多关系

Week 4 Declarative Data Analysis with SQL

4.1 SQL – The Structured Query Language 结构化查询语言(DDL, DML)

– SQL is the standard declarative query language for RDBMS
SQL 是 RDBMS 的标准声明式查询语言

– Describing what data we are interested in, but not how to retrieve it.
描述我们感兴趣的数据，而不是如何检索它。

– Supported commands from roughly two categories:
支持的命令大致分为两类

– DDL (Data Definition Language) 数据定义语言
• Create, drop, or alter the relation schema 创建、删除或更改关系模式
• Example:
CREATE TABLE name ( list_of_columns )

– DML (Data Manipulation Language) 数据操作语言
• for retrieval of information also called query language 用于检索信息，也称为查询语言
•Example:
INSERT, DELETE, UPDATE
SELECT … FROM … WHERE

4.2 Table Constraints and Relational Keys(Primary key,Foreign keys) 表约束和关系键

When creating a table, we can also specify Integrity Constraints for columns
创建表时，我们还可以为列指定完整性约束
– eg. domain types per attribute, or NULL / NOT NULL constraints
例如。每个属性的域类型，或 NULL / NOT NULL 约束

– Primary key: unique, minimal identifier of a relation.
主键：唯一的、最小的关系标识符。
– Examples include employee numbers, social security numbers, etc. This is how we can guarantee that all rows are unique.
示例包括员工编号、社会保险号等。这就是我们如何保证所有行都是唯一的。

– Foreign keys are identifiers that enable a dependent relation (on the many side of a relationship) to refer to its parent relation (on the one side of the relationship)
外键是使依赖关系（在关系的多方面）能够引用其父关系（在关系的一侧）的标识符

– Must refer to a candidate key of the parent relation 必须引用父关系的候选键
– Like a `logical pointer’ 就像一个“逻辑指针”

– Keys can be simple (single attribute) or composite (multiple attributes)
键可以是简单的（单属性）或复合的（多属性）

4.3 SQL Domain Constraints 域约束

SQL supports various domain constraints to restrict attribute to valid domains
SQL 支持各种域约束以将属性限制为有效域

• NULL / NOT NULL whether an attribute is allowed to become NULL (unknown)
是否允许属性变为 NULL（未知）

• DEFAULT to specify a default value
DEFAULT 指定默认值

• CHECK( condition ) a Boolean condition that must hold for every tuple in the db instance
一个布尔条件，该条件必须适用于数据库实例中的每个元组

Example:(DDL)

CREATE TABLE Student 
( 
	sid 		INTEGER 		PRIMARY KEY,
	name 		VARCHAR(20) 	NOT NULL,
	gender 		CHAR CHECK 		(gender IN ('M,'F','T')),
	birthday 	DATE,
	country 	VARCHAR(20),
	level 		INTEGER 		DEFAULT 1 CHECK (level BETWEEN 1 and 5)
);

4.4 SQL查询语言（暂无）

SQL语言不做详细介绍，基础语言可以在菜鸟教程找
链接: PostgreSQL 教程

Week 5 Scalable Data Analytics The Role of Indexes and Data Partitioning

5.1 Where is Data Stored? 数据存储在哪里

Main Memory (RAM):主要存储内存条
• Expensive 昂贵的
• Volatile 易挥发的

Secondary Storage (HDD):辅助存储硬盘
• Cheap 便宜的
• Stable 稳定的
• BIG 容量大

Tertiary Storage (e.g. Tape): 三级存储（例如磁带）
• Very Cheap 非常便宜
• Stable 稳定的

5.2 How to Access to Data on Secondary Storage FAST?如何快速访问二级存储上的数据

Key Challenge: Secondary storage needed for sheer data volume (and persistence), but it is slow.
主要挑战：纯粹的数据量（和持久性）需要二级存储，但速度很慢……

Approaches:方法
– Block-wise transfer 分块转移
• transfer data in fixed-size chunks (blocks or pages) between storage layers
在存储层之间以固定大小的块（块或页）传输数据

– Caching / Buffering 缓存/缓冲
• Keep ‘hot’ data in memory, use secondary storage for ‘cold’ data
将“热”数据保存在内存中，将“冷”数据使用二级存储

– Optimised File Organisation 优化的文件组织
• Heap Files vs. Sorted Files; Row Stores vs Column Stores
堆文件与排序文件；行存储与列存储

– Indexing 索引

– Partitioning 分区

5.3 Alternative File Organizations 替代文件组织（Heap Files，Sorted Files）

Many alternatives exist, each ideal for some situations, and not so good in others:
存在许多替代方案，每一种都适用于某些情况，但在其他情况下不太好：

– Indexes – data structures to organize records via trees or hashing
通过树或散列来组织记录的数据结构
– like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields
与排序文件一样，它们可以根据某些（“搜索键”）字段中的值加快对记录子集的搜索
– Updates are much faster than in sorted files.
更新比排序文件快得多。

5.3.1Heap Files（Unordered）堆文件无序

a record can be placed anywhere in the file where there is space (random order)
堆文件 – 记录可以放置在文件中任何有空间的地方（随机顺序）

– suitable when typical access is a file scan retrieving all records.
适用于典型访问是检索所有记录的文件扫描。

– Simplest file structure contains records in no particular order.
最简单的文件结构包含没有特定顺序的记录

–Access method is a linear scan 访问方法是线性扫描
–— In average half of the pages in a file must be read,in the worst case even the whole file
平均必须读取文件中的一半页面，在最坏的情况下甚至整个文件
–— Efficient if all rows are returned (SELECT * FROM table)
如果返回所有行则有效（SELECT * FROM table）
—– Very inefficient if a few rows are requested 如果请求几行，效率非常低

– Rows appended to end of file as they are inserted
行在插入时附加到文件末尾
—– Hence the file is unordered因此文件是无序的

– Deleted rows create gaps in file 删除的行在文件中产生间隙
–— File must be periodically compacted to recover space 必须定期压缩文件以恢复空间

5.3.2 Sorted Files

– store records in sequential order, based on the value of the search key of each record
Sorted Files – 根据每条记录的搜索键值按顺序存储记录
– best if records must be retrieved in some order, or only a ‘range’ of records is needed.
最好是必须按某种顺序检索记录，或者只需要一个“范围”的记录。

5.3.3 Column Store – Pros and Cons (暂无）

看lecture上讲的全。(偷懒一下，没人发现）

5.4 Index 索引

5.4.1 Indices (指数）

– Idea: Separate location mechanism from data storage
将定位机制与数据存储分离

– Just remember a book index:只需记住一个书籍索引
Index is a set of pages (a separate file) with pointers (page numbers) to the data page which contains the value
索引是一组页（一个单独的文件），带有指向包含值的数据页的指针（页码）

– Instead of scanning through whole book (relation) each time, using the index is much faster to navigate (less data to search)
不是每次都扫描整本书（关系），使用索引导航要快得多（要搜索的数据更少）

– Index typically much smaller than the actual data
索引通常比实际数据小得多

5.4.2 Index Example 索引例子

在这里插入图片描述

Here, index is on name attribute of Stations table
索引位于 Stations 表的 name 属性上
– We say name is the search key for this index (it is the attribute which we use to look up data)
name 是这个索引的搜索键（它是我们用来查找数据的属性）

5.4.3 Index structure choices:结构选择

– Tree index: search keys are stored in sorted order in index [this supports range query for search key]
树索引：搜索键按排序顺序存储在索引中[这支持搜索键的范围查询]

– Hash index: search keys are distributed uniformly across “buckets” in index using a “hash function”.
哈希索引：使用“哈希函数”在索引中的“桶”中均匀分布搜索键。

5.4.4 Index Definition in SQL 在SQL语言里

创建index：
CREATE INDEX name ON relation-name (<attributelist>)
例子：

CREATE INDEX StationNameIdx ON Stations(name)

To drop an index 删除索引:
DROP INDEX index-name

5.4.5 Clustered Index 聚集索引

In a clustered index, both index entries and rows with the actual data are ordered in the same way.
在聚集索引中，索引条目和包含实际数据的行都以相同的方式排序。

– The particular index structure (e.g. hash or tree) dictates how the index entries are organized in the storage structure
特定的索引结构（例如哈希或树）决定了索引条目在存储结构中的组织方式

– For a clustered index, this then dictates how the data rows are organized
对于聚集索引，这决定了数据行的组织方式

– There can be at most one clustered index on a table.
一张表上最多可以有一个聚集索引。
– e.g. the white pages of the phone book in alphabetical order
例如按字母顺序排列的电话簿白页

–CREATE TABLE statement generally creates a clustered index on primary key.
CREATE TABLE 语句通常在主键上创建聚集索引。
– To have clustered index on other attribute, 要在其他属性上设置聚集索引
in PostgreSQL use command: CLUSTER TABLE name ON Index

5.4.6 Unclustered Index 非聚集索引

– Index entries and rows are not ordered in the same way.
索引条目和行的排序方式不同。

– There can be many secondary indices on a table.
一个表上可以有很多次要索引。

– Index created by CREATE INDEX is generally an unclustered, secondary index.
CREATE INDEX 创建的索引通常是非聚集的二级索引

5.4.7 Covering Index 覆盖索引

– Goal: Is it possible to answer whole query just from an index?
目标：是否可以仅从索引中回答整个查询？

– Covering Index - an index that contains all attributes required to answer a given SQL query:覆盖索引 - 包含回答给定 SQL 查询所需的所有属性的索引：
– all attributes from the WHERE filter condition 来自 WHERE 过滤条件的所有属性
– if it is a grouping query, also all attributes from GROUP BY & HAVING
果是分组查询，还有来自 GROUP BY 和 HAVING 的所有属性
– all attributes mentioned in the SELECT clause
SELECT 子句中提到的所有属性

– Typically a multi-attribute index
通常是多属性索引

– Order of attributes is important: Prefix of the search key must be the attributes from the WHERE
属性顺序很重要：搜索关键字的前缀必须是来自 WHERE 的属性.

5.5 Distributed Data Management 分布式数据管理(Partitioning)

Two main physical design techniques:

– Data Partitioning 数据分区

– Storing sub-sets of the original data set at different places
在不同地方存储原始数据集的子集

• can be in different tables in schema on same server, or at remote sites
可以位于同一服务器或远程站点的架构中的不同表中

– Goal is to query smaller data sets & to gain scalability by parallelism
目标是查询较小的数据集并通过并行性获得可扩展性
\

– Sub-sets can be defined by
可以通过以下方式定义子集

• columns: Vertical Partitioning 列：垂直分区
• rows: Horizontal Partitioning 行：水平分区
(if each partition is stored on a different site also called Sharding)
如果每个分区都存储在不同的站点上，也称为分片）

–Data Replication (Not covered in this unit of study)
数据复制（本研究单元未涵盖）

– Storing copies (‘replicas’) of the same data at more than one place
在多个地方存储相同数据的副本（“副本”）
– Goal is fail safety / availability
目标是故障安全/可用性

5.5.1 Partitioning 分区

– Advantages of Partitioning:分区的优点：
– Easier to manage than a large table 比大桌子更容易管理

– Better availability: 更好的可用性
if one partition is down, others are unaffected if stored on different tablespace / disk
如果一个分区关闭，其他分区不受影响，如果存储在不同的表空间/磁盘上

– Helps with bulk loading, e.g for data warehouse applications
助于批量加载，例如用于数据仓库应用程序

– Queries faster on smaller partitions; can be evaluated in parallel
在较小的分区上查询速度更快；可以并行评估

Week 6 Scraping Web Data

6.1 (儿童节快乐）Web Scraping – General Approach 网页抓取 - 一般方法

– Reconnaissance 侦察

– Identify source, and check its structure and content
识别来源，并检查其结构和内容

– Webpage Retrieval 网页检索

– Download one or multiple pages from source
从源下载一个或多个页面
– Typically in a script or program that auto-generates new URLs based on website structure and its URL format
通常在根据网站结构及其 URL 格式自动生成新 URL 的脚本或程序中

– Data Extraction from webpage 从网页中提取数据
– Content parsing, raw data extraction
内容解析、原始数据提取

– Data Cleaning and transformation into required format
数据清理和转换为所需格式

– Data Storage / Analysis / combining with other data sets
数据存储/分析/与其他数据集结合

6.2 Robots Exclusion Standard 机器人排除标准(robots协议）

– Many websites provide a robots.txt file
许多网站提供 robots.txt 文件

– Meant for web crawlers who should check this content first before starting crawling a website
适用于在开始抓取网站之前应先检查此内容的网络爬虫

– Different rules in here这里有不同的规则
• Crawling/scraping allowed at all?是否允许爬行/抓取？

• Only specific subdirectories?
只有特定的子目录?

• Only certain programs (“user-agent”)?
只有某些程序（“用户代理”）？

• Which frequency (“request-rate”)?
哪个频率（“请求率”）？

Df. https://en.wikipedia.org/wiki/Robots_exclusion_standard

– Be a good net citizen: 做一个好的网民
Check, ask, don’t overload – and don’t steal (check copyright!)
检查、询问、不要超载——也不要偷窃（检查版权！）

6.3 Is it Legal? 法外狂徒？

– Web scraping per itself is not illegal, you are free to save all publicly data available on the internet to your computer.
网络抓取本身并不违法，您可以自由地将互联网上所有可用的公开数据保存到您的计算机上。

– The way you will use that data is what might be illegal.
您使用该数据的方式可能是非法的。

– Please read the website terms and conditions, and robots.txt, and make sure you are not doing anything illegal 😃
请阅读网站条款和条件以及 robots.txt，并确保您没有做任何违法的事情

6.4 Web Page Retrieval: URLs 网页检索 URL

– URL – Uniform Resource Locator URL – 统一资源定位符

– “address” format on the web 网络上的“地址”格式
– Example:
• https://convictrecords.com.au/ships/adamant/1821

– General Format 通用格式
• protocol://site/path_to_resource
• Typical protocols: http https ftp
典型协议

– Can be scripted or programmed; more details later and in tutorials
可以编写脚本或编程；稍后和教程中的更多详细信息

6.5 HTML – Hypertext Markup Language 超文本标记语言

– Webpages are written in HTML网页是用 HTML 编写的

– Textual markup language that defines structure, content, and design of a page as well as active elements (scripts, forms, etc.)
定义页面结构、内容和设计以及活动元素（脚本、表单等）的文本标记语言

– Typically several additional files linked:通常链接几个附加文件：
• CSS - cascading style sheets CSS - 级联样式表
• Scripts, Images, videos etc. 脚本、图像、视频等

6.6 General Structure of a Web Page 网页的一般结构

– Head 头部
– title, style sheets, scripts, meta-data
标题、样式表、脚本、元数据

– Body 主体
– headings, text, lists, tables, images, forms etc.
标题、文本、列表、表格、图像、表格等

6.7How to Select Content in a Webpage?如何选择网页中的内容？

详细请看lecture

– Four options: 四个选项

– text patterns 文本模式
– DOM navigation DOM 导航
– CSS selectors CSS 选择器
– XPath expressions XPath 表达式

6.8 HTML Document Model (DOM): Element-Tree 文档模型 (DOM)：元素树

在这里插入图片描述

Week 7 Semistructured Data; NoSQL

7.1 Getting Data via Service-APIs (Web Services) 通过服务 API（Web 服务）获取数据

– Many website or web service provide programmable APIs which allow you to explicitly request data for a program to process, instead of pages to view in browser
许多网站或 Web 服务提供可编程 API，允许您明确请求数据以供程序处理，而不是在浏览器中查看页面

7.2 Semistructured Data 半结构化数据

– HTML, XML and JSON are examples of so-called semistructured data models
HTML、XML 和 JSON 是所谓的半结构化数据模型的示例
– data with non-rigid structure具有非刚性结构的数据

– Characteristics of semistructured data半结构化数据的特征
– Missing or additional attributes
缺少或额外的属性

– Multiple attributes
多个属性

– Nesting: semistructured objects (‘documents’) are hierarchical / have tree-structure
嵌套：半结构化对象（“文档”）是分层的/具有树状结构

– Different types in different objects
不同对象中的不同类型

– Heterogeneous collections
异构集合

Self-describing, irregular data, no a priori structure
自描述，不规则数据，无先验结构

7.3 HTML vs. XML

– While HTML is mainly for web page design, 虽然 HTML 主要用于网页设计
XML is the more structured “cousin” for data exchange XML 是更结构化的数据交换“表亲”

– Some web services can be asked to send XML rather than HTML pages
可以要求某些 Web 服务发送 XML 而不是 HTML 页面

– Also common in enterprise data exchange, or open data sets
也常见于企业数据交换或开放数据集

7.4 Logical Document Structure 逻辑文件结构

– XML refers to its objects as elements XML 将其对象称为元素

– The top-most element is called the root or document element.
最顶层的元素称为根或文档元素。

– Elements are bound by tags:元素由标签绑定

– Tree structure! (not a graph) 树结构！（不是图表）

– Solely data type for leaf elements:PCDATA (parseable character data)
叶元素的唯一数据类型：PCDATA（可解析的字符数据）

7.5 How to query or filter XML?如何查询或过滤 XML？

– DOM Navigation DOM 导航

– XML documents represent a tree structure which can be navigated using XML’s Document Object Model (DOM)
XML 文档表示可以使用 XML 的文档对象模型 (DOM) 导航的树结构

– XPath

– XPath expressions allow to query single values, node(s) or whole subtrees within one XML document
XPath 表达式允许在一个 XML 文档中查询单个值、节点或整个子树

– XQuery

– XQuery builds on XPath to specify a declarative query language over a set of XML documents
XQuery 建立在 XPath 之上，以在一组 XML 文档上指定声明性查询语言

7.5.1 XPath Document Model Tree XPath 文档模型树

在这里插入图片描述

7.6 Semi-Structured data versus Structured Data 半结构化数据与结构化数据

– Relational World关系世界

Schema-first, rich type system for attributes, integrity constraints
模式优先，丰富的属性类型系统，完整性约束

– “First Normal Form”: only atomic type attributes allowed
第一范式”：仅允许原子类型属性

– Semi-structured World 半结构化世界

– Self-describing data with flexible structure
结构灵活的自描述数据

– Nested data model with tree-structure
具有树结构的嵌套数据模型

– optional attributes, grammar, schema and vocabulary
可选属性、语法、模式和词汇

7.7 NoSQL

– Traditional dbms platforms were relational (SQL as query language; relational data model) and also powerful (lots of features for integrity, security, tuning), expensive, resource-intensive, hard to administer
传统的 dbms 平台是关系型的（SQL 作为查询语言；关系型数据模型）并且功能强大（许多功能用于完整性、安全性、调优）、昂贵、资源密集、难以管理

– Mostly focused on scale-up (run on powerful expensive servers to get excellent performance)
主要关注纵向扩展（在功能强大的昂贵服务器上运行以获得出色的性能）

– Rise of cloud computing shifted focus to scale-out on many commodity simple servers, with fault-tolerance
云计算的兴起将重点转移到具有容错性的许多商用简单服务器上的横向扩展

– New systems were designed, and described as “NoSQL” because they gave up features of traditional platforms
设计了新系统，并将其描述为“NoSQL”，因为它们放弃了传统平台的功能

– Simpler data model, simpler queries and updates (eg without crosstable joins or triggers), weaker guarantees for consistency and integrity
更简单的数据模型，更简单的查询和更新（例如，没有跨表连接或触发器），一致性和完整性的保证较弱

– Often open-source and sometimes free
通常是开源的，有时是免费的

7.7.1NOSQL

– Over time, the new platforms added features like joins, triggers and integrity (under pressure from users) while old platforms added support for more diverse data models
随着时间的推移，新平台增加了连接、触发器和完整性等功能（在用户压力下），而旧平台增加了对更多样化数据模型的支持

– The phrase “Not only SQL” has been used for these systems
短语“不仅是 SQL”已用于这些系统

7.8 MongoDB Data Model 数据模型

– Basically a JSON store (JSON type system)
基本上是一个 JSON 存储（JSON 类型系统）

– Flexible schema: Document in a collection do not need to have the same structure
灵活的模式：集合中的文档不需要具有相同的结构

– All documents have an object ID (_id) – either user-defined or automatically generated
所有文档都有一个对象 ID (_id) – 用户定义或自动生成

– Relationships:
either via nested documents (“embedded sub-documents”) or using references
通过嵌套文档（“嵌入式子文档”）或使用引用

7.8.1 MongoDB vs. RDBMS

在这里插入图片描述

Week 8 Text Data Processing: Feature Extraction & Analysis

8.1 Text data

Text data usually does not have a pre-defined data model, is unstructured and is typically text-heavy, but may contain dates, numbers and facts as well.
文本数据通常没有预定义的数据模型，是非结构化的，通常是大量文本，但也可能包含日期、数字和事实。

This results in ambiguities that make it more difficult to understand than data in structured databases.
这会导致歧义，使其比结构化数据库中的数据更难理解。

8.2 Machine Learning tasks 机器学习任务

– Supervised learning – predict a value where truth is available in the training data
监督学习 – 预测训练数据中的真实值

– Prediction 预言

– Classification (categorical - discrete labels), Regression (quantitative -numeric values)
分类（分类 - 离散标签），回归（定量 - 数值）

– Unsupervised learning – find patterns without ground truth in training data
无监督学习 – 在训练数据中找到没有基本事实的模式

– Clustering 聚类

– Probability distribution estimation 概率分布估计

– Finding association (in features) 寻找关联（在功能中）

– Dimension reduction 降维

Other tasks: Semi-supervised learning, Reinforcement learning
其他任务：半监督学习、强化学习
在这里插入图片描述

8.3 Tokenisation

把有用的提取出来

Split a string (document) into pieces called tokens
将字符串（文档）拆分为称为令牌的部分

– Possibly remove some characters, e.g., punctuation
可能会删除一些字符，例如标点符号

– Remove “stop words” such as “a”, “the”, “and” which are considered irrelevant
删除“停用词”，如“a”、“the”、“and”等被认为不相关的词

8.4 Normalisation

统一同一格式

Map similar words to the same token
将相似的词映射到相同的标记

– Stemming/lemmatisation 词干/词形还原

– Avoid grammatical and derivational sparseness 避免语法和派生稀疏
– E.g., “was” => “be”

– Lower casing, encoding 下壳，编码

– E.g., “Naïve” => “naive”

8.5 Indicator Features

记录

Binary indicator feature for each word in a document
文档中每个单词的二进制指示符功能

Ignore frequencies
忽略频率

8.6 Term Frequency Weighting

Term frequency 词频

– Give more weight to terms that are common in document
对文档中常见的术语给予更多权重
– TF = |occurrences of term in doc|

– Damping 阻尼

– Sometimes want to reduce impact of high counts
有时想减少高计数的影响
TF = log(|occurrences of term in doc|)

8.7 TF-IDF Weighting 加权

Inverse document frequency (IDF)逆向文档频率 (IDF)

– Give less weight to terms that are common across documents
减少文档中常见术语的权重
• deals with the problems of the Zipf distribution
处理 Zipf 分布的问题
– IDF = log(|total docs|/|docs containing term|)

– TFIDF
– TFIDF = TF * IDF

8.8 Vector Space Model 向量空间模型

Documents are represented as vectors in term space
文档在术语空间中表示为向量

– Terms are usually stems
术语通常是词干
– Document vector values can be weighted by, e.g., frequency
文档向量值可以通过例如频率加权

– Queries represented the same as documents
查询表示与文档相同

8.9 Document Vectors 文档向量

All document vectors together: Document-Term-Matrix (Feature-Matrix) 所有文档向量加在一起

Week 9 (Geo-)Spatial Data

9.1 Spatial Data 空间数据

– Spatial data is about objects and entities which have a location and/or a geometry
空间数据是关于具有位置和/或几何形状的对象和实体

– A special form is geospatial data which refers to data or information that identifies the geographic location of features and boundaries on Earth (such as localities, cities, suburbs etc)
一种特殊形式是地理空间数据，它指的是识别地球上特征和边界（例如地点、城市、郊区等）的地理位置的数据或信息

9.2 SDBMS vs GIS

– Spatial Database Management System (SDBMS)
空间数据库管理系统 (SDBMS)

– Handle large amount of spatial data stored in secondary storage.
处理存储在二级存储中的大量空间数据。

– Spatial semantics built into query language
查询语言中内置的空间语义

– Specialized index structure to access spatial data
访问空间数据的专用索引结构

– **Geographic Information System (GIS)**地理信息系统 (GIS)

– SDBMS Client SDBMS 客户端

– Characterized by a rich set of geographic analysis functions
以丰富的地理分析功能为特点

– SDBMS allows GIS to scale to large databases, which are now becoming the norm
SDBMS 允许 GIS 扩展到大型数据库，这现已成为常态

– Information in a GIS is typically organized in “layers”. GIS 中的信息通常按“层”组织。
• For example a map will have a layer of “roads”, “train stations”, “suburbs” and “water bodies”.
例如，地图将具有“道路”、“火车站”、“郊区”和“水体”层。
• GIS allows data exploration and integration across layers.
GIS 允许跨层进行数据探索和集成

9.3 Object Model 对象模型

– Object model concepts 对象模型概念

– Objects: distinct identifiable things relevant to an application
对象：与应用程序相关的不同可识别事物
• Objects have attributes and operations
对象具有属性和操作

– Attribute: a simple (e.g. numeric, string) property of an object
属性：对象的简单（例如数字、字符串）属性

– Operations: function maps object attributes to other objects
操作：函数将对象属性映射到其他对象

9.4 PostGIS: Geometry vs. Geography Type

– Geometry type: 几何类型(平面)：
– shapes on a plane; shortest path between two points is a straight line
平面上的形状；两点之间的最短路径是一条直线

– Geography type 地理类型(球体):
– Basis is a sphere; shortest path between two points is a circle arc
基础是一个球体；两点之间的最短路径是圆弧

Week 10 Time Series Data

10.1 Temporal Data 时间数据

– Almost all data is qualified with time (period or point)
几乎所有数据都用时间（周期或点）限定

– Web stores 网上商店
– Data warehousing 数据仓库
– Medical records, loans, … 医疗记录、贷款、…
– Sensor data and time series 传感器数据和时间序列
– Transport information 运输信息

10.2 Temporal Support in Databases 数据库中的时间支持

– Limited support for temporal data management in DBMSs
对 DBMS 中时态数据管理的有限支持

– Conventional (non-temporal) DBs represent a static snapshot
传统（非临时）数据库代表静态快照

– Management of temporal aspects is implemented by the application
时间方面的管理由应用程序实现
• Adds additional complexity to application programs
增加了应用程序的复杂性

– Some time data types and functions available in SQL, e.g., DATE, TIME, DATEADD(), DATEDIFF()
SQL 中可用的一些时间数据类型和函数，例如 DATE、TIME、DATEADD()、DATEDIFF()
• SQL:2011 added support for temporal tables
SQL:2011 添加了对时态表的支持
• Still very limited query support
仍然非常有限的查询支持

– A temporal database provides built-in support for the management of temporal data/time
时态数据库为时态数据/时间的管理提供内置支持

– Representation of various temporal aspects, e.g., valid time, transaction time
各种时间方面的表示，例如，有效时间、交易时间

– Support for multiple calendars and granularities
支持多种日历和粒度

– Easy formulation of complex queries over time
随着时间的推移轻松制定复杂的查询

– Queries over and modification of previous states
查询和修改以前的状态

10.3 Concepts in Temporal Databases

10.3.1 Temporal Data Types 时态数据类型

– SQL supports time instants and intervals (but no periods)
SQL 支持时间瞬间和间隔（但不支持句点）

– Instant data types: 即时数据类型：

– DATE 日期
• SQL-92: day, month and year of a time instant (from year 1 to 9999)
• Postgresql: date (no time of day) from 4713 BC to 5874897 AD

– TIMESTAMP 时间戳
• SQL-92: date + time with variable resolution of fractions of a second (default: 1ms)
• Postgresql: date + time of same range than DATE with 1 ms resolution; optional time zone

– TIME 时间
• SQL-92: hours, minutes, seconds and optional fractional digits of second
• not really a time instant (no date!); in PostgreSQL with 1ms resolution

– Interval data types: 间隔数据类型
– Various specification options, eg. Year-Month Intervals: INTERVAL YEAR TO MONTH
各种规格选项，例如。年月间隔

– Many DBMS only support time instants, but no intervals
许多 DBMS 只支持时间瞬间，但不支持时间间隔
– Must hence be simulated with two time instants (start + end)
因此必须模拟两个时间点（开始 + 结束）

10.3.2 Kinds of Data 数据种类

– User-defined time 用户定义的时间

– According to Snodgrass as ‘an uninterpreted time interval’
根据 Snodgrass 的说法，这是“未解释的时间间隔”

– E.g. a birthdate or a publication time
例如出生日期或出版时间

– Valid Time & Transaction Time
有效时间和交易时间
– Cf. following examples
参见下面的例子

– A table can be associated with none, one, two or all three kinds of time
一个表可以关联无、一种、两种或所有三种时间

10.3.3 Kinds of Temporal Statements

– Current 当前的

– “What is now?”
• E.g. “How many products do we currently have in stock?”

– Sequenced已排序

– “What was, and when?”
• E.g. “Give the sequence of how many product were in stock.”
• Or “When did the stock level fall below X in the past?”

– Very central, but not directly supported by SQL!

– Nonsequenced 无序

– “What was at any time?”
• E.g. “How many products A did we have at any time in stock?”

10.3.4 Transaction Time and Valid Time 交易时间和有效时间

– Valid time records the time when a fact is true in the real world.
有效时间记录事实在现实世界中为真的时间。

– Can move forward and backward 可以前后移动

– Transaction time records the history of database activity.
事务时间记录了数据库活动的历史。

– Only moves forward (as you cannot go back in history and change things –alas!)
只能前进（因为您无法回到历史并改变事物 – 唉！）

– Therefore allows rollback (very useful for auditing)
因此允许回滚（对审计非常有用）

Week 11 Image Data Processing

11.1 Image Data 图像数据

– Images can be described as vector graphics or raster data
图像可以描述为矢量图形或光栅数据

– Raster images 光栅图像

– Matrix with fixed number of rows and columns
具有固定行数和列数的矩阵

– Digital images consist of fixed number of picture elements, called pixels
数字图像由固定数量的图片元素组成，称为像素

– Each pixel represents brightness of a given color
每个像素代表给定颜色的亮度
• Color depth => different number of channels
颜色深度 => 不同的通道数

– Raster images can be created in multiple ways
可以通过多种方式创建光栅图像

– Digital photography / video
数码摄影/视频

– Image sensors in (scientific) instruments (e.g. satellite images, astronomy, DNA sequencers, microscopes, …)
(科学）仪器中的图像传感器（例如卫星图像、天文学、DNA 测序仪、显微镜等）
– Scanners 扫描仪

– Medical instruments (e.g. Xray, CET, MRT) 医疗器械（例如 X 射线、CET、MRT）

11.2 Types of Images种类

TrueColor or RGB Image 真彩色或RGB图像

Gray-scale image 灰度图像

Binary image 二进制图像

详细内容在lecture-11上

11.3 Aspects of Image Processing 图像处理方面

– Image Enhancement: Processing an image so that the result is more suitable for a particular application. (sharpening or de-blurring an out of focus image, highlighting edges, improving image contrast, or brightening an image, removing noise)
– 图像增强：处理图像以使结果更适合特定应用程序。（锐化或去模糊离焦图像、突出边缘、提高图像对比度或增亮图像、去除噪点）

– Image Restoration: This may be considered as reversing the damage done to an image by a known cause. (removing of blur caused by linear motion, removal of optical distortions)
图像恢复：这可以被视为逆转已知原因对图像造成的损坏。（去除线性运动引起的模糊，去除光学畸变）

– Image Segmentation: This involves subdividing an image into constituent parts, or isolating certain aspects of an image. (finding lines, circles, or particular shapes in an image, in an aerial photograph, identifying cars, trees, buildings, or roads.
– 图像分割：这涉及将图像细分为组成部分，或隔离图像的某些方面。（在图像、航拍照片中寻找线条、圆圈或特定形状，识别汽车、树木、建筑物或道路。

11.4 Morphological Image Processing 形态学图像处理

– broad set of operations that process images based on shapes.
基于形状处理图像的广泛操作集。

– Goal: removing of imperfections in images (binary or grayscale)
目标：去除图像中的缺陷（二进制或灰度）

– Morphological techniques probe an image with a small shape or template called a structuring element.
形态学技术使用称为结构元素的小形状或模板探测图像。

– The structuring element is a small binary image, i.e. a small matrix of pixels, each with a value of zero or one
结构元素是一个小的二进制图像，即一个小的像素矩阵，每个像素的值为零或一

Week 12 Big Data

在这里插入图片描述

12.1 Big Data: Volume 量

– very relative due to Moore’s Law 由于摩尔定律而非常相关

– What once was considered big data, is considered a main-memory problem nowadays
曾经被认为是大数据，现在被认为是主内存问题

– eg. Excel: In 2003 max 65000 rows, now max 1 million rows, still …
例如。 Excel：2003 年最多 65000 行，现在最多 100 万行，仍然…

– Nowadays: Terabyte to Exabyte 如今：太字节到艾字节

12.2 Big Data: Velocity 速度

– conventional scientific research:
常规科学研究

– months to gather data from 100s cases, weeks to analyze the data and years to publish.
几个月收集 100 个案例的数据，几周来分析数据，几年来发布。

– Example: Iris flower data set by Edgar Anderson and Ronal Fisher from 1936
示例：1936 年 Edgar Anderson 和 Ronal Fisher 设置的鸢尾花数据

– on the other end of the scale: Twitter
在天平的另一端：推特
– average 6000 tweets/sec, 500 million per day or 200 billion per year
平均每秒 6000 条推文，每天 5 亿条或每年 2000 亿条

12.3 Big Data: Variety 多样性

– Structured Data, such as CSV or RDBMS
结构化数据，例如 CSV 或 RDBMS

– Semi-structured Data, such as JSON or XML
半结构化数据，例如 JSON 或 XML

– Unstructured Data, ie. text, e-mails, images, video
非结构化数据，即。文本、电子邮件、图像、视频

– an estimated 80% of enterprise data is unstructured
估计 80% 的企业数据是非结构化的

– study by Forester Research: variety biggest challenge in Big Data
Forester Research 的研究：大数据中的多样性最大挑战

12.4 Scale-Up 单个升级

– The traditional approach: 传统方法：

– To scale with increasing load, buy more powerful, larger hardware
为了随着负载的增加而扩展，购买更强大、更大的硬件
• from single workstation
从单个工作站

• to dedicated db server
到专用数据库服务器

• to large massive-parallel database appliance
到大型海量并行数据库设备

12.5 The Alternative: Scale-Out 增加数量

A single server has limits… 单个服务器有限制……
For real Big Data processing, need to scale-out to a cluster of multiple servers (nodes):
对于真正的大数据处理，需要横向扩展到多台服务器（节点）的集群：

12.6 MapReduce Overview

– Scan large volumes of data
扫描大量数据

– Map: Extract some interesting information
地图：提取一些有趣的信息

– Shuffle and sort intermediate results
对中间结果进行洗牌和排序

– Reduce: aggregate intermediate results
减少：聚合中间结果

– Generate final output
生成最终输出

– Key idea: provide an abstraction at the point of these two operations (map and reduce)
关键思想：在这两个操作（map 和 reduce）的点上提供一个抽象

– Higher-order functions高阶函数
– Cf. map functions in functional programming languages such as Lisp or Haskell
参见函数式编程语言（如 Lisp 或 Haskell）中的映射函数

12.7 MapReduce Discussion

Pros:优点

– very flexible due to the user-defined functions 由于用户定义的功能而非常灵活

– great scalability because FP approach 伟大的可扩展性，因为 FP 方法

– easy parallelism due to stateless functions 由于无状态函数而易于并行

– fault-tolerance 容错

Cons: 缺点
– requires programming skills and functional thinking 需要编程技能和函数式思维
– relatively low-level, even filtering to be coded manually相对低级，甚至过滤手动编码
– complex frameworks 复杂的框架
– batch-processing oriented 面向批处理