INT303 期末复习总结

最新推荐文章于 2024-07-18 14:25:20 发布

Timoker

最新推荐文章于 2024-07-18 14:25:20 发布

阅读量476

点赞数

分类专栏： INT303笔记文章标签：大数据

本文链接：https://blog.csdn.net/mty501354982/article/details/128554405

版权

Module 1

The basics.

1. What is big data analytics?

The ability to take data, be able to understand it, process it,extract value from it, visualize it, communicate
掌握数据的能力, 能够理解它, 处理它, 从中提取价值, 将它可视化, 进行交流

2. What is data?

“A datum (a piece of information) is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data are multiple such measurements.”
Claim: everything is (can be) data!
“一个基准点 (一条信息）是对某一事物的单一测量, 其尺度对记录者和读者来说都是可以理解的。数据是多个这样的测量值。”
声称：所有的东西都是（可以是）数据!

3. What is ‘Big Data’?

A rectangular data set has two dimensions: number of observations (n) and the number of predictors §.
一个矩形的数据集有两个维度：观察值的数量（n）和预测因子的数量（p）。
‘Big Data’ is when:

n is big (and p is small to moderate)
p is big (and n is small to moderate)
n and p are both big

4. Big data Gramma

python gramma (pandas, etc.)

Module 2

Big data collection and visualization.

1. Data collection and Data scraping

Why scrape the web?

companies have not provided APIs 公司没有提供API
automate tasks 实现任务自动化
keep up with sites 跟上网站的步伐
fun! 好玩!

Challenges in web scraping

Which data?
- It is not always easy to know which site to scrape
  要知道在哪个网站上搜刮数据并不容易
- Which data is relevant, up to date, reliable？
  哪些数据是相关的、最新的、可靠的？
The internet is dynamic
互联网是动态的
- Each web site has a particular structure, which may be changed anytime
  每个网站都有一个特定的结构, 可能随时会改变。
Data is volatile
数据是不稳定的
- Be aware of changing data patterns over time
  要注意随时间变化的数据模式

Legal

Privacy:
- Legislation on protection of personal information
  关于保护个人信息的立法
- Only scrape public sources
  只搜刮公共资源
Netiquette (practical) 网络礼节:
- respect the Robots Exclusion Protocol also known as the robots.txt (example)
  尊重Robots排除协议, 也被称为robots.txt(例子)
- identify yourself (user-agent)
  确定自己的身份（用户代理）
- do not overload servers, use some idle time between requests, run crawlers at night / morning
  不要使服务器过载, 在两次请求之间使用一些空闲时间请求, 在晚上/早上运行爬虫程序
- Inform website owners if feasible
  如果可行的话, 通知网站所有者

2. Data Virtualization

Visualization Motivation

Visualizations help us to analyze and explore the data.
They help to:

Identify hidden patterns and trends
识别隐藏的模式和趋势
Formulate/test hypotheses
形成/测试假设
Communicate any modeling results
交流任何建模结果
- Present information and ideas succinctly
  简明扼要地介绍信息和观点
- Provide evidence and support
  提供证据和支持
- Influence and persuade
  影响力和说服力
Determine the next step in analysis/modeling
确定分析/建模的下一步行动

Principles of Visualization

Some basic data visualization guidelines from Edward Tufte:

Maximize data to ink ratio: show the data
最大限度地提高数据与墨水的比率：展示数据
Don’t lie with scale: minimize
不要用比例撒谎：最小化
Minimize chart-junk: show data variation, not design variation
尽量减少图表垃圾：显示数据变化, 而不是设计变化
Clear, detailed and thorough labeling
清晰、详细和彻底的标签

Types of Visualization

What do you want your visualization to show about your data?

Distribution: how a variable or variables in the dataset distribute over a range of possible values. (Histogram)
分布：数据集中的一个或多个变量在可能的数值范围内如何分布。
Relationship: how the values of multiple variables in the dataset relate. (Scatter Plot)
关系：数据集中的多个变量的值是如何关联的。
Composition: how a part of your data compares to the whole. (Line or Bar Charts)
构成：你的数据的一部分与整体的比较情况
Comparison: how trends in multiple variable or datasets compare. (Line or Bar Charts)
比较：多个变量或数据集的趋势如何比较。

Module 3

Systems and software . It introduces the popular platforms available for big data processing.

1. Large-scale computing

Challenges:

How do you distribute computation?
如何分布计算？
How can we make it easy to write distributed programs?
我们如何才能使编写分布式程序变得容易？
Machines fail
机器故障

Issue: Copying data over a network takes time
在网络上复制数据需要时间

Idea:

Bring computation to data
将计算带入数据
Store files multiple times for reliability
多次存储文件以保证可靠性
Spark/Hadoop address these problems
- Storage Infrastructure - File system
- Google: GFS. Hadoop: HDFS
Programming model
- MapReduce
- Spark

Problem:
If nodes fail, how to store data persistently?

Answer:

Distributed File System
- Provides global file namespace
Typical usage pattern:
- Huge files (100s of GB to TB)
- Data is rarely updated in place
- Reads and appends are common

2. Distributed file system

Chunk servers 块状服务器
- File is split into contiguous chunks
  文件被分割成连续的小块
- Typically each chunk is 16-64MB
  一般来说, 每个块是16-64MB
- Each chunk replicated (usually 2x or 3x)
  每个块被复制（通常是2倍或3倍）
- Try to keep replicas in different racks
  尽量将复制体放在不同的机架上
Master node
- a.k.a. Name Node in Hadoop’sHDFS
  也就是Hadoop的HDFS中的名称节点
- Stores metadata about where files are stored
  存储关于文件存储位置的元数据
- Might be replicated
  可能会被复制
Client library for file access
- Talks to master to find chunk servers
  与主站对话以寻找分块服务器
- Connects directly to chunk servers to access data
  直接连接到分块服务器以访问数据
Reliable distributed file system
- Data kept in “chunks” spread across machines
  数据以 "块 "的形式保存, 分布在机器上
- Each chunk replicated on different machines
  每个块都在不同的机器上进行复制
- Seamless recovery from disk or machine failure
  从磁盘或机器故障中无缝恢复

3. MapReduce: Distributed computing programming model

MapReduce is a style of programming designed for:

Easy parallel programming
易于并行编程
Invisible management of hardware and software failures
对硬件和软件故障的隐形管理
Easy management of very-large-scale data
易于管理超大规模的数据

It has several implementations, including Hadoop, Spark (used in this class), Flink, and the original Google implementation just called “MapReduce”

Steps of MapReduce

Map:

Apply a user-written Map function to each input element
- Mapper applies the Map function to a single element
- Many mappers grouped in a Map task(the unit of parallelism)
The output of the Map function is a set of 0, 1, or more key-value pairs.

Group by key: Sort and shuffle

System sorts all the key-value pairs by key, and outputs key-(list
of values) pairs

Reduce:

User-written Reduce functions applied to each key-(list of values)

判断:

Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input. (False)
The output type of keys/values of mappers/reducers must be of the same type as their input. (False)
The inputs to reducers are grouped by key. (True)
It is possible to start reducers while some mappers are still running. (False)

4. Spark: Extends MapReduce

Spark: Overview

Open source software (Apache Foundation)
Supports Java, Scala and Python

Key construct/idea: Resilient Distributed Dataset (RDD)
Higher-level APIs:
- DataFrames& DataSets
- Introduced in more recent versions of Spark
- Different APIs for aggregate data, which allowed to introduce SQL support

Key concept: Resilient Distribute

最低0.47元/天解锁文章

Timoker

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
INT303 期末复习总结

一个基准点 (一条信息）是对某一事物的单一测量, 其尺度对记录者和读者来说都是可以理解的。Interaction Terms会导致特征维度的急剧上升, 因此我们需要一些手段来解决维度过大的问题。掌握数据的能力, 能够理解它, 处理它, 从中提取价值, 将它可视化, 进行交流。一个矩形的数据集有两个维度：观察值的数量（n）和预测因子的数量（p）。一个矩形的数据集有两个维度：观察值的数量（n）和预测因子的数量（p）。声称：所有的东西都是（可以是）数据!)的线性组合, 以达到降维的效果。
复制链接

扫一扫