INT303 期末复习总结

Module 1

The basics.

1. What is big data analytics?

The ability to take data, be able to understand it, process it,extract value from it, visualize it, communicate
掌握数据的能力, 能够理解它, 处理它, 从中提取价值, 将它可视化, 进行交流

2. What is data?

“A datum (a piece of information) is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data are multiple such measurements.”
Claim: everything is (can be) data!
“一个基准点 (一条信息)是对某一事物的单一测量, 其尺度对记录者和读者来说都是可以理解的。数据是多个这样的测量值。”
声称:所有的东西都是(可以是)数据!

3. What is ‘Big Data’?

A rectangular data set has two dimensions: number of observations (n) and the number of predictors §.
一个矩形的数据集有两个维度:观察值的数量(n)和预测因子的数量(p)。
‘Big Data’ is when:

  • n is big (and p is small to moderate)
  • p is big (and n is small to moderate)
  • n and p are both big

4. Big data Gramma

python gramma (pandas, etc.)

Module 2

Big data collection and visualization.

1. Data collection and Data scraping

Why scrape the web?

  • companies have not provided APIs 公司没有提供API
  • automate tasks 实现任务自动化
  • keep up with sites 跟上网站的步伐
  • fun! 好玩!

Challenges in web scraping

  • Which data?
    • It is not always easy to know which site to scrape
      要知道在哪个网站上搜刮数据并不容易
    • Which data is relevant, up to date, reliable?
      哪些数据是相关的、最新的、可靠的?
  • The internet is dynamic
    互联网是动态的
    • Each web site has a particular structure, which may be changed anytime
      每个网站都有一个特定的结构, 可能随时会改变。
  • Data is volatile
    数据是不稳定的
    • Be aware of changing data patterns over time
      要注意随时间变化的数据模式

Legal

  • Privacy:
    • Legislation on protection of personal information
      关于保护个人信息的立法
    • Only scrape public sources
      只搜刮公共资源
  • Netiquette (practical) 网络礼节:
    • respect the Robots Exclusion Protocol also known as the robots.txt (example)
      尊重Robots排除协议, 也被称为robots.txt(例子)
    • identify yourself (user-agent)
      确定自己的身份(用户代理)
    • do not overload servers, use some idle time between requests, run crawlers at night / morning
      不要使服务器过载, 在两次请求之间使用一些空闲时间请求, 在晚上/早上运行爬虫程序
    • Inform website owners if feasible
      如果可行的话, 通知网站所有者

2. Data Virtualization

Visualization Motivation

Visualizations help us to analyze and explore the data.
They help to:

  • Identify hidden patterns and trends
    识别隐藏的模式和趋势
  • Formulate/test hypotheses
    形成/测试假设
  • Communicate any modeling results
    交流任何建模结果
    • Present information and ideas succinctly
      简明扼要地介绍信息和观点
    • Provide evidence and support
      提供证据和支持
    • Influence and persuade
      影响力和说服力
  • Determine the next step in analysis/modeling
    确定分析/建模的下一步行动

Principles of Visualization

Some basic data visualization guidelines from Edward Tufte:

  1. Maximize data to ink ratio: show the data
    最大限度地提高数据与墨水的比率:展示数据
  2. Don’t lie with scale: minimize
    不要用比例撒谎:最小化
  3. Minimize chart-junk: show data variation, not design variation
    尽量减少图表垃圾:显示数据变化, 而不是设计变化
  4. Clear, detailed and thorough labeling
    清晰、详细和彻底的标签

Types of Visualization

What do you want your visualization to show about your data?

  • Distribution: how a variable or variables in the dataset distribute over a range of possible values. (Histogram)
    分布:数据集中的一个或多个变量在可能的数值范围内如何分布。
  • Relationship: how the values of multiple variables in the dataset relate. (Scatter Plot)
    关系:数据集中的多个变量的值是如何关联的。
  • Composition: how a part of your data compares to the whole. (Line or Bar Charts)
    构成:你的数据的一部分与整体的比较情况
  • Comparison: how trends in multiple variable or datasets compare. (Line or Bar Charts)
    比较:多个变量或数据集的趋势如何比较。

Module 3

Systems and software . It introduces the popular platforms available for big data processing.

1. Large-scale computing

Challenges:

  • How do you distribute computation?
    如何分布计算?
  • How can we make it easy to write distributed programs?
  • 我们如何才能使编写分布式程序变得容易?
  • Machines fail
    机器故障

Issue: Copying data over a network takes time
在网络上复制数据需要时间

Idea:

  • Bring computation to data
    将计算带入数据
  • Store files multiple times for reliability
    多次存储文件以保证可靠性
  • Spark/Hadoop address these problems
    • Storage Infrastructure - File system
    • Google: GFS. Hadoop: HDFS
  • Programming model
    • MapReduce
    • Spark

Problem:
If nodes fail, how to store data persistently?

Answer:

  • Distributed File System
    • Provides global file namespace
  • Typical usage pattern:
    • Huge files (100s of GB to TB)
    • Data is rarely updated in place
    • Reads and appends are common

2. Distributed file system

  • Chunk servers 块状服务器
    • File is split into contiguous chunks
      文件被分割成连续的小块
    • Typically each chunk is 16-64MB
      一般来说, 每个块是16-64MB
    • Each chunk replicated (usually 2x or 3x)
      每个块被复制(通常是2倍或3倍)
    • Try to keep replicas in different racks
      尽量将复制体放在不同的机架上
  • Master node
    • a.k.a. Name Node in Hadoop’sHDFS
      也就是Hadoop的HDFS中的名称节点
    • Stores metadata about where files are stored
      存储关于文件存储位置的元数据
    • Might be replicated
      可能会被复制
  • Client library for file access
    • Talks to master to find chunk servers
      与主站对话以寻找分块服务器
    • Connects directly to chunk servers to access data
      直接连接到分块服务器以访问数据
  • Reliable distributed file system
    • Data kept in “chunks” spread across machines
      数据以 "块 "的形式保存, 分布在机器上
    • Each chunk replicated on different machines
      每个块都在不同的机器上进行复制
    • Seamless recovery from disk or machine failure
      从磁盘或机器故障中无缝恢复

3. MapReduce: Distributed computing programming model

MapReduce is a style of programming designed for:

  • Easy parallel programming
    易于并行编程
  • Invisible management of hardware and software failures
    对硬件和软件故障的隐形管理
  • Easy management of very-large-scale data
    易于管理超大规模的数据

It has several implementations, including Hadoop, Spark (used in this class), Flink, and the original Google implementation just called “MapReduce

Steps of MapReduce

Map:

  • Apply a user-written Map function to each input element
    • Mapper applies the Map function to a single element
    • Many mappers grouped in a Map task(the unit of parallelism)
  • The output of the Map function is a set of 0, 1, or more key-value pairs.

Group by key: Sort and shuffle

  • System sorts all the key-value pairs by key, and outputs key-(list
    of values) pairs

Reduce:

  • User-written Reduce functions applied to each key-(list of values)

判断:

  • Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input. (False)
  • The output type of keys/values of mappers/reducers must be of the same type as their input. (False)
  • The inputs to reducers are grouped by key. (True)
  • It is possible to start reducers while some mappers are still running. (False)

4. Spark: Extends MapReduce

Spark: Overview

Open source software (Apache Foundation)
Supports Java, Scala and Python

  • Key construct/idea: Resilient Distributed Dataset (RDD)
  • Higher-level APIs:
    • DataFrames& DataSets
    • Introduced in more recent versions of Spark
    • Different APIs for aggregate data, which allowed to introduce SQL support

Key concept: Resilient Distribute

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值