Scalable mechine learning: Lecture one: Introduction to Spark and HPC

Lecture one: Introduction to Spark and HPC

1. Data structure Spectrum(相关范围:broad range of related values or qualities or ideas or activities)

Structured: Relational database, Formatted messages
semi-structured: Documents XML, Tagged Text/media, .csv document, Json
Unstructured: Plain text, Media

2. Structured data

Data base: relational data model
Schema: the organizastion of data as a blueprint of how the database is constructed(The programmer must statically specify the schema)
SQL: Structure Query Language

3.Semi-structrued data

Self describing rather than formal constructured

tags/markers to separate semantic elements

Schema: the column types

4. ELT

impose structure from unstructured data:

Extract, Transform, Load

5. Apache Spark

Fast and general cluster computing system:

interoperable with hadoop

improve efficiency:
a. In memory computing primitives
b. general computation graphs

improve usability:
a. Rich apis in scala, python, Java
b. Interactive shall

6. Spark models

Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets(RDDs):
a. collections of objects that can be store on memory or disk across a cluster
b. parallel functional transformations(map, filter…)
c. automatically rebuilt on failure

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值