Lecture one: Introduction to Spark and HPC
1. Data structure Spectrum(相关范围:broad range of related values or qualities or ideas or activities)
Structured: Relational database, Formatted messages
semi-structured: Documents XML, Tagged Text/media, .csv document, Json
Unstructured: Plain text, Media
2. Structured data
Data base: relational data model
Schema: the organizastion of data as a blueprint of how the database is constructed(The programmer must statically specify the schema)
SQL: Structure Query Language
3.Semi-structrued data
Self describing rather than formal constructured
tags/markers to separate semantic elements
Schema: the column types
4. ELT
impose structure from unstructured data:
Extract, Transform, Load
5. Apache Spark
Fast and general cluster computing system:
interoperable with hadoop
improve efficiency:
a. In memory computing primitives
b. general computation graphs
improve usability:
a. Rich apis in scala, python, Java
b. Interactive shall
6. Spark models
Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets(RDDs):
a. collections of objects that can be store on memory or disk across a cluster
b. parallel functional transformations(map, filter…)
c. automatically rebuilt on failure