Week 1 Introduction
----------------------------------------------------------
Data Science refersto an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.
Three types of tasks in a data science project:
Preparing to run a model (80% of the work)
Running the model
communicating the results(the other 80% of the work)
Science is about asking questions
Taditionally: query the world
eScience: download the word
Ways to do science
Empirical (for thousands of years)
Theoretical (in the last few hundred years, reinforcing empirical methods)
Computational (in the last 50 years or so, simulate phenomenon that cannot be obseve directly and theoretical models become too complex to sovle analytically)
eScience( = Data Science) (int last ten years or so, explore massive data)
What's Big Data
Big data is any data that is expensive to manage and hard to extract value from
Big Data: Three Challenges
Volumne(the size of the data)
Velocity (the latency of data processing relative to the growing demand for interactivity)
Variety (the diversity of sources, formats, quality, structures)
Week 2 Relational Databases, Relational Algebra
-----------------------------------------------------------------------------------
What is a Data Model
trhee components:
1. structures
2. constraints
3. operations
Database:
Physical Data Independence (just table of algebra)
select, project, cross-product, join
SQL is declarative language, about "what not how"
Algebraic Optimization
Logical Data Independence
view: a query with a name
Database can exploit index and it is sure to complete an operation no matter how large the data size is
Week 3 MapReduce
------------------------------------------------------------------------------------------
Scalable
Operationally:
Scale up: works even if data doesn't fit in main memory
Scale out:can make use of 1000s of cheap computers
Algorithmically:
the complex should be polynomial, parallelized polynomial or nlog(n)
Parallel Architectures
Shared nothing
Shared disc
shared memory
Two notions of parallel query processing
distributed query
rewrite they query as a union of subqueries, finally the results are combined (bottleneck)
parallel query (Teradata, parallel database)
each operator is implemented with a parallel algorithm (like the mapreduce fashion)
Pig (Yahoo)
Relational Algebra over Hadoop
Hive (Facebook)
SQL over Hadoop
Both are Declarative query lanquages, support schemas and algebraic optimization
Hadoop vs. RDBMS
loading data: hadoop is faster (Hadoop just needs to do parition, databases need extra effort)
execution: RDBMS is faster (becasuse of index)
Week 4 NoSql
----------------------------------------------------------------------
NoSql is mainly used to building very large scalable web application
Social Network application (when to see a friend's status)
database: see all or nothing
two-phase commit
prepareto be ready: usually write to a log
commit: if all subordinates are ready
if one coordinator used: signal point failure
distributed protocol for committing: Paxos
MongoDB
eventual consistency through vector clocks
CAP Theorem
sacrifices Consistency or availablity to achieve parition
NoSQL features
lookup, read, write 1 or few records over many servers (high scale)
able to replicate and partition data (high scale)
no sql (no sql)
weaker concurrency model than ACID(Atomicity, Consistency, Isolation, Durability) transactions (no transaction)
dynamically add new attributes to records (no schema)
Category for data models
document = nested values,extensible records(XML, JSON)
extensible record (hbase/ BigTable)
key-value object (memcache)
Consistent hashing ( Memcached: no persistence, no replication, no fault-tolerance)
map server IDs and the key values into the same space
schema-on-read, instead schema-on-write (pig)
When data is too big, you cannot bring data to computation, you have to bring the computation to the data
Three Special Join:
Replicated Join
Skewed Join
Merge Join
NoSQL Features:
No Schema
No Language
No Transactions