Lecture 1 Introduction
Definition
a group of independent computers presents a unified whole to the user as if it were a system. The system has a variety of general physical and logical resources, which can dynamically allocate tasks, and the scattered physical and logical resources realize information exchange through the computer network.
1. Advantages and disadvantages of Distributed Systems
Advantages:
- Parallelism of CPUs: Enhancing computing ability.
- High reliability: If one of the nodes fails, the others can continue. (Raft algorithm)
- Overcome physical distance: The nodes are interconnected through a communication network.
Disadvantages:
- Security: Easy data sharing also means that confidential data can be stolen easily.
- Concurrency: concurrent programmings mean complex interactions.
- Partial failure: Multiple pieces plus a network can have unexpected failure.
2. Infrastructure and Implementation
Infrastructure:
The following infrastructures need to be considered when building a distributed system:
- Storage
- Communication
- Computation
Implementation:
- RPC(Remote Procedure Call): it is one of the means of distributed inter-process communication in message transfer mode.
- Threads: a way of structuring concurrent operations that hopefully simplifies the programmer view of those concurrent operations.
3. Expected Performance
- Scalability: Improve through put by increasing the number of computers with little refactor.
- Fault Tolerance:
- Availability: Under some certain kinds of failures, the system can will keep operation.
- Recoverability: After the repaire, the system will be able to continue as if nothing bad gone wrong without any loss of correctness. ( One Solution: Using non-volatile store like hard drivers or flash or solid state driver)
4. MapReduce
Definition
MapReduce is a programming model for parallel computation of large data sets (larger than 1TB). It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. Current software implementations specify a Map function that maps a set of key-value pairs to a new set of key-value pairs, and a concurrent Reduce function that ensures that each of all mapped key-value pairs shares the same key set.
Advice: Reading MapReduce paper.
Example: WordCount
In this case,
Map(k,v)
Split v into work
for each word w
emit(w,“1”)
Reduce(k,v)
emit(len(v))