Distributed Systems (1)

最新推荐文章于 2024-09-15 10:32:33 发布

风带走了时间

最新推荐文章于 2024-09-15 10:32:33 发布

阅读量149

点赞数

分类专栏： Distributed System 文章标签： mapreduce 分布式

原文链接：https://www.bilibili.com/video/BV1R7411t71W?p=1

版权

Distributed System 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

这篇博客介绍了分布式系统的定义及其带来的并行计算能力增强、高可用性和克服地理距离等优势。同时，也提到了安全性、并发控制和部分故障等挑战。在基础设施方面，涉及了存储、通信和计算。分布式系统通过RPC和线程实现进程间通信。预期性能关注可扩展性和容错性，如使用非易失性存储确保故障恢复。MapReduce作为一种编程模型，简化了大型数据集的并行计算，例如在WordCount应用中进行数据处理。

摘要由CSDN通过智能技术生成

Lecture 1 Introduction

Definition
a group of independent computers presents a unified whole to the user as if it were a system. The system has a variety of general physical and logical resources, which can dynamically allocate tasks, and the scattered physical and logical resources realize information exchange through the computer network.

1. Advantages and disadvantages of Distributed Systems

Advantages:

Parallelism of CPUs: Enhancing computing ability.
High reliability: If one of the nodes fails, the others can continue. (Raft algorithm)
Overcome physical distance: The nodes are interconnected through a communication network.

Disadvantages:

Security: Easy data sharing also means that confidential data can be stolen easily.
Concurrency: concurrent programmings mean complex interactions.
Partial failure: Multiple pieces plus a network can have unexpected failure.

2. Infrastructure and Implementation

Infrastructure:
The following infrastructures need to be considered when building a distributed system:

Storage
Communication
Computation

Implementation:

RPC(Remote Procedure Call): it is one of the means of distributed inter-process communication in message transfer mode.
Threads: a way of structuring concurrent operations that hopefully simplifies the programmer view of those concurrent operations.

3. Expected Performance

Scalability: Improve through put by increasing the number of computers with little refactor.
Fault Tolerance:
- Availability: Under some certain kinds of failures, the system can will keep operation.
- Recoverability: After the repaire, the system will be able to continue as if nothing bad gone wrong without any loss of correctness. ( One Solution: Using non-volatile store like hard drivers or flash or solid state driver)

4. MapReduce

Definition
MapReduce is a programming model for parallel computation of large data sets (larger than 1TB). It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. Current software implementations specify a Map function that maps a set of key-value pairs to a new set of key-value pairs, and a concurrent Reduce function that ensures that each of all mapped key-value pairs shares the same key set.
Advice: Reading MapReduce paper.
在这里插入图片描述
Example: WordCount
In this case,
Map(k,v)
Split v into work
for each word w
emit(w,“1”)
Reduce(k,v)
emit(len(v))