MIT6.824 第二课——线程的概念以及分布式并发爬虫

最新推荐文章于 2023-06-28 11:17:05 发布

wwxy261

最新推荐文章于 2023-06-28 11:17:05 发布

阅读量271

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/wwxy1995/article/details/107977361

版权

算法专栏收录该内容

3633 篇文章 119 订阅

订阅专栏

Today:
  Threads and RPC in Go, with an eye towards the labs

今天：

线程和RPC，着眼于实验室

Why Go?
  good support for threads
  convenient RPC
  type safe
  garbage-collected (no use after freeing problems)
  threads + GC is particularly attractive!
  relatively simple
  After the tutorial, use https://golang.org/doc/effective_go.html

为什么要去？

良好的螺纹支撑

方便的RPC

安全型

垃圾收集（排除问题后不再使用）

线程+GC特别吸引人！

相对简单

教程结束后，使用https://golang.org/doc/effective_go.html

Threads
  a useful structuring tool, but can be tricky
  Go calls them goroutines; everyone else calls them threads

一个有用的结构化工具，但可能很棘手

Go称它们为goroutines；其他人都称它们为线程

Thread = "thread of execution"
  threads allow one program to do many things at once
  each thread executes serially, just like an ordinary non-threaded program
  the threads share memory
  each thread includes some per-thread state:
    program counter, registers, stack

Thread=“执行线程”

线程允许一个程序同时做许多事情

每个线程都是串行执行的，就像普通的非线程程序一样

线程共享内存

每个线程包含一些每个线程的状态：

程序计数器、寄存器、堆栈

Why threads?
  They express concurrency, which you need in distributed systems
  I/O concurrency
    Client sends requests to many servers in parallel and waits for replies.
    Server processes multiple client requests; each request may block.
    While waiting for the disk to read data for client X,
      process a request from client Y.
  Multicore performance
    Execute code in parallel on several cores.
  Convenience
    In background, once per second, check whether each worker is still alive.

为什么是线程？

它们表示分布式系统中需要的并发性

I/O并发

客户机并行地向多个服务器发送请求并等待答复。

服务器处理多个客户端请求；每个请求都可能被阻塞。

在等待磁盘读取客户端X的数据时，

处理来自客户端Y的请求。

多核性能

在多个核心上并行执行代码。

方便

在后台，每秒一次，检查每个工人是否还活着。

Is there an alternative to threads?
  Yes: write code that explicitly interleaves activities, in a single thread.
    Usually called "event-driven."
  Keep a table of state about each activity, e.g. each client request.
  One "event" loop that:
    checks for new input for each activity (e.g. arrival of reply from server),
    does the next step for each activity,
    updates state.
  Event-driven gets you I/O concurrency,
    and eliminates thread costs (which can be substantial),
    but doesn't get multi-core speedup,
    and is painful to program.

有没有线程的替代品？

是：在一个线程中编写显式地交错活动的代码。

通常称为“事件驱动”

保存每个活动的状态表，例如每个客户端请求。

一个“事件”循环：

检查每个活动的新输入（例如，来自服务器的回复到达），

为每个活动执行下一步，

更新状态。

事件驱动的I/O并发性，

并消除了线程成本（可能相当可观），

但没有多核加速，

编程很痛苦。

Threading challenges:
  shared data 
    e.g. what if two threads do n = n + 1 at the same time?
      or one thread reads while another increments?
    this is a "race" -- and is usually a bug
    -> use locks (Go's sync.Mutex)
    -> or avoid sharing mutable data
  coordination between threads
    e.g. one thread is producing data, another thread is consuming it
      how can the consumer wait (and release the CPU)?
      how can the producer wake up the consumer?
    -> use Go channels or sync.Cond or WaitGroup
  deadlock
    cycles via locks and/or communication (e.g. RPC or Go channels)

Let's look at the tutorial's web crawler as a threading example.

线程挑战：

共享数据

e、 g.如果两个线程同时n=n+1怎么办？

或者一个线程读取而另一个线程递增？

这是一场“竞赛”——通常是一个bug

->使用锁（Go's同步互斥体)

->或者避免共享可变数据

线程之间的协调

e、一个线程正在生成数据，另一个线程正在消耗数据

消费者如何等待（并释放CPU）？

生产者如何唤醒消费者？

->使用Go频道或同步状态或等待组

死锁

通过锁和/或通信循环（例如，RPC或Go通道）

让我们来看看网络爬虫教程。

报错笔记

Let's look at the tutorial's web crawler as a threading example.

What is a web crawler?
  goal is to fetch all web pages, e.g. to feed to an indexer
  web pages and links form a graph
  multiple links to some pages
  graph has cycles

让我们来看看网络爬虫教程。

什么是网络爬虫？

目标是获取所有网页，例如，将其馈送给索引器

网页和链接形成一个图形

指向某些页面的多个链接

图有圈

Crawler challenges
  Exploit I/O concurrency
    Network latency is more limiting than network capacity
    Fetch many URLs at the same time
      To increase URLs fetched per second
    => Need threads for concurrency
  Fetch each URL only *once*
    avoid wasting network bandwidth
    be nice to remote servers
    => Need to remember which URLs visited 
  Know when finished

爬虫挑战

利用I/O并发

网络延迟比网络容量更为有限

同时获取多个URL

增加每秒获取的URL

=>需要线程进行并发

只获取一次每个URL*

避免浪费网络带宽

善待远程服务器

=>需要记住访问了哪些URL

知道何时完成

We'll look at two styles of solution [crawler.go on schedule page]

Serial crawler:
  performs depth-first exploration via recursive Serial calls
  the "fetched" map avoids repeats, breaks cycles
    a single map, passed by reference, caller sees callee's updates
  but: fetches only one page at a time
    can we just put a "go" in front of the Serial() call?
    let's try it... what happened?

我们将研究两种解决方案[爬虫。开始按计划页面]

串行爬虫程序：

通过递归串行调用执行深度优先探测

“获取”地图避免重复，打破循环

通过引用传递的单个映射，调用者可以看到被调用者的更新

但是：一次只获取一页

我们可以在Serial（）调用前面加一个“go”吗？

让我们试试。。。怎么了？

ConcurrentMutex爬虫程序：

为每个页面获取创建一个线程

并发取数多，取数率高

“gofunc”创建goroutine并开始运行

函数。。。是一个“匿名函数”

线程共享“fetched”映射

所以只有一个线程可以获取任何给定的页面

为什么是互斥体（Lock（）和Unlock（））？

原因之一：

两个不同的网页包含指向同一个URL的链接

两个线程同时获取这两个页面

T1读取获取的[url]，T2读取获取的[url]

两者都看到url没有被获取（already==false）

两个都取，这是错误的

锁使检查和更新成为原子的

所以只有一个线程看到了==false

另一个原因是：

在内部，map是一个复杂的数据结构（tree？可扩展哈希？）

并发更新/更新可能破坏内部不变量

并发更新/读取可能会导致读取崩溃

如果我注释掉Lock（）/Unlock（）？

快跑爬虫。开始

为什么有用？

去跑步-比赛爬虫。开始

即使输出正确，也能检测到比赛！

ConcurrentMutex爬虫程序是如何决定的？

同步等待组

Wait（）等待所有Add（）被Done（）平衡

i、等待所有子线程完成

[图表：goroutines树，覆盖在循环URL图上]

树中每个节点都有一个WaitGroup

这个爬虫程序可以创建多少个并发线程？

Why is it not a race that multiple threads use the same channel?

Is there a race when worker thread writes into a slice of URLs,
and master thread reads that slice, without locking?
* worker only writes slice *before* sending
* master only reads slice *after* receiving
So they can't use the slice at the same time.

When to use sharing and locks, versus channels?
Most problems can be solved in either style
What makes the most sense depends on how the programmer thinks
state -- sharing and locks
communication -- channels
For the 6.824 labs, I recommend sharing+locks for state,
and sync.Cond or channels or time.Sleep() for waiting/notification.

以上是对分布式爬虫代码的解释

package main

import (
	"fmt"
	"sync"
)

//
// Several solutions to the crawler exercise from the Go tutorial
// https://tour.golang.org/concurrency/10
//

//
// Serial crawler
//

func Serial(url string, fetcher Fetcher, fetched map[string]bool) {
	if fetched[url] {
		return
	}
	fetched[url] = true
	urls, err := fetcher.Fetch(url)
	if err != nil {
		return
	}
	for _, u := range urls {
		Serial(u, fetcher, fetched)
	}
	return
}

//
// Concurrent crawler with shared state and Mutex
//

type fetchState struct {
	mu      sync.Mutex
	fetched map[string]bool
}

func ConcurrentMutex(url string, fetcher Fetcher, f *fetchState) {
	f.mu.Lock()
	already := f.fetched[url]
	f.fetched[url] = true
	f.mu.Unlock()

	if already {
		return
	}

	urls, err := fetcher.Fetch(url)
	if err != nil {
		return
	}
	var done sync.WaitGroup
	for _, u := range urls {
		done.Add(1)
        u2 := u
		go func() {
			defer done.Done()
			ConcurrentMutex(u2, fetcher, f)
		}()
		//go func(u string) {
		//	defer done.Done()
		//	ConcurrentMutex(u, fetcher, f)
		//}(u)
	}
	done.Wait()
	return
}

func makeState() *fetchState {
	f := &fetchState{}
	f.fetched = make(map[string]bool)
	return f
}

//
// Concurrent crawler with channels
//

func worker(url string, ch chan []string, fetcher Fetcher) {
	urls, err := fetcher.Fetch(url)
	if err != nil {
		ch <- []string{}
	} else {
		ch <- urls
	}
}

func master(ch chan []string, fetcher Fetcher) {
	n := 1
	fetched := make(map[string]bool)
	for urls := range ch {
		for _, u := range urls {
			if fetched[u] == false {
				fetched[u] = true
				n += 1
				go worker(u, ch, fetcher)
			}
		}
		n -= 1
		if n == 0 {
			break
		}
	}
}

func ConcurrentChannel(url string, fetcher Fetcher) {
	ch := make(chan []string)
	go func() {
		ch <- []string{url}
	}()
	master(ch, fetcher)
}

//
// main
//

func main() {
	fmt.Printf("=== Serial===\n")
	Serial("http://golang.org/", fetcher, make(map[string]bool))

	fmt.Printf("=== ConcurrentMutex ===\n")
	ConcurrentMutex("http://golang.org/", fetcher, makeState())

	fmt.Printf("=== ConcurrentChannel ===\n")
	ConcurrentChannel("http://golang.org/", fetcher)
}

//
// Fetcher
//

type Fetcher interface {
	// Fetch returns a slice of URLs found on the page.
	Fetch(url string) (urls []string, err error)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) ([]string, error) {
	if res, ok := f[url]; ok {
		fmt.Printf("found:   %s\n", url)
		return res.urls, nil
	}
	fmt.Printf("missing: %s\n", url)
	return nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
	"http://golang.org/": &fakeResult{
		"The Go Programming Language",
		[]string{
			"http://golang.org/pkg/",
			"http://golang.org/cmd/",
		},
	},
	"http://golang.org/pkg/": &fakeResult{
		"Packages",
		[]string{
			"http://golang.org/",
			"http://golang.org/cmd/",
			"http://golang.org/pkg/fmt/",
			"http://golang.org/pkg/os/",
		},
	},
	"http://golang.org/pkg/fmt/": &fakeResult{
		"Package fmt",
		[]string{
			"http://golang.org/",
			"http://golang.org/pkg/",
		},
	},
	"http://golang.org/pkg/os/": &fakeResult{
		"Package os",
		[]string{
			"http://golang.org/",
			"http://golang.org/pkg/",
		},
	},
}

wwxy261

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MIT6.824 第二课——线程的概念以及分布式并发爬虫

Today: Threads and RPC in Go, with an eye towards the labs今天：线程和RPC，着眼于实验室Why Go? good support for threads convenient RPC type safe garbage-collected (no use after freeing problems) threads + GC is particularly attractive! relatively s.
复制链接

扫一扫