[翻译]Go并发模式：构建和终止流水线

最新推荐文章于 2021-11-08 08:01:00 发布

置顶 xiaohuihuicb

最新推荐文章于 2021-11-08 08:01:00 发布

阅读量739

点赞数 1

分类专栏： go chan 文章标签： go pipeline 管道

原文链接：https://blog.golang.org/pipelines

版权

go 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

chan

3 篇文章 0 订阅

订阅专栏

Go并发模式：构建和终止流水线 (Go Concurrency Patterns: Pipelines and cancellation)

原著：Sameer Ajmani 2014-03-12

翻译：Narcism 2020-04-02

介绍

Go的并发特性(concurrency primitives)让它轻易的构建可以有效利用I / O和多个CPU的流数据流水线 (streaming data pipeline)。这篇文章介绍了一些这种流水线的例子，重点介绍了操作失败时出现的细微差别，并介绍了完整的处理错误的技术。

什么是流水线(pipeline)

流水线(pipeline)在go中并没有官方的定义，它只是多种并发模式中的一种。非官方定义，流水线是由通道(channel)连接起来的一系列的阶段，每个阶段是一组相同功能的goroutine.在每个阶段中，这些goroutines:

从上游(upstream)通过输入通道(inbound channels)接受数据
对数据进行一些处理，通常会产生新的数据
把数据通道输出通道(outbound channels)发送到下游(downstream)

除了只有输出通道的第一个阶段和只有输入通道的最后一个阶段外，每一个阶段由任意个输入通道和输出通道。通常把第一个阶段叫做 source 或生产者(producer),把最后一个阶段叫sink或消费者(consumer)。

在文章中首先会通过一个简单的例子来解释流水线(pipeline)的创意和技巧。然后会展示一个更接近实际应用的的例子。

计算平方数

go中关于chan应用的程序分析
go pipeline的流水线示意图

这个流水线由三个阶段组成。

第一个阶段是gin，它是一个把数字组成的列表转换到一个发出整个列表中数字的通道(channel)的方法。它会打开一个goroutine，这个goroutine会发送数字到通道，并在数字都发送完之后关闭该通道：

func gen(nums ...int) <-chan int {
    out := make(chan int)
    go func() {
        for _, n := range nums {
            out <- n
        }
        close(out)
    }()
    return out
}

第二个阶段是sq，它从一个输入通道(inbound channel)接收数字，并返回一个发送接受到的数字的平方的输出通道(outbound channels)。当输入通道(inbound channels)关闭并且这个阶段把所有的值都发送到下游(downstream)后，它就会关闭输出通道(outbound channels)：

func sq(in <-chan int) <-chan int {
    out := make(chan int)
    go func() {
        for n := range in {
            out <- n * n
        }
        close(out)
    }()
    return out
}

第三个阶段在主方法main中，mian方法主要声明流水线，并运行最后一个阶段：它从第二个阶段中接收数据并挨个打印(print)出来，直到第二阶段中的输出通道关闭。

func main() {
    // Set up the pipeline.
    c := gen(2, 3)
    out := sq(c)

    // Consume the output.
    fmt.Println(<-out) // 4
    fmt.Println(<-out) // 9
}

因为sq方法的输入和输出的通道是同一种类型，所以可以多次使用它进行整个流水线的组合。我们也可以把main改成一种像其他的阶段一样的循环的方式进行print：

func main() {
    // Set up the pipeline and consume the output.
    for n := range sq(sq(gen(2, 3))) {
        fmt.Println(n) // 16 then 81
    }
}

go中关于chan应用的程序分析
go pipeline的流水线示意图

扇出，扇入(fan-out,Fan-in)

扇出(fan-out)多个方法可以从同一个尚未关闭的通道(channel)读数据。这提供了一种并行使用CPU和I/O的方法。(This provides a way to distribute work amongst a group of workers to parallelize CPU use and I/O)。

扇入(fan-in)是一个方法能够从多个输入通道中读取数据，并一直读取直到所有的通道都关闭，通过将多个输入通道多路复用到一个（当所有的输入通道关闭的时候关闭的）通道。

A function can read from multiple inputs and proceed until all are closed by multiplexing the input channels onto a single channel that’s closed when all the inputs are closed. This is called fan-in.

我们把流水线编程运行两个sq实例，每一个都从相同的输入通道读取数据。然后用一个新的方法merge扇入sq的的输出：

func main() {
    in := gen(2, 3)

    // Distribute the sq work across two goroutines that both read from in.
    c1 := sq(in)
    c2 := sq(in)

    // Consume the merged output from c1 and c2.
    for n := range merge(c1, c2) {
        fmt.Println(n) // 4 then 9, or 9 then 4
    }
}

merge方法通过为每一个输入通道打开一个把输入通道的数据发送到输出通道的goroutine来将多个通道的数据转换到一个通道上。所有的这些goroutine启动后，merge再打开一个goroutine在上面这些goroutine结束后关闭输出通道。

为了避免把数据推到已经关闭的通道上而引起panic，等所有的goroutine结束后再关闭输出通道就变得很重要。 sync.WaitGroup 提供了简单的方法解决这个问题：

func merge(cs ...<-chan int) <-chan int {
    var wg sync.WaitGroup
    out := make(chan int)

    // Start an output goroutine for each input channel in cs.  output
    // copies values from c to out until c is closed, then calls wg.Done.
    output := func(c <-chan int) {
        for n := range c {
            out <- n
        }
        wg.Done()
    }
    wg.Add(len(cs))
    for _, c := range cs {
        go output(c)
    }

    // Start a goroutine to close out once all the output goroutines are
    // done.  This must start after the wg.Add call.
    go func() {
        wg.Wait()
        close(out)
    }()
    return out
}

go中关于chan应用的程序分析
go pipeline的流水线示意图

突然停止（stopping short）

我们的流水线(pipeline)业务有这样一种模式：

当发送数据的操作取消后，各个阶段会关闭他们的输出通道。
各个阶段会不断的从输入通道获取数据，直到输入通道被关闭。

这种模式让每一个阶段看起开都是一个循环，并确保一旦所有的值都成功发送，所有的goroutine都会关闭。

但是在真实的流水线系统中，有些阶段并不是能够接收到所有的输入数据。有时候我们会把程序设计成只需要接受到一部分数据就可以确保运行。更多的时候，因为输入数据表示上游的阶段发生了错误而导致本阶段结束。在这两种情况下，接受方不需要一直等待来接受数据，并且，我们希望上游的阶段在下游不在需要数据的时候就不在产生新的数据。

在示例的流水心啊中，如果一个阶段失败导致没有消费所有的输入数据，尝试发送这些数据的goroutine就会一直阻塞：

    // Consume the first value from the output.
    out := merge(c1, c2)
    fmt.Println(<-out) // 4 or 9
    return
    // Since we didn't receive the second value from out,
    // one of the output goroutines is hung attempting to send it.
}

这是资源泄漏：goroutines消耗内存和运行时资源，goroutine堆栈中的堆引用会阻止数据被垃圾回收。 goroutine不会被垃圾回收；他们必须自己退出。

即使下游阶段没有收到上游传来的所有数据，我们也需要安排上游阶段退出。第一种方法：可以给输出通道添加一个缓冲区，缓冲区可以保留特定数量的数据；如果缓冲区内还有空间，那么发送操作瞬间就完成了：

c := make(chan int, 2) // buffer size 2
c <- 1  // succeeds immediately
c <- 2  // succeeds immediately
c <- 3  // blocks until another goroutine does <-c and receives 1

如果在创建通道的时候就知道了要发送的数据量，就可以简单的编写声明带缓冲区的通道的代码。例如，我们可以重写gen函数使其可以复制所有的数字数组到一个带缓冲区的通道，而且没有创建一个新的goroutine：

func gen(nums ...int) <-chan int {
    out := make(chan int, len(nums))
    for _, n := range nums {
        out <- n
    }
    close(out)
    return out
}

考虑到我们流水线中阻塞的goroutine，我们可以考虑在merge中的输出通道中加入一个缓冲区：

func merge(cs ...<-chan int) <-chan int {
    var wg sync.WaitGroup
    out := make(chan int, 1) // enough space for the unread inputs
    // ... the rest is unchanged ...

尽管这种方式修复了阻塞的goroutine，这依旧是错误的代码。我们在这里选择1个缓冲区，是因为我们知道上游会发送的数据量和下游会消费的数据量。一旦上游多发送一个数据，或者下游少消费一个数据，这个程序依旧会阻塞。

所以我们需要为下游的阶段提供一种通知发送数据阶段停止发送数据的方法。

明确退出（explicit cancellation）

当main函数决定在没有接受所有数据的时候退出，它必须通知所有的上游阶段的goroutine放弃他们正在尝试发送的值。它可以通过给done通道发送值。因为可能有两个阻塞的发送者，所以要给done发送两个值：

func main() {
    in := gen(2, 3)

    // Distribute the sq work across two goroutines that both read from in.
    c1 := sq(in)
    c2 := sq(in)

    // Consume the first value from output.
    done := make(chan struct{}, 2)
    out := merge(done, c1, c2)
    fmt.Println(<-out) // 4 or 9

    // Tell the remaining senders we're leaving.
    done <- struct{}{}
    done <- struct{}{}
}

发送数据的goroutine要把他们发送数据的操作替换成用select的方式。select操作要么发送数据到out或者从done通道接受数据。因为done中的数据无关紧要，所以发送的是空的struct：它只是一个接收事件来通知放弃发送数据到out。这时output的goroutine会继续循环从输入通道c中读取数据，而不会造成阻塞。(稍后我们会讨论如何让循环提早退出)。

func merge(done <-chan struct{}, cs ...<-chan int) <-chan int {
    var wg sync.WaitGroup
    out := make(chan int)

    // Start an output goroutine for each input channel in cs.  output
    // copies values from c to out until c is closed or it receives a value
    // from done, then output calls wg.Done.
    output := func(c <-chan int) {
        for n := range c {
            select {
            case out <- n:
            case <-done:
            }
        }
        wg.Done()
    }
    // ... the rest is unchanged ...

这种方式进行退出有一个问题：每一个下游都需要知道上游会有多少个可能阻塞的发送者，让这些发送者提早结束。持续的记录追踪这些值不仅无聊而且容易出错。

我们需要一种方法来来通知我们不清出数量或者一直在发送数据的goroutine不再发送数据到下游。在GO中，我们可以通过关闭channel来实现这种操作。因为在关闭通道上的接受操作总是会立马成功，尽管接受的值是空值。

这意味着main可以通过关闭done通道来防止发送者可能造成的阻塞。关闭操作时间上是对所有的发送者进行广播。我们在每流水线的每一个方法中都把done作为一个接收参数。然后通过defer关闭输出通道。这样就可以通过main控制所有的goroutine进行退出，防止阻塞。

func main() {
    // Set up a done channel that's shared by the whole pipeline,
    // and close that channel when this pipeline exits, as a signal
    // for all the goroutines we started to exit.
    done := make(chan struct{})
    defer close(done)

    in := gen(done, 2, 3)

    // Distribute the sq work across two goroutines that both read from in.
    c1 := sq(done, in)
    c2 := sq(done, in)

    // Consume the first value from output.
    out := merge(done, c1, c2)
    fmt.Println(<-out) // 4 or 9

    // done will be closed by the deferred call.
}

这样流水线上的各个阶段在done关闭后直接结束。因为sq中的输出通道在done关闭后不在发送数据，所以merge中的outputroutine会在不耗尽输入通道的情况下结束。output通过defer确保wg.Done执行:

func merge(done <-chan struct{}, cs ...<-chan int) <-chan int {
    var wg sync.WaitGroup
    out := make(chan int)

    // Start an output goroutine for each input channel in cs.  output
    // copies values from c to out until c or done is closed, then calls
    // wg.Done.
    output := func(c <-chan int) {
        defer wg.Done()
        for n := range c {
            select {
            case out <- n:
            case <-done:
                return
            }
        }
    }
    // ... the rest is unchanged ...

同样的sq可以在done关闭后直接返回。sq也通过defer确保输出通道关闭。

func sq(done <-chan struct{}, in <-chan int) <-chan int {
    out := make(chan int)
    go func() {
        defer close(out)
        for n := range in {
            select {
            case out <- n * n:
            case <-done:
                return
            }
        }
    }()
    return out
}

流水线构建指南：

当发送数据的操作取消后，各个阶段会关闭他们的输出通道。

各个阶段会不断的从输入通道获取数据，直到输入通道被关闭。

流水线通过确保__有足够的缓存区__或通过__接受放弃发送数据的信号__来确保协程不会阻塞。

消化一颗树（digesting a tree)

下面是一个更接近实际应用的流水线。

MD5是一种用于文件校验的消息摘要(message-digest)算法。命令行程序md5sum是用来打印文件列表的摘要值：

% md5sum *.go
d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go
ee869afd31f83cbb2d10ee81b2b831dc  parallel.go
b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go

我们的示例程序与MD5相似，但是我们接受的是一个文件夹，计算文件夹内每个文件的摘要值，并按照路径名称进行排序打印。

% go run serial.go .
d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go
ee869afd31f83cbb2d10ee81b2b831dc  parallel.go
b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go

主函数调用了一个MD5All函数（这个函数会返回一个路径：摘要值的map），然后对结果进行排序打印。

func main() {
    // Calculate the MD5 sum of all files under the specified directory,
    // then print the results sorted by path name.
    m, err := MD5All(os.Args[1])
    if err != nil {
        fmt.Println(err)
        return
    }
    var paths []string
    for path := range m {
        paths = append(paths, path)
    }
    sort.Strings(paths)
    for _, path := range paths {
        fmt.Printf("%x  %s\n", m[path], path)
    }
}

MD5All是我们讨论的重点。在 serial.go中，只是遍历了文件夹，并没后用到并发。

// MD5All reads all the files in the file tree rooted at root and returns a map
// from file path to the MD5 sum of the file's contents.  If the directory walk
// fails or any read operation fails, MD5All returns an error.
func MD5All(root string) (map[string][md5.Size]byte, error) {
    m := make(map[string][md5.Size]byte)
    err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }
        if !info.Mode().IsRegular() {
            return nil
        }
        data, err := ioutil.ReadFile(path)
        if err != nil {
            return err
        }
        m[path] = md5.Sum(data)
        return nil
    })
    if err != nil {
        return nil, err
    }
    return m, nil
}

并行消化(parallel digestion)

在 parallel.go中，把MD5All分成了两个阶段的流水线，第一个阶段是 sumFiles，遍历文件夹，每个文件都在一个新的goroutine中计算摘要值，然后把计算的结果通过 result类型发送到通道：

type result struct {
    path string
    sum  [md5.Size]byte
    err  error
}

sumFiles返回两个通道：一个是result类型的通道另一个是filepath.Walk返回的error类型的通道。在walk方法中，每一个文件都会用一个新的goroutine，然后检查done的状态。如果done关闭了，walk就会立刻停止：

func sumFiles(done <-chan struct{}, root string) (<-chan result, <-chan error) {
    // For each regular file, start a goroutine that sums the file and sends
    // the result on c.  Send the result of the walk on errc.
    c := make(chan result)
    errc := make(chan error, 1)
    go func() {
        var wg sync.WaitGroup
        err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
            if err != nil {
                return err
            }
            if !info.Mode().IsRegular() {
                return nil
            }
            wg.Add(1)
            go func() {
                data, err := ioutil.ReadFile(path)
                select {
                case c <- result{path, md5.Sum(data), err}:
                case <-done:
                }
                wg.Done()
            }()
            // Abort the walk if done is closed.
            select {
            case <-done:
                return errors.New("walk canceled")
            default:
                return nil
            }
        })
        // Walk has returned, so all calls to wg.Add are done.  Start a
        // goroutine to close c once all the sends are done.
        go func() {
            wg.Wait()
            close(c)
        }()
        // No select needed here, since errc is buffered.
        errc <- err
    }()
    return c, errc
}

MD5All接收来自c通道的摘要值。MD5All如果发生错误会提前返回，所以通过defer确保done通道一定被关闭。

func MD5All(root string) (map[string][md5.Size]byte, error) {
    // MD5All closes the done channel when it returns; it may do so before
    // receiving all the values from c and errc.
    done := make(chan struct{})
    defer close(done)

    c, errc := sumFiles(done, root)

    m := make(map[string][md5.Size]byte)
    for r := range c {
        if r.err != nil {
            return nil, r.err
        }
        m[r.path] = r.sum
    }
    if err := <-errc; err != nil {
        return nil, err
    }
    return m, nil
}

有界并行(Bounded parallelism)

在parallel.go中的MD5All给每一个文件都开启了一个新的goroutine。如果文件夹内有很多大文件，可能会造成分配的资源多余计算机上可用的资源。

我们可以通过限定并行运行的文件个数来限制这种资源分配。在bounded.go 中，通过创建特定数量的goroutines达到这种限制。现在我们的流水线分成三个阶段：遍历树，读取并计算文件的摘要值，收集摘要值。

第一个阶段，walkFiles，输出文件夹中文件的地址:

func walkFiles(done <-chan struct{}, root string) (<-chan string, <-chan error) {
    paths := make(chan string)
    errc := make(chan error, 1)
    go func() {
        // Close the paths channel after Walk returns.
        defer close(paths)
        // No select needed for this send, since errc is buffered.
        errc <- filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
            if err != nil {
                return err
            }
            if !info.Mode().IsRegular() {
                return nil
            }
            select {
            case paths <- path:
            case <-done:
                return errors.New("walk canceled")
            }
            return nil
        })
    }()
    return paths, errc
}

第二个阶段开启固定数量的goroutine，从地址(path)中获取文件名计算摘要值并把结果发送给c通道：

func digester(done <-chan struct{}, paths <-chan string, c chan<- result) {
    for path := range paths {
        data, err := ioutil.ReadFile(path)
        select {
        case c <- result{path, md5.Sum(data), err}:
        case <-done:
            return
        }
    }
}

跟之前的例子不通，因为有多个goroutine在同一个通道上发送值，digester并没有关闭输出通道c。__digester可以对标上面例子中的output这个函数对象，在output中也没有关闭通道。__关闭这个通道是在MD5All中进行的。（每个函数只会关闭自己的输出通道，而对其他的通道不会进行关闭操作，也就是，通道在哪个方法中被定义的，就会在哪个方法中被关闭。这样可以避免混乱）:

    // Start a fixed number of goroutines to read and digest files.
    c := make(chan result)
    var wg sync.WaitGroup
    const numDigesters = 20
    wg.Add(numDigesters)
    for i := 0; i < numDigesters; i++ {
        go func() {
            digester(done, paths, c)
            wg.Done()
        }()
    }
    go func() {
        wg.Wait()
        close(c)
    }()

也可以对每个digester都创建一个自己的输出通道，但是这样的话就需要一个额外的goroutine进行扇入(fan-in)各个digester的结果。（对标计算平方的方法，MD5All把sq和merge两个方程整合到了一起，把计算过程整合到了MD5All)。

最后一个阶段从c中接受所有的result，并检查从errc中接受的error。检查操作只能在这里执行，因为如果再早的话，就可能造成walkFiles的阻塞：

    m := make(map[string][md5.Size]byte)
    for r := range c {
        if r.err != nil {
            return nil, r.err
        }
        m[r.path] = r.sum
    }
    // Check whether the Walk failed.
    if err := <-errc; err != nil {
        return nil, err
    }
    return m, nil
}

结论

本篇文章介绍了用GO构建流数据流水线的技术。因为每个阶段都有可能在尝试向下游发送数据的时候造成阻塞，同时下游也可能不在关心新进来的数据，这导致处理这种流水线的程序的错误会很棘手。文中也展示了如何通过关闭done来广播发送停止所有goroutine的方法，同时定义了正确构建流水线的指南。
更多阅读：

Go Concurrency Patterns (video) presents the basics of Go’s concurrency primitives and several ways to apply them.
Advanced Go Concurrency Patterns (video) covers more complex uses of Go’s primitives, especially select.
Douglas McIlroy’s paper Squinting at Power Series shows how Go-like concurrency provides elegant support for complex calculations.

serial.go

// +build OMIT

package main

import (
	"crypto/md5"
	"fmt"
	"io/ioutil"
	"os"
	"path/filepath"
	"sort"
)

// MD5All reads all the files in the file tree rooted at root and returns a map
// from file path to the MD5 sum of the file's contents.  If the directory walk
// fails or any read operation fails, MD5All returns an error.
func MD5All(root string) (map[string][md5.Size]byte, error) {
	m := make(map[string][md5.Size]byte)
	err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error { // HL
		if err != nil {
			return err
		}
		if !info.Mode().IsRegular() {
			return nil
		}
		data, err := ioutil.ReadFile(path) // HL
		if err != nil {
			return err
		}
		m[path] = md5.Sum(data) // HL
		return nil
	})
	if err != nil {
		return nil, err
	}
	return m, nil
}

func main() {
	// Calculate the MD5 sum of all files under the specified directory,
	// then print the results sorted by path name.
	m, err := MD5All(os.Args[1]) // HL
	if err != nil {
		fmt.Println(err)
		return
	}
	var paths []string
	for path := range m {
		paths = append(paths, path)
	}
	sort.Strings(paths) // HL
	for _, path := range paths {
		fmt.Printf("%x  %s\n", m[path], path)
	}
}

parallel.go

// +build OMIT

package main

import (
	"crypto/md5"
	"errors"
	"fmt"
	"io/ioutil"
	"os"
	"path/filepath"
	"sort"
	"sync"
)

// A result is the product of reading and summing a file using MD5.
type result struct {
	path string
	sum  [md5.Size]byte
	err  error
}

// sumFiles starts goroutines to walk the directory tree at root and digest each
// regular file.  These goroutines send the results of the digests on the result
// channel and send the result of the walk on the error channel.  If done is
// closed, sumFiles abandons its work.
func sumFiles(done <-chan struct{}, root string) (<-chan result, <-chan error) {
	// For each regular file, start a goroutine that sums the file and sends
	// the result on c.  Send the result of the walk on errc.
	c := make(chan result)
	errc := make(chan error, 1)
	go func() { // HL
		var wg sync.WaitGroup
		err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
			if err != nil {
				return err
			}
			if !info.Mode().IsRegular() {
				return nil
			}
			wg.Add(1)
			go func() { // HL
				data, err := ioutil.ReadFile(path)
				select {
				case c <- result{path, md5.Sum(data), err}: // HL
				case <-done: // HL
				}
				wg.Done()
			}()
			// Abort the walk if done is closed.
			select {
			case <-done: // HL
				return errors.New("walk canceled")
			default:
				return nil
			}
		})
		// Walk has returned, so all calls to wg.Add are done.  Start a
		// goroutine to close c once all the sends are done.
		go func() { // HL
			wg.Wait()
			close(c) // HL
		}()
		// No select needed here, since errc is buffered.
		errc <- err // HL
	}()
	return c, errc
}

// MD5All reads all the files in the file tree rooted at root and returns a map
// from file path to the MD5 sum of the file's contents.  If the directory walk
// fails or any read operation fails, MD5All returns an error.  In that case,
// MD5All does not wait for inflight read operations to complete.
func MD5All(root string) (map[string][md5.Size]byte, error) {
	// MD5All closes the done channel when it returns; it may do so before
	// receiving all the values from c and errc.
	done := make(chan struct{}) // HLdone
	defer close(done)           // HLdone

	c, errc := sumFiles(done, root) // HLdone

	m := make(map[string][md5.Size]byte)
	for r := range c { // HLrange
		if r.err != nil {
			return nil, r.err
		}
		m[r.path] = r.sum
	}
	if err := <-errc; err != nil {
		return nil, err
	}
	return m, nil
}

func main() {
	// Calculate the MD5 sum of all files under the specified directory,
	// then print the results sorted by path name.
	m, err := MD5All(os.Args[1])
	if err != nil {
		fmt.Println(err)
		return
	}
	var paths []string
	for path := range m {
		paths = append(paths, path)
	}
	sort.Strings(paths)
	for _, path := range paths {
		fmt.Printf("%x  %s\n", m[path], path)
	}
}

bounded.go

// +build OMIT

package main

import (
	"crypto/md5"
	"errors"
	"fmt"
	"io/ioutil"
	"os"
	"path/filepath"
	"sort"
	"sync"
)

// walkFiles starts a goroutine to walk the directory tree at root and send the
// path of each regular file on the string channel.  It sends the result of the
// walk on the error channel.  If done is closed, walkFiles abandons its work.
func walkFiles(done <-chan struct{}, root string) (<-chan string, <-chan error) {
	paths := make(chan string)
	errc := make(chan error, 1)
	go func() { // HL
		// Close the paths channel after Walk returns.
		defer close(paths) // HL
		// No select needed for this send, since errc is buffered.
		errc <- filepath.Walk(root, func(path string, info os.FileInfo, err error) error { // HL
			if err != nil {
				return err
			}
			if !info.Mode().IsRegular() {
				return nil
			}
			select {
			case paths <- path: // HL
			case <-done: // HL
				return errors.New("walk canceled")
			}
			return nil
		})
	}()
	return paths, errc
}

// A result is the product of reading and summing a file using MD5.
type result struct {
	path string
	sum  [md5.Size]byte
	err  error
}

// digester reads path names from paths and sends digests of the corresponding
// files on c until either paths or done is closed.
func digester(done <-chan struct{}, paths <-chan string, c chan<- result) {
	for path := range paths { // HLpaths
		data, err := ioutil.ReadFile(path)
		select {
		case c <- result{path, md5.Sum(data), err}:
		case <-done:
			return
		}
	}
}

// MD5All reads all the files in the file tree rooted at root and returns a map
// from file path to the MD5 sum of the file's contents.  If the directory walk
// fails or any read operation fails, MD5All returns an error.  In that case,
// MD5All does not wait for inflight read operations to complete.
func MD5All(root string) (map[string][md5.Size]byte, error) {
	// MD5All closes the done channel when it returns; it may do so before
	// receiving all the values from c and errc.
	done := make(chan struct{})
	defer close(done)

	paths, errc := walkFiles(done, root)

	// Start a fixed number of goroutines to read and digest files.
	c := make(chan result) // HLc
	var wg sync.WaitGroup
	const numDigesters = 20
	wg.Add(numDigesters)
	for i := 0; i < numDigesters; i++ {
		go func() {
			digester(done, paths, c) // HLc
			wg.Done()
		}()
	}
	go func() {
		wg.Wait()
		close(c) // HLc
	}()
	// End of pipeline. OMIT

	m := make(map[string][md5.Size]byte)
	for r := range c {
		if r.err != nil {
			return nil, r.err
		}
		m[r.path] = r.sum
	}
	// Check whether the Walk failed.
	if err := <-errc; err != nil { // HLerrc
		return nil, err
	}
	return m, nil
}

func main() {
	// Calculate the MD5 sum of all files under the specified directory,
	// then print the results sorted by path name.
	m, err := MD5All(os.Args[1])
	if err != nil {
		fmt.Println(err)
		return
	}
	var paths []string
	for path := range m {
		paths = append(paths, path)
	}
	sort.Strings(paths)
	for _, path := range paths {
		fmt.Printf("%x  %s\n", m[path], path)
	}

xiaohuihuicb

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[翻译]Go并发模式：构建和终止流水线

GoGo 文档Go并发模式：管道和取消并发 (Go Concurrency Patterns: Pipelines and cancellation) 原著：Sameer Ajmani 2014-03-12 翻译：Narcism 2020-04-02介绍 Go的并发特性(concurrency primitives)让它轻易的构建可以有效利用I / O和多个CPU的流数...
复制链接

扫一扫