【go-libp2p源码剖析】Swarm拨号

1. 简介

libp2p swarm是用于libp2p网络的“低级”接口,它使您可以更精细地控制系统的各个方面。swarm可以建立监听,也可以向其他主机拨号建立新的连接(比如和某个主机建立tcp连接),而这里所指的拨号其实就是建立出站连接的过程,它的实现逻辑较为复杂,这里做一个梳理。

2. 代码结构

仓库地址https://github.com/libp2p/go-libp2p-swarm.git
拨号相关代码主要分布在swarm_dial.go,limiter.go,dial_sync.go这三个文件,它们包含的结构体:
swarm_dial.go:DialBackoff,backoffAddr
DialBackoff主要用于拨号失败后再次拨号时间限制
dial_sync.go:DialSync、activeDial
DialSync同步拨号帮助程序,同一时刻只有一个到指定Peer的拨号处于活跃状态
limiter.go:dialLimiter、dialJob、dialResult
dialLimiter主要对拨号并发数限制

3.时序图

在这里插入图片描述

通过上图可以看出拨号其实是对并发拨号、同步、重试做了一系列检查,最后再调用Transport进行拨号。假设有1000个Peer,每个Peer有5个不同的地址,如果一个一个同步拨势必影响效率,所以需要启动多个协程并发拨号,但也不能完全不限制,dialLimiter实现了对并发拨号的限制。如果一个地址拨号失败,也不能立即就去再次尝试拨号,这样大概率会失败,这里需要等一段时间再去拨,否则就是浪费资源,所以等多久这里有个算法,DialBackoff实现了这些功能。那为什么还需要DialSync?外部程序在调用DialPeer的时候,可能启动了多个协程对同一个Peer进行并发拨号,因为无法限制外面是怎么调用方式,所以只有在拨号源头进行限制(内部已经实现了并发拨号)。
这里借用swarm_dial.go里的一张图,看看DialSync是如何工作的:

 Diagram of dial sync:

   many callers of Dial()   synched w.  dials many addrs       results to callers
  ----------------------\    dialsync    use earliest          /--------------
  -----------------------\             |----------\           /----------------
  ------------------------>------------<------------>---------<-----------------
  -----------------------|              \----x                 \----------------
  ----------------------|                \-----x                \---------------
                                         any may fail          if no addr at end
                                                                   retry dialAttempt x

4.调用入口

swarm对外暴露了一个DialPeer方法,应用程序可以直接通过它对Peer拨号,有两处用到了它。

// DialPeer connects to a peer.
func (s *Swarm) DialPeer(ctx context.Context, p peer.ID) (network.Conn, error) {
	if s.gater != nil && !s.gater.InterceptPeerDial(p) {
		log.Debugf("gater disallowed outbound connection to peer %s", p.Pretty())
		return nil, &DialError{Peer: p, Cause: ErrGaterDisallowedConnection}
	}

	return s.dialPeer(ctx, p)
}

1.BasicHost的Connect方法调用了DialPeer方法

func (h *BasicHost) Connect(ctx context.Context, pi peer.AddrInfo) error {
	// absorb addresses into peerstore
	h.Peerstore().AddAddrs(pi.ID, pi.Addrs, peerstore.TempAddrTTL)

	if h.Network().Connectedness(pi.ID) == network.Connected {
		return nil
	}

	resolved, err := h.resolveAddrs(ctx, h.Peerstore().PeerInfo(pi.ID))
	if err != nil {
		return err
	}
	h.Peerstore().AddAddrs(pi.ID, resolved, peerstore.TempAddrTTL)

	return h.dialPeer(ctx, pi.ID)
}

func (h *BasicHost) dialPeer(ctx context.Context, p peer.ID) error {
	log.Debugf("host %s dialing %s", h.ID(), p)
	c, err := h.Network().DialPeer(ctx, p)
	if err != nil {
		return err
	}
	select {
	case <-h.ids.IdentifyWait(c):
	case <-ctx.Done():
		return ctx.Err()
	}

	log.Debugf("host %s finished dialing %s", h.ID(), p)
	return nil
}

最后IpfsDHT调用了BasicHost的Connect

func (dht *IpfsDHT) dialPeer(ctx context.Context, p peer.ID) error {
	// short-circuit if we're already connected.
	if dht.host.Network().Connectedness(p) == network.Connected {
		return nil
	}

	logger.Debug("not connected. dialing.")
	routing.PublishQueryEvent(ctx, &routing.QueryEvent{
		Type: routing.DialingPeer,
		ID:   p,
	})

	pi := peer.AddrInfo{ID: p}
	if err := dht.host.Connect(ctx, pi); err != nil {
		logger.Debugf("error connecting: %s", err)
		routing.PublishQueryEvent(ctx, &routing.QueryEvent{
			Type:  routing.QueryError,
			Extra: err.Error(),
			ID:    p,
		})

		return err
	}
	logger.Debugf("connected. dial success.")
	return nil
}

2.另外Swarm的NewStream也调用了dialPeer,如果还没有建立连接则先拨号

func (s *Swarm) NewStream(ctx context.Context, p peer.ID) (network.Stream, error) {
	log.Debugf("[%s] opening stream to peer [%s]", s.local, p)
	dials := 0
	for {
		c := s.bestConnToPeer(p)
		if c == nil {
			if nodial, _ := network.GetNoDial(ctx); nodial {
				return nil, network.ErrNoConn
			}

			if dials >= DialAttempts {
				return nil, errors.New("max dial attempts exceeded")
			}
			dials++

			var err error
			c, err = s.dialPeer(ctx, p)
			if err != nil {
				return nil, err
			}
		}
		s, err := c.NewStream()
		if err != nil {
			if c.conn.IsClosed() {
				continue
			}
			return nil, err
		}
		return s, nil
	}
}

5.拨号程序初始化

Swarm{
    ....
	// dialing helpers
	dsync   *DialSync
	backf   DialBackoff
	limiter *dialLimiter
}

func NewSwarm(ctx context.Context, local peer.ID, peers peerstore.Peerstore, bwc metrics.Reporter, extra ...interface{}) *Swarm {
	s := &Swarm{
		local: local,
		peers: peers,
		bwc:   bwc,
	}
	.....
	s.dsync = NewDialSync(s.doDial)
	s.limiter = newDialLimiter(s.dialAddr, s.IsFdConsumingAddr)
	s.proc = goprocessctx.WithContext(ctx)
	s.ctx = goprocessctx.OnClosingContext(s.proc)
	s.backf.init(s.ctx)

	return s
}

type DialFunc func(context.Context, peer.ID) (*Conn, error)

// NewDialSync constructs a new DialSync
func NewDialSync(dfn DialFunc) *DialSync {
	return &DialSync{
		dials:    make(map[peer.ID]*activeDial),
		dialFunc: dfn,
	}
}

type dialfunc func(context.Context, peer.ID, ma.Multiaddr) (transport.CapableConn, error)
type isFdConsumingFnc func(ma.Multiaddr) bool

func newDialLimiter(df dialfunc, fdFnc isFdConsumingFnc) *dialLimiter {
	fd := ConcurrentFdDials
	if env := os.Getenv("LIBP2P_SWARM_FD_LIMIT"); env != "" {
		if n, err := strconv.ParseInt(env, 10, 32); err == nil {
			fd = int(n)
		}
	}
	return newDialLimiterWithParams(fdFnc, df, fd, DefaultPerPeerRateLimit)
}

func newDialLimiterWithParams(isFdConsumingFnc isFdConsumingFnc, df dialfunc, fdLimit, perPeerLimit int) *dialLimiter {
	return &dialLimiter{
		isFdConsumingFnc:   isFdConsumingFnc,
		fdLimit:            fdLimit,
		perPeerLimit:       perPeerLimit,
		waitingOnPeerLimit: make(map[peer.ID][]*dialJob),
		activePerPeer:      make(map[peer.ID]int),
		dialFunc:           df,
	}
}

func (db *DialBackoff) init(ctx context.Context) {
	if db.entries == nil {
		db.entries = make(map[peer.ID]map[string]*backoffAddr)
	}
	go db.background(ctx)
}

在NewSwarm实例的时候对DialBackoff、dialLimiter、DialBackoff三个拨号帮助程序进行了初始化。
NewDialSync需要传入一个拨号函数做参数(实际调用Swarm的doDial函数)
newDialLimiter则需要传入两个函数:一个为拨号函数(实际调用Swarm的dialAddr函数),一个是判断协议是否需要消耗FD(UNIX/TCP)
DialBackoff的init 后台还会启动一个协程做Backoff清理工作

6.涉及协程

1、针对每一个peer,在DialSync启动了一个协程去拨号

func (ad *activeDial) start(ctx context.Context) {
	ad.conn, ad.err = ad.ds.dialFunc(ctx, ad.id)

	// This isn't the user's context so we should fix the error.
	switch ad.err {
	case context.Canceled:
		// The dial was canceled with `CancelDial`.
		ad.err = errDialCanceled
	case context.DeadlineExceeded:
		// We hit an internal timeout, not a context timeout.
		ad.err = ErrDialTimeout
	}
	close(ad.waitch)
	ad.cancel()
}

func (ds *DialSync) getActiveDial(p peer.ID) *activeDial {
	ds.dialsLk.Lock()
	defer ds.dialsLk.Unlock()

	actd, ok := ds.dials[p]
	if !ok {
		adctx, cancel := context.WithCancel(context.Background())
		actd = &activeDial{
			id:     p,
			cancel: cancel,
			waitch: make(chan struct{}),
			ds:     ds,
		}
		ds.dials[p] = actd

		go actd.start(adctx)
	}

	// increase ref count before dropping dialsLk
	actd.incref()

	return actd
}

2、针对每一个Peer的每个地址,在dialLimiter启动了一个协程去拨号

func (dl *dialLimiter) addCheckFdLimit(dj *dialJob) {
	if dl.shouldConsumeFd(dj.addr) {
		if dl.fdConsuming >= dl.fdLimit {
			log.Debugf("[limiter] blocked dial waiting on FD token; peer: %s; addr: %s; consuming: %d; "+
				"limit: %d; waiting: %d", dj.peer, dj.addr, dl.fdConsuming, dl.fdLimit, len(dl.waitingOnFd))
			dl.waitingOnFd = append(dl.waitingOnFd, dj)
			return
		}

		log.Debugf("[limiter] taking FD token: peer: %s; addr: %s; prev consuming: %d",
			dj.peer, dj.addr, dl.fdConsuming)
		// take token
		dl.fdConsuming++
	}

	log.Debugf("[limiter] executing dial; peer: %s; addr: %s; FD consuming: %d; waiting: %d",
		dj.peer, dj.addr, dl.fdConsuming, len(dl.waitingOnFd))
	go dl.executeDial(dj)
}

func (dl *dialLimiter) addCheckPeerLimit(dj *dialJob) {
	if dl.activePerPeer[dj.peer] >= dl.perPeerLimit {
		log.Debugf("[limiter] blocked dial waiting on peer limit; peer: %s; addr: %s; active: %d; "+
			"peer limit: %d; waiting: %d", dj.peer, dj.addr, dl.activePerPeer[dj.peer], dl.perPeerLimit,
			len(dl.waitingOnPeerLimit[dj.peer]))
		wlist := dl.waitingOnPeerLimit[dj.peer]
		dl.waitingOnPeerLimit[dj.peer] = append(wlist, dj)
		return
	}
	dl.activePerPeer[dj.peer]++

	dl.addCheckFdLimit(dj)
}

// executeDial calls the dialFunc, and reports the result through the response channel when finished. Once the response is sent it also releases all tokens it held during the dial.
func (dl *dialLimiter) executeDial(j *dialJob) {
	defer dl.finishedDial(j)
	if j.cancelled() {
		return
	}

	dctx, cancel := context.WithTimeout(j.ctx, j.dialTimeout())
	defer cancel()

	con, err := dl.dialFunc(dctx, j.peer, j.addr)
	select {
	case j.resp <- dialResult{Conn: con, Addr: j.addr, Err: err}:
	case <-j.ctx.Done():
		if err == nil {
			con.Close()
		}
	}
}

3、Backoff清理

func (db *DialBackoff) init(ctx context.Context) {
	if db.entries == nil {
		db.entries = make(map[peer.ID]map[string]*backoffAddr)
	}
	go db.background(ctx)
}

func (db *DialBackoff) background(ctx context.Context) {
	ticker := time.(BackoffMax)NewTicker
	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			db.cleanup()
		}
	}
}

func (db *DialBackoff) cleanup() {
	db.lock.Lock()
	defer db.lock.Unlock()
	now := time.Now()
	for p, e := range db.entries {
		good := false
		for _, backoff := range e {
			backoffTime := BackoffBase + BackoffCoef*time.Duration(backoff.tries*backoff.tries)
			if backoffTime > BackoffMax {
				backoffTime = BackoffMax
			}
			if now.Before(backoff.until.Add(backoffTime)) {
				good = true
				break
			}
		}
		if !good {
			delete(db.entries, p)
		}
	}
}

7.一些重要规则和算法

1、拨号地址的过滤

// filterKnownUndialables takes a list of multiaddrs, and removes those that we definitely don't want to dial: addresses configured to be blocked, IPv6 link-local addresses, addresses without a dial-capable transport, and addresses that we know to be our own. This is an optimization to avoid wasting time on dials that we know are going to fail.
func (s *Swarm) filterKnownUndialables(p peer.ID, addrs []ma.Multiaddr) []ma.Multiaddr {
	lisAddrs, _ := s.InterfaceListenAddresses()
	var ourAddrs []ma.Multiaddr
	for _, addr := range lisAddrs {
		protos := addr.Protocols()
		// we're only sure about filtering out /ip4 and /ip6 addresses, so far
		if len(protos) == 2 && (protos[0].Code == ma.P_IP4 || protos[0].Code == ma.P_IP6) {
			ourAddrs = append(ourAddrs, addr)
		}
	}

	return addrutil.FilterAddrs(addrs,
		addrutil.SubtractFilter(ourAddrs...),
		s.canDial,
		// TODO: Consider allowing link-local addresses
		addrutil.AddrOverNonLocalIP,
		func(addr ma.Multiaddr) bool {
			return s.gater == nil || s.gater.InterceptAddrDial(p, addr)
		},
	)
}

// FilterAddrs is a filter that removes certain addresses, according to the given filters.
// If all filters return true, the address is kept.
func FilterAddrs(a []ma.Multiaddr, filters ...func(ma.Multiaddr) bool) []ma.Multiaddr {
	b := make([]ma.Multiaddr, 0, len(a))
	for _, addr := range a {
		good := true
		for _, filter := range filters {
			good = good && filter(addr)
		}
		if good {
			b = append(b, addr)
		}
	}
	return b
}

// AddrOverNonLocalIP returns whether the addr uses a non-local ip link
func AddrOverNonLocalIP(a ma.Multiaddr) bool {
	split := ma.Split(a)
	if len(split) < 1 {
		return false
	}
	if manet.IsIP6LinkLocal(split[0]) {
		return false
	}
	return true
}

2、拨号地址排序

// ranks addresses in descending order of preference for dialing   Private UDP > Public UDP > Private TCP > Public TCP > UDP Relay server > TCP Relay server
	rankAddrsFnc := func(addrs []ma.Multiaddr) []ma.Multiaddr {
		var localUdpAddrs []ma.Multiaddr // private udp
		var relayUdpAddrs []ma.Multiaddr // relay udp
		var othersUdp []ma.Multiaddr     // public udp

		var localFdAddrs []ma.Multiaddr // private fd consuming
		var relayFdAddrs []ma.Multiaddr //  relay fd consuming
		var othersFd []ma.Multiaddr     // public fd consuming

		for _, a := range addrs {
			if _, err := a.ValueForProtocol(ma.P_CIRCUIT); err == nil {
				if s.IsFdConsumingAddr(a) {
					relayFdAddrs = append(relayFdAddrs, a)
					continue
				}
				relayUdpAddrs = append(relayUdpAddrs, a)
			} else if manet.IsPrivateAddr(a) {
				if s.IsFdConsumingAddr(a) {
					localFdAddrs = append(localFdAddrs, a)
					continue
				}
				localUdpAddrs = append(localUdpAddrs, a)
			} else {
				if s.IsFdConsumingAddr(a) {
					othersFd = append(othersFd, a)
					continue
				}
				othersUdp = append(othersUdp, a)
			}
		}

		relays := append(relayUdpAddrs, relayFdAddrs...)
		fds := append(localFdAddrs, othersFd...)

		return append(append(append(localUdpAddrs, othersUdp...), fds...), relays...)
	}

3、Backoff时间设定

// BackoffBase is the base amount of time to backoff (default: 5s).
var BackoffBase = time.Second * 5

// BackoffCoef is the backoff coefficient (default: 1s).
var BackoffCoef = time.Second

// BackoffMax is the maximum backoff time (default: 5m).
var BackoffMax = time.Minute * 5

// AddBackoff lets other nodes know that we've entered backoff with peer p, so dialers should not wait unnecessarily. We still will attempt to dial with one goroutine, in case we get through.
//
// Backoff is not exponential, it's quadratic and computed according to the following formula:
//
//     BackoffBase + BakoffCoef * PriorBackoffs^2
//
// Where PriorBackoffs is the number of previous backoffs.
func (db *DialBackoff) AddBackoff(p peer.ID, addr ma.Multiaddr) {
	saddr := string(addr.Bytes())
	db.lock.Lock()
	defer db.lock.Unlock()
	bp, ok := db.entries[p]
	if !ok {
		bp = make(map[string]*backoffAddr, 1)
		db.entries[p] = bp
	}
	ba, ok := bp[saddr]
	if !ok {
		bp[saddr] = &backoffAddr{
			tries: 1,
			until: time.Now().Add(BackoffBase),
		}
		return
	}

	backoffTime := BackoffBase + BackoffCoef*time.Duration(ba.tries*ba.tries)
	if backoffTime > BackoffMax {
		backoffTime = BackoffMax
	}
	ba.until = time.Now().Add(backoffTime)
	ba.tries++
}

8.核心拨号逻辑

func (s *Swarm) dialAddrs(ctx context.Context, p peer.ID, remoteAddrs []ma.Multiaddr) (transport.CapableConn, *DialError) {
	/*
		This slice-to-chan code is temporary, the peerstore can currently provide
		a channel as an interface for receiving addresses, but more thought
		needs to be put into the execution. For now, this allows us to use
		the improved rate limiter, while maintaining the outward behaviour
		that we previously had (halting a dial when we run out of addrs)
	*/
	var remoteAddrChan chan ma.Multiaddr
	if len(remoteAddrs) > 0 {
		remoteAddrChan = make(chan ma.Multiaddr, len(remoteAddrs))
		for i := range remoteAddrs {
			remoteAddrChan <- remoteAddrs[i]
		}
		close(remoteAddrChan)
	}

	log.Debugf("%s swarm dialing %s", s.local, p)

	ctx, cancel := context.WithCancel(ctx)
	defer cancel() // cancel work when we exit func

	// use a single response type instead of errs and conns, reduces complexity *a ton*
	respch := make(chan dialResult)
	err := &DialError{Peer: p}

	defer s.limiter.clearAllPeerDials(p)

	var active int
dialLoop:
	for remoteAddrChan != nil || active > 0 {
		// Check for context cancellations and/or responses first.
		select {
		case <-ctx.Done():
			break dialLoop
		case resp := <-respch:
			active--
			if resp.Err != nil {
				// Errors are normal, lots of dials will fail
				if resp.Err != context.Canceled {
					s.backf.AddBackoff(p, resp.Addr)
				}

				log.Infof("got error on dial: %s", resp.Err)
				err.recordErr(resp.Addr, resp.Err)
			} else if resp.Conn != nil {
				return resp.Conn, nil
			}

			// We got a result, try again from the top.
			continue
		default:
		}

		// Now, attempt to dial.
		select {
		case addr, ok := <-remoteAddrChan:
			if !ok {
				remoteAddrChan = nil
				continue
			}

			s.limitedDial(ctx, p, addr, respch)
			active++
		case <-ctx.Done():
			break dialLoop
		case resp := <-respch:
			active--
			if resp.Err != nil {
				// Errors are normal, lots of dials will fail
				if resp.Err != context.Canceled {
					s.backf.AddBackoff(p, resp.Addr)
				}

				log.Infof("got error on dial: %s", resp.Err)
				err.recordErr(resp.Addr, resp.Err)
			} else if resp.Conn != nil {
				return resp.Conn, nil
			}
		}
	}

	if ctxErr := ctx.Err(); ctxErr != nil {
		err.Cause = ctxErr
	} else if len(err.DialErrors) == 0 {
		err.Cause = network.ErrNoRemoteAddrs
	} else {
		err.Cause = ErrAllDialsFailed
	}
	return nil, err
}

一个Peer可能有多个地址,对能拨号的地址过滤后,再对这些地址排好序,扔到了这里,先把这些地址放进channel,再遍历通道里的地址(dialLoop):

step 1.检查上下文和响应
1.1、如果上下文取消,则跳出循环。
1.2、收到响应,并将active计数减1, 如果上个循环拨号出错,则AddBackoff,并记录错误,如果上个循环拨号成功,则直接将Conn返回,两者都不是则继续下一次循环。

step 2.尝试拨号
2.1 从channel中取出地址,调用limitedDial进行拨号(内部会启动协程去拨号),并将active计数加1。
2.2 如果上下文取消,则跳出循环。
2.3 收到响应,并将active计数减1, 如果拨号出错,则AddBackoff,并记录错误,如果拨号成功,则直接将Conn返回, 两者都不是则继续下一次循环(下一次先检查上下文和响应)。

step 3.返回错误。 dialLoop结束也没有返回Conn则说明:上下文取消或没有地址可拨 ,要不然就是拨号出错,一个地址都没拨成功。

假设这里有三个地址 ,因为启动了协程去拨号,第一个失败了 ,第二个成功了,在等待第二个成功的响应时,第三个拨号任务可能已经执行了,这时返回Conn前,先会执行defer cancel() 第三个拨号任务会收到cancel信号,如果第三个拨号任务成功了 ,则将这个Conn关闭,详见dialLimiter.executeDial。再执行和 defer s.limiter.clearAllPeerDials(p),这里对waitingOnPeerLimit数据进行清理,不管拨号成功还是失败对这个Peer而言拨号已经结束了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值