分布式链路追踪
分布式链路追踪最早由谷歌的Dapper论文中提出的,提供提供简单易用的API来记录不同系统之间的调用的链路及耗时情况,从而提供各个系统的性能分析的依据。
Dapper论文概述
Dapper对分布式跟踪系统提出了一系列的要求如,
- 性能低损耗,分布式跟踪系统对服务的性能损耗应尽可能做到小的影响,特别是对性能要求较高的业务。
- 应用级别的透明,对于应用的开发者,需要对代码的侵入性要小,尽量提供开箱即用的工具提供给应用开发者使用,例如将跟踪系统封装到公共组件中去等。
- 延展性要好,即可以方便的拓展。
分布式追踪示例
主要表示了请求RequestX从用户侧进入服务A,然后A通过rpc调用了B和C,C通过rpc调用了D和E,最终将结果聚合返回给用户,通过标注A到B和C,C到D和E的过程完成分布式跟踪各个性能响应的监控,最终想要实现的样式如下;
通过Span来记录每一个节点中的执行情况,并通过id来表示出调用的层级关系,从而完成对跟踪过程的详细描述。通过规定与语言无关的方式来完成对调用树的记录。其他详细的内容请查看论文相关介绍。
zipkin的原理
zipkin就是推特根据谷歌的Dapper论文开发的一款开源的分布式链路追踪软件,其架构如下;
使用了打标签的client和server,通过实现的transport方法将数据发送到zipkin服务端,zipkin服务端会将数据存储到数据库中,zipkin支持包括es,mysql等数据库,transport也支持kalfka、mq等消息中间件。基础的概念大概有了之后,接下来在实际中深入理解一下。
zipkin实践
大家可自行查看官网的安装配置,本文直接使用了docker启动来实际测试,如果生成环境请自行查阅官网。使用的两个flask来演示一下zipkin的使用。
服务1
import requests
from flask import Flask
from py_zipkin.zipkin import zipkin_span, create_http_headers_for_new_span
import time
app = Flask(__name__)
def do_stuff():
time.sleep(2)
headers = create_http_headers_for_new_span()
requests.get('http://localhost:6000/service/', headers=headers)
return 'OK'
def http_transport(encoded_span):
zipkin_url = "http://192.168.10.204:9411/api/v1/spans"
headers = {"Content-Type": "application/x-thrift"}
# You'd probably want to wrap this in a try/except in case POSTing fails
requests.post(zipkin_url, data=encoded_span, headers=headers)
@app.route('/')
def index():
with zipkin_span(
service_name='flask_server_1',
span_name='index',
transport_handler=http_transport,
port=5000,
sample_rate=100, #0.05, # Value between 0.0 and 100.0
):
with zipkin_span(service_name='service_1', span_name='service_1_do_stuff'):
do_stuff()
time.sleep(1)
return 'OK', 200
if __name__ == '__main__':
app.run(host="0.0.0.0", port=5000, debug=True)
服务1就是监听5000端口,并接受请求之后访问6000端口的服务,访问完成之后再结束整个请求。
服务2
from flask import request
import requests
from flask import Flask
from py_zipkin.zipkin import zipkin_span, ZipkinAttrs
import time
app = Flask(__name__)
def do_stuff():
time.sleep(2)
with zipkin_span(service_name='service1', span_name='service1_db_search'):
time.sleep(2)
return 'OK'
def http_transport(encoded_span):
zipkin_url="http://192.168.10.204:9411/api/v1/spans"
headers = {"Content-Type": "application/x-thrift"}
# You'd probably want to wrap this in a try/except in case POSTing fails
requests.post(zipkin_url, data=encoded_span, headers=headers)
@app.route('/service/')
def index():
with zipkin_span(
service_name='flask_server_2',
zipkin_attrs=ZipkinAttrs(
trace_id=request.headers['X-B3-TraceID'],
span_id=request.headers['X-B3-SpanID'],
parent_span_id=request.headers['X-B3-ParentSpanID'],
flags=request.headers['X-B3-Flags'],
is_sampled=request.headers['X-B3-Sampled'],
),
span_name='service_span',
transport_handler=http_transport,
port=6000,
sample_rate=100, #0.05, # Value between 0.0 and 100.0
):
with zipkin_span(service_name='service_2', span_name='service_2_do_stuff'):
do_stuff()
return 'OK', 200
if __name__=='__main__':
app.run(host="0.0.0.0", port=6000, debug=True)
服务2就是监听6000端口并提供服务,服务完成之后就返回,服务中通过加入time.sleep来模拟实际的耗时操作。
接下来就启动两个服务并运行,然后访问http://127.0.0.1:5000/,
接着再访问http://192.168.10.204:9411/来查看zipkin提供的服务,并通过输入服务名flask_server_1来查看调用链如下;
从界面上可以看处理整个的调用链条和每个应用的耗时,zipkin的源码处理流程就不查看了大家有兴趣可以查看一下,因为大概浏览了一下zipkin的源码,大致就是些业务保存与查询的业务操作,思路比较清晰,接下来我们查看一下py_zipkin的执行逻辑,分析一下客户端的基础流程。
py_zipkin流程概述
根span处理流程
py_zipkin通过适配了zipkin服务端的相关协议二提供的客户端,需要在需要记录的节点上将span的信息进行收集统计,需要组织保存好每一个生成的span信息并将这些span信息发送到zipkin服务端保存,接下来从如下代码开始入手查看;
with zipkin_span(
service_name='flask_server_1',
span_name='index',
transport_handler=http_transport,
port=5000,
sample_rate=100, #0.05, # Value between 0.0 and 100.0
):
with zipkin_span(service_name='service_1', span_name='service_1_do_stuff'):
do_stuff()
time.sleep(1)
通过with上下文管理器来包裹住一个根span,然后在根span里面可以有多个不同的span的生成,所有里面会有一个队列来保存根span下所有的span并依次排列。
class zipkin_span(object):
"""Context manager/decorator for all of your zipkin tracing needs.
Usage #1: Start a trace with a given sampling rate
This begins the zipkin trace and also records the root span. The required
params are service_name, transport_handler, and sample_rate.
# Start a trace with do_stuff() as the root span
def some_batch_job(a, b):
with zipkin_span(
service_name='my_service',
span_name='my_span_name',
transport_handler=some_handler,
port=22,
sample_rate=0.05,
):
do_stuff()
Usage #2: Trace a service call.
The typical use case is instrumenting a framework like Pyramid or Django. Only
ss and sr times are recorded for the root span. Required params are
service_name, zipkin_attrs, transport_handler, and port.
# Used in a pyramid tween
def tween(request):
zipkin_attrs = some_zipkin_attr_creator(request)
with zipkin_span(
service_name='my_service,'
span_name='my_span_name',
zipkin_attrs=zipkin_attrs,
transport_handler=some_handler,
port=22,
) as zipkin_context:
response = handler(request)
zipkin_context.update_binary_annotations(
some_binary_annotations)
return response
Usage #3: Log a span within the context of a zipkin trace
If you're already in a zipkin trace, you can use this to log a span inside. The
only required param is service_name. If you're not in a zipkin trace, this
won't do anything.
# As a decorator
@zipkin_span(service_name='my_service', span_name='my_function')
def my_function():
do_stuff()
# As a context manager
def my_function():
with zipkin_span(service_name='my_service', span_name='do_stuff'):
do_stuff()
"""
def __init__(
self,
service_name,
span_name="span",
zipkin_attrs=None,
transport_handler=None,
max_span_batch_size=None,
annotations=None,
binary_annotations=None,
port=0,
sample_rate=None,
include=None,
add_logging_annotation=False,
report_root_timestamp=False,
use_128bit_trace_id=False,
host=None,
context_stack=None,
span_storage=None,
firehose_handler=None,
kind=None,
timestamp=None,
duration=None,
encoding=Encoding.V1_THRIFT,
_tracer=None,
):
"""Logs a zipkin span. If this is the root span, then a zipkin
trace is started as well.
:param service_name: The name of the called service
:type service_name: string
:param span_name: Optional name of span, defaults to 'span'
:type span_name: string
:param zipkin_attrs: Optional set of zipkin attributes to be used
:type zipkin_attrs: ZipkinAttrs
:param transport_handler: Callback function that takes a message parameter
and handles logging it
:type transport_handler: BaseTransportHandler
:param max_span_batch_size: Spans in a trace are sent in batches,
max_span_batch_size defines max size of one batch
:type max_span_batch_size: int
:param annotations: Optional dict of str -> timestamp annotations
:type annotations: dict of str -> int
:param binary_annotations: Optional dict of str -> str span attrs
:type binary_annotations: dict of str -> str
:param port: The port number of the service. Defaults to 0.
:type port: int
:param sample_rate: Rate at which to sample; 0.0 - 100.0. If passed-in
zipkin_attrs have is_sampled=False and the sample_rate param is > 0,
a new span will be generated at this rate. This means that if you
propagate sampling decisions to downstream services, but still have
sample_rate > 0 in those services, the actual rate of generated
spans for those services will be > sampling_rate.
:type sample_rate: float
:param include: which annotations to include
can be one of {'client', 'server'}
corresponding to ('cs', 'cr') and ('ss', 'sr') respectively.
DEPRECATED: use kind instead. `include` will be removed in 1.0.
:type include: iterable
:param add_logging_annotation: Whether to add a 'logging_end'
annotation when py_zipkin finishes logging spans
:type add_logging_annotation: boolean
:param report_root_timestamp: Whether the span should report timestamp
and duration. Only applies to "root" spans in this local context,
so spans created inside other span contexts will always log
timestamp/duration. Note that this is only an override for spans
that have zipkin_attrs passed in. Spans that make their own
sampling decisions (i.e. are the root spans of entire traces) will
always report timestamp/duration.
:type report_root_timestamp: boolean
:param use_128bit_trace_id: If true, generate 128-bit trace_ids.
:type use_128bit_trace_id: boolean
:param host: Contains the ipv4 or ipv6 value of the host. The ip value
isn't automatically determined in a docker environment.
:type host: string
:param context_stack: explicit context stack for storing
zipkin attributes
:type context_stack: object
:param span_storage: explicit Span storage for storing zipkin spans
before they're emitted.
:type span_storage: py_zipkin.storage.SpanStorage
:param firehose_handler: [EXPERIMENTAL] Similar to transport_handler,
except that it will receive 100% of the spans regardless of trace
sampling rate.
:type firehose_handler: BaseTransportHandler
:param kind: Span type (client, server, local, etc...).
:type kind: Kind
:param timestamp: Timestamp in seconds, defaults to `time.time()`.
Set this if you want to use a custom timestamp.
:type timestamp: float
:param duration: Duration in seconds, defaults to the time spent in the
context. Set this if you want to use a custom duration.
:type duration: float
:param encoding: Output encoding format, defaults to V1 thrift spans.
:type encoding: Encoding
:param _tracer: Current tracer object. This argument is passed in
automatically when you create a zipkin_span from a Tracer.
:type _tracer: Tracer
"""
self.service_name = service_name # 记录的服务名
self.span_name = span_name # span的名称
self.zipkin_attrs_override = zipkin_attrs # 需要重新的zipkin属性
self.transport_handler = transport_handler # 传输的handler
self.max_span_batch_size = max_span_batch_size
self.annotations = annotations or {}
self.binary_annotations = binary_annotations or {}
self.port = port
self.sample_rate = sample_rate # 采样速率
self.add_logging_annotation = add_logging_annotation
self.report_root_timestamp_override = report_root_timestamp
self.use_128bit_trace_id = use_128bit_trace_id # trace id
self.host = host
self._context_stack = context_stack
self._span_storage = span_storage
self.firehose_handler = firehose_handler
self.kind = self._generate_kind(kind, include)
self.timestamp = timestamp
self.duration = duration
self.encoding = encoding
self._tracer = _tracer
self._is_local_root_span = False
self.logging_context = None
self.do_pop_attrs = False
# Spans that log a 'cs' timestamp can additionally record a
# 'sa' binary annotation that shows where the request is going.
self.remote_endpoint = None
self.zipkin_attrs = None
# It used to be possible to override timestamp and duration by passing
# in the cs/cr or sr/ss annotations. We want to keep backward compatibility
# for now, so this logic overrides self.timestamp and self.duration in the
# same way.
# This doesn't fit well with v2 spans since those annotations are gone, so
# we also log a deprecation warning.
if "sr" in self.annotations and "ss" in self.annotations:
self.duration = self.annotations["ss"] - self.annotations["sr"]
self.timestamp = self.annotations["sr"]
log.warning(
"Manually setting 'sr'/'ss' annotations is deprecated. Please "
"use the timestamp and duration parameters."
)
if "cr" in self.annotations and "cs" in self.annotations:
self.duration = self.annotations["cr"] - self.annotations["cs"]
self.timestamp = self.annotations["cs"]
log.warning(
"Manually setting 'cr'/'cs' annotations is deprecated. Please "
"use the timestamp and duration parameters."
)
# Root spans have transport_handler and at least one of
# zipkin_attrs_override or sample_rate.
if self.zipkin_attrs_override or self.sample_rate is not None:
# transport_handler is mandatory for root spans
if self.transport_handler is None:
raise ZipkinError("Root spans require a transport handler to be given")
self._is_local_root_span = True
# If firehose_handler than this is a local root span.
if self.firehose_handler:
self._is_local_root_span = True
if self.sample_rate is not None and not (0.0 <= self.sample_rate <= 100.0):
raise ZipkinError("Sample rate must be between 0.0 and 100.0")
if self._span_storage is not None and not isinstance(
self._span_storage, storage.SpanStorage
):
raise ZipkinError(
"span_storage should be an instance of py_zipkin.storage.SpanStorage"
)
if self._span_storage is not None:
log.warning("span_storage is deprecated. Set local_storage instead.")
self.get_tracer()._span_storage = self._span_storage
if self._context_stack is not None:
log.warning("context_stack is deprecated. Set local_storage instead.")
self.get_tracer()._context_stack = self._context_stack
def __call__(self, f):
@functools.wraps(f)
def decorated(*args, **kwargs):
with zipkin_span(
service_name=self.service_name,
span_name=self.span_name,
zipkin_attrs=self.zipkin_attrs,
transport_handler=self.transport_handler,
max_span_batch_size=self.max_span_batch_size,
annotations=self.annotations,
binary_annotations=self.binary_annotations,
port=self.port,
sample_rate=self.sample_rate,
include=None,
add_logging_annotation=self.add_logging_annotation,
report_root_timestamp=self.report_root_timestamp_override,
use_128bit_trace_id=self.use_128bit_trace_id,
host=self.host,
context_stack=self._context_stack,
span_storage=self._span_storage,
firehose_handler=self.firehose_handler,
kind=self.kind,
timestamp=self.timestamp,
duration=self.duration,
encoding=self.encoding,
_tracer=self._tracer,
):
return f(*args, **kwargs)
return decorated
def get_tracer(self):
if self._tracer is not None:
return self._tracer
else:
return get_default_tracer() # 获取保存在thread.Local中的数据,该数据保存在线程中
def __enter__(self):
return self.start() # 开启
def _generate_kind(self, kind, include):
# If `kind` is not set, then we generate it from `include`.
# This code maintains backward compatibility with old versions of py_zipkin
# which used include rather than kind to identify client / server spans.
if kind:
return kind
else:
if include:
# If `include` contains only one of `client` or `server`
# than it's a client or server span respectively.
# If neither or both are present, then it's a local span
# which is represented by kind = None.
log.warning("The include argument is deprecated. Please use kind.")
if "client" in include and "server" not in include:
return Kind.CLIENT
elif "client" not in include and "server" in include:
return Kind.SERVER
else:
return Kind.LOCAL
# If both kind and include are unset, then it's a local span.
return Kind.LOCAL
def _get_current_context(self):
"""Returns the current ZipkinAttrs and generates new ones if needed.
:returns: (report_root_timestamp, zipkin_attrs)
:rtype: (bool, ZipkinAttrs)
"""
# This check is technically not necessary since only root spans will have
# sample_rate, zipkin_attrs or a transport set. But it helps making the
# code clearer by separating the logic for a root span from the one for a
# child span.
if self._is_local_root_span: # 是否是根span
# If sample_rate is set, we need to (re)generate a trace context.
# If zipkin_attrs (trace context) were passed in as argument there are
# 2 possibilities:
# is_sampled = False --> we keep the same trace_id but re-roll the dice
# for is_sampled.
# is_sampled = True --> we don't want to stop sampling halfway through
# a sampled trace, so we do nothing.
# If no zipkin_attrs were passed in, we generate new ones and start a
# new trace.
if self.sample_rate is not None: # 如果速率不为空
# If this trace is not sampled, we re-roll the dice.
if (
self.zipkin_attrs_override
and not self.zipkin_attrs_override.is_sampled
):
# This will be the root span of the trace, so we should
# set timestamp and duration.
return (
True,
create_attrs_for_span(
sample_rate=self.sample_rate,
trace_id=self.zipkin_attrs_override.trace_id,
), # 获取已经存在的id和速率
)
# If zipkin_attrs_override was not passed in, we simply generate
# new zipkin_attrs to start a new trace.
elif not self.zipkin_attrs_override:
return (
True,
create_attrs_for_span(
sample_rate=self.sample_rate,
use_128bit_trace_id=self.use_128bit_trace_id,
), # 如果没有重新的重新创建
)
if self.firehose_handler and not self.zipkin_attrs_override:
# If it has gotten here, the only thing that is
# causing a trace is the firehose. So we force a trace
# with sample rate of 0
return (
True,
create_attrs_for_span(
sample_rate=0.0, use_128bit_trace_id=self.use_128bit_trace_id,
),
)
# If we arrive here it means the sample_rate was not set while
# zipkin_attrs_override was, so let's simply return that.
return False, self.zipkin_attrs_override
else:
# Check if there's already a trace context in _context_stack.
existing_zipkin_attrs = self.get_tracer().get_zipkin_attrs() # 如果不是根span则获取已经存在的熟悉
# If there's an existing context, let's create new zipkin_attrs
# with that context as parent.
if existing_zipkin_attrs:
return (
False,
ZipkinAttrs(
trace_id=existing_zipkin_attrs.trace_id,
span_id=generate_random_64bit_string(),
parent_span_id=existing_zipkin_attrs.span_id,
flags=existing_zipkin_attrs.flags,
is_sampled=existing_zipkin_attrs.is_sampled,
), # 根据存在的属性重新初始化一份出来
)
return False, None
def start(self):
"""Enter the new span context. All annotations logged inside this
context will be attributed to this span. All new spans generated
inside this context will have this span as their parent.
In the unsampled case, this context still generates new span IDs and
pushes them onto the threadlocal stack, so downstream services calls
made will pass the correct headers. However, the logging handler is
never attached in the unsampled case, so the spans are never logged.
"""
self.do_pop_attrs = False
report_root_timestamp, self.zipkin_attrs = self._get_current_context() # 获取当前的保存的队列
# If zipkin_attrs are not set up by now, that means this span is not
# configured to perform logging itself, and it's not in an existing
# Zipkin trace. That means there's nothing else to do and it can exit
# early.
if not self.zipkin_attrs: # 如果没有属性值则返回
return self
self.get_tracer().push_zipkin_attrs(self.zipkin_attrs) # 将当前的熟悉保存
self.do_pop_attrs = True
self.start_timestamp = time.time() # 获取开始的时间戳
if self._is_local_root_span: # 是否是根span
# Don't set up any logging if we're not sampling
if not self.zipkin_attrs.is_sampled and not self.firehose_handler:
return self
# If transport is already configured don't override it. Doing so would
# cause all previously recorded spans to never be emitted as exiting
# the inner logging context will reset transport_configured to False.
if self.get_tracer().is_transport_configured():
log.info(
"Transport was already configured, ignoring override "
"from span {}".format(self.span_name)
)
return self
endpoint = create_endpoint(self.port, self.service_name, self.host)
self.logging_context = ZipkinLoggingContext(
self.zipkin_attrs,
endpoint,
self.span_name,
self.transport_handler,
report_root_timestamp or self.report_root_timestamp_override,
self.get_tracer,
self.service_name,
binary_annotations=self.binary_annotations,
add_logging_annotation=self.add_logging_annotation,
client_context=self.kind == Kind.CLIENT,
max_span_batch_size=self.max_span_batch_size,
firehose_handler=self.firehose_handler,
encoding=self.encoding,
annotations=self.annotations,
)
self.logging_context.start() # 保存上下文 最后通过logging_context来将所有span发送出去
self.get_tracer().set_transport_configured(configured=True)
return self
def __exit__(self, _exc_type, _exc_value, _exc_traceback):
self.stop(_exc_type, _exc_value, _exc_traceback)
def stop(self, _exc_type=None, _exc_value=None, _exc_traceback=None):
"""Exit the span context. Zipkin attrs are pushed onto the
threadlocal stack regardless of sampling, so they always need to be
popped off. The actual logging of spans depends on sampling and that
the logging was correctly set up.
"""
if self.do_pop_attrs: # 如果需要弹出属性则弹出
self.get_tracer().pop_zipkin_attrs()
# If no transport is configured, there's no reason to create a new Span.
# This also helps avoiding memory leaks since without a transport nothing
# would pull spans out of get_tracer().
if not self.get_tracer().is_transport_configured():
return
# Add the error annotation if an exception occurred
if any((_exc_type, _exc_value, _exc_traceback)):
error_msg = u"{0}: {1}".format(_exc_type.__name__, _exc_value)
self.update_binary_annotations({ERROR_KEY: error_msg}) # 如果出错则更新注解
# Logging context is only initialized for "root" spans of the local
# process (i.e. this zipkin_span not inside of any other local
# zipkin_spans)
if self.logging_context:
try:
self.logging_context.stop() # 如果是根span则调用stop来讲所有的span信息发送到zipkin服务端
except Exception as ex:
err_msg = "Error emitting zipkin trace. {}".format(repr(ex))
log.error(err_msg)
finally:
self.logging_context = None # 重置
self.get_tracer().clear() # 清空当前的队列
self.get_tracer().set_transport_configured(configured=False)
return
# If we've gotten here, that means that this span is a child span of
# this context's root span (i.e. it's a zipkin_span inside another
# zipkin_span).
end_timestamp = time.time() # 如果不是根span则获取当前的时间节点
# If self.duration is set, it means the user wants to override it
if self.duration:
duration = self.duration
else:
duration = end_timestamp - self.start_timestamp # 计算耗时时间
endpoint = create_endpoint(self.port, self.service_name, self.host) # 创建一个endpoint
self.get_tracer().add_span(
Span(
trace_id=self.zipkin_attrs.trace_id,
name=self.span_name,
parent_id=self.zipkin_attrs.parent_span_id,
span_id=self.zipkin_attrs.span_id,
kind=self.kind,
timestamp=self.timestamp if self.timestamp else self.start_timestamp,
duration=duration,
annotations=self.annotations,
local_endpoint=endpoint,
remote_endpoint=self.remote_endpoint,
tags=self.binary_annotations,
)
) # 将该span添加到tracer中
def update_binary_annotations(self, extra_annotations):
"""Updates the binary annotations for the current span."""
if not self.logging_context:
# This is not the root span, so binary annotations will be added
# to the log handler when this span context exits.
self.binary_annotations.update(extra_annotations)
else:
# Otherwise, we're in the context of the root span, so just update
# the binary annotations for the logging context directly.
self.logging_context.tags.update(extra_annotations)
def add_annotation(self, value, timestamp=None):
"""Add an annotation for the current span
The timestamp defaults to "now", but may be specified.
:param value: The annotation string
:type value: str
:param timestamp: Timestamp for the annotation
:type timestamp: float
"""
timestamp = timestamp or time.time()
if not self.logging_context:
# This is not the root span, so annotations will be added
# to the log handler when this span context exits.
self.annotations[value] = timestamp
else:
# Otherwise, we're in the context of the root span, so just update
# the annotations for the logging context directly.
self.logging_context.annotations[value] = timestamp
def add_sa_binary_annotation(
self, port=0, service_name="unknown", host="127.0.0.1",
):
"""Adds a 'sa' binary annotation to the current span.
'sa' binary annotations are useful for situations where you need to log
where a request is going but the destination doesn't support zipkin.
Note that the span must have 'cs'/'cr' annotations.
:param port: The port number of the destination
:type port: int
:param service_name: The name of the destination service
:type service_name: str
:param host: Host address of the destination
:type host: str
"""
if self.kind != Kind.CLIENT:
# TODO: trying to set a sa binary annotation for a non-client span
# should result in a logged error
return
remote_endpoint = create_endpoint(
port=port, service_name=service_name, host=host,
)
if not self.logging_context:
if self.remote_endpoint is not None:
raise ValueError("SA annotation already set.")
self.remote_endpoint = remote_endpoint
else:
if self.logging_context.remote_endpoint is not None:
raise ValueError("SA annotation already set.")
self.logging_context.remote_endpoint = remote_endpoint
def override_span_name(self, name):
"""Overrides the current span name.
This is useful if you don't know the span name yet when you create the
zipkin_span object. i.e. pyramid_zipkin doesn't know which route the
request matched until the function wrapped by the context manager
completes.
:param name: New span name
:type name: str
"""
self.span_name = name
if self.logging_context:
self.logging_context.span_name = name
zipkin_span主要的工作就是根据是否是根span,来创建logging_context,然后通过tracer来保存根span节点下面创建的span信息,最后通过logging_context.stop()方法将所有的span信息发送出去。其中tracer默认是
def _get_thread_local_tracer():
"""Returns the current tracer from thread-local.
If there's no current tracer it'll create a new one.
:returns: current tracer.
:rtype: Tracer
"""
if not hasattr(_thread_local_tracer, "tracer"):
_thread_local_tracer.tracer = Tracer()
return _thread_local_tracer.tracer
def _set_thread_local_tracer(tracer):
"""Sets the current tracer in thread-local.
:param tracer: current tracer.
:type tracer: Tracer
"""
_thread_local_tracer.tracer = tracer
class Tracer(object):
def __init__(self):
self._is_transport_configured = False
self._span_storage = SpanStorage()
self._context_stack = Stack() #
def get_zipkin_attrs(self):
return self._context_stack.get() # 获取当前的熟悉值
def push_zipkin_attrs(self, ctx):
self._context_stack.push(ctx) # 保存一个属性值
def pop_zipkin_attrs(self):
return self._context_stack.pop() # 弹出一个属性值
def add_span(self, span):
self._span_storage.append(span) # 保存span到队列中
def get_spans(self):
return self._span_storage
def clear(self):
self._span_storage.clear()
def set_transport_configured(self, configured):
self._is_transport_configured = configured
def is_transport_configured(self):
return self._is_transport_configured
def zipkin_span(self, *argv, **kwargs):
from py_zipkin.zipkin import zipkin_span
kwargs["_tracer"] = self
return zipkin_span(*argv, **kwargs)
def copy(self):
"""Return a copy of this instance, but with a deep-copied
_context_stack. The use-case is for passing a copy of a Tracer into
a new thread context.
"""
the_copy = self.__class__()
the_copy._is_transport_configured = self._is_transport_configured
the_copy._span_storage = self._span_storage
the_copy._context_stack = self._context_stack.copy()
return the_copy
class Stack(object):
"""
Stack is a simple stack class.
It offers the operations push, pop and get.
The latter two return None if the stack is empty.
.. deprecated::
Use the Tracer interface which offers better multi-threading support.
Stack will be removed in version 1.0.
"""
def __init__(self, storage=None):
if storage is not None:
log.warning("Passing a storage object to Stack is deprecated.")
self._storage = storage
else:
self._storage = []
def push(self, item):
self._storage.append(item)
def pop(self):
if self._storage:
return self._storage.pop()
def get(self):
if self._storage:
return self._storage[-1]
def copy(self):
# Return a new Stack() instance with a deep copy of our stack contents
the_copy = self.__class__()
the_copy._storage = self._storage[:]
return the_copy
class SpanStorage(deque):
"""Stores the list of completed spans ready to be sent.
.. deprecated::
Use the Tracer interface which offers better multi-threading support.
SpanStorage will be removed in version 1.0.
"""
pass
从tracer的获取方法可知,该保存在了thread_local中,并且上下文通过Stack类实现,而span存储则是一个双向队列来实现的,在zipkin_span的__exit__方法中通过add_span一个Span实例来保存,接下来查看一下logging_context的流程。
class ZipkinLoggingContext(object):
"""A logging context specific to a Zipkin trace. If the trace is sampled,
the logging context sends serialized Zipkin spans to a transport_handler.
The logging context sends root "server" or "client" span, as well as all
local child spans collected within this context.
This class should only be used by the main `zipkin_span` entrypoint.
"""
def __init__(
self,
zipkin_attrs,
endpoint,
span_name,
transport_handler,
report_root_timestamp,
get_tracer,
service_name,
binary_annotations=None,
add_logging_annotation=False,
client_context=False,
max_span_batch_size=None,
firehose_handler=None,
encoding=None,
annotations=None,
):
self.zipkin_attrs = zipkin_attrs
self.endpoint = endpoint
self.span_name = span_name
self.transport_handler = transport_handler
self.response_status_code = 0
self._get_tracer = get_tracer
self.service_name = service_name
self.report_root_timestamp = report_root_timestamp
self.tags = binary_annotations or {}
self.add_logging_annotation = add_logging_annotation
self.client_context = client_context
self.max_span_batch_size = max_span_batch_size
self.firehose_handler = firehose_handler
self.annotations = annotations or {}
self.remote_endpoint = None
self.encoder = get_encoder(encoding)
def start(self):
"""Actions to be taken before request is handled."""
# Record the start timestamp.
self.start_timestamp = time.time() # 记录开始的时间 即根span的时间
return self
def stop(self):
"""Actions to be taken post request handling.
"""
self.emit_spans() # 提交所有的spans
def emit_spans(self):
"""Main function to log all the annotations stored during the entire
request. This is done if the request is sampled and the response was
a success. It also logs the service (`ss` and `sr`) or the client
('cs' and 'cr') annotations.
"""
# FIXME: Should have a single aggregate handler
if self.firehose_handler:
# FIXME: We need to allow different batching settings per handler
self._emit_spans_with_span_sender(
ZipkinBatchSender(
self.firehose_handler, self.max_span_batch_size, self.encoder
)
)
if not self.zipkin_attrs.is_sampled:
self._get_tracer().clear()
return
span_sender = ZipkinBatchSender(
self.transport_handler, self.max_span_batch_size, self.encoder
) # 生成一个多个span_sender
self._emit_spans_with_span_sender(span_sender) # 将所有的span包装好并发送
self._get_tracer().clear()
def _emit_spans_with_span_sender(self, span_sender):
with span_sender:
end_timestamp = time.time() # 处理结束时间
# Collect, annotate, and log client spans from the logging handler
for span in self._get_tracer()._span_storage:
span.local_endpoint = copy_endpoint_with_new_service_name(
self.endpoint, span.local_endpoint.service_name,
)
span_sender.add_span(span)
if self.add_logging_annotation:
self.annotations[LOGGING_END_KEY] = time.time()
span_sender.add_span(
Span(
trace_id=self.zipkin_attrs.trace_id,
name=self.span_name,
parent_id=self.zipkin_attrs.parent_span_id,
span_id=self.zipkin_attrs.span_id,
kind=Kind.CLIENT if self.client_context else Kind.SERVER,
timestamp=self.start_timestamp,
duration=end_timestamp - self.start_timestamp,
local_endpoint=self.endpoint,
remote_endpoint=self.remote_endpoint,
shared=not self.report_root_timestamp,
annotations=self.annotations,
tags=self.tags,
) # 添加每个span
)
class ZipkinBatchSender(object):
MAX_PORTION_SIZE = 100
def __init__(self, transport_handler, max_portion_size, encoder):
self.transport_handler = transport_handler
self.max_portion_size = max_portion_size or self.MAX_PORTION_SIZE
self.encoder = encoder
if isinstance(self.transport_handler, BaseTransportHandler):
self.max_payload_bytes = self.transport_handler.get_max_payload_bytes()
else:
self.max_payload_bytes = None
def __enter__(self):
self._reset_queue() # 先重置队列
return self
def __exit__(self, _exc_type, _exc_value, _exc_traceback):
if any((_exc_type, _exc_value, _exc_traceback)):
filename = os.path.split(_exc_traceback.tb_frame.f_code.co_filename)[1]
error = "({0}:{1}) {2}: {3}".format(
filename, _exc_traceback.tb_lineno, _exc_type.__name__, _exc_value,
)
raise ZipkinError(error)
else:
self.flush() # 将序列化好的数据发送出去
def _reset_queue(self):
self.queue = []
self.current_size = 0
def add_span(self, internal_span):
encoded_span = self.encoder.encode_span(internal_span)
# If we've already reached the max batch size or the new span doesn't
# fit in max_payload_bytes, send what we've collected until now and
# start a new batch.
is_over_size_limit = (
self.max_payload_bytes is not None
and not self.encoder.fits(
current_count=len(self.queue),
current_size=self.current_size,
max_size=self.max_payload_bytes,
new_span=encoded_span,
)
)
is_over_portion_limit = len(self.queue) >= self.max_portion_size
if is_over_size_limit or is_over_portion_limit: # 查看是否超过队列限制
self.flush()
self.queue.append(encoded_span) # 添加到队列中
self.current_size += len(encoded_span)
def flush(self):
if self.transport_handler and len(self.queue) > 0:
message = self.encoder.encode_queue(self.queue) # 编码队列的多有数据
self.transport_handler(message) # 调用根span中的transport_handler发送数据
self._reset_queue() # 重置队列
至此根span的处理流程如上所述,接下来再查看一下从服务1到服务2的过程。
服务1到服务2的流程
def do_stuff():
time.sleep(2)
headers = create_http_headers_for_new_span()
requests.get('http://localhost:6000/service/', headers=headers)
return 'OK'
在执行到该函数的时候,执行了create_http_headers_for_new_span方法;
def create_http_headers_for_new_span(context_stack=None, tracer=None):
"""
Generate the headers for a new zipkin span.
.. note::
If the method is not called from within a zipkin_trace context,
empty dict will be returned back.
:returns: dict containing (X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId,
X-B3-Flags and X-B3-Sampled) keys OR an empty dict.
"""
return create_http_headers(context_stack, tracer, True)
def create_http_headers(
context_stack=None, tracer=None, new_span_id=False,
):
"""
Generate the headers for a new zipkin span.
.. note::
If the method is not called from within a zipkin_trace context,
empty dict will be returned back.
:returns: dict containing (X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId,
X-B3-Flags and X-B3-Sampled) keys OR an empty dict.
"""
if tracer:
zipkin_attrs = tracer.get_zipkin_attrs()
elif context_stack:
zipkin_attrs = context_stack.get()
else:
zipkin_attrs = get_default_tracer().get_zipkin_attrs() # 获取trace列表中的最后的一个属性值
# If zipkin_attrs is still not set then we're not in a trace context
if not zipkin_attrs:
return {}
if new_span_id:
span_id = generate_random_64bit_string() # 新生成一个span_id
parent_span_id = zipkin_attrs.span_id # 设置父span_id
else:
span_id = zipkin_attrs.span_id # 从属性中获取span_id
parent_span_id = zipkin_attrs.parent_span_id # 设置当前的父span_id
return {
"X-B3-TraceId": zipkin_attrs.trace_id, # 将获取的内容设置到headers中
"X-B3-SpanId": span_id,
"X-B3-ParentSpanId": parent_span_id,
"X-B3-Flags": "0",
"X-B3-Sampled": "1" if zipkin_attrs.is_sampled else "0",
}
跨服务的过程中通过设置头部信息X-B3-TraceId等指标来进行链路的传递,如上的过程可以归结为如下的流程。
总结
本文简略的了解了一下分布式链路追踪的原理,如要详细的了解可了解Dapper论文和zipkin官网阅读相关资料,并且简单的通过py_zipkin客户端的执行流程分析了一下有关zipkin服务器的一个执行的流程加深一下对分布式链路的认识,并且基于一个简单的http的示例来了解基本原理。由于本人才疏学浅,如有错误请批评指正。