

数字看起来如何?(How are the numbers looking?)

Working in tech start-ups, we are often asked about metrics of each technical component in production. Ambitious new startups have the “build it first, then they will come” mentality. Whether their offerings are going viral or dead in the water, they prefer shipping new features than thinking about platform reliability.

在技​​术初创公司中,我们经常被问到生产中每个技术组件的指标。 雄心勃勃的新创业公司有“先建好,然后就来”的心态。 无论他们的产品是病毒式传播还是死水式传播,他们都希望提供新功能,而不是考虑平台可靠性。

Bigger shops with an analytics person (or team) on-board can put together analysis on which direction the business should go. However, as the product becomes bigger and more complex, it will grow to have a temperament of its own.

拥有分析人员(或团队)的大型商店可以将分析汇总到业务往哪个方向发展。 但是,随着产品变得越来越大,越来越复杂,它会变得具有自己的气质。

You may be familiar with these customer service reports:


“Customer is saying the form loads for them, but it does not complete.”


“Customer is saying their identification request has not completed in a week. Shouldn’t it finish in three days?”

“客户说他们的身份证明请求一周之内没有完成。 它不应该在三天内完成吗?”

“Customer is complaining about the product not loading and blank documents are shown.”


If you have heard of those things, you may be familiar with these internal chatters:


“I thought we rolled out that fix last month, why is it coming back?”


“Didn’t we QA for international names?”


“Are we being hacked? Why is the site behaving so badly?”

“我们被黑客入侵了吗? 为什么网站的表现如此糟糕?”

For every customer report you heard about, ten to a hundred other reload, retry, and gave up. By the time you hear about it in the engineering team, it has impacted thousands or more.

对于您听到的每个客户报告,其他十到一百次重新加载,重试和放弃。 到您在工程团队中听说它时,它已经影响了成千上万甚至更多。

How can we stay on top of these technical issues? This where instrumentation and observability comes into the picture.

我们如何才能紧贴这些技术问题? 这是仪表和可观察性的体现。

您的代码是否已安装好? (Is your code instrumented today?)

  • Can you give me the number of successfully handled requests vs failed ones?

  • How about the number of times where a customer order was inserted into the database?

  • If this data is in database and remote log storage, how much effort would it take you to put together a report?


  • How about a report that updates hourly?


Conventional troubleshooting relies on building pattern matching rules on log files. In some cases, operators log into the server to look at logs directly. The more elements there are in the system, the more places errors can spontaneously appear out of thin air.We know the mantra to keep everything as simple as possible, of course. But, some problems do require taking on additional expertise and complexity. At the end of the day, you may have more logs than time available to scan through them all.

常规故障排除依赖于在日志文件上构建模式匹配规则。 在某些情况下,操作员会登录服务器直接查看日志。 系统中存在的元素越多,错误就会自然而然地出现在更多的位置。当然,我们知道使所有事情保持尽可能简单的口头禅。 但是,某些问题确实需要额外的专业知识和复杂性。 一天结束时,您可能需要花费更多的时间来浏览所有日志。

How do we detect warning signs before it impacts business revenue, given limited team bandwidth?


The answer is: learn about open source instrumentation systems. The two products highly recommended by the community are Grafana and Prometheus.https://radar.cncf.io/2020-09-observability

答案是:了解开源仪器系统。 社区高度推荐的两种产品是Grafana和Prometheus。 https://radar.cncf.io/2020-09-可观察性

忘记日志。 首先关注指标。 (Forget about logs. Focus on metrics first.)

Instrumentation allows us to keep tabs on a program's current state.


  • We can declare a counter that gets increased whenever a record is successfully inserted into the database.

  • We can measure the amount of time an external system takes to handle


    our requests.


  • We can measure the current environment's cpu/memory usage to reflect on the possibility of memory leaks.

    我们可以测量当前环境的cpu /内存使用情况,以反映内存泄漏的可能性。

Where as logs would allow an investigator to pinpoint exactly where a user journey goes wrong, metrics builds a top-level model for the team to operate with.


想象一个人节食并进行日常锻炼: (Imagine a person going on diets and exercising routines:)

they are instructed to keep tabs on the calories content of the food they eat, and the intensity/length of the type of exercises performed. Lastly, they must record their weight at a regular interval! When we are not thinking in terms of metrics, we lack the proper units to even frame our goals with.A diet routine without numbers may work very well, but we cannot be absolutely certain until we measure with numbers.

他们被指示要密切注意所吃食物的卡路里含量以及所进行的运动类型的强度/时间。 最后,他们必须定期记录体重! 当我们不按照指标进行思考时,我们缺乏适当的单位来制定目标,没有数字的饮食习惯可能效果很好,但是直到我们用数字进行衡量时我们才能绝对确定。

The same can be said for creating and maintaining software offerings. If we are not thinking in terms of metrics such as:

创建和维护软件产品也可以这样说。 如果我们不考虑以下指标:

  • 99 percentile request response time

  • server up time

  • error and disconnect rates


We are already lost in terms of quality. We can make tweaks, fixes, and push new features in our platform, but we aren't sure if they make matters better or worse. The only certainty when flying blind is that we know the errors in the logs have stopped, but was it due to our fixes? Or a restart would have fixed it? Who knows?!

我们已经失去了质量。 我们可以在平台中进行调整,修复和推送新功能,但是我们不确定它们是否会改善或恶化。 盲目的唯一确定的是我们知道日志中的错误已停止,但这是由于我们的修复程序造成的吗? 还是重启会解决? 谁知道?!

Forget about logs. Focus on metrics first.

忘记日志。 首先关注指标。

“但是我们太忙了,无法花时间进行测量!” (“But we are too busy to spend time on measurements!”)

No one is perfect. We have a limited amount of time available in choosing winning strategies and implementing them. Writing code without customer inputs/feedback and insights on how the code is running is a frightening reality many developers face.

没有人是完美的。 我们在选择制胜战略和实施战略方面有有限的时间。 在没有客户输入/反馈的情况下编写代码以及对代码如何运行的见解是许多开发人员面临可怕现实

If you care about winning and staying in business, you need to keep your customers happy, and your services reliable enough. Just as the dieting and exercising person must measure calories, time exercises, and record their weight, so can developer teams sit down to figure out what numbers to measure, and leverage Prometheus and Grafana to keep measuring them.

如果您关心赢得和维持业务,则需要使您的客户满意,并且您的服务足够可靠。 正如节食和锻炼的人必须测量卡路里,进行时间锻炼并记录其体重一样,开发人员团队也可以坐下来确定要测量的数字,并利用Prometheus和Grafana来不断测量它们。

You literally cannot set objectives without measurements.


“好,但是有多少个步骤?” (“Ok, but how many steps are there?”)

Now onto the business of monitoring itself. Here are the 8 steps any engineers can follow to get insights into our platform:

现在进入监视本身的业务。 任何工程师都可以按照以下8个步骤来深入了解我们的平台:

  1. Install Prometheus. It will retain two weeks of metrics by default, and is enough for most start-ups. If you are a bigger business that needs longer data retention, you know who to reach out to.

    安装Prometheus 。 默认情况下,它将保留两周的指标,足以满足大多数初创企业的需求。 如果您是一家规模较大的企业,需要更长的数据保留时间,那么您就会知道该联系谁。

  2. Install Grafana. Ideally, it would use an external database (such as Postgres) instead of the internal sqlite3 database. We want all monitoring related components to be as reliable as possible.

    安装Grafana 。 理想情况下,它将使用外部数据库(例如Postgres)而不是内部sqlite3数据库。 我们希望所有与监视相关的组件都尽可能地可靠。

  3. Install node_exporter on our virtual machines. Packages are available for ubuntu, centos, and other flavors of linux. This light weight agent helps monitor resource usage on the machines.

    在我们的虚拟机上安装node_exporter 。 软件包可用于ubuntu,centos和其他类型的linux。 该轻量级代理有助于监视计算机上的资源使用情况。

  4. Import Prometheus client library into the application, and start with tracking the number of errors and exceptions occurring in the system.


  5. Configure Prometheus to scrape both application metrics and node_exporter machine metrics. Verify that samples are flowing through.

    配置Prometheus,以同时抓取应用程序指标和node_exporter计算机指标。 验证样品是否流过。
  6. Setup Prometheus as a data source in Grafana, and create your very first dashboard to see how many errors are occurring in the system.

  7. Setup a slack notification channel in Grafana, so you can be warned about error rate rising.

  8. Iterate, add, and refine metrics as the team becomes more knowledgeable on the types of metrics it cares about.


“这项工作需要多长时间?” (“How long will this effort take?”)

How long should these items take?


For smaller footprints of under twenty machines and five applications, this exercise in instrumenting and monitoring will take a single engineer no longer than two weeks. That's 80 hours. Hire a contractor, and youwill be done and complete with training and documentation within a month.This is much less time than the many hours engineering team will spend reading through logs in the future. Once the pipeline is established, more different types of measurements is possible.

对于少于20台机器和5种应用的较小占地面积,在仪器和监视中进行的这项工作将花费单个工程师不超过两周的时间。 那是80个小时。 雇用承包商,您将在一个月内完成培训和文档编制工作,而这比工程团队日后要花大量时间阅读日志的时间要少得多。 一旦建立了管道,就可以进行更多不同类型的测量。

Better metrics can lead to better business decisions.


结论 (Conclusion)

Just as you wouldn't trust a hospital's treatment when they do not take measurement, we cannot be sure of our product's reliability until we actually look at the numbers. If you are serious about service reliability, but are not sure where to begin?

就像您不信任医院不进行测量时的治疗方法一样,在实际查看数字之前,我们无法确定产品的可靠性。 如果您认真对待服务可靠性,但不确定从哪里开始?

Reach out to us with questions about observability at info@teamzerolabs.com


