访谈:SitePoint如何管理和确定监视优先级

This post was sponsored by PagerDuty. Thank you for supporting the sponsors who make SitePoint possible!

该帖子由PagerDuty赞助。 感谢您支持使SitePoint成为可能的赞助商!

Like many developers, I expect stuff to just work … and I throw a tantrum when it doesn’t! Behind the scenes, many technical people are working their magic to integrate hardware, software and services into workable cohesive systems. In this article I interview Jude Aakjaer about his DevOps duties and experiences at SitePoint.

像许多开发人员一样,我希望所有东西都能正常工作…… 而在不起作用时,我会发脾气! 在幕后,许多技术人员正在竭尽全力将硬件,软件和服务集成到可行的内聚系统中。 在本文中,我采访了Jude Aakjaer,介绍了他在SitePoint上的DevOps职责和经验。

Craig Buckler: Hey Jude! (Sorry, couldn’t resist.) Could you tell us who you are and what you do for SitePoint?

克雷格·巴克勒:嗨,裘德! (对不起,无法抗拒。)您能告诉我们您是谁以及您为SitePoint做的事情吗?

Jude Aakjaer: Hey Craig. Not a problem — unsurprisingly I do get that quite often!

Jude Aakjaer:嗨Craig。 没问题-毫不奇怪,我确实经常得到它!

I’m one of the developers working on products and systems at both SitePoint and Learnable. That means backend programming in Ruby and PHP but also DevOps tasks.

我是在SitePoint和Learnable上致力于产品和系统的开发人员之一。 这意味着使用Ruby和PHP进行后端编程,还需要使用DevOps任务。

CB: What are the biggest challenges and issues you face daily?

CB:您每天面临的最大挑战和问题是什么?

JA: Definitely sorting the signal from the noise. If we jumped at every package update email and website exception we would never get any work done!

JA:绝对可以将噪声中的信号分类。 如果我们不停更新软件包更新电子邮件和网站例外情况,我们将永远无法完成任何工作!

As well as updates, we also need to fix issues and bugs with the code. It doesn’t matter how good or robust your code is — errors will occur. The challenge is identifying which problems require immediate attention and which can be examined as part of a wider refactoring task.

除了更新,我们还需要修复代码中的问题和错误。 代码的好坏与否无关紧要-会发生错误。 挑战在于确定哪些问题需要立即关注,哪些可以作为更广泛的重构任务的一部分进行检查。

CB: Where do you receive alerts from?

CB:您从哪里收到警报?

JA: We use a variety of tools to monitor different parts of applications and services.

JA:我们使用各种工具来监视应用程序和服务的不同部分。

For our Ruby on Rails websites, we use a notifier gem (Airbrake) that alerts us whenever our code throws an exception or there are other unexpected events. We also use an external monitoring website (Wormly) which is configured to detect certain HTTP responses. Lastly, we use the AWS CloudWatch monitoring service which alerts us about hardware problems or failures.

对于我们的Ruby on Rails网站,我们使用一个通知程序gem( Airbrake ),它在我们的代码引发异常或有其他意外事件时提醒我们。 我们还使用一个外部监视网站( Wormly ),该网站配置为检测某些HTTP响应。 最后,我们使用AWS CloudWatch监视服务,该服务会向我们发出有关硬件问题或故障的警报。

Alerts are primarily sent by SMS and email. As you can imagine, messages are fired from different angles from many applications. We are constantly looking to improve our monitoring tools.

警报主要通过短信和电子邮件发送。 您可以想象,许多应用程序从不同角度触发了消息。 我们一直在寻求改进监控工具。

CB: How do you prioritize alerts? Do you base their importance according to business value impact, long-term importance, difficulty, whoever shouts loudest, or other factors?

CB:您如何确定警报的优先级? 您是否根据业务价值影响,长期重要性,难度,大声喊叫或其他因素来确定其重要性?

JA: Alert priorities are context sensitive and we manually determine the order. Obviously if one of our websites has fallen over, that takes highest priority! Other alerts — such as disk space reaching certain levels — are scheduled into weekly review tasks and attended to in a more relaxed manner.

JA:警报优先级是上下文相关的,我们手动确定顺序。 显然,如果我们的网站之一崩溃了,那将是最高优先级! 其他警报(例如磁盘空间已达到一定水平)被安排在每周检查任务中,并以更轻松的方式进行处理。

Many of the processes have been in place for a number of years and we can quickly identify what needs to be done. For example, the Wormly alerts are always important. Airbrake reports application-specific issues and we’ll examine the issue frequency to decide when it should be fixed.

许多流程已经实施了很多年,我们可以快速确定需要完成的工作。 例如,蠕虫警报始终很重要。 Airbrake报告特定于应用程序的问题,我们将检查问题的发生频率,以决定何时应解决。

We encourage our developers to tackle at least one recurring error per sprint. This also allows us to keep the error reporting noise down to a minimum.

我们鼓励开发人员每次冲刺至少解决一个重复发生的错误。 这也使我们能够将错误报告噪声降至最低。

CB: How do you plan monitoring for new systems and services?

CB:您如何计划监视新系统和服务?

JA: Monitoring has a variety of flavors but must be considered from the start.

JA:监控具有多种风格,但必须从一开始就加以考虑。

First, we want to monitor the actual servers the application runs on. Since we’re using AWS for deployment, the built-in CloudWatch statistics let us discover issues such as consistently high CPU and memory usage, running out of disk space or unresponsive servers.

首先,我们要监视应用程序运行的实际服务器。 由于我们使用AWS进行部署,因此内置的CloudWatch统计信息使我们能够发现诸如CPU和内存使用率持续升高,磁盘空间不足或服务器无响应之类的问题。

We then monitor the program code itself. The tools report fatal exceptions or unexpected events within the application.

然后,我们监视程序代码本身。 这些工具报告应用程序内的致命异常或意外事件。

Lastly, we monitor applications as they are seen from the outside world. The monitoring systems send HTTP requests to key pages and compare it to known responses such as successful requests, redirects, or even an error.

最后,我们监视从外部世界看到的应用程序。 监视系统将HTTP请求发送到关键页面,并将其与已知响应进行比较,例如成功的请求,重定向甚至是错误。

All our new applications and services should follow this process. Of course, sometimes something slips through. When that occurs, we write additional tests to detect that event in the future. Getting tripped up the first time a problem occurs is one thing — you’re in trouble if it occurs twice!

我们所有的新应用程序和服务都应遵循此过程。 当然,有时会漏掉一些东西。 发生这种情况时,我们会编写其他测试以在将来检测到该事件。 第一次出现问题就被绊倒是一回事–如果两次发生,您就很麻烦!

We employ various tools and technologies but, naturally, our requirements evolve. It’s important for tools to grow with us.

我们采用各种工具和技术,但是自然地,我们的要求也在不断发展。 工具与我们一起成长非常重要。

CB: What advice would you give to someone on a team that’s transitioning to a DevOps model?

CB:您将向正在向DevOps模型过渡的团队中的某人提供什么建议?

JA: That’s a broad question but, at heart, it’s about understanding the concerns of both developers and system administrators. Developers want an environment which can be built and deployed quickly so they can continue with the more interesting issues of application development. System administrators want to ensure best-practice security, privacy and scalable architectures are created. There are times when these two sets of concerns conflict; a pragmatic approach is recommended.

JA:这是一个广泛的问题,但从本质上讲 ,这是关于了解开发人员和系统管理员的关注。 开发人员想要一个可以快速构建和部署的环境,以便他们可以继续解决应用程序开发中更有趣的问题。 系统管理员希望确保创建最佳实践的安全性,隐私和可伸缩的体系结构。 有时,这两组问题会发生冲突。 建议采取务实的方法。

Crucially, you should be constantly building and deploying applications and servers. Your orchestration and deployment scripts must be constantly exercised and improved. You should avoid snowflake systems which few people understand or can recreate. Ideally, aim for phoenix systems which can be burnt and reborn at a moment’s notice by anyone on the team.

至关重要的是,您应该不断构建和部署应用程序和服务器。 您的业​​务流程和部署脚本必须不断地练习和改进。 您应该避免很少有人了解或可以重建的雪花系统。 理想情况下,针对凤凰系统,该凤凰系统可以被团队中的任何人立即燃烧和重生。

Treat your servers like cattle — not pets! It’ll give you the confidence to create new stacks or scale quickly on demand.

像牛一样对待您的服务器, 而不是宠物! 它使您有信心创建新堆栈或按需快速扩展。

CB: Thanks Jude. We appreciate all your efforts in keeping the SitePoint.com services up and running.

CB:谢谢裘德。 感谢您为保持SitePoint.com服务正常运行所做的所有努力。

PagerDuty:阻止事件成为紧急情况 (PagerDuty: Stop Incidents Becoming Emergencies)

Not every company has a team of experts ready to pounce on every alert. PagerDuty can help manage incidents, increase visibility and improve collaboration. The core features:

并非每个公司都有一支专家团队随时准备对每个警报发出警报。 PagerDuty可以帮助管理事件,增加可见性并改善协作。 核心功能:

  • PagerDuty is quick to set up and integrates with more than 100 systems

    PagerDuty可以快速设置并与100多个系统集成
  • monitoring is aggregated in a single place — everything can be viewed on one dashboard

    监控汇总在一个地方—可以在一个仪表板上查看所有内容
  • alerts are effective — use SMS, push notifications, phone calls, email or whatever method suits you

    警报有效-使用短信,推送通知,电话,电子邮件或任何适合您的方法
  • automated escalation policy rules can be defined — the system can prioritize work for you

    可以定义自动升级策略规则-系统可以为您确定工作的优先级
  • you can schedule, collaborate and analyze your systems with ease.

    您可以轻松安排,协作和分析系统。

For more information, visit PagerDuty.com.

有关更多信息,请访问PagerDuty.com

翻译自: https://www.sitepoint.com/manage-and-prioritize-systems-monitoring/

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值