javaee概览
In my previous blog post, I explored Unified Test Runner (or simply UTR) and Hoarder. In short: UTR is a single entry point we use to run all types of tests both locally and on our CI system – Katana. UTR sends test execution data to a database. We call this database and its surrounding web-services Hoarder. This post is about how UTR and Hoarder helped us make our automation smarter, create new tools, and improve the existing ones.
在我以前的博客文章中 ,我探讨了Unified Test Runner(或简称UTR)和Hoarder。 简而言之:UTR是我们用于在本地以及在我们的CI系统Katana上运行所有类型的测试的单个入口点。 UTR将测试执行数据发送到数据库。 我们将此数据库及其周围的Web服务称为Hoarder。 这篇文章是关于UTR和Hoarder如何帮助我们使自动化变得更智能,创建新工具并改进现有工具的。
ard积分析UI (Hoarder Analytics UI)
All the Hoarder data is available through a REST API. We built some applications on top of these API. One of them is the Hoarder Analytics UI. It helps us query different slices of test execution data. Most often we use it to find out:
所有的Hoarder数据都可以通过REST API获得。 我们在这些API之上构建了一些应用程序。 其中之一是Hoarder Analytics UI。 它可以帮助我们查询测试执行数据的不同部分。 大多数情况下,我们使用它来发现:
For example, we can set up a Filter which tells us the statistics for the AudioClipCanBeLoadedFromAssetBundle test for the Last Month on the branch draft/trunk. Once the filter is specified and the Get Data button is clicked, Analytics UI shows the following information:
例如,我们可以设置一个过滤器,该过滤器告诉我们分支草稿/中继上个月的AudioClipCanBeLoadedFromAssetBundle测试的统计信息。 指定过滤器并单击“ 获取数据”按钮后,Analytics UI将显示以下信息:
Looking at result above we see that AudioClipCanBeLoadedFromAssetBundle passed in 92% out of 590 executions. It takes ~14 seconds on average. What’s interesting here is that a deviation is almost twice larger than the average execution time. Very often this is a good predictor that the test has some infrastructure issues. Let’s figure out why!
查看上面的结果,我们看到AudioClipCanBeLoadedFromAssetBundle在590次执行中传递了92%。 平均大约需要14秒。 有趣的是,偏差几乎是平均执行时间的两倍。 通常,这可以很好地预测该测试存在一些基础结构问题。 让我们找出原因!
By clicking on the test name we can drill down into the test execution history and find points of failure.
通过单击测试名称,我们可以深入到测试执行历史记录并查找故障点。
A successful run takes ~7 seconds. But in the case of failure, we wait for almost 2 minutes! If we click on the Build Number link on the failed test run, we will navigate back to Katana. Here we can find out what exactly has failed:
成功运行大约需要7秒钟。 但是,如果发生故障,我们将等待近2分钟! 如果在失败的测试运行中单击“ 内部版本号”链接,我们将导航回到“ Katana”。 在这里,我们可以找出失败的确切原因:
There are thread dump and minidump artifacts. This means that the standalone player running the test has crashed. But still, why did it take 2 minutes to execute the test? The reason was that the test framework was expecting a ‘test finished’ marker, but there wasn’t one due to the player crashing. Besides the actual problem in the standalone player, it’s clear that the test framework should be improved to detect crashes as soon as possible. Every second of test execution time in Katana counts.
有线程转储和最小转储工件。 这意味着运行测试的独立播放器已崩溃。 但是,为什么要花2分钟执行测试? 原因是测试框架期待一个“测试完成”的标记,但是没有一个是由于玩家崩溃而导致的。 除了独立播放器中的实际问题之外,很明显,应该改进测试框架以尽快检测崩溃。 片假名中测试执行时间的每一秒都很重要。
Another interesting fact about Hoarder Analytics UI is that while implementing it, we hit some performance vs. testability issues. If you want to know how we approached them read “Stored procedures guided by tests”.
关于Hoarder Analytics UI的另一个有趣的事实是,在实施它时,我们遇到了一些性能与可测试性问题。 如果您想知道我们如何处理它们,请阅读“ 测试指导的存储过程 ”。
改进的测试报告 (Improved test report)
The test execution report is often the first place we use to spot failures. This report is generated based on JSON data produced by Unified Test Runner. The most significant advantage of the JSON format is that we can extend it in any way we want. For example, for every test, we embedded a list of related artifacts produced while running this test. We also inserted a command line to run a failed test locally. So when Katana displays the test execution report, it can use this information:
测试执行报告通常是我们用来发现故障的第一位。 该报告基于Unified Test Runner产生的JSON数据生成。 JSON格式的最大优势在于,我们可以按照需要的任何方式对其进行扩展。 例如,对于每个测试,我们都会嵌入运行此测试时产生的相关工件的列表。 我们还插入了命令行以在本地运行失败的测试。 因此,当Katana显示测试执行报告时,它可以使用以下信息:
A similar version of the test report is also available when running tests locally. However, the test report isn’t perfect — there are still a lot of things that can be improved:
在本地运行测试时,也可以使用类似版本的测试报告。 但是,测试报告并不完美-仍有很多地方可以改进:
Crash analyzer integration
碰撞分析仪集成
图形测试工具 (Graphics Test Tool)
Hoarder data helped us build the Graphics Test Tool. Before I jump into it, let me briefly explain the graphics tests. The way graphics tests work is by comparing a rendered image with a reference image. If the rendered image differs too much from the reference image the graphics test fails. For example, if this is what is rendered:
ard积者数据帮助我们构建了图形测试工具。 在我开始之前,让我简要地解释一下图形测试。 图形测试的工作方式是将渲染的图像与参考图像进行比较。 如果渲染的图像与参考图像相差太大,则图形测试将失败。 例如,如果这是呈现的内容:
The graphics test fails because the difference between the two images is too big:
图形测试失败,因为两个图像之间的差异太大:
We store thousands of graphics reference images in a mercurial repository. If we change some rendering feature, it might require updating hundreds of reference images in the graphics test repository.
我们在商品库中存储了数千个图形参考图像。 如果我们更改某些渲染功能,则可能需要更新图形测试存储库中的数百个参考图像。
The Graphics Test Tool helps solve this problem. It asks Hoarder “Which graphics tests have failed for a given revision?” Knowing the answer, it can fetch all the failed images from Katana. Then it displays combined information via a web-interface, which allows us to quickly spot differences, download and update reference images in the repository. Read more in this dedicated blog post: “Graphics tests, the last line of automated testing.”
图形测试工具有助于解决此问题。 它问霍华德“在给定的修订中哪些图形测试失败了?” 知道答案后,它可以从Katana获取所有失败的图像。 然后,它通过Web界面显示组合信息,这使我们能够快速发现差异,下载和更新存储库中的参考图像。 在此专门的博客文章中了解更多信息:“ 图形测试,自动化测试的最后一行 。”
更好的可见度 (Better visibility)
Suppose you are a physics developer and you’ve just fixed some bug in the physics part of the unity codebase. Now, when you run the tests, you notice a documentation test failing. Whose fault is this? Is it a known failure or just an instability? Hoarder can answer this question. All that a developer has to do is paste a test name to Hoarder. If based on the data in Hoarder, it can be determined that it is not the developer’s fault and their branch can still be merged to trunk.
假设您是一名物理开发人员,并且您刚刚修复了统一代码库的物理部分中的一些错误。 现在,当您运行测试时,您会发现文档测试失败。 这是谁的错 是已知的故障还是不稳定? ard积者可以回答这个问题。 开发人员要做的就是将测试名称粘贴到Hoarder。 如果基于Hoarder中的数据,则可以确定这不是开发人员的错,并且他们的分支仍可以合并到主干。
用户体验分析 (User experience analysis)
We can see how our employees are running tests locally. Here are top 5 testing frameworks we use to run tests locally (April 2017).
我们可以看到我们的员工如何在本地运行测试。 以下是我们用于本地运行测试的前5个测试框架(2017年4月)。
Framework type | # of runs |
runtime | 17052 |
integration | 14162 |
native | 8932 |
graphics | 6874 |
editor | 2154 |
框架类型 | 运行次数 |
运行 | 17052 |
积分 | 14162 |
本机 | 8932 |
图形 | 6874 |
编辑 | 2154 |
It is possible to drill down further and figure out usage scenarios depending on the framework type. It can give us very useful insights into improving the user experience.
可以根据框架类型进一步深入并找出使用方案。 它可以为我们提供有关改善用户体验的非常有用的见解。
智能测试执行 (Smart Test Execution)
Any pull request must pass automated tests before being merged to trunk or any other production branch. However, we could not afford to run all the tests on each pull request. Therefore we split our tests into 2 categories:
任何拉取请求在合并到中继或任何其他生产分支之前都必须通过自动化测试。 但是,我们无力对每个请求请求运行所有测试。 因此,我们将测试分为两类:
We required having a green ABV for any pull request targeting trunk or any other release branch. Nightly tests were optional. This saved a lot of execution time but it created a hole through which red nightly tests could get in trunk. For a while, that was a big challenge for us. It is at this point that we introduced a new Queue Verification Process.
对于任何针对主干或任何其他发行分支的拉取请求,我们都需要有一个绿色ABV。 每晚测试是可选的。 这节省了很多执行时间,但是却创建了一个漏洞,红色夜间测试可以通过该漏洞进入主干。 一段时间以来,这对我们来说是一个巨大的挑战。 正是在这一点上,我们引入了一个新的队列验证过程。
We stopped merging pull requests into trunk directly. Instead, we take a number of pull requests and merge them all into a draft repository. We run all our tests on this batch of pull requests. If any tests fail, we use Mercurial bisect to find the point of failure. The pull request that introduced the failure gets kicked out from the batch. The remaining pull requests are merged to trunk. You can read more about it in this dedicated blogpost: “A Look Inside: The Path to Trunk.”
我们停止了直接将拉取请求合并到中继中。 取而代之的是,我们接受许多拉取请求,并将它们全部合并到草稿存储库中。 我们对这批请求请求运行所有测试。 如果任何测试失败,我们将使用Mercurial bisect查找失败点。 引入故障的请求请求从批处理中排除。 其余的拉取请求将合并到中继。 您可以在此专门的博文中了解有关此内容的更多信息:“ 深入了解:通往树干的道路 。”
We built Smart Test execution on top of the Queue Verification process. The idea was very simple — do not run tests which meet the following conditions:
我们在队列验证过程的基础上构建了智能测试执行。 这个想法很简单-不要运行满足以下条件的测试:
But we still run all test tests in queue verification, trunk and release branches.
但是我们仍然在队列验证,中继和发布分支中运行所有测试。
UTR and Hoarder played their role here too. Smart Test Execution was implemented by having UTR send Hoarder a list of tests it is going to run. For each test, Hoarder decides whether the test should be run based on the set of rules above.
UTR和Hoarder在这里也发挥了作用。 智能测试执行是通过让UTR向Hoarder发送要运行的测试列表来实现的。 对于每个测试,Hoarder会根据上述规则来决定是否应该运行测试。
It could lead to a situation where someone discovers that a test excluded by Smart Test Selection fails when it hits Queue verification. This could be an unpleasant surprise. We address this situation by letting our developers disable Smart Test Selection.
这可能会导致以下情况:有人发现“智能测试选择”所排除的测试在到达“队列验证”时失败。 这可能是一个不愉快的惊喜。 我们通过让开发人员禁用智能测试选择来解决这种情况。
We are also working on making Smart Test Selection smarter about which tests it excludes by analyzing code coverage data.
我们还致力于通过分析代码覆盖率数据,使“智能测试选择”更加智能,从而排除哪些测试。
Even though some people were affected by this situation, overall we saved thousands of hours of Katana’s execution time.
即使有些人受到这种情况的影响,总体上我们还是节省了Katana执行时间的数千小时。
Check out my colleague Boris Prikhodskiy’s talk about this at GTAC: GTAC2016: Using Test Run Automation Statistics to Predict Which Tests to Run.
在GTAC上查看我的同事Boris Prikhodskiy的相关话题: GTAC2016:使用测试运行自动化统计信息来预测要运行的测试 。
有趣的图表 (Fun with charts)
Before we started collecting data, we didn’t know exactly how many tests we were running hourly/daily/weekly/monthly. So this was one of the first things we decided to look at. In August 2015 we ran 16,559,749 tests both locally and on Katana. Since then, the company has grown, Katana has grown, and the number of tests we run every day has increased a lot. In March 2017, for comparison, there were 78,312,981 test runs — 4.5 times more than in August 2015.
在开始收集数据之前,我们并不确切知道每小时/每天/每周/每月运行多少测试。 因此,这是我们决定研究的第一件事。 2015年8月,我们在本地和Katana上进行了16,559,749次测试。 从那时起,公司发展了,武士刀也发展了,我们每天运行的测试数量已经大大增加了。 相比之下,2017年3月,测试运行量为78,312,981次,是2015年8月的4.5倍。
If we take a look at the number of tests we run per day, we’ll see the following picture:
如果我们看一下每天运行的测试数量,就会看到以下图片:
The lower numbers of test runs correspond to weekends. It told us two things:
试运行次数较少,对应于周末。 它告诉我们两件事:
Let’s look at how the number of tests run each month changed during 2016:
让我们看一下在2016年期间每月运行的测试数量如何变化:
Why is there a spike in November? Hoarder data can tell us where all these test runs came from:
为什么11月会出现高峰? ard积数据可以告诉我们所有这些测试运行来自何处:
Branch | # of runs |
trunk | 6043317 |
5.5/staging | 5328429 |
draft/trunk | 4272583 |
5.3/staging | 1606800 |
xxx1 | 1362732 |
xxx2 | 692974 |
xxx3 | 603495 |
科 | 运行次数 |
树干 | 6043317 |
5.5 /分期 | 5328429 |
草稿/树干 | 4272583 |
5.3 /阶段 | 1606800 |
xxx1 | 1362732 |
xxx2 | 692974 |
xxx3 | 603495 |
Looking at the top contributors, we can say that we’ve been busy to run tests on trunk and draft/trunk. These tests come from the Queue Verification and Nightly runs on trunk. Nothing surprising.
看看最杰出的贡献者,可以说我们一直在忙于在主干和草稿/树干上进行测试。 这些测试来自“队列验证”,并且“夜间”在主干上运行。 毫不奇怪。
We can also notice that we run a lot of tests against 5.5/staging. And this was because we released it that month. We also maintain 5.4 and 5.3 and therefore we run some tests there as well.
我们还可以注意到,我们针对5.5 / staging进行了大量测试。 这是因为我们当月发布了它 。 我们还维护5.4和5.3,因此我们也在那里进行了一些测试。
Let’s look at xxx1. We intentionally removed the real branch name. The story behind this branch was that we updated NUnit to the newest version. It could potentially affect all our NUnit-based test suites. We overused Katana to test it. Hundreds of thousands of test runs could have been avoided if we had better planned this effort. We could have run a set of smoke tests locally before running them all in Katana. It taught us that not everyone thinks of Katana as a shared resource. It was (and still is) an educational problem. There are several ways to solve it. One way is to teach people how to avoid Katana overuse. Another way is to scale up Katana to let people do what is easiest for them. Ideally we should do both.
让我们看看xxx1。 我们有意删除了真实的分支名称。 该分支背后的故事是,我们将NUnit更新为最新版本。 它可能会影响我们所有基于NUnit的测试套件。 我们过度使用Katana进行测试。 如果我们更好地计划这项工作,就可以避免成千上万的测试运行。 我们可以在Katana中进行所有烟雾测试之前先在本地进行一系列烟雾测试。 它告诉我们,并非所有人都认为Katana是共享资源。 这曾经是(现在仍然是)一个教育问题。 有几种解决方法。 一种方法是教人们如何避免武士刀过度使用。 另一种方法是扩大片假名,让人们做对他们来说最容易的事情。 理想情况下,我们都应该做。
The stories behind xxx2 and xxx3 may have told us something different. We didn’t investigate these cases though, because they are too old. Instead we are keeping an eye on actual cases. For each case, because we know a source, we can figure out an exact reason and decide what we can do about it.
xxx2和xxx3背后的故事可能告诉了我们一些不同。 但是,我们没有调查这些案例,因为它们太旧了。 相反,我们会密切关注实际情况。 对于每种情况,由于我们知道来源,因此我们可以找出确切的原因并决定可以采取什么措施。
只是为了好玩:测试计数器设备 (Just for fun: a test counter device)
Hundreds of our employees run tests on tens of platforms producing millions of test runs every day. We decided to build a small device, which shows us how many tests we’ve run since we started measuring this (top number), how many we ran this month (middle number), and today’s runs (bottom number).
我们数百名员工在数十个平台上运行测试,每天产生数百万次测试运行。 我们决定建造一个小型设备,该设备向我们显示自开始测量此数字以来(顶部数字)运行了多少测试,本月运行了多少次(中间数字)以及今天的运行次数(最低数字)。
The device is made with Raspberry Pi connected via PCI to a LED RGB matrix. All internals are enclosed in a custom-made 3D printed case. It was inspired by one of the bicycle counters installed in Copenhagen.
该设备由通过PCI连接到LED RGB矩阵的Raspberry Pi制成。 所有内部零件均封装在定制的3D打印盒中。 它的灵感来自设在哥本哈根的一个自行车柜台。
This device not only reminds us that Katana is a shared resource, but also that automation is a living giant creature which requires care.
这种设备不仅让我们想起了Katana是共享资源,而且还使自动化成为了需要照顾的生物。
结论 (Conclusion)
UTR and Hoarder became a fertile soil to grow new tools and improve the existing ones. It is not a coincidence. Unification is a powerful technique which often leads to simplicity. Take a look at power sockets. Wouldn’t be nice if there were fewer types of power sockets? At least device vendors would greatly appreciate this.
UTR和Hoarder成为了沃土,可以种植新工具并改善现有工具。 这不是巧合。 统一是一种强大的技术,通常会导致简单化。 看一下电源插座。 如果电源插座的类型较少,会不会很好? 至少设备供应商会对此大加赞赏。
Therefore by unifying and simplifying the test running process, we unveiled a lot of possibilities to improve our automation. It was easier to build tools knowing there was only one type of socket to plug a tool into the test running process.
因此,通过统一和简化测试运行过程,我们揭示了许多改进自动化的可能性。 知道只有一种类型的套接字可以将工具插入测试运行过程,因此构建工具更加容易。
UTR and Hoarder continue to shape our automation. They play a significant role in one crucial project we are currently working on — Parallel Test Execution, which we’ll cover in future blog post.
UTR和Hoarder继续塑造我们的自动化。 它们在我们当前正在研究的一个关键项目中扮演重要角色-并行测试执行,我们将在以后的博客文章中介绍。
If you want to hear more about anything I mentioned in this post just write a comment here and follow me (@ydrugalya) on twitter.
如果您想听到更多关于我在这篇文章中提到的内容的信息,请在此处写评论,然后在Twitter上关注我(@ydrugalya)。
翻译自: https://blogs.unity3d.com/2017/06/15/a-look-inside-evolving-automation/
javaee概览