Data annotation tools for machine learning are advancing fast. In technical solution engineering at CloudFactory, we’re seeing new tools and new features nearly every month. One emerging feature is automation, also known as pre-annotation or auto labeling. This article will focus on some of its benefits and drawbacks.
用于机器学习的数据注释工具正在Swift发展。 在CloudFactory的技术解决方案工程中,我们几乎每个月都会看到新工具和新功能。 一种新兴功能是自动化,也称为预批注或自动标记。 本文将重点介绍其一些优点和缺点。
什么是自动标签? (What’s auto labeling?)
Auto labeling is a feature found in data annotation tools that apply artificial intelligence (AI) to enrich, annotate, or label a dataset. Tools with this feature augment the work of humans in the loop to save time and money on data labeling for machine learning.
自动标记是数据注释工具中的一项功能,该功能使用人工智能(AI)来丰富,注释或标记数据集。 具有此功能的工具可以在循环中增强人类的工作量,从而节省用于机器学习的数据标记上的时间和金钱。
Most tools allow you to load pre-annotated data into the tool. More advanced tools, which are evolving into platforms (e.g., tool plus Software Development Kit or SDK), allow you to leverage AI or bring your own algorithm to the tool to improve the data enrichment process by auto labeling data.
大多数工具允许您将预先注释的数据加载到工具中。 正在平台中发展的更高级的工具(例如,工具以及软件开发工具包或SDK),使您可以利用AI或将自己的算法引入工具中,以通过自动标记数据来改善数据丰富过程。
Other tools offer prediction models that suggest annotations so workers can validate them. Some features leverage embedded neural networks that can learn from every annotation made. All of these features can save time and resources for machine learning teams and will have a profound effect on data annotation workflows.
其他工具提供了建议模型的预测模型,以便工人可以对其进行验证。 一些功能利用了嵌入式神经网络,可以从所做的每个注释中学习。 所有这些功能可以为机器学习团队节省时间和资源,并将对数据注释工作流产生深远的影响。
自动贴标签的主要优点 (Top benefits of auto labeling)
In our work with organizations using tools to annotate images for machine learning, we find auto labeling can be helpful when it is applied in a data annotation workflow in two ways:
在我们与使用工具对图像进行注释以进行机器学习的组织的合作中,我们发现将自动标注以两种方式应用于数据标注工作流时可能会有所帮助:
1) Pre-annotate some or all of your dataset. Workers come behind the automation to review, correct, and complete the annotations. Automation cannot annotate everything; there will be exceptions and edge cases. It’s also far from perfect, so you must plan for people to make reviews and corrections as necessary.
1)预先注释部分或全部数据集。 工人落后于自动化,以查看,更正和完成注释。 自动化无法注释所有内容。 会有例外和极端情况。 它也远非完美,因此您必须计划让人们进行必要的评论和更正。
2) Reduce the amount of work sent to people. An auto-labeling model can assign a confidence level based on the use case, task difficulty, and other factors. It enriches the dataset with annotations, and sends annotations with lower confidence scores to a person for review or correction.
2)减少发送给人们的工作量。 自动标记模型可以根据用例,任务难度和其他因素来分配置信度。 它使用注解充实了数据集,并将具有较低置信度分数的注解发送给人员进行检查或更正。
We’ve run time experiments, with one team using tools that have an automation feature versus another team that is manually annotating the same data. In some cases, we’ve seen auto labeling provide low quality results which increases the amount of time required per annotation task. Other times, it has provided a helpful starting point and reduced task time.
我们进行了运行时实验,一个团队使用具有自动化功能的工具,而另一个团队则使用手动注释相同的数据。 在某些情况下,我们已经看到自动标记提供的质量较低,这会增加每个注释任务所需的时间。 其他时候,它提供了有用的起点并减少了工作时间。
In one image annotation experiment, auto labeling combined with human-powered review and improvements was 10% faster than the 100% manual labeling process. That time savings increased to 40% to 50% faster as the automation learned over time.
在一个图像注释实验中,自动标记与人工审核和改进相结合的速度比100%手动标记过程快10%。 随着自动化随着时间的推移,节省的时间增加了40%到50%。
It also had a more than five-pixel margin of error for vehicles and missed the objects that were farthest from the camera. As you can see in the image, an auto-labeling feature tagged a garbage bin as a person. It’s important to keep in mind that pre-annotation predictions are based on existing models and any misses in the auto labeling reflect the accuracy of those models.
车辆的误差范围也超过了5像素,并且错过了距离相机最远的物体。 如您在图像中看到的,自动标记功能将垃圾桶标记为人。 重要的是要牢记,注释前的预测是基于现有模型的,并且自动标注中的任何缺失都会反映出这些模型的准确性。
Some tasks are ripe for pre-annotation. For example, if you use the example from our experiment, you could use pre-annotation to label images, and a team of data labelers can determine whether to resize or delete the labels, or bounding boxes. This reduction of labeling time can be helpful for a team that needs to annotate images at pixel-level segmentation.
一些任务已经可以进行预注释了。 例如,如果您使用我们实验中的示例,则可以使用预注释来标记图像,然后一组数据标记者可以确定是否要调整大小或删除标签或边框。 标记时间的减少对于需要在像素级分割中注释图像的团队很有帮助。
Our takeaway from the experiments is that applying auto labeling requires creativity. We find that our clients who use it successfully are willing to experiment, fail, and pivot their process as necessary.
我们从实验中得出的结论是,应用自动标签需要创造力。 我们发现成功使用它的客户愿意尝试,失败并根据需要调整其过程。
自动贴标签的底线 (The bottom line on auto labeling)
Auto labeling is a game-changer but it’s not a slam dunk. Like most AI-powered solutions, it requires creativity and iteration along the way to successfully generate time and resource savings. Using these features saves annotation time but you’ll still have to perform quality control checks on the work that is done.
自动标记可以改变游戏规则,但不是灌篮。 像大多数基于AI的解决方案一样,它需要创造性和迭代性,才能成功节省时间和资源。 使用这些功能可以节省注释时间,但是您仍然必须对完成的工作执行质量控制检查。
We expect auto labeling to continue to improve, so this is an area to keep an eye on as you prepare for your next machine learning project. To learn more about data annotation tools, check out Data Annotation Tools for Machine Learning (An Evolving Guide).
我们希望自动标记会继续改善,因此在为下一个机器学习项目做准备时,这是一个值得关注的领域。 要了解有关数据注释工具的更多信息,请查看用于机器学习的数据注释工具(正在发展的指南) 。
Originally published at https://blog.cloudfactory.com.
翻译自: https://medium.com/@thecloudfactory/top-benefits-and-limitations-of-auto-labeling-41198fb5679d