深度学习 大数据集处理
Machine learning is data-driven. Most artificial intelligence (AI) practitioners would agree that dataset ingestion, data processing, data cleansing, and data management take more than 90% of the development effort of a total machine learning project. Even though many tested data collection tools (such as Spark, Kafka, and Hadoop) are very well discussed, the topic of dataset management, surprisingly, doesn’t come up very often within the machine learning community.
机器学习是数据驱动的。 大多数人工智能(AI)从业人员都同意,数据集提取,数据处理,数据清理和数据管理将花费整个机器学习项目90%以上的开发工作。 尽管对许多经过测试的数据收集工具(例如Spark,Kafka和Hadoop)进行了很好的讨论,但令人惊讶的是,在机器学习社区中,数据集管理的主题并不经常出现。
At Salesforce, we knew this was a crucial part of our internal AI development process, so we designed a unique system for organizing datasets that aims to protect data integrity for updates, improve data retrieval efficiency, and facilitate lifelong learning.
在Salesforce,我们知道这是内部AI开发流程的关键部分,因此我们设计了一个用于组织数据集的独特系统,旨在保护数据完整性以进行更新,提高数据检索效率并促进终身学习。
数据集管理挑战 (Dataset Management Challenges)
Before talking about our system design, let’s first take a look at the major challenges of a typical dataset management system.
在讨论我们的系统设计之前,让我们首先看一下典型数据集管理系统的主要挑战。
终身学习 (Lifelong Learning)
As we said from the beginning of this post, machine learning training is data-driven. A model is typically trained with stationary batches of data, but our world is changing everyday; users of the model demand that it perform well in a dynamic environment, meaning that the dataset management system needs to absorb incrementally available data from non-stationary data sources and keep the training data updated.
正如我们从本文开头所说的那样,机器学习培训是由数据驱动的。 通常使用固定的数据批次训练模型,但是我们的世界每天都在变化。 模型的用户要求它在动态环境中表现良好,这意味着数据集管理系统需要吸收来自非平稳数据源的增量可用数据并保持训练数据更新。
数据集版本控制 (Dataset Versioning)
Machine learning is an iterative process. When we observe model performance regression after a retrain, comparing the datasets in the two training runs could reveal a lot of details that are critical for troubleshooting. Unlike a single file, a training dataset is usually a large group of binary files. How to version different groups of files and efficiently retrieve these files with the version requires deliberate thinking.
机器学习是一个反复的过程。 当我们在重新训练后观察模型性能回归时,比较两次训练运行中的数据集可能会发现许多对故障排除至关重要的细节。 与单个文件不同,训练数据集通常是一大组二进制文件。 如何对不同组的文件进行版本控制以及如何通过版本有效地检索这些文件需要认真思考。
使用严格的访问控制策略训练数据 (Training data with strict access control policy)
There are some highly sensitive datasets that require strict access, such as customer sales records and payment records. But training systems usually operate with normal security standards, making it impossible to train with these valuable but highly sensitive datasets.
有些高度敏感的数据集需要严格的访问权限,例如客户销售记录和付款记录。 但是培训系统通常以正常的安全标准运行,因此无法使用这些有价值但高度敏感的数据集进行培训。
多租户数据集管理 (Multi Tenant Dataset Management)
There could be different users and organizations sharing one training system. This requires the data management system to restrict data access only to the authenticated data owner and not to share data or grant access outside of its tenant.
可能会有不同的用户和组织共享一个培训系统。 这要求数据管理系统仅将数据访问限制为仅通过身份验证的数据所有者,而不能共享数据或在其租户外部授予访问权限。
数据可访问性 (Data Accessibility)
The Internet is full of data. The list of data collectors and tools seems to keep growing forever. We are building new data ingestion pipelines almost everyday, bringing in fresh and valuable data, but this also introduces lots of complexity to downstream model training components. A well-thought-out dataset management system should encapsulate all the details (such as data source, tools etc) and only expose some simple APIs for retrieving training data.
互联网上充满了数据。 数据收集器和工具的列表似乎永远都在增长。 我们几乎每天都在建立新的数据摄取管道,以获取新鲜且有