机器学习来源框架_机器学习的秘密来源：策展

最新推荐文章于 2022-06-04 23:27:44 发布

weixin_26746401

最新推荐文章于 2022-06-04 23:27:44 发布

阅读量286

点赞数

文章标签：机器学习 python 人工智能大数据 java

原文链接：https://towardsdatascience.com/machine-learnings-secret-source-curation-e8c3107dcc13

版权

机器学习来源框架

成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence)

It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of machine learning models will significantly depend on the quality of the data; I’m not saying anything new here.

人们普遍认为，数据是新的石油，就像石油一样，数据需要进行适当的精炼才能发展以得到完美利用。机器学习模型的功能将在很大程度上取决于数据的质量。我不是在这里说新的话。

As AI development and its subsequent applications become even more pervasive, ML engineers everywhere are confronted with a grim reality. Once stakeholders overcome biases or skepticisms and finally buy-in, identify a use case with proven ROI, and now are eager to jump onto the AI ship, data curation is usually neglected and suffers from not attracting its due importance — often due to a quick win mentality and the fact it’s not sexy!

随着AI开发及其后续应用变得越来越普遍，各地的ML工程师都面临严峻的现实。一旦利益相关者克服了偏见或怀疑并最终接受了投资，确定了具有良好ROI的用例，现在又急于跳入AI船上，数据管理通常会被忽略，并且由于无法快速获得数据，因此无法发挥应有的重要性。赢得心态和事实，那就不是性感！

There are many assumptions even within technology groups, that AI only needs to be fed data collected and combined on a large measure; in most cases, this gravely backfires. Inaccurate datasets can come in many forms ranging from factually incorrect information to knowledge gaps to wrong guidelines. Among many other problems, an uncurated dataset can be:

即使在技术小组内部，也有许多假设，即只需要向AI提供大量收集和合并的数据即可。在大多数情况下，这会适得其反。不准确的数据集可能以多种形式出现，从事实不正确的信息到知识鸿沟再到错误的准则。除许多其他问题外，未整理的数据集可能是：

Biased: recently, several popular AI’s used for image recognition displayed disturbing gender and racial bias.
偏见：最近，几种流行的用于图像识别的AI显示出令人不安的性别和种族偏见。
Inaccurate, unreliable or falsely represented
不准确，不可靠或虚假陈述
Error-ridden or ambiguous
错误缠身或模棱两可

The lack of using refined or curated raw datasets are universally known to decrease feature quality and limit the evaluation and applications of transfer tasks. So how should datasets be treated in a way that they serve the exact purpose ML needs to work, this is highly dependant on the use cases the ML engineers are trying to address.

众所周知，缺乏使用精炼或精选原始数据集会降低要素质量并限制传输任务的评估和应用。因此，应如何以满足ML工作所需确切目的的方式对待数据集，这在很大程度上取决于ML工程师试图解决的用例。

机器学习的数据集类型 (Types of Datasets for Machine Learning)

ML engineers depend on data throughout each step of their AI journey — from model choice, training, and testing. These datasets typically fall under three classifications:

机器学习工程师在AI历程的每个步骤中都依赖于数据，包括模型选择，培训和测试。这些数据集通常分为三类：

Training sets
训练套
Validation sets
验证集
Testing sets.
测试装置。

Every ML project starts with two data set categories; the training data set and the testing data set.

每个ML项目都以两个数据集类别开始；训练数据集和测试数据集。

The training data set is used to train an algorithm, implement concepts, discover, and give results.
训练数据集用于训练算法，实现概念，发现并给出结果。
Testing data is used to examine the validity of the training data set. Training data is not used for testing because it will produce expected outputs.
测试数据用于检查训练数据集的有效性。训练数据不用于测试，因为它将产生预期的输出。

Image for post — Image created by Author Steve Leven

机器学习的数据需求 (Data needs for Machine Learning)

Data scientists collect data from various sources, integrate it into one form, validate, manipulate, archive, preserve, retrieve, and express it.

数据科学家从各种来源收集数据，将其集成为一种形式，然后进行验证，操作，存档，保存，检索和表达。

The process of curating datasets for machine learning starts well before availing datasets.

整理用于机器学习的数据集的过程在使用数据集之前就已经开始了。

My suggestion:

我的建议：

Identify the aim of the AI
确定AI的目标
Identify what dataset you will require to solve the problem
确定解决问题所需的数据集
Create a record of your hypotheses while selecting the Data
选择数据时创建假设记录
Strive for collecting assorted and meaningful data from both external and internal sources
努力从外部和内部来源收集各种有意义的数据
Create datasets that are hard for your competitors to copy (defendability)
创建难以被竞争对手复制的数据集(可防御性)

If you have a small dataset, applying a model pre-trained on large datasets can be a great approach and use your small dataset to fine-tune.

如果您的数据集较小，则对大型数据集应用预训练的模型可能是一种不错的方法，并使用小型数据集进行微调。

Once you have accumulated the correct Data, you can progress with creating the training set. This step of putting data in the optimal format is called feature transformation, and it involves four stages:

一旦积累了正确的数据，就可以继续创建训练集。将数据以最佳格式放置的这一步骤称为特征转换，它涉及四个阶段：

Formatting: Data discovery is in different formats. Formatting will bring it together in one sheet. For example, consumer Data can come with different currencies, semantics and so on. These need to be compiled under one format for foundation uniformity.

格式：数据发现采用不同的格式。格式化会将其合并到一张纸中。例如，消费者数据可以带有不同的币种，语义等。这些需要以一种格式进行编译以实现基础均匀性。

Labelling: Labelling ensures the Data set works for the specific model choice. For example, an autonomous car requires data labelled as images of cars, pedestrians, road signs, walkways.

贴标签：贴标签可确保数据集适用于特定的模型选择。例如，自动驾驶汽车需要标记为汽车，行人，道路标志，人行道图像的数据。

Cleansing: Suboptimal characters need to be removed, and missing values are managed based on the weighting of need.

清理：需要删除次优字符，并根据需要的权重来管理缺失值。

Extraction: Several features are examined and optimised — features that are essential for predictive capability and faster computation and less memory consumption.

提取：已检查和优化了几个功能-这些功能对于预测功能，更快的计算和更少的内存消耗至关重要。

底线 (The Bottom Line)

A dataset solely can ensure the success or failure of a machine learning model. Data curation is one of the fundamental aspects of machine learning, and if exercised correctly, it can unleash tremendous potential. The methods and subsequent processes can appear time-consuming; however, this will guarantee your dataset’s calibration with the goals of your machine learning at each step.

数据集仅可以确保机器学习模型的成功或失败。数据管理是机器学习的基本方面之一，如果正确执行，它可以释放巨大的潜力。方法和后续过程可能很耗时。但是，这将确保您的数据集的校准符合每一步的机器学习目标。

Introducing data curation processes into your data team and the following procedures will appear time-consuming and expensive in the short term; therefore, organisations must carefully analyse current objectives and develop a strategy to support the relevance for curation-as-a-function. Managed services and Unsupervised methods trained on curated data are available and marketed by advisory and technology firms, be careful and choose carefully; this will play a key role in your AI future.

在您的数据团队中引入数据管理流程，以下过程在短期内将显得既耗时又昂贵。因此，组织必须仔细分析当前的目标并制定策略，以支持与策展即功能有关。咨询和技术公司可以使用托管的服务和不受监管的方法进行策划的数据培训，并且要谨慎行事并谨慎选择；这将在您的AI未来中发挥关键作用。