ai转型指南_穿越AI转型的转折点

ai转型指南

When talking about the open-source AI projects, people would think of the model framework projects like Google TensorFlow, PyTorch, etc. Since the model framework is the critical component while training the AI models, those projects usually receive the most attention. But Artificial Intelligence (AI) is not a technology. AI is a complicated tech field involving several sub-areas and many different components.

在谈论开源AI项目时,人们会想到诸如Google TensorFlow,PyTorch等模型框架项目。由于模型框架是训练AI模型的关键组成部分,因此这些项目通常会受到最多的关注。 但是,人工智能(AI)并不是一种技术。 人工智能是一个复杂的技术领域,涉及多个子领域和许多不同的组件。

Milvus is an open-source project providing data serving functionality for the open-source AI ecosystem.

M ilvus是一个开源项目,为开源AI生态系统提供数据服务功能。

人工智能转型的转折点 (The turning point of AI transformation)

Generally speaking, the turning point of technology upgrades is the point where the returns of technology upgrades go far beyond the cost. When it applies to AI transformation, it would involve fundamental factors containing models (algorithms), model inference, and data service.

一般而言,技术升级的转折点是技术升级的回报远远超出成本的点。 当将其应用于AI转换时,它将涉及包含模型(算法),模型推理和数据服务的基本因素。

While talking about the models, we need to ask ourselves about the expectation of leveraging AI technologies. If we want to use AI technologies to beat and replace human workers, for example, to replace all the customer support specialists with the AI-powered conversational bot. Then the demands of the AI models will be pretty high and will not even be achievable in the short term.

在讨论模型时,我们需要问自己关于利用AI技术的期望。 例如,如果我们想使用AI技术来击败和替代人工,则可以使用AI驱动的对话机器人来替代所有客户支持专家。 这样一来,对AI模型的需求将非常高,并且短期内甚至无法实现。

If we want to alleviate our customer support specialists from the tedious daily routines, which means we intend to leverage AI technologies as an amplifier to improve human productivity and capability, then the models today are already good enough in many scenarios.

如果我们想减轻日常繁琐的客户支持专家的工作量,这意味着我们打算利用AI技术作为放大器来提高人类的生产力和能力,那么今天的模型在许多情况下已经足够了。

It sounds inspiring. However, one heated debate regarding models is that, although some models are available to the public, the best models (the state-of-the-art models, SOTA) are not. The companies which could hire AI scientists with the own those SOTA models. So would I lose the competitive advantage if I only use the public models?

听起来很鼓舞人心。 但是,关于模型的激烈争论是,尽管一些模型可供公众使用,但是最好的模型(最先进的模型,SOTA)却没有。 可以雇用拥有SOTA模型的AI科学家的公司。 如果仅使用公共模型,是否会失去竞争优势?

People are scratching their heads over this because they assume models with greater effectiveness would deliver higher business value. Yet this could be wrong. In most cases, the relationship between model effectiveness and business value is neither linear nor monotonically increasing. The graph of the function should look like below.

人们为此over之以鼻,因为他们认为效率更高的模型可以带来更高的业务价值。 但是,这可能是错误的。 在大多数情况下,模型有效性与业务价值之间的关系既不是线性的也不是单调增加的。 该函数的图形应如下图所示。

Image for post
Image by author) 作者提供 )

It is a piecewise function. In the first phase, before the model is practical in the application scenario, there is no business value. In the second phase, although a better model theoretically should have better performance (response time, effectiveness, etc.), it could be not that obvious in the real world scenario. Let us take a look at the below circumstance.

这是一个分段函数。 在第一阶段,在该模型在应用程序场景中实用之前,没有任何商业价值。 在第二阶段,尽管理论上更好的模型应该具有更好的性能(响应时间,有效性等),但在现实世界中它可能并不那么明显。 让我们看一下以下情况。

Before a doctor could confirm if a patient has lung infections, the doctor needs to take the CT images of the suspicious patient’s lungs. It would generate around 300 CT images. And an experienced doctor would have to spend 5–15 minutes on studying these hundreds of CT images. Normally, it would not be a problem if a doctor only deals with a small number of patients. However, in extreme cases such as the COVID-19, doctors are overwhelmed by the surge of patients.

在医生确认患者是否患有肺部感染之前,医生需要拍摄可疑患者肺部的CT图像。 它将生成约300张CT图像。 一位经验丰富的医生将不得不花费5至15分钟的时间来研究这数百张CT图像。 通常,如果医生只处理少量患者,这将不是问题。 但是,在极端情况下(例如COVID-19),患者数量激增使医生不知所措。

The good news is data scientists have tried to help doctors by the computer vision technology. They trained the models which could process the hundreds of CT images and provide diagnostic suggestions in seconds. Hence doctors only need to take 1 minute to review the results generated by the models.

好消息是数据科学家试图通过计算机视觉技术帮助医生。 他们训练了可以处理数百张CT图像并在几秒钟内提供诊断建议的模型。 因此,医生只需花费1分钟即可查看模型生成的结果。

So before applying machine learning technology, it would take an average of 10 minutes to review the results generated in one CT scan, now it would take about 1 minute. The productivity improvement is almost 90%.

因此,在应用机器学习技术之前,平均需要花费10分钟才能查看一次CT扫描生成的结果,而现在大约需要1分钟。 生产率提高了近90%。

What if we have a faster model which would only need 3 seconds to generate the results? From 1 minute and 5 seconds to 1 minute and 3 seconds, it does not seem attractive.

如果我们有一个更快的模型,只需要3秒钟即可生成结果呢? 从1分5秒到1分3秒,它似乎没有吸引力。

What if we have a more effective model that could raise the accuracy from 80% to 90%? Could doctors review fewer results? The answer is no because, although the model could only be wrong in 1 of 10, we could not know which one is incorrect. Thus the final reviewer has to go through all the results. As a result, it would not save more diagnostic time.

如果我们有一个更有效的模型可以将准确度从80%提高到90%怎么办? 医生可以检查更少的结果吗? 答案是否定的,因为尽管模型只能在10个中的1个中出错,但我们不知道哪个模型是错误的。 因此,最终审稿人必须仔细检查所有结果。 结果,它不会节省更多的诊断时间。

Furthermore, to lower the cost of the model inference service, we sometimes would rather sacrifice the model effectiveness. Here’s a real example from our user, they are a business intelligence platform that holds 55 million images of trademarks. The company wanted to provide a service that allows users to search the owners of these trademarks. Users perform the search by uploading trademark images as the input query instead of giving the keywords.

此外,为了降低模型推理服务的成本,有时我们宁愿牺牲模型有效性。 这是我们用户的一个真实例子,它们是一个商业智能平台,拥有5500万张商标图片。 该公司希望提供一项服务,允许用户搜索这些商标的所有者。 用户通过上载商标图像作为输入查询而不是给出关键字来执行搜索。

The technology behind is computer vision, for instance, the VGG model. If the company runs the model inference on the back end server, they have to allocate and reserve the hardware resources in the data center. Another choice is to deploy a smaller model so that the company could put the model inference on edge devices (the smartphone in most cases). It will certainly cut off the cost of the expensive model inference hardware like GPU. It is another example that SOTA models are impossible to be competitive in all the scenarios.

背后的技术是计算机视觉,例如VGG模型。 如果公司在后端服务器上运行模型推断,则他们必须在数据中心中分配和保留硬件资源。 另一个选择是部署较小的模型,以便公司可以在边缘设备(大多数情况下为智能手机)上进行模型推断。 无疑,它将削减昂贵的模型推理硬件(如GPU)的成本。 另一个例子是,SOTA模型不可能在所有情况下都具有竞争力。

We are already at the turning point of AI transformation. Then the question is how we could pass the turning point and adopt the AI technologies to empower our business.

我们已经处于AI转型的转折点。 接下来的问题是,我们如何才能渡过转折点,并采用AI技术来增强业务能力。

A usable model is a prerequisite. However, we would not be able to develop an AI program at ease if we only have the model. Like in the traditional applications, data serving is always a critical piece. And we can see it is becoming an essential component in AI adoption nowadays. That is why we initiated the open-source project-Milvus to accelerate AI adoption.

可用模型是先决条件。 但是,如果只有模型,我们将无法轻松开发AI程序。 像传统应用程序一样,数据服务始终是至关重要的部分。 我们可以看到,它已成为当今采用AI的重要组成部分。 这就是为什么我们启动开源项目Milvus来加速AI的采用。

人工智能的数据挑战 (The data challenge of AI adoption)

Most of the data we try to process through AI technologies is unstructured. We expect the Milvus project to provide a solid foundation for unstructured data service.

我们尝试通过AI技术处理的大多数数据都是非结构化的。 我们希望Milvus项目为非结构化数据服务提供坚实的基础。

People usually categorize data into three types, structured data, semi-structured data, and unstructured data. Structured data includes numbers, dates, strings, etc. Semi-structured data usually comprises text information in certain formats, such as various computer systems logs. Unstructured data involves pictures, video, voice, natural language, and any other data that could not be directly processed by the computer.

人们通常将数据分为三类,结构化数据,半结构化数据和非结构化数据。 结构化数据包括数字,日期,字符串等。半结构化数据通常包括某些格式的文本信息,例如各种计算机系统日志。 非结构化数据涉及图片,视频,语音,自然语言以及计算机无法直接处理的任何其他数据。

It is estimated unstructured data accounts for at least 80% of the overall digital data universe. For example, you may send and receive several kilobytes text messages with your family, friends, or colleague every day. But even if you only take one photo on your mobile device, let’s say it’s iPhone 11 with a 12-megapixel camera, it would be several megabytes. And what if you take a video shot with 720p resolution?

据估计,非结构化数据至少占整个数字数据世界的80%。 例如,您可能每天与家人,朋友或同事发送和接收几千字节的短信。 但是,即使您仅在移动设备上拍摄一张照片,也就是使用12兆像素摄像头的iPhone 11,也要几兆字节。 而且,如果您以720p分辨率拍摄视频,该怎么办?

People have developed technologies like relational databases, big data technologies to process structured data efficiently. And semi-structured data can be handled by the text-based search engine like Lucene, Solr, Elastic Search, etc. However, for unstructured data, the large chunk of all data, people didn’t have effective analytical methods in the past. Until the rise of Deep Learning technology in recent years, the development of unstructured data processing had been unthinkable.

人们已经开发了诸如关系数据库,大数据技术之类的技术来有效地处理结构化数据。 半结构化数据可以通过基于文本的搜索引擎(如Lucene,Solr,Elastic Search等)进行处理。但是,对于非结构化数据,即所有数据中的很大一部分,人们过去没有有效的分析方法。 直到近年来深度学习技术的兴起,非结构化数据处理的发展一直是不可想象的。

非结构化数据服务 (Unstructured data service)

Embedding, a Deep Learning jargon, refers to transforming unstructured data into feature vectors through the models. Since a feature vector is a numeric array, it’s easy to be processed by computers. Thus the analysis of unstructured data could be translated to vector computation.

深度学习术语嵌入是指通过模型将非结构化数据转换为特征向量。 由于特征向量是数字数组,因此很容易由计算机处理。 因此,非结构化数据的分析可以转换为矢量计算。

One most common argument is feature vector seems to be the intermediate result in the unstructured data processing. Is it necessary to build up a general-purpose vector similarity search engine? Should it be included in the models?

一个最普遍的论点是特征向量似乎是非结构化数据处理的中间结果。 是否有必要建立通用的矢量相似度搜索引擎? 是否应将其包括在模型中?

From my perspective, a feature vector is not just the intermediate result. It is the knowledge representation of unstructured data in Deep Learning scenarios. It’s also known as feature learning.

在我看来,特征向量不仅仅是中间结果。 它是深度学习场景中非结构化数据的知识表示。 这也称为特征学习。

Another argument is, since a feature vector also contains numerical values, why not perform vector computation upon existing data processing platforms like databases or computing framework such as Spark.

另一个论点是由于特征向量还包含数值,为什么不对现有的数据处理平台(例如数据库)或计算框架(例如Spark)执行向量计算。

To be precise, a vector consists of a list of numbers. It leads to two significant differences between vector computation and numerical operations.

确切地说,向量由数字列表组成。 这导致矢量计算和数值运算之间的两个重大区别。

First, the most frequent operations of vectors and numbers are different. For numbers, addition, subtraction, multiplication, and division are the most common operations. But for vectors, the most common requirement is to calculate the similarity. You see, here I am giving the formula for computing Euclidean distance, and the computation of vectors is much higher than the ordinary numerical calculation.

首先,向量和数字最频繁的运算是不同的。 对于数字,加,减,乘和除是最常见的运算。 但是对于向量,最常见的要求是计算相似度。 您会看到,在这里我给出了计算欧几里德距离的公式,向量的计算比普通的数值计算要高得多。

Secondly, the index organization of the data is different. Between two numbers, we can compare the values with each other. So we could create the number index based on the algorithm like B tree. But between two vectors, we could not perform the comparison. We can only calculate the similarity between them. So the vector index is usually based on algorithms like approximate nearest neighbor ANN algorithm.

其次,数据的索引组织是不同的。 在两个数字之间,我们可以相互比较这些值。 因此,我们可以基于B树之类的算法创建数字索引。 但是在两个向量之间,我们无法执行比较。 我们只能计算它们之间的相似度。 因此,矢量索引通常基于诸如近似最近邻ANN算法之类的算法。

Image for post
Image by author) 图片由作者 )

Because of these significant differences, we find it hard to meet the requirements of vector analysis with the traditional database and big data technologies. The algorithms they support and the scenarios they focus on are all different.

由于存在这些重大差异,我们发现使用传统的数据库和大数据技术很难满足矢量分析的要求。 他们支持的算法和他们关注的场景都是不同的。

We are not reinventing the wheels in the Milvus project.

我们并没有在Milvus项目中重新发明轮子。

Image for post
Image by author) 作者提供的图片 )

找出用例 (To find out the use cases)

Links of use cases:

用例链接:

  1. Milvus × VGG: Building a Content-based Image Retrieval System

    Milvus×VGG:构建基于内容的图像检索系统

  2. Smarter Housing Search and Recommendation Powered by Milvus

    Milvus支持的更智能的房屋搜索和推荐

  3. Accelerating New Drug Discovery

    加速新药发现

加入开源社区 (Join the open-source community)

The Milvus project is an incubating project hosted by the LF AI foundation.

Milvus项目是LF AI基金会主持的孵化项目。

Please feel free to join the community of the Milvus project. If you are interested in the Milvus project, want to build up AI applications upon it, or would like to engage in the project development.

请随时加入Milvus项目社区 如果您对Milvus项目感兴趣,则想在其上构建AI应用程序,或者想参与项目开发。

If you want to learn more detail about the project, please follow the publication Unstructured Data Service.

如果您想了解有关该项目的更多详细信息,请遵循出版物《 非结构化数据服务》

关于作者 (About the author)

I am Jun Gu, a database engineer and one of the enthusiastic organizers of the Milvus open-source community.

我是数据库工程师,Milvus开源社区的热心组织者之一Jun Gu

I am currently a member of the technical advisory council (TAC) at the LF AI foundation. However, my longest working experience was in the Fin-tech domain. I served in the Enterprise Infrastructure Dept. of Morgan Stanley for 8 years.

我目前是LF AI基金会技术咨询委员会(TAC)的成员。 但是,我最长的工作经验是在金融技术领域。 我曾在Morgan Stanley的企业基础架构部门工作了8年。

翻译自: https://towardsdatascience.com/passing-the-turning-point-of-ai-transformation-4855bc9742a1

ai转型指南

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值