人工智能民主化无关紧要，数据孤岛以及如何建立一家AI公司-CSDN博客

重点 (Top highlight)

意见 (Opinion)

When I first heard the phrase, “data is the new oil”, I wrote it off as clever marketing.

当我第一次听到“数据是新的石油”一词时，我就把它写成聪明的营销。

But having worked in machine learning for a few years, I take it back.

但是在机器学习领域工作了几年后，我收回了它。

It’s an understatement.

轻描淡写。

The right data can launch a company, create jobs, and solve real problems.

正确的数据可以启动公司，创造就业机会并解决实际问题。

If only it were that easy.

如果那么简单就好了。

您不能只建立AI公司 (You cannot just build an AI company)

The smartest AI scientists cannot train a model without data.

最聪明的AI科学家无法在没有数据的情况下训练模型。

This poses a chicken and egg problem for startups.

这给初创企业带来了麻烦。

You need data to build an “AI Company”. But you need a functioning company to collect data in a specific domain.

您需要数据来建立“ AI公司”。但是，您需要一家运作正常的公司来收集特定域中的数据。

这解释了为什么： (This explains why:)

companies pretend to do AI
公司假装做AI
very few products have AI at their core
很少有产品以AI为核心
big companies (with data) have a huge advantage
大公司(具有数据)具有巨大优势

Solutions include data partnerships with larger companies, and utilizing public datasets.

解决方案包括与大型公司建立数据合作伙伴关系，以及利用公共数据集。

但我提出了另一种选择。 (But I propose an alternative.)

Build a business (software or otherwise), collect data, and THEN use ML to augment that business.

建立业务(软件或其他方式)，收集数据，然后使用ML扩展业务。

Make data collection a priority from day one. Lack of data is a bigger impediment to AI than lack of talent.

从第一天开始就将数据收集作为优先事项。缺少数据比缺少人才是对AI的更大障碍。

您不需要大数据，您需要利基数据 (You do not need big data, you need niche data)

Self-driving cars and AI-powered drug discovery require huge amounts of data.

自动驾驶汽车和人工智能驱动的药物发现需要大量数据。

But if you minimize the scope of a problem, you often don’t need much data.

但是，如果您将问题的范围最小化，则通常不需要太多数据。

Recipe generation in a specific cuisine, optimizing water levels for greenhouse tomatoes and brewing the perfect shot of espresso likely don’t require a million data points.

特定美食的食谱生成，优化温室番茄的水位以及酿造完美的意式浓缩咖啡可能不需要一百万个数据点。

If you can automate a tiresome/time-consuming piece of work with a combination of narrow models, domain knowledge and hardcoded logic, you’ve built something valuable.

如果您可以结合狭窄的模型，领域知识和硬编码逻辑来使繁琐/耗时的工作自动化，那么您已经建立了有价值的东西。

General AI to overrated. Solve a specific problem in a specific domain.

一般AI被高估了。解决特定领域中的特定问题。

人工智能的民主化被夸大了 (The democratization of AI is overhyped)

In definition, democratization means increasing access to those without knowledge and resources.

从定义上讲 ，民主化意味着增加接触那些没有知识和资源的人的机会。

In reality, it’s marketing from companies providing AI-powered APIs.

实际上 ，它是由提供AI驱动的API的公司进行的营销。

The ability to easily add AI-powered chat, image recognition, or sentiment analysis is great for mildly augmenting an existing product.

轻松添加基于AI的聊天，图像识别或情感分析的功能非常适合适度扩展现有产品。

But not for building the core of a product.

但不是为了构建产品的核心。

It provides no moat against other companies using the same API
它不会对使用相同API的其他公司造成困扰
You’re giving away hard-earned data that a larger company will use to train its models
您正在提供来之不易的数据，大型公司将使用这些数据来训练其模型
The API could be deprecated one day
该API有一天可能会被弃用

You need to own your models to build a sustainable enterprise.

您需要拥有自己的模型才能构建可持续发展的企业。

存放您自己的数据 (Silo your own data)

I wish there was enough open data for everyone. There’s not.

我希望每个人都有足够的开放数据。没有。

For an AI startup, data is your moat.

对于AI初创公司而言，数据就是您的护城河。

大公司拥有大量孤立的数据。 (Big companies have huge amounts of siloed data.)

Google has browsing history
Google有浏览记录
Facebook has your images, friends and interests
Facebook有您的图像，朋友和兴趣
Amazon has purchase history
亚马逊有购买记录

This combined data could spawn a hundred new companies. But it won’t.

这些综合数据可能产生一百家新公司。但是不会。

You need your own private store of data. You can open source what you’ve built after you’re successful.

您需要自己的私人数据存储。成功后，您可以开源自己构建的内容。

用领域知识增强您的数据 (Supercharge your data with domain knowledge)

If data is a moat. Data + domain knowledge is an ocean.

如果数据很麻烦。数据+领域知识是一片海洋。

Most of the real opportunities are solving problems you wouldn’t know about unless you worked in a field.

除非您在野外工作，否则大多数真正的机会是解决您不会知道的问题。

Given 20 years of granular weather data, I could come up with a few potential startup ideas. But a farmer, a general contractor, or a logistics company, could come up with use-cases I couldn’t imagine. The world’s real problems are better suited to domain experts than teams of engineers in FAAMG.

鉴于20年来的细颗粒天气数据，我可以提出一些潜在的启动想法。但是，农民，总承包商或物流公司可能会提出我无法想象的用例。与FAAMG的工程师团队相比，全球的实际问题更适合领域专家。

Labelling data that requires domain expertise is near-impossible to outsource. I’ve tried. Outsourcing the labelling of dog VS cat images is easy. But classifying legal cases takes an expert.

标记需要领域专业知识的数据几乎不可能外包。我试过了。将狗VS猫图像的标签外包很容易。但是对法律案件进行分类需要专家。

You probably need to label your own data. It sucks. But it’s a good thing if only you can do it.

您可能需要标记自己的数据。糟透了但是，只有您能做到，这是一件好事。

不要过度依赖公共数据集 (Do not over-rely on public data sets)

If you find a great opportunity for a public dataset, take it. But anecdotally, these are far and few between.

如果您发现公开数据集的绝佳机会，那就抓住它。但有趣的是，这些之间是遥不可及的。

It’s less defensible because anyone can use it, and you likely can’t generate additional data points unless the dataset is updated.

它的防御性较低，因为任何人都可以使用它，而且除非更新数据集，否则您可能无法生成其他数据点。

As a proportion of data in the world, public datasets make up a tiny fraction of a tiny fraction.

作为全球数据的一部分，公共数据集只占一小部分。

To riff on my previous example, finding images of dogs VS cats is easy. Finding images of hamburger buns without enough seeds is hard.

在前面的例子中，找到狗与猫的图像比较容易。很难找到没有足够种子的汉堡面包的图像。

In my experience, this effect is even more pronounced in NLP than in images.

以我的经验，这种影响在NLP中比在图像中更为明显。

Collect your own data.

收集您自己的数据。

这是一个机会 (This is an opportunity)

AI has a data problem. And we know the future has more AI. So solving that is an opportunity.

AI有数据问题。我们知道，未来将拥有更多的人工智能。因此解决这是一个机会。

政府可以通过数据刺激创新 (Governments can spur innovation with data)

Governments own a lot of data. Not all of it is sensitive.

政府拥有大量数据。并非全部都是敏感的。

Open-data and data-partnerships can attract companies to solve specific problems, as well as generate economic value if provided the right data.

开放数据和数据伙伴关系可以吸引公司解决特定问题，并在提供正确数据的情况下创造经济价值。

This Fish Hackathon comes to mind.

这鱼黑客马拉松浮现在脑海。

将铲子卖给矿工 (Sell shovels to the miners)

During historical gold rushes, selling shovels was more profitable than mining. Supporting AI companies with data is a product. We need more:

在历史性的淘金热期间，出售铁锹比开采铁矿更有利可图。用数据支持AI公司是一种产品。我们需要更多：

data marketplaces
数据市场
the ability to rent domain expertise
租用领域专业知识的能力
data curation
数据策划

Solving the data problem is an integral part of solving problems with AI.

解决数据问题是解决AI问题不可或缺的一部分。

结论 (Conclusion)

Disclaimer: I conflate AI with ML.

免责声明：我将AI与ML混为一谈。

There are a lot of problems that artificial intelligence has the potential to solve. AI requires data to do that.

人工智能有很多潜在的问题可以解决。人工智能需要数据来做到这一点。

Being a machine learning expert is not enough. Acquiring and creatively using data is a business problem that comes first.

仅仅作为机器学习专家是不够的。获取和创造性地使用数据是首先要解决的业务问题。

This isn’t easy and that’s a good thing. When you have it, it provides an advantage over the competition and others’ with technical expertise.

这并不容易，这是一件好事。当您拥有它时，它比竞争对手和其他拥有专业技术的人更具优势。

翻译自: https://towardsdatascience.com/data-over-everything-abbeb9ee758