初学者的机器学习

最新推荐文章于 2024-10-03 09:02:12 发布

weixin_26706653

最新推荐文章于 2024-10-03 09:02:12 发布

阅读量102

点赞数

文章标签： python 机器学习人工智能

原文链接：https://medium.com/the-innovation/machine-learning-for-startups-7d1c411577b2

版权

Machine learning is all the rage these days and can be a major differentiator for your startup. Unfortunately most startups underestimate how difficult and expensive implementing ML can be. The following points are guidelines that I follow to successfully integrate machine learning into software while startups are still in the early stages.

如今，机器学习风靡一时，并且可以成为您创业的主要区别。不幸的是，大多数初创公司低估了实施ML的难度和成本。以下几点是在初创公司还处于初期阶段时，为了将机器学习成功集成到软件中所遵循的准则。

你需要什么 (What you’ll need)

Machine Learning is simply a set of statistical methods that can be used on large datasets to make predictions. Whether you are focused on computer vision, robotics, recommendation systems, or any of the myriad of ways that ML can be used. You are essentially focusing on making predictions. You put an input in, and predictions comes out. This is simple in theory, but finding the right model to accomplish what you want to do isn’t always so straight forward. To help facilitate your ML journey you’ll need a way to capture, transfer, and transform data to fit your needs.

机器学习只是一组统计方法，可用于大型数据集进行预测。无论您是专注于计算机视觉，机器人技术，推荐系统，还是使用ML的多种方式中的任何一种。您实际上是在进行预测。您输入了内容，预测就出来了。从理论上讲这很简单，但是找到合适的模型来完成您想要做的事情并不总是那么简单。为了帮助您进行机器学习，您需要一种方法来捕获，传输和转换数据以满足您的需求。

ML入门的常用方法 (Common methods for getting started with ML)

The primary problem with machine learning is that as a startup it can seem nearly impossible to acquire all the data you will need. You could try partnering with an organisation that has all the data you might need, but few startups actually accomplish this. Another strategy that is commonly suggested is using services for extracting and labeling data, but that is usually quite expensive. As a startup you generally don’t have a lot of money and don’t want to blow your entire budget on creating a data set. It’s possible to save the expense and do the extraction and labeling yourself as long as you don’t have anything else you want to do for the next few months.

机器学习的主要问题在于，作为一家初创企业，似乎几乎不可能获取所需的所有数据。您可以尝试与拥有您可能需要的所有数据的组织建立合作伙伴关系，但是很少有初创公司真正做到这一点。通常建议的另一种策略是使用服务来提取和标记数据，但这通常非常昂贵。作为一家初创公司，您通常没有很多钱，也不想浪费您的全部预算来创建数据集。只要您在接下来的几个月中没有其他想要做的事情，就可以节省开支并提取和标记自己。

I’ve offered a few options and argued against all of them. So what is a startup founder to do? I’ve found that the best way to solve the problem is to plan on an incremental evolution toward machine learning along with designing your product around getting your users to label your data for you. The rest of this post will focus on this strategy and how to go about implementing it.

我提供了一些选择，并反对所有这些选择。那么，创业创始人要做什么？我发现解决问题的最佳方法是计划向机器学习的逐步发展，并围绕让用户为您标记数据的方式设计产品。本文的其余部分将重点讨论此策略以及如何实施它。

传统AI的力量 (The power of traditional AI)

Back before Machine Learning was a common term, companies implemented AI using more traditional methods and they were often really successful. Techniques such as rule based expert systems, logic trees and clustering algorithms like k-means are quite effective and in many cases they are still the backbone of the ML industry today. By building your service using these technologies you can bridge the data gap as you work on collecting all of the data that you need to layer in Machine learning algorithms.

早在机器学习是一个通用术语之前，公司就使用更传统的方法来实现AI，并且它们通常确实很成功。基于规则的专家系统，逻辑树和诸如k-means之类的聚类算法之类的技术非常有效，在许多情况下，它们仍然是当今机器学习行业的骨干力量。通过使用这些技术构建服务，您可以在收集所有需要在机器学习算法中分层的数据时，缩小数据鸿沟。

制定一个计划 (Make a Plan)

As an early stage startup the tools you use matter, especially for tools that govern your companies data. How should you pick the right data tool? Whatever you pick it should be inexpensive or free to start, they should have a low learning curve, and they should provide flexibility so that your company can grow without unnecessary friction. Essentially, you’re going to want to avoid the enterprise tools.

作为早期启动，您使用的工具很重要，尤其是用于管理公司数据的工具。您应该如何选择合适的数据工具？无论您选择哪种方法，它都应该便宜或免费启动，它们的学习曲线应该很短，并且应该提供灵活性，以便您的公司能够在没有不必要摩擦的情况下成长。本质上，您将要避免使用企业工具。

If you go out and ask the average data engineer which tool you should use, they’ll likely say Spark. It’s a great tool for enterprises, but the problems it solves aren’t the problems a startup experiences so it’s best left out of your stack until you have hundreds of gigabytes to sift through. Instead I prefer to use tools like Segment.io. Their service can be added to your product within minutes and it lets you easily change which services you send your data to. I don’t get anything for singing their praises, I just really like the service. I’m sure there are others out there that are similar. If you have one that you like please leave a comment and I’ll work on adding it in.

如果您出去问普通的数据工程师应该使用哪种工具，他们可能会说Spark。对于企业来说，这是一个很好的工具，但是它解决的问题不是初创公司遇到的问题，因此最好将其遗漏在堆栈中，直到有数百GB的数据可供筛选。相反，我更喜欢使用Segment.io之类的工具。他们的服务可以在几分钟之内添加到您的产品中，它使您可以轻松更改将数据发送到的服务。我并没有因为赞美而得到任何东西，我真的很喜欢这项服务。我敢肯定还有其他类似的人。如果您有自己喜欢的一个，请发表评论，我将继续进行添加。

面向数据 (Be Data Oriented)

Building data pipelines into your system from the beginning will make it easier to add in machine learning as your startup matures. You won’t have any data on day 1 though. So when you are designing your product experience you should be considering ways that you can get your users to label data for you. The goal being to acquire accurately labeled data for as close to free as possible.

从一开始就将数据管道构建到您的系统中，将使随着初创企业的成熟而更容易添加机器学习。不过，第一天您将没有任何数据。因此，在设计产品体验时，应该考虑使用户为您标记数据的方法。目标是获取尽可能接近免费的，准确标记的数据。

资料扩充 (Data Expansion)

When you have a lot of data but it doesn’t seem to be quite enough data for a ML model, you might be able to use the data expansion strategy to increase the size of your data set without merely duplicating records. In doing so you can reduce the potential for high bias models that are caused by data-sets without enough variance. Depending on the data that you have available, you could apply transformations to the data as a way of generating more data points for training from your existing data set. Techniques like rotating images, or adjusting pitch in audio can help expand your data set and produce better results.

当您有大量数据但对于ML模型来说似乎还不够用时，您可能能够使用数据扩展策略来增加数据集的大小，而不仅仅是复制记录。这样一来，您可以减少由数据集引起的高偏差模型的可能性，而不会产生足够的方差。根据可用数据的不同，您可以对数据进行转换，以生成更多数据点以从现有数据集中进行训练。旋转图像或调整音频音高等技术可以帮助扩展数据集并产生更好的结果。

不要重新发明轮子 (Don’t reinvent the wheel)

Companies like Google, Amazon, and Microsoft all have APIs that allow you to perform predictions using their ML models. For more specialized applications that their services don’t cover, you will have to roll up your sleeves and use frameworks like TensorFlow and Keras to create models of your own. I don’t suggest doing this in an early stage unless it is absolutely crucial to your value proposition. Creating quality models from scratch can take months and despite best efforts, they can still end in failure.

像Google，Amazon和Microsoft这样的公司都具有API，使您可以使用其ML模型执行预测。对于他们的服务无法涵盖的更专业的应用程序，您将不得不袖手旁观，并使用TensorFlow和Keras之类的框架来创建自己的模型。我不建议在早期阶段这样做，除非这对您的价值主张绝对至关重要。从头开始创建质量模型可能要花费几个月的时间，尽管尽了最大的努力，但它们仍然可能以失败告终。

结语 (Wrapping Up)

It’s not easy running a startup, let alone one focused on the use of Machine Learning. By being creative with how you acquire and label data, you’ll find that ML doesn’t have to be an insurmountable hurdle. I realize that the strategy I present is likely just 1 of many. If you have a process that you have experienced success with then please leave a comment.

经营一家初创公司并不容易，更不用说专注于使用机器学习了。通过对获取和标记数据的方式进行创新，您将发现ML不必成为无法克服的障碍。我意识到我提出的策略可能只是众多策略中的一种。如果您有一个成功的过程，请发表评论。