初创公司如何搭建开发框架_我帮助初创企业构建和部署数据科学的框架

初创公司如何搭建开发框架

I help startups go from “product” to “product+machine learning”.

我帮助初创企业从“产品”过渡到“产品+机器学习”。

This is my framework for achieving that, including advice, caveats and examples at each stage.

这是我实现这一目标的框架,包括每个阶段的建议,警告和示例。

While every company, problem and data is different, there’s always a lot in common.

尽管每个公司,问题和数据都不尽相同,但总有很多共同点。

This framework revolves around building a proof-of-concept ASAP, then incrementally improving it. This follows my experience in ML that you don’t know if something will work until you try it.

该框架围绕构建概念验证 ASAP,然后逐步对其进行改进。 这是根据我在ML上的经验得出的,直到您尝试了它,您才知道某些方法是否会起作用。

I mostly work on natural language processing, but this framework applies equally to images and numeric data.

我主要从事 自然语言处理 工作 ,但是该框架同样适用于图像和数字数据。

从问题或数据入手 (Start with a problem or data)

具有ML潜力的公司分为两个类别: (Companies with ML-potential fall into two buckets:)

  1. Start with a problem (to solve with data)

    从问题开始(用数据解决)
  2. Start with data (to extract value from)

    从数据开始(从中提取价值)

Anecdotally, tech companies fall into #1 and non-tech companies fall into number #2.

有趣的是,科技公司排名第一,非科技公司排名第二。

从问题开始 (Starting with a problem)

You have a problem that ML may be able to solve.

您有一个ML可能解决的问题。

Example: A startup wants to recommend which vegetables can be grown, given geography and environmental conditions. They don’t have data related to this space.

示例:一家初创企业希望根据地理位置和环境条件,推荐可以种植哪些蔬菜。 他们没有与此空间相关的数据。

The first step is brainstorming what data is required.

第一步是集体讨论所需的数据。

Data can then be acquired via strategic partnerships, web scraping or open data sets.

然后可以通过战略合作伙伴关系,网络抓取或开放数据集获取数据。

从数据开始 (Starting with data)

You own data (and likely a functioning business) and want to derive additional value from that data.

您拥有数据(并且可能是一家运转中的企业),并希望从该数据中获得更多价值。

Example: A uniform manufacturer owns granular movement data about each of it’s sales reps.

示例:统一的制造商拥有有关其每个销售代表的精细运动数据。

The first step is brainstorming potential use-cases for the data.

第一步是集思广益潜在的数据用例。

In this example, it could be detecting which salespeople are the least efficient in navigating their territory. So they can be proactively encouraged to improve.

在此示例中,可能正在检测哪些销售人员在导航其区域时效率最低。 因此,可以积极鼓励他们进行改进。

Starting with data (and an existing business) is great because marginal improvements are valuable. Starting with a problem only makes sense if solving it is aspirational and game-changing.

从数据(和现有业务)入手非常好,因为边际改进很有价值。 从问题入手只有在解决问题是有抱负且改变游戏规则时才有意义。

数据探索 (Data exploration)

Understanding your data is important, but often gets too much attention in data science projects.

了解数据很重要,但是在数据科学项目中常常会引起过多关注。

Data exploration helps find gaps between what you think data looks like, and what it actually looks like. This is not the place to show off your data visualization skills.

数据探索有助于发现您认为数据看起来与实际看起来之间的差距。 这里不是炫耀您的数据可视化技能的地方。

Investigate data-types, distributions of values, and if anything is missing or dirty.

研究数据类型,值的分布以及是否丢失或不干净。

列出流行的解决方案 (List out popular solutions)

Whether you plan on coding a solution from scratch, or using an API on AWS, make a list of potential algorithms, libraries and APIs.

无论您是计划从头开始编写解决方案,还是计划在AWS上使用API​​,都要列出潜在的算法,库和API。

Example: Your product detects whether an image contains a poisonous mushroom.

示例:您的产品检测图像是否包含有毒的蘑菇。

List out how this might be solved. Options including Fastai, Keras, Sklearn, Amazon Rekognition and some other niche classification providers.

列出如何解决。 选项包括Fastai,Keras,Sklearn,Amazon Rekognition和其他一些细分市场提供商。

This is high level and not intended to be exhaustive. Simply find the popular options you can run with.

这是高级别的,并不旨在详尽无遗。 只需找到可以运行的流行选项即可。

将数据分为测试集和训练集 (Split the data into test and train sets)

Models are be trained on the train set, and evaluated on the test set.

在训练集上训练模型,并在测试集上进行评估。

我在这里有2条建议: (I have 2 pieces of advice here:)

  1. Don’t look at examples in your test set. Otherwise you‘ll be inclined to add similar examples to your train set. This will cause you trouble later on when your model performs worse in production than on the test set.

    不要看测试集中的例子。 否则,您将倾向于在火车中添加类似的示例。 这会在以后的模型生产效果比测试集性能差的情况下给您带来麻烦。
  2. Ensure the distribution of classes between your test and train sets are similar. This will help train more accurate models. Some libraries like sklearn’s stratified shuffle split are designed to do this.

    确保测试集和训练集之间的类分配相似。 这将有助于训练更准确的模型。 诸如sklearn的分层洗牌拆分之类的某些库就是为此目的而设计的。

If non-technical colleagues are involved in this data science process, you’ll likely get push back on #1. Be prepared to push back.

如果非技术同事参与此数据科学过程,您可能会退居第一。 准备好后退。

尽可能简单地解决问题 (Solve the problem as simply as possible)

Pick the fastest-to-implement solution you found above and run your data through it using the simplest acceptable level of pre-processing.

选择您在上面找到的最快实现的解决方案,并使用最简单的可接受级别的预处理通过它运行数据。

That means, no feature engineering, a bag-of-words for vectorizing your text, minimal transformations on training images, etc…

这意味着没有功能工程,用于向量化文本的单词包,对训练图像的最小转换等。

If a a pre-trained model can accomplish your task, use that!

如果一个预先训练的模型可以完成您的任务,请使用该模型!

The point is to get a result, even if it’s not great.

关键是要获得结果,即使效果不是很好。

评估结果 (Evaluate results)

Choose a metric for success.

选择成功的指标。

For a classification problem, these will be f1, precision and recall. Beware of “accuracy” which is misleading on unbalanced data sets.

对于分类问题,这些将是f1,精度和召回率。 当心“准确性”,这会在不平衡的数据集上产生误导。

For other problems like regressions or recommendation engines, you’ll need to decide on other measures.

对于诸如回归或推荐引擎之类的其他问题,您需要决定其他度量。

At the end of the day, you need “some” numeric measure of success to objectively compare different solutions.

最终,您需要“一些”成功的数字量度来客观地比较不同的解决方案。

这是你的基准 (This is your baseline)

Based on the above, you have a mark that all future models can be measured against.

基于以上所述,您可以标记所有将来的模型。

At this point, the result may be terrible. But now you have a sense of the complexity of the problem, and how close you are to solving it.

在这一点上,结果可能是可怕的。 但是现在您已经知道了问题的复杂性以及您距离解决问题有多近。

If you don’t already have one, now’s the deadline for deciding on a minimum acceptable success threshold. This is the mark you need to achieve before deploying to production.

如果您还没有,那么现在是确定最低可接受成功阈值的截止日期。 这是部署到生产之前需要实现的目标。

Keep in mind that model results vary depending on the volume of data and type of pre-processing. Set your intention to compare apples-to-apples as you move forward and try other models and setups.

请记住,模型结果取决于数据量和预处理类型而变化。 继续前进并尝试其他模型和设置时,请设定要比较苹果与苹果的意图。

列出你的杠杆 (List your levers)

What levels can we pull to ratchet our results up?

我们可以提高什么水平来提高结果?

List them all. This list should be exhaustive, but geared to the specific problem at hand.

全部列出。 该列表应详尽无遗,但要针对当前的特定问题。

例子: (Examples:)

  • data cleaning (removing weird characters from text)

    数据清理(从文本中删除奇怪的字符)
  • vectorization (bows, embeddings…)

    向量化(弓,嵌入…)
  • stemming and lemmatization

    词干和词根化
  • feature engineering (handcrafted and composite features)

    特征工程(手工和复合特征)
  • feature reduction (PCA, top-n-features)

    功能减少(PCA,前n个功能)
  • ngrams

    语法
  • model selection

    选型
  • hyperparameter tuning

    超参数调整
  • class weighting (some models require a balanced number of classes for training)

    类权重(某些模型需要均衡数量的类进行训练)

通过拉杆击败基线 (Beat the baseline by pulling levers)

Improve your model results by pulling levers.

通过拉杆改善模型结果。

Note the affect of pulling each lever on results. This will build your intuition for future data science projects.

注意拉动每个杠杆对结果的影响。 这将为将来的数据科学项目建立您的直觉。

Keep pulling levers until you get your results to an acceptable level for deployment.

一直保持杠杆,直到将结果提高到可接受的水平以进行部署为止。

Conversely, you may discover that solving the problem is impossible given current data, resources, and technology.

相反,您可能会发现,鉴于当前的数据,资源和技术,解决问题是不可能的。

生产模型 (Productionizing your model)

There’s a huge gap between running a model locally and making that model available to a production application.

在本地运行模型与将模型提供给生产应用程序之间存在巨大的差距。

This involves productionizing both the pre-processing pipeline and the model — requests that hit production must go through the same pipeline as training data went through.

这涉及到生产预处理管道和模型—命中生产的请求必须与训练数据通过同一管道。

Pro tip: Decouple your model and infrastructure components as much as possible. This allows re-deploying small pieces rather than the whole pipeline if you change a single hyperparameter.

专家提示:尽可能将模型和基础架构组件分离。 如果更改单个超参数,这将允许重新部署小块而不是整个管道。

The correct production infrastructure is a function of model size, libraries, memory requirements, budget and technical expertise.

正确的生产基础结构取决于模型大小,库,内存需求,预算和技术专长。

Digging into this could fill a whole book, but AWS users should consider combinations of EC2, Lambda and SageMaker to start.

深入研究可能会填满整本书,但AWS用户应考虑将EC2,Lambda和SageMaker组合使用。

Drop me a message in the comments if you’d like some high level advice.

如果您需要一些高级建议,请在评论中给我留言。

测试生产 (Testing production)

Can you cause weird behaviour, or make it throw an error?

您会导致奇怪的行为,还是会引发错误?

At this point, we’re only concerned with scalability and robustness. Not the quality of predictions.

在这一点上,我们只关心可伸缩性和健壮性。 不是预测的质量。

Scalability is how it performs as load varies.

可伸缩性是负载变化时它如何执行的。

Robustness is how it handles unexpected inputs.

鲁棒性是它处理意外输入的方式。

Get your whole team involved in trying to crash the system you’ve built. Better to break it now than when customers are using it.

让您的整个团队参与尝试使您构建的系统崩溃。 现在最好将其破坏而不是客户使用它时。

If you’ve pass all the above steps, congratulations! You’re ready to incrementally roll out ML to your customers.

如果您通过了上述所有步骤,那么恭喜! 您已准备好逐步向客户推出ML。

结论 (Conclusion)

This is a general framework I’ve applied to multiple machine learning and data science problems.

这是我已应用于多个机器学习和数据科学问题的通用框架。

In a nutshell, build a quick and dirty solution, then incrementally improve it, until it’s good enough for production.

简而言之,构建一个快速而肮脏的解决方案,然后逐步对其进行改进,直到足以用于生产为止。

A common pitfall is over-engineering before understanding the likelihood of success. You could literally spend years in the pre-processing stage without building anything useful, so don’t fall into that trap! Move fast!

一个常见的陷阱是在了解成功的可能性之前过度设计。 您实际上可以在预处理阶段花费数年,而无需构建任何有用的东西,所以请不要陷入陷阱! 快速移动!

This framework should apply to both startup ML, any any hobby data science projects you’re working on. I hope you found it helpful.

该框架应同时适用于启动ML和您正在从事的任何业余数据科学项目。 希望对您有所帮助。

翻译自: https://towardsdatascience.com/my-framework-for-helping-startups-build-and-deploy-data-science-43cf40bc1a1e

初创公司如何搭建开发框架

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值