linkedin爬虫_这些框架帮助LinkedIn大规模构建了机器学习

最新推荐文章于 2024-07-12 16:16:27 发布

weixin_26632369

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量166

点赞数

文章标签： python 机器学习人工智能 java

原文链接：https://medium.com/swlh/these-frameworks-have-helped-linkedin-build-machine-learning-at-scale-50406f87a58b

版权

linkedin爬虫

I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

我最近开始了一份有关AI教育的新时事通讯。 TheSequence是无BS(意味着没有大肆宣传，没有新闻等)，它是专注于AI的新闻通讯，需要5分钟的阅读时间。 目的是使您了解机器学习项目，研究论文和概念的最新动态。 请通过以下订阅尝试一下：

Building machine learning at scale is a road full of challenges and there are not many well-documented case studies that can be used as a reference. My team at Invector Labs, recently published a slide deck that summarizes some of the lessons we have learned building machine learning solutions at scale but we are also always trying to study how other companies in the space are solving these issues.

大规模构建机器学习是一条充满挑战的道路，没有太多有据可查的案例研究可作为参考。我在Invector Labs的团队最近发布了一个幻灯片，总结了我们在大规模构建机器学习解决方案方面学到的一些经验，但是我们也一直在努力研究该领域的其他公司如何解决这些问题。

LinkedIn is one of the companies that have been applying machine learning to large scale scenarios for years but little was known about the specific methods and techniques used at the software giant. Recently, the LinkedIn engineering team has published a series of blog posts that provide some very interesting insights about their machine learning infrastructure and practices. While many of the scenarios are very specific to LinkedIn, the techniques and best practices are applicable to many large scale machine learning solutions.

LinkedIn是将机器学习应用于大规模场景多年的公司之一，但对该软件巨头使用的特定方法和技术知之甚少。最近，LinkedIn工程团队发布了一系列博客文章，提供了有关其机器学习基础结构和实践的一些非常有趣的见解。尽管许多情况都是特定于LinkedIn的，但这些技术和最佳实践适用于许多大规模的机器学习解决方案。

人与人之间的机器学习 (Machine Learning with Humans in the Loop)

One of the most interesting aspects of LinkedIn’s machine learning architecture is how they leverage humans as part of the machine learning workflows. Let’s take, for instance, a scenario that discovers relationships between different titles such as “sr. software engineer” or “lead developer” to improve the search experience. LinkedIn uses human taxonomists to tag relationships between titles so that they can be used in machine learning models such as Long-Short-Term-Memory networks which help to discover additional relationships between titles. That machine learning architecture is the foundation of LinkedIn Knowledge Graph.

LinkedIn机器学习体系结构最有趣的方面之一是他们如何利用人类作为机器学习工作流程的一部分。让我们以一个场景为例，该场景发现了不同标题(例如“ sr。软件工程师”或“首席开发人员”来改善搜索体验。 LinkedIn使用人类分类学家来标记标题之间的关系，以便将它们用于机器学习模型(例如长时记忆网络)中，以帮助发现标题之间的其他关系。机器学习架构是LinkedIn知识图的基础。

大规模的机器学习基础架构 (Machine Learning Infrastructure at Scale)

The core of LinkedIn’s machine learning infrastructure is a proprietary system called Pro-ML. Conceptually, Pro-ML controls the entire lifecycles of machine learning models from training to monitoring. In order to scale Pro-ML, LinkedIn has built an architecture that combines some of its open source technologies such as Kafka or Samza with infrastructure building blocks like Spark or Hadoop YARN.

LinkedIn的机器学习基础架构的核心是称为Pro-ML的专有系统。从概念上讲，Pro-ML控制着机器学习模型从训练到监视的整个生命周期。为了扩展Pro-ML，LinkedIn建立了一个架构，该架构将其一些开源技术(例如Kafka或Samza)与基础架构构建块(例如Spark或Hadoop YARN)相结合。

While most of the technologies used as part of LinkedIn’s machine learning stack are well-known, there are a couple of new contributions that deserve further exploration:

尽管LinkedIn机器学习堆栈中使用的大多数技术都是众所周知的，但仍有一些新的贡献值得进一步探索：

· Ambry: LinkedIn’s Ambry is a distributed immutable blob storage system that is highly available, very easy to scale, optimized to serve immutable objects of few KBs to multiple GBs in size with high throughput and low latency and enables end to end streaming from the clients to the storage tiers and vice versa. The system has been built to work under active-active setup across multiple datacenters and provides very cheap storage.

· Ambry ： LinkedIn的Ambry是一个分布式不可变blob存储系统，具有高可用性，易于扩展，经过优化，可以以高吞吐量和低延迟为大小从几个KB到多个GB的不可变对象提供服务，并能够从客户端进行端到端流传输到存储层，反之亦然。该系统已构建为可在多个数据中心之间的主动-主动设置下工作，并提供非常便宜的存储。

· TonY: TensorFlow on YARN (TonY) is a framework to natively run TensorFlow on Apache Hadoop. TonY enables running either single node or distributed TensorFlow training as a Hadoop application.

· TonY ： Yarn (TonY)上的TensorFlow是一个框架，可在Apache Hadoop上本地运行TensorFlow。 TonY支持将单个节点或分布式TensorFlow培训作为Hadoop应用程序运行。

· PhotonML: Photon ML is a machine learning library based on Apache Spark. Currently, Photon ML supports training different types of Generalized Linear Models(GLMs) and Generalized Linear Mixed Models(GLMMs/GLMix model): logistic, linear, and Poisson.

· PhotonML ： Photon ML是基于Apache Spark的机器学习库。当前，Photon ML支持训练不同类型的广义线性模型(GLM)和广义线性混合模型(GLMMs / GLMix模型)：逻辑，线性和泊松。

Hadoop上的TensorFlow (TensorFlow on Hadoop)

Last month, the LinkedIn engineering team open sourced the first release of its TensorFlow on YARN(TonY) framework. The goal of the release was to enable TensorFlow programs to run on distributed YARN clusters. While TensorFlow workflows are widely supported on infrastructures like Apache Spark, YARN has remained largely ignored by the machine learning community. TonY e first-class support for running TensorFlow jobs on Hadoop by handling tasks such as resource negotiation and container environment setup.

上个月，LinkedIn工程团队在YARN(TonY)框架上开源了其TensorFlow的第一版。该版本的目标是使TensorFlow程序能够在分布式YARN群集上运行。尽管TensorFlow工作流在Apache Spark等基础设施上得到了广泛支持，但YARN在很大程度上仍被机器学习社区所忽略。一流的支持，可通过处理诸如资源协商和容器环境设置之类的任务在Hadoop上运行TensorFlow作业。

At its core, TonY takes a TensorFlow programs and splits it into multiple parallel tasks that can be executed on a YARN cluster. It does so while maintaining full support for TensorFlow’s computation graph which means that tools such as TensorBoard can be used on TonY without any modifications.

TonY的核心是使用TensorFlow程序并将其拆分为可以在YARN集群上执行的多个并行任务。这样做是在完全支持TensorFlow的计算图的同时，这意味着TensorBoard之类的工具可以在TonY上使用，而无需进行任何修改。

TonY is an interesting contribution to the TensorFlow ecosystem that can improve the experience of TensorFlow applications running at scale. Furthermore, TonY can benefit from the wide range of tools and libraries available in the YARN ecosystem to provide a highly-scalable runtime for training and running TensorFlow applications.

TonY是对TensorFlow生态系统的有趣贡献，可以改善大规模运行的TensorFlow应用程序的体验。此外，TonY可以从YARN生态系统中可用的各种工具和库中受益，从而为培训和运行TensorFlow应用程序提供高度可扩展的运行时。

测试中 (Testing)

LinkedIn runs thousands of concurrent machine learning models which are constantly evolving and being versioned. In those scenarios, developing a robust testing methodology is essential to optimize the performance of machine learning models at runtime. In the case of LinkedIn, the engineering team has embedded A/B Testing as a first-class citizen of its Pro-ML architecture allowing machine learning engineers to deploy competing algorithms for specific scenarios and evaluate the one that yield the best results.

LinkedIn运行着数以千计的并发机器学习模型，这些模型正在不断发展和版本化。在这些情况下，开发健壮的测试方法对于优化运行时机器学习模型的性能至关重要。就LinkedIn而言，工程团队已将A / B测试作为其Pro-ML体系结构的一等公民，使机器学习工程师可以针对特定场景部署竞争算法并评估产生最佳结果的算法。

Internet giants like LinkedIn are at the forefront of the implementation of large-scale machine learning solutions and their insights about this subject are incredibly valuable to companies starting their machine learning journey. LinkedIn’s work clearly shows that developing machine learning at scale is a never-ending exercises that combines popular open source libraries and platforms with proprietary frameworks and methodologies.

像LinkedIn这样的互联网巨头站在大规模机器学习解决方案实施的最前沿，他们对这一主题的见解对于开始他们的机器学习之旅的公司来说具有不可思议的价值。 LinkedIn的工作清楚地表明，大规模开发机器学习是一项永无止境的练习，它将流行的开源库和平台与专有框架和方法相结合。

翻译自: https://medium.com/swlh/these-frameworks-have-helped-linkedin-build-machine-learning-at-scale-50406f87a58b

linkedin爬虫

weixin_26632369

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
linkedin爬虫_这些框架帮助LinkedIn大规模构建了机器学习

linkedin爬虫I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you u...
复制链接

扫一扫