[翻译]用.NET平台的Spark扩展包Mobius来开发Apache Spark应用

最近一个大数据项目中,用到数据挖掘,我一直用C#开发的,还不会JAVA,试图用C#进行SPARK开发,发现网上有这个微软的Mobius项目,但中文资料很少。也许是我还没查到吧。就花几个小时翻译一下。后面有必要的话,还继续翻译相关内容。
第一次翻译,不对不好的地方,请各位担待一些。出于严谨性,保留了英文。
============================================================
Developing Apache Spark Applications in .NET using Mobius
Extending Spark to the .NET developer community
用.NET平台的Spark扩展包Mobius来开发Apache Spark应用


 by Kaarthik Sivashanmugam Posted in COMPANY BLOGAugust 3, 2016

作者:Kaarthik (微软Mobius项目成员之一),2016-8-3


To address the gap between Spark and .NET, Microsoft created Mobius, an open source project, with guidance from Databricks. By adding the C# language API to Spark, it extends and enables .NET framework developers to build Apache Spark Applications. This guest blog provides an overview of this C# API.

为了弥补Spark和.NET平台之间的空缺,微软在Databricks公司(Spark的商业文化公司,百度)的指导下,创建了一个开源项目Mobius。通过增加了Spark的C#语言API,.NET FRAMEWORKS的开发者构建能够扩展构建Apache Spark应用。本篇博客对这个C# API进行简述。

Apache Spark has transformed the big data processing and analytics space over the last few years. It provides high-level APIs in Scala, Java, Python and R and dramatically reduced the cost and complexity of building a wide variety of big data workloads. The results of Spark Survey 2015 indicate that the ease of programming is one of the most important aspects of Spark. So it is apparent that having APIs in multiple languages appealed to various developer persona and contributed to the rapid adoption of Spark.

过去的几年里,Apache Spark已经能进行大数据处理和分析。它提供了Scala,Java,Python和R语言的高级API接口,并且明显的减少了构建各类大数据工作的成本和复杂度。2015年Spark调查表明,对于Spark,最重要方面之一就是编程的容易性。因此,很显然,支持多语言的API对各类个人开发者而言是及其有吸引力的,也有利于人们快速接纳Spark。

However, Spark had been out of reach for the .NET developer community. The results of Spark Survey 2015 also indicated that there was huge spike in the Spark usage in Windows, and there is a high likelihood that a good portion of the developers using Spark in Windows are .NET professionals. To address the gap between Spark and .NET, Microsoft created Mobius as an open source project with the goal of adding a C# language API to Spark enabling the usage of any .NET Framework language in building Apache Spark applications. With Mobius, organizations deeply invested in .NET can reuse their existing .NET libraries in their Spark applications.

然而,Spark没有进入.NET开发人员社区。2015年的调查结果也表明,在windows操作系统中使用Spark存在巨大的障碍,而且在windows中使用Spark的开发人员很大一部分是.NET专业人员。为了填补Spark和.NET之间的缺口,微软公司创建了开源响应Mobius,目标就是增加Spark的C#语言API,使得任何.NET Framework语言可以用于构建Apache Apark应用程序。通过Mobius项目,大量投资在.NET上的组织可以重用他们的.NET库到Spark应用程序中。


Spark Applications in .NET

.NET开发的Spark应用程序

The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.

C#语言捆绑到Spark就像Python和R语言的绑定一样。事实上,Mobius遵循相同的设计模式,在一致性和重用性方面都是利用了现有的语言绑定组件实现。下面的图说明了Mobius中,.NET应用程序和C# API之间的依赖性,它内部依赖于Spark的公开Scala和Java API,也依赖于从PySpark扩展PythonRDD,从而实现CSharpRDD。


 
As shown above, the driver programs are written entirely in a .NET programming language like C# or F# using the C# API in Mobius. Mobius applications can be used with Spark deployed on premises or in the cloud. Mobius is supported on Windows and Linux. In Linux, Mobius uses Mono, an open source implementation of the .NET framework.

如上所述,在Mobius中,驱动程序全部有.NET编程语言,如C#,或F#来编写的,Mobius应用程序能和Spark一起部署在本地环境或云端。Mobius支持Windows和Linux操作系统。在Linux操作系统中,Mobius使用Mono这个.NET framework的开源实现平台来支持。

Developing & Submitting Mobius Applications

开发和提交Mobius应用程序
Mobius driver applications can be developed in an IDE (like Visual Studio) that supports .NET development. Mobius API and the worker implementation (used to execute user defined functionality in C# code in Spark worker nodes) are released to NuGet. Once these Mobius binaries and any other .NET library dependencies are added to the Mobius driver project in the IDE, the driver application code can be developed, debugged and tested like any other .NET program within the IDE.

Mobius驱动程序能在支持.NET开发的集成开发环境(如Visual Studio)中进行开发。Mobius API和工作者实现(用于执行用户定义的功能,用C#代码编写Spark工作者节点程序)都是发布给NuGet的。在集成开发环境(IDE)中,一旦这些Mobius二进制文件和任何其他依赖的.NET库添加到Mobius驱动项目中,驱动应用程序代码就能像IDE中的其他任意.NET程序一样,被开发,调试和测试。

Mobius driver applications in .NET are compiled into an executable (.exe file), which is copied along with its dependencies to the client machine from which Spark job needs to be submitted. A supported version of Mobius release is also needed on the client machine on which Mobius job submission script (sparkclr-submit.cmd or sparkclr-submit.sh) is used to submit Mobius-based application to a Spark cluster. A Mobius job submission script accepts the same parameters as a spark-submit script, but it also needs an additional parameter for specifying the Mobius driver executable name and its path. As shown above, the driver programs are written entirely in a .NET programming language like C# or F# using the C# API in Mobius.

.NET Mobius驱动应用程序被编译为一个可执行程序(.EXE文件),它将连同依赖的文件一起被拷贝到Spark任务需要提交的客户机中。在提交一个基于Mobius的应用程序到Spark集群的Mobius子任务脚本(如sparkclr-submit.cmd 或 sparkclr-submit.sh)的客户机中,需要一个Mobius发布版本的支持。一个Mobius子任务脚本接收的参数和Spark提交任务脚本一样,但它也需要一个额外参数来说明Mobius驱动程序的可执行文件文件的名称和路径。正如前面所述,在Mobius中,驱动程序完全由.NET编程语言如C#或F#通过C# API来编写。

More information on running a Mobius application is available at on GitHub.

在GitBub网站上,有更多关于运行一个Mobius应用程序的信息。

The Mobius API has the same method names and signatures with similar data types as the Scala API for Spark. As a result, the driver programs implemented using Mobius look similar to those implemented in Scala or Java. Here is a code example for implementing Spark’s “Word Count” example in C# using Mobius API.

Mobius API有与Spark Scala API一样方法名称和相同数据类型的签名。结果,使用Mobius的驱动程序编程实现,看起来也与Scala或java的实现相似。这里用C#语言和Mobius API来举一个例子说明Spark中的“Word Count”例子的实现。

var textFile = sparkContext.TextFile(@"hdfs://...");
var counts = textFile
             .FlatMap(x => x.Split(' '))
             .Map(word => new KeyValuePair<string, int>(word, 1))
             .ReduceByKey((x, y) => x + y)
             .Map(wordCount => $"{wordCount.Key},{wordCount.Value}");
counts.SaveAsTextFile(@"hdfs://...");


The code snippet below is in F# and shows how to query the data in JSON format and use the DataFrame API to look for rows with State = ‘California’ and also register those rows as a temp table and use Spark SQL to query for all rows with name = ‘Bill’.

下面的代码片段是用F#语言写的,说明了如何使用JSON格式查询数据的和使用DataFrame API来查找带有State = ‘California’ 的行,并且注册这些行为一个临时表,使用Spark SQL来查询所有带条件:name = ‘Bill’的行。

let peopleDataFrame = sqlContext.Read().Json("hdfs://...")
let filteredDf = peopleDataFrame.Select("name", "address.state")
                 .Where("state = 'California'")
filteredDf.Show()
filteredDf.RegisterTempTable "filteredDfAsTempTable"
let countAsDf = sqlContext.Sql "SELECT * FROM filteredDfAsTempTable where name='Bill'"
let countOfRows = countAsDf.Count()
printf "Count of rows with name='Bill' and State='California' = %d" countOfRows

More examples for RDD, DataFrame and DStream API are available here. These examples also cover HDFS, Cassandra, Event Hubs, Kafka, Hive and JDBC sources in Mobius applications.

更多RDD,DataFrame和DStream API这里也都有,这些例子都是Mobius应用程序,都是涵盖了HDFS,Cassandra,Event Hubs,Kafka,Hive和JDBC方面内容的源码

More Resources

更多资源

You can peruse our GitHub repository, and we welcome your contributions. Additional information on Mobius is available in the slides and video from the talk on Mobius presented at  Spark Summit 2016. Finally, Mobius powers several .NET-based Spark workloads in Microsoft. For example, the Spark Summit 2016 talk (slides, video) covers the lessons learned using Spark in a Bing-scale workload.

你可以细细阅读我们的GitHub网站,我们欢迎你做更多贡献。关于Mobius更多的信息,可以在2016年峰会Mobius谈话的幻灯片和视频中找到。最后,微软公司Mobius提供了一些基于.NET的Spark 资料。例如2016年Spark峰会谈话(幻灯片,视频)涵盖了在Bing-scale中研究使用Spark的课程。
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值