高性能计算--HPCC--HPCC vs Hadoop篇

原文:http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop
翻译:那海蓝蓝,译文请见“ 【】”中的部分

Read how the HPCC Platform compares to Hadoop

说明:HPCC的相关部分,如果没有翻译,则参见: ttp://blog.163.com/li_hx/blog/static/183991413201163104244293/

Item
HPCC Platform
Hadoop
Hardware TypeProcessing clusters using commodity off-the-shelf (COTS) hardware. Typically rack-mounted blade servers with Intel or AMD processors, local memory and disk connected to a high-speed communications switch (usually Gigabit Ethernet connections) or hierarchy of communications switches depending on the total size of the cluster. Clusters are usually homogeneous (all processors are configured identically), but this is not a requirement.
【翻译参见:ttp://blog.163.com/li_hx/blog/static/183991413201163104244293/】
Operating System
Linux and Windows
System ConfigurationsHPCC clusters can be implemented in two configurations: Data Refinery (Thor) is analogous to the Hadoop MapReduce Cluster; Data Delivery Engine (Roxie) provides separate high-performance online query processing and data warehouse capabilities. Both configurations also function as distributed file systems but are implemented differently based on the intended use to improve performance. HPCC environments typically consist of multiple clusters of both configuration types. Although file systems on each cluster are independent, a cluster can access files within a file system on any other cluster in the same environment.Hadoop system software implements clusters with MapReduce processing paradigm. The cluster also functions as a distributed file system running HDFS. Other capabilities are layered on top of the Hadoop MapReduce and HDFS system software including HBase, Hive, etc.【Hadoop的系统软件实现了MapReduce的处理范式集群。群集也可以作为一个分布式的文件系统运行HDFS。其他方面的能力包括位于顶层的Hadoop MapReduce和包括HBase、Hive的HDFS系统软件。】
Licensing & Maintenance Cost Community Edition is Free. Enterprise License fees currently depend on size and type of system configurations.【社区版本免费。企业license费用依赖与系统配置的规模】---那海蓝蓝评注:完全免费才能使得技术被大型应用商使用,如此模式,寿命不长。技术层面可以深入HPCC探讨。Free, although there are different paid maintenance offerings by multiple vendors.【免费,由多个供应商提供不同支付维护服务】
Core SoftwareFor a Thor configuration, core software includes the operating system and various services installed on each node of the cluster to provide job execution and distributed file system access. A separate server called the Dali server provides file system name services and manages workunits for jobs in the HPCC environment. A Thor cluster is also configured with a master node and multiple slave nodes. A Roxie cluster is a peer-coupled cluster where each node runs Server and Agent tasks for query execution and key and file processing. The file system on the Roxie cluster uses a distributed B+Tree to store index and data and provides keyed access to the data. Additional middleware components are required for operation of Thor and Roxie clusters.Core software includes the operating system, Hadoop MapReduce cluster and HDFS software. Each slave node includes a Tasktracker service and Datanode service. A master node includes a Jobtracker service which can be configured as a separate hardware node or run on one of the slave hardware nodes. Likewise, for HDFS, a master Namenode service is also required to provide name services and can be run on one of the slave nodes or a separate node.【核心软件包括操作系统,Hadoop MapReduce集群和HDFS软件。每个从节点包括一个TaskTracker(任务跟踪)服务和一个Datanode(数据节点)服务。主节点包括一个Jobtracker(作业跟踪)服务,可以配置在一个单独的硬件节点或者一个从节点所在硬件节点上。类似的,HDFS,主Namenode(命名服务节点)服务提供名字服务可以运行在一个从节点上或单独的节点上。
Middleware ComponentsMiddleware components include an ECL code repository implemented on a MySQL server, and ECL server for compiling of ECL programs and queries, an ECLAgent acting on behalf of a client program to manage the execution of a job on a Thor cluster, an ESPServer (Enterpise Services Platform) providing authentication, logging, security, and other services for the job execution and Web services environment, and the Dali server which functions as the system data store for job workunit information and provides naming services for the distributed file systems. Flexibility exists for running the middleware components on one to several nodes. Multiple copies of these servers can provide redundancy and improve performance.None. Client software can submit jobs directly to the Jobtracker on the master node of the cluster. A Hadoop Workflow Scheduler (HWS) which will run as a server is currently under development to manage jobs which require multiple MapReduce sequences.【无。客户端软件可以径直提交作业到集群中主节点的Jobtracker。Hadoop工作流调度作为一个服务运行,此服务出于开发中,其管理要求多MapReduce序列的作业】
System ToolsHPCC includes a suite of client and operation tools for managing, maintaining, and monitoring HPCC configurations and environments. These include ECL IDE, the program development environment, an Attribute Migration Tool, Distributed File Utility (DFU), an Environment Configuration Utility, Roxie Configuration Utility. Command line versions are also available. ECLWatch is a Web-based utility program for monitoring the HPCC environment and includes queue management, distributed file system management, job monitoring, and system performance monitoring tools. Additional tools are provided through Web services interfaces. The dfsadmin tool provides information about the state of the file system; fsck is a utility for checking the health of files in HDFS; datanode block scanner periodically verifies all the blocks stored on a datanode; balancer re-distributes blocks from over-utilized datanodes to underutilized datanodes as needed. The MapReduce Web UI includes the JobTracker page which displays information about running and completed jobs; drilling down on a specific job displays detailed information about the job. There is also a Tasks page that displays info about Map and Reduce tasks.【Dfsadmin(分布式文件系统管理员)工具文件系统状态的信息;fsck以一个检查在HDFS镇南文件健康状态的工具;datanode块扫描定期验证存储在一个数据节点上的所有块;根据需要,平衡者工具会从过载的数据节点重新分配到为被充分使用的数据节点。MapReduce Web UI包括显示在运行和完成的作业信息的JobTracker(作业跟踪器);在指定的作业上下钻能显示作业的详细信息。有一个任务页显示Map和Reduce任务的信息。】
Ease of DeploymentEnvironment configuration tool. A Genesis server provides a central repository to distribute OS level settings, services, and binaries to all net-booted nodes in a configuration.Assisted by online tools provided by 3rd party utilizing wizards. Requires a manual RPM deployment.【通过第三方提供的可使用的向导联机工具获得(开发)帮助。要求手工的RPM开发部署】
Distributed File SystemThe Thor DFS is record-oriented, uses local Linux file system to store file parts. Files are initially loaded (sprayed) across nodes and each node has a single file part which can be empty for each distributed file. Files are divided on even record/document boundaries specified by the user. Master/Slave architecture with name services and file mapping information stored on a separate server. Only one local file per node required to represent a distributed file. Read/write access is supported between clusters configured in the same environment. Utilizing special adapters allows files from external databases such as MySQL to be accessed, allowing transactional data to be integrated with DFS data and incorporated into batch jobs. The Roxie DFS utilizes distributed B+Tree index files containing key information and data stored in local files on each node.Block-oriented, uses large 64MB or 128MB blocks in most installations. Blocks are stored as independent units/local files in the local Unix/Linux file system for the node. Metadata information for blocks is stored in a separate file for each block. Master/Slave architecture with a single Namenode to provide name services and block mapping and multiple Datanodes. Files are divided into blocks and spread across nodes in the cluster. Multiple local files (one containing the block, one containing metadata) for each logical block stored on a node are required to represent a distributed file.【面向块,在多数安装下使用64M或128M的块。作为独立的单元和本地文件、在一个节点的Unix/Linux
文件系统,
块被存储。块的元信息单独存储在一个文件中。带有一个命名节点的主/从架构提供命名服务、块映射和多重数据节点。文件被分割为块发散在集群中的节点上。多个本地文件存储节点上的每个逻辑块(一个包含块,一个包含元数据)代表一个分布式的文件。】
Fault ResilienceThe DFS for Thor and Roxie stores replicas of file parts on other nodes (configurable) to protect against disk and node failure. Thor system offers either automatic or manual node swap and warm start following a node failure, jobs are restarted from last checkpoint or persist. Replicas are automatically used while copying data to the new node. Roxie system continues running following a node failure with a reduced number of nodes.HDFS stores multiple replicas (user-specified) of data blocks on other nodes (configurable) to protect against disk and node failure with automatic recovery. MapReduce architecture includes speculative execution, when a slow or failed Map task is detected, additional Map tasks are started to recover from node failures.【HDFS存储多个数据块的复制物在其他节点上(可配置)用以保护磁盘毁坏和自动恢复中的节点故障。当一个缓慢或失败的Map被检测到时,MapReduce架构包括一个不确定的执行,额外的Map任务开始在失败的节点上恢复(执行)。】
Job Execution EnvironmentThor utilizes a Master/Slave processing architecture. Processing steps defined in an ECL job can specify local (data processed separately on each node) or global (data is processed across all nodes) operation. Multiple processing steps for a procedure are executed automatically as part of a single job based on an optimized execution graph for a compiled ECL dataflow program. A single Thor cluster can be configured to run multiple jobs concurrently reducing latency if adequate CPU and memory resources are available on each node. Middleware components including an ECLAgent, ECLServer, and Dali Server provide the client interface and manage execution of the job which is packaged as a workunit. Roxie utilizes a multiple Server/Agent architecture to process ECL programs accessed by queries using Server tasks acting as a manager for each query and multiple Agent tasks as needed to retrieve and process data for the query.Uses MapReduce processing paradigm with input data in key-value pairs. Master/Slave processing architecture. A Jobtracker runs on the master node, and a Tasktracker runs on each of the slave nodes. Map tasks are assigned to input splits of the input file, usually one per block. The number of Reduce tasks is assigned by the user. Map processing is local to assigned node. A shuffle and sort operation is done following Map phase to distribute and sort key-value pairs to Reduce tasks based on key regions so that pairs with identical keys are processed by same Reduce tasks. Multiple MapReduce processing steps are typically required for most procedures and must be sequenced and chained separately by the user or language such as Pig.【使用MapReduce处理典型的key-value对应用。主/从处理架构。一个JobTracker运行在主节点,一个TaskTracker运行在每一个从节点。Map任务分割输入的文件,一块一个。很多Reduce任务被用户分配。继Map阶段后,洗牌(分发)和排序操作被完成,然后分发和排序key-value对给基于关键字的Reduce任务,以至具有同样关键字的被同一个Reduce任务处理。多个MapReduce的处理步骤,通常需要对大多数程序和必须测序和链接,由用户或Pig语言分开。】
Programming LanguagesECL is the primary programming language for the HPCC environment. ECL is compiled into optimized C++ which is then compiled into DLLs for execution on the Thor and Roxie platforms. ECL can include inline C++ code encapsulated in functions. External services can be written in any language and compiled into shared libraries of functions callable from ECL. A Pipe interface allows execution of external programs written in any language to be incorporated into jobs.Hadoop MapReduce jobs are usually written in Java. Other languages are supported through a streaming or pipe interface. Other processing environments execute on top of Hadoop MapReduce such as HBase and Hive which have their own language interface. The Pig Latin language and Pig execution environment provides a high-level dataflow language which is then mapped into multiple Java MapReduce jobs.【Hadoop MapReduce作业通常用Java开发。其他语言通过流或管道接口被支持。其他处理环境执行在有自己语言接口的诸如Hase和Hive的Hadoop的MapReduce之上。Pig Latin语言和Pig执行环境提供了一个可以映射到多JavaMapReduce工作中的高级数据流原语言】
Integrated Program Development EnvironmentThe HPCC platform is provided with ECL IDE, a comprehensive IDE specifically for the ECL language. ECL IDE provides access to shared source code repositories and provides a complete development and testing environment for developing ECL dataflow programs. Access to the ECLWatch tool is built-in, allowing developers to watch job graphs as they are executing. Access to current and historical job workunits is provided allowing developers to easily compare results from one job to the next during development cycles.Hadoop MapReduce utilizes the Java programming language and there are several excellent program development environments for Java including Netbeans and Eclipse which offer plug-ins for access to Hadoop clusters. The Pig environment does not have its own IDE, but instead uses Eclipse and other editing environments for syntax checking. A PigPen add-in for Eclipse provides access to Hadoop Clusters to run Pig programs and additional development capabilities.【Hadoop MapReduce使用Java程序设计语言,有几款极好的Java程序开发环境包括NewBeans和Eclipse,都提供了访问Hadoop集群的插件。Pig环境没有自己的集成开发环境,但可以使用Eclipse和其他的编辑环境做语法检查。附加在Eclipse上的PigPen提供访问Hadoop集群和运行Pig查询和额外的开发能力。】
Database CapabilitiesThe HPCC platform includes the capability to build multi-key, multivariate indexes on DFS files. These indexes can be used to improve performance and provide keyed access for batch jobs on a Thor system, or be used to support development of queries deployed to Roxie systems. Keyed access to data is supported directly in the ECL language.The basic Hadoop MapReduce system does not provide any keyed access indexed database capabilities. An add-on system for Hadoop called HBase provides a column-oriented database capability with keyed access. A custom script language and Java interface is provided. Access to HBase is not directly supported by the Pig environment and requires user-defined functions or separate MapReduce procedures.【基本的Hadoop MapReduce的系统不提供任何键访问索引数据库的能力。HBase的Hadoop的附加系统使用关键字访问提供了一个面向列的数据库功能定义脚本语言和Java接口被提供。访问HBase不直接被Pig环境支持,这要求用户定义函数或单独的MapReduce国产来(访问)。】
Online Query and Data Warehouse CapabilitiesThe Roxie system configuration in the HPCC platform is specifically designed to provide data warehouse capabilities for structured queries and data analysis applications. Roxie is a high-performance platform capable of supporting thousands of users and providing sub-second response time depending on the application.The basic Hadoop MapReduce system does not provide any data warehouse capabilities. An add-on system for Hadoop called Hive provides data warehouse capabilities and allows HDFS data to be loaded into tables and accessed with an SQL-like language. Access to Hive is not directly supported by the Pig environment and requires user-defined functions or separate MapReduce procedures.【基本的Hadoop MapReduce系统不支持任何数据仓库。称为Hive的附加于Hadoop的系统提供数据仓库能力并允许HDFS数据被加载到表且被类SQL语言访问。访问Hive不直接被Pig环境支持,这要求用户定义函数或单独的MapReduce来访问。
ScalabilityOne to several thousand nodes. In practice, HPCC configurations require significantly fewer nodes to provide the same processing performance as a Hadoop cluster. Sizing of clusters may depend however on the overall storage requirements for the distributed file system.One to thousands of nodes.【一到数千个节点】
PerformanceThe HPCC platform has demonstrated sorting 1 TB on a high-performance 400-node system in 102 seconds. In a recent head-to-head benchmark versus Hadoop on another 400-node system, the HPCC performance was 6 minutes 27 seconds and the Hadoop performance was 25 minutes 28 seconds. This result on the same hardware configuration showed that HPCC was 3.95 times faster than Hadoop for this benchmark.Currently the only available standard performance benchmarks are the sort benchmarks sponsored by http://sortbenchmark.org. Yahoo! has demonstrated sorting 1TB on 1460 nodes in 62 seconds, 100TB using 3452 nodes in 173 minutes, and 1PB using 3658 nodes in 975 minutes.【当前可用的标准性能测试基准被http://sortbenchmark.org发布。雅虎已经证明在1460节点上1TB(数据量)排序用时62秒,在3452节点上100TB(数据量)排序用时173分钟,在3658节点上排序1PB用时975分钟。
TrainingBasic and advanced training classes on ECL programming are offered monthly in several locations or can be conducted on customer premises. A system administration class is also offered and scheduled as needed. A free HPCC VM image with a complete HPCC and ECL learning environment which can be used on a single PC or laptop is also available.Hadoop training is offered through a 3rd party. Both beginning and advanced classes are provided. The advanced class includes Hadoop add-ons including HBase and Pig. Another 3rd party also provides a VMWare based learning environment which can be used on a standard laptop or PC. Online tutorials are also available.【Hadoop培训由第三方提供。包括初始的高级培训。高级培训包括Hadoop附加的HBase和Pig。另一个第三方也提供基于标准膝上型电脑或PC机的虚拟学习环境。理解指南也是可用的。】
Item
HPCC Platform
Hadoop
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值