What Hadoop is Not

We see a lot of emails where people hear about Hadoop, and think it will be the silver bullet to solve all their application/datacentre problems. It is not. It solves some specific problems for some companies and organisations, but only after they have understood the technology and where it is appropriate. If you start using Hadoop in the belief it is a drop-in replacement for your database or SAN filesystem, you will be disappointed.

Apache Hadoop is not a substitute for a database

Databases are wonderful. Issue an SQL SELECT call against an indexed/tuned database and the response comes back in milliseconds. Want to change that data? SQL UPDATE and the change is in. Hadoop does not do this.

Hadoop stores data in files, and does not index them. If you want to find something, you have to run a MapReduce job going through all the data. This takes time, and means that you cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is too big for a database (i.e. you have reached the technical limits, not just that you don't want to pay for a database license). With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data. With many machines trying to write to the database, you can't get locks on it. Here the idea of vaguely-related files in a distributed filesystem can work.

There is a high performance column-table database that runs on top of Hadoop HDFS: Apache HBase. This is a great place to keep the results extracted from your original data.

MapReduce is not always the best algorithm

MapReduce is a profound idea: taking a simple functional programming operation and applying it, in parallel, to gigabytes or terabytes of data. But there is a price. For that parallelism, you need to have each MR operation independent from all the others. If you need to know everything that has gone before, you have a problem. Such problems can be aided by

  • Iteration: run multiple MR jobs, with the output of one being the input to the next.
  • Shared state information. HBase is an option to consider here, otherwise something like memcache is an option.

Do not try to remember things in shared variables, as they are only remembered in a single JVM, for the life of that JVM. That is the wrong way to work in a massively parallel environment.

Hadoop and MapReduce is not a place to learn Java programming

There are currently a lot of assumptions in the Hadoop APIs and documentation, assumptions that you know the basics of Java programming, and of the common error messages you get when things don't work. If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding.

Hadoop is not an ideal place to learn networking error messages

You will find things work a lot easier if you are already familiar with networking and the common error messages -for example, what "Connection Refused" means, and how is different from "No Route to Host".

A lot of people post onto the user list with problems related to "Connection Refused""No Route to Host" and other common TCP-IP level errors. These are usually signs of an invalid cluster configuration, some parts of the cluster not running, or machines not being able to talk to each other on the LAN. People on the mailing list cannot debug your network configuration for you, as it is your network, not theirs. They can point you at some of the tools and tests to try, but since it will take a day for every email round trip, you won't find this a very fast way to get help.

Nobody in the Hadoop team are deliberately trying to make things hard, its just that when things do not work in a large distributed system, you get some interesting error messages. If you can help improve those network messages or diagnostics, we would love to have that code.

Hadoop clusters are not a place to learn Unix/Linux system administration

You need to know your way round a Unix/Linux system. How to install it, what the various files in /etc/ are for, how to set up networking, what is a good hosts table, how to debug DNS problems, why to keep logs on a separate disk from the root disk, etc. If you cannot look after a single machine, you aren't going to be able to handle a cluster of 80 of them. That said, don't try maintaining those 80+ boxes using the same technique of hand-editing files like /etc/hosts, because it doesn't scale.

Things you need to know

  • SSH, what it is, how to set up authorized_keys, how to use ssh and scp
  • ifconfig, nslookup and other network config/diagnostics tools
  • How your platform keeps itself up to date
  • What the various log files your machine generates, and what they mean
  • How to set up native filesystems and mount them

This is important. If you don't know these, you are out of your depth and should not start installing Hadoop until you have the basics of a couple of linux systems up and running, letting you ssh in to each of them without entering a password, know each other's hostname and such like. The Hadoop installation documents all assume you can do these things, and aren't going to bother explaining about them.

HDFS is not a POSIX filesystem

The Posix filesystem model has files that can appended too, seek() calls made, files locked.You cannot seamlessly map code that assumes that all filesystems are Posix-compatible to HDFS.


src:

http://wiki.apache.org/hadoop/HadoopIsNot

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了安卓应用、小程序、Python应用和Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值