Why the days are numbered for Hadoop as we know it

Hadoop is everywhere. For better or worse, it has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard. Want to be big bata or enterprise analytics or BI-compliant?  You better play well with Hadoop.

It’s therefore far from controversial to say that Hadoop is firmly planted in the enterprise as the big data standard and will likely remain firmly entrenched for at least another decade. But, building on some previous discussion, I’m going to go out on a limb and ask, “Is the enterprise buying into a technology whose best day has already passed?”

First, there were Google File System and Google MapReduce

To study this question we need to return to Hadoop’s inspiration – Google’s MapReduce. Confronted with a data explosion, Google engineers Jeff Dean and Sanjay Ghemawat architected (and published!) two seminal systems: the Google File System (GFS) and Google MapReduce (GMR). The former was a brilliantly pragmatic solution to exabyte-scale data management using commodity hardware. The latter was an equally brilliantimplementation of a long-standing design pattern applied to massively parallel processing of said data on said commodity machines.

GMR’s brilliance was to make big data processing approachable to Google’s typical user/developer and to make it fast and fault tolerant. Simply put, it boiled data processing at scale down to the bare essentials and took care of everything else. GFS and GMR became the core of the processing engine used to crawl, analyze, and rank web pages into the giant inverted index that we all use daily at google.com. This was clearly a major advantage for Google.

Enter reverse engineering in the open source world, and, voila, Apache Hadoop — comprised of the Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR. Yes, Hadoop is developing into an ecosystem of projects that touch nearly all parts of data management and processing. But, at its core, it is a MapReduce system. Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you.

Then Google evolved. Can Hadoop catch up?

Most interesting to me, however, is that GMR no longer holds such prominence in the Google stack. Just as the enterprise is locking into MapReduce, Google seems to be moving past it. In fact, many of the technologies I’m going to discuss below aren’t even new; they date back the second half of the last decade, mere years after the seminal GMR paper was in print.


Here are technologies that I hope will ultimately seed the post-Hadoop era. While many Apache projects and commercial Hadoop distributions are actively trying to address some of the issues below via technologies and features such as HBaseHive and Next-Generation MapReduce (aka YARN), it is my opinion that it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google’s technology. (A more technical exposition with published benchmarks is available at http://www.slideshare.net/mlmilleratmit/gluecon-miller-horizon.)

Percolator for incremental indexing and analysis of frequently changing datasets. Hadoop is a big machine. Once you get it up to speed it’s great at crunching your data. Get the disks spinning forward as fast as you can. However, each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset. If your dataset is always growing, this means your analysis time also grows without bound.

So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine calledPercolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!

Coming from the Large Hadron Collider (an ever-growing big data corpus), this topic is near and dear to my heart. Some datasets simply never stop growing. It is why we baked a similar approach deep into the Cloudant data layer service, it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.

Dremel for ad hoc analytics. Google and the Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall through Pig and Hive, many interface layers have been built. Yet, for all of the SQL-like familiarity, they ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration.

In stark contrast, many BI/analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Not only is writing map and reduce workflows prohibitive for many analysts, but waiting minutes for jobs to start and hours for workflows to complete is not conducive to the interactive experience. Therefore, Google invented Dremel (now exposed as the BigQuery product) as a purpose-built tool to allow analysts to scan over petabytes of data in seconds to answer ad hoc queries and, presumably, power compelling visualizations.

Google BigQuery

Google’s Dremel paper says it is “capable of running aggregation queries over trillions of rows in seconds,” and the same paper notes that running identical queries in standard MapReduce is approximately 100 times slower than in Dremel. Most impressive, however, is real world data from production systems at Google, where the vast majority of Dremel queries complete in less than 10 seconds, a time well below the typical latencies of even beginning execution of a MapReduce workflow and its associated jobs.

Interestingly, I’m not aware of any compelling open source alternatives to Dremel at the time of this writing and consider this a fantastic BI/analytics opportunity.

Pregel for analyzing graph data. Google MapReduce was purpose-built for crawling and analyzing the world’s largest graph data structure – the internet. However, certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other graph data structures. For example, calculation of the single-source shortest path (SSSP) through a graph requires copying the graph forward to future MapReduce passes, an amazingly inefficient approach and simply untenable at scale.

Therefore, Google built  Pregel, a large bulk synchronous processing application for petabyte -scale graph processing on distributed commodity machines. The results are impressive. In contrast to Hadoop, which often causes exponential data amplification in graph processing, Pregel is able to naturally and efficiently execute graph algorithms such as SSSP or PageRank in dramatically shorter time and with significantly less complicated code. Most stunning is the published data demonstrating processing on billions of nodes with trillions of edges in mere minutes, with a near linear scaling of execution time with graph size.

At the time of writing, the only viable option in the open source world is Giraph, an early Apache incubator project that leverages HDFS and Zookeeper. There’s another project called Golden Orb available on GitHub.

In summary, Hadoop is an incredible tool for large-scale data processing on clusters of commodity hardware. But if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm. Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data. I would be shocked if they don’t have a similar impact on IT as Google’s original big three of GFS, GMR, and BigTable have had.

Mike Miller (@mlmilleratmit) is chief scientist and co-founder at Cloudant, and Affiliate Professor of Particle Physics at University of Washington.

Feature image courtesy of Shutterstock user Jason Prince; evolution of the wheel image courtesy of Shutterstock user James Steidl.

Reference: https://gigaom.com/2012/07/07/why-the-days-are-numbered-for-hadoop-as-we-know-it/

Write a computer program that could be used to track users' activities. Lab Number Computer Station Numbers 1 1-3 2 1-4 3 1-5 4 1-6  You run four computer labs. Each lab contains computer stations that are numbered as the above table.  There are two types of users: student and staff. Each user has a unique ID number. The student ID starts with three characters (for example, SWE or DMT), and is followed by three digits (like, 001). The staff ID only contains digits (for example: 2023007).  Whenever a user logs in, the user’s ID, lab number, the computer station number and login date are transmitted to your system. For example, if user SWE001 logs into station 2 in lab 3 in 01 Dec, 2022, then your system receives (+ SWE001 2 3 1/12/2022) as input data. Similarly, when a user SWE001 logs off in 01 Jan, 2023, then your system receives receives (- SWE001 1/1/ 2023). Please use = for end of input.  When a user logs in or logs off successfully, then display the status of stations in labs. When a user logs off a station successfully, display student id of the user, and the number of days he/she logged into the station.  When a user logs off, we calculate the price for PC use. For student, we charge 0 RMB if the number of days is not greater than 14, and 1 RMB per day for the part over 14 days. For staff, we charge 2 RMB per day if the number of days is not greater than 30, and 4 RMB per day for the part over 30 days.  If a user who is already logged into a computer attempts to log into a second computer, display "invalid login". If a user attempts to log into a computer which is already occupied, display "invalid login". If a user who is not included in the database attempts to log off, display "invalid logoff".
06-01
1444. Elephpotamus Time limit: 0.5 second Memory limit: 64 MB Harry Potter is taking an examination in Care for Magical Creatures. His task is to feed a dwarf elephpotamus. Harry remembers that elephpotamuses are very straightforward and imperturbable. In fact, they are so straightforward that always move along a straight line and they are so imperturbable that only move when attracted by something really tasty. In addition, if an elephpotamus stumbles into a chain of its own footprints, it falls into a stupor and refuses to go anywhere. According to Hagrid, elephpotamuses usually get back home moving along their footprints. This is why they never cross them, otherwise they may get lost. When an elephpotamus sees its footprints, it tries to remember in detail all its movements since leaving home (this is also the reason why they move along straight lines only, this way it is easier to memorize). Basing on this information, the animal calculates in which direction its burrow is situated, then turns and goes straight to it. It takes some (rather large) time for an elephpotamus to perform these calculations. And what some ignoramuses recognize as a stupor is in fact a demonstration of outstanding calculating abilities of this wonderful, though a bit slow-witted creature. Elephpotamuses' favorite dainty is elephant pumpkins, and some of such pumpkins grow on the lawn where Harry is to take his exam. At the start of the exam, Hagrid will drag the elephpotamus to one of the pumpkins. Having fed the animal with a pumpkin, Harry can direct it to any of the remaining pumpkins. In order to pass the exam, Harry must lead the elephpotamus so that it eats as many pumpkins as possible before it comes across its footprints. Input The first input line contains the number of pumpkins on the lawn N (3 ≤ N ≤ 30000). The pumpkins are numbered from 1 to N, the number one being assigned to the pumpkin to which the animal is brought at the start of the trial. In the next N lines, the coordinates of the pumpkins are given in the order corresponding to their numbers. All the coordinates are integers in the range from −1000 to 1000. It is guaranteed that there are no two pumpkins at the same location and there is no straight line passing through all the pumpkins. Output In the first line write the maximal number K of pumpkins that can be fed to the elephpotamus. In the next K lines, output the order in which the animal will eat them, giving one number in a line. The first number in this sequence must always be 1.写一段Java完成此目的
06-03
Solve the problem with c++ code, and give your code: Ack Country has N cities connected by M one-way channels. The cities occupied by the rebels are numbered 1, while the capital of Ack country is numbered N. In order to reduce the loss of effective force, you are permitted to use self-propelled bombers for this task. Any bomber enters the capital, your job is done. This seems simple enough, but the only difficulty is that many cities in Ack Country are covered by shields. If a city is protected by a shield, all shield generators that maintain the shield need to be destroyed before the bomber can enter or pass through the city. Fortunately, we know the cities where all the shield generators are located, and which cities' shields are being charged. If the bomber enters a city, all of its shield generators can be destroyed instantly. You can release any number of Bombermen and execute any command at the same time, but it takes time for bombermen to pass through the roads between cities. Please figure out how soon you can blow up Ack Nation's capital. The clock is ticking. Input: Two positive integers N,M in the first row. The next M lines, each with three positive integers, indicate that there is a road leading from the city to the city. It takes w time for the bomber to cross this road. Then N lines, each describing a city's shield. The first is a positive integer n, representing the number of shield generators that maintain shields in the city. Then n_i city numbers between 1 and N, indicating the location of each shield generator. In other words, if your bomber needs to enter the city, the bomber needs to enter all the entered cities in advance. If n_i=0, the city has no shields. Guarantee n_i=0.Output: a positive integer, the minimum time to blow up the capital. e.g., Input: 6 6 1 2 1 1 4 3 2 3 3 2 5 2 4 6 2 5 3 2 0 0 0 1 3 0 2 3 5, Output: 6.
06-01
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值