2013年06月_macyang

转载 The Secret To 10 Million Concurrent Connections -The Kernel Is The Problem, Not The Solution

Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 millio

2013-06-29 20:53:18 1199

转载 Bitmap Index vs. B-tree Index: Which and When?

Understanding the proper application of each index can have a big impact on performance.Published 2005Conventional wisdom holds that bitmap indexes are most appropriate for columns having low

2013-06-29 10:14:15 1130

转载 Indexing Strategies for Optimizing Queries on MySQL

MySQL’s index begins by reviewing how indexes work, as well as their structure. Next, it reviews indexing features specific to each of the major MySQL data storage engines. This article then examines

2013-06-29 09:49:30 842

转载 TAO: The power of the graph

Facebook puts an extremely demanding workload on its data backend. Every time any one of over a billion active users visits Facebook through a desktop browser or on a mobile device, they are presented

2013-06-27 22:04:11 1056

原创 lost task tracker issue in CDH 4.1.2

今天帮助一个同学解决job运行时间过长的问题， task 被kill后的 error信息是： “Lost task tracker: tracker_xxxxxx”，从job history可以看到“Stage-2 map = 100%, reduce = 100%” 打印了很长时间，所以怀疑是dump文件的时间过长，然后查看代码发现她的sql中存在两个大表的join操作（9亿+ * 97

2013-06-27 13:54:57 1790

转载 Improvements in the Hadoop YARN Fair Scheduler

Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule n

2013-06-24 22:26:34 1088

转载 Wormhole pub/sub system: Moving data through space and time

Over the last couple of years, we have built and deployed a reliable publish-subscribe system called Wormhole. Wormhole has become a critical part of Facebook's software infrastructure. At a high leve

2013-06-17 19:23:45 1457

转载 Moving Hadoop Beyond Batch with Apache YARN

Apache Hadoop 2.0 continues to make its way through the open source community process at the Apache Software Foundation and is getting closer to being declared “ready” from a community development per

2013-06-16 21:07:29 1041

转载 Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing

At Yahoo!, Hadoop plays a central role in providing personalized experiences for our users and creating value for our advertisers. To serve Yahoo!’s emerging business needs, the Cloud Engineering Grou

2013-06-16 20:56:11 1319

转载 28msec - query data from any source in real time

Derrick Harris writing about 28msec, still-in-stealth-mode, generic query language:Their solution was to create a platform able to extract data from any of these sources, transform it into a sta

2013-06-14 17:33:18 837

转载 Hadoop and the EDW

Rob Klopp summarizes a whitepaper published by Cloudera and Teradata:Simply put, Hadoop becomes the staging area for “raw data streams” while the EDW stores data from “operational systems”. Ha

2013-06-14 17:14:10 904

转载 Optimizing Joins running on HDInsight Hive on Azure at GFS

IntroductionTo analyze hardware utilization within their data centers, Microsoft’s Online Services Division – Global Foundation Services (GFS) is working with Hadoop / Hive via HDInsight on Azure.

2013-06-14 17:04:24 1076

转载 Migration to the New Metrics Hotness – Metrics2

IntroductionHBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help spot problems before

2013-06-14 16:55:12 802

转载 Storing Big Data With Hive: RCFile

Are sequence files or RCFile (Record Columnar File) the best way to store big data in Hive?There are reasons to use text on the periphery of an ETL process, as the previous post discussed. (See: S

2013-06-14 16:50:50 1357

转载 Introduction to HBase Mean Time to Recover (MTTR)

HBase is an always-available service and remains available in the face of machine failures and rack failures. Machines in the cluster runs RegionServer daemons. When a RegionServer crashes or the mach

2013-06-14 16:41:12 1053

转载 HBase - Who needs a Master?

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the cas

2013-06-14 16:10:55 831

Mac Track