From OSDI'10

Summary:

There has been more and more work focusing on large scale data parallel computing.

This one is the first to characterize the prevalence of stragglers in production and their various causes. By understanding the causes -- i) machine characteristics - both hardware reliability as well as run-time contention for processor, mem. and other resouces; ii) network characteristics with varying bandwidths and congestion along paths; iii) imbalance in workload among tasks, addressing stragglers early and scheduling duplicates only when there is a fair chance that the speculation saves both time and resources, Mantri greatly reduce the job completion time while using fewer resources than prior strategies that duplicate tasks towards the end of a phase.

Related works mentioned in the paper that worth reading:

Dryad: which investigates programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center. (By Microsoft Research)

LATE(Longest Approximate Time to End), which is highly robust to heterogeneity.

Hadoop, MapReduce