MapReduce is a good fit for problems
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually updated.
MapReduce works well on unstructured or semistructured
data, since it is designed to interpret the data at processing time.
MapReduce is a linearly scalable programming model.
but becomes a problem when nodes need to
access larger data volumes (hundreds of gigabytes, the point at which MapReduce really
starts to shine), since the network bandwidth is the bottleneck, and compute nodes
become idle.
MapReduce tries to colocate the data with the compute node, so data access is fast
since it is local.* This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance.
MPI gives great control to the programmer, but requires that he or she explicitly handle
the mechanics of the data flow, exposed via low-level C routines and constructs, such
as sockets, as well as the higher-level algorithm for the analysis. MapReduce operates
only at the higher level: the programmer thinks in terms of functions of key and value
pairs, and the data flow is implicit.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated
hardware running in a single data center with very high aggregate bandwidth interconnects.