Algorithm:

http://www.norstad.org/matrix-multiply/index.html

A classic summarization of a mapreduce algorithm for matrix multiplication, including four blocking strategies.


anatomy of high-performance matrix multiplication

The paper for GotoBLAS, analyzes the different blocking strategies on the hierachical memory.

OpenBLAS is now the latest version based on GotoBLAS under maintanance.


Cost:

Upper and Lower Bounds on the Cost of a Map-Reduce Computation

This paper models the tradeoff between parallism and communication -- generally, better parallism leads to more replication for the inputs and more consequent communication. There are three examples including matrix multiplication in the paper.


http://www.gordon-taft.net/MatrixMultiplication.html

It summarizes the types of cache misses and the main cache priciples for matrix multiplication.