What are some good class projects for machine learning using MapReduce?
We are looking for a (not necessarily academic) class project for a class where we are learning to implement various Machine Learning algorithms over the MapReduce framework using AWS. Should meet the following criteria:
1. Low time spent on cleaning the data
2. Parallelizable to ~3 people (eg. each person could try a different method and then combine it into an ensemble)
3. Duration: ~2-3 hours/(week * student) * 2/3 students/group * 4 weeks * 7 groups
4. Have a verifyable sub-result(s) (so that students have a way of knowing they are on the right track)
5. (Optional) Have an open ended question that could possibly be pursued by the more enthusiastic folks.
2. Parallelizable to ~3 people (eg. each person could try a different method and then combine it into an ensemble)
3. Duration: ~2-3 hours/(week * student) * 2/3 students/group * 4 weeks * 7 groups
4. Have a verifyable sub-result(s) (so that students have a way of knowing they are on the right track)
5. (Optional) Have an open ended question that could possibly be pursued by the more enthusiastic folks.
Try implementing some ML algorithms not yet covered in Apache Mahout: What are some important algorithms not yet covered in Mahout? , and What are the top 10 data mining or machine learning algorithms?
See open items: https://cwiki.apache.org/1) Matrix Decomposition routines (QR, Cholesky etc)
- Numerical Recipes: http://www.nr.com/
- Matrix factorization algorithms: http://bickson.blogspot.c
om/2011...
2) Decision Trees with ID3, C4.5 or other heuristic ( https://issues.apache.or
Tutorials: Decision Trees: What are some good resources for learning about decision trees?
Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/
https://cwiki.apache.org/
https://cwiki.apache.org/
3) Linear Regression https://cwiki.apache.org/
4) Gradient Descent and other optimization and linear programming algorithms, see Convex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/
5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/w
6) SVM: https://issues.apache.org
7) Vector space models http://en.wikipedia.org/w
8) Hidden Markov Models - an extremely popular method in NLP & bioinformatics. See Hidden Markov Models: What are some good resources for learning about Hidden Markov Models? and https://issues.apache.org
9) Slope One by Daniel Lemire: http://en.wikipedia.org/w
10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/
11) PageRank, here is a good tutorial: http://michaelnielsen.org
12) Build an eigensolver: http://www.cs.cmu.edu/~uk
13) For a wealth of open ended problems see Programming Challenges: What are some good "toy problems" in data science?
Notes:
- See Jimmy Lin's book Data-Intensive Text Processing with MapReduce for some good tips: http://www.umiacs.umd.edu
/~jimmy... and Tom White's great book on Hadoop: http://www.hadoopbook.com / - Map-Reduce for Machine Learning on Multicore by Chu et al.: www-cs.stanford.edu/~ang/
papers/... - Muthu Muthukrishnan's MapReduce resources: http://www.cs.rutgers.edu
/~muthu... - Top 10 algorithms in data mining: http://www.mendeley.com/r
esearch... - Large Data Logistic Regression (with example Hadoop code): http://www.win-vec
tor.com/blog/2... - A Comparison of Eight MapReduce Languages: http://www.dat
aspora.com/2011/04... - Seven data-mining algorithms which are 200-400x faster on GPUs: http://www.smedirec
tor.com/2010/... via Michael E Driscoll - RecLab Core by Darren Erik Vengroff: http://code.richrelevance
.com/re... - Amund Tveit's links: http://atbrox.com/2011/05
/16/map... - Jeff Hammerbacher's links: http://www.mendeley.com/g
roups/1... - MR bibliography I've compiled a while back: http://www.columbia.edu/~
ak2834/... - Scaling up machine learning: http://www.cs.umass.edu/~
ronb/sc... - Machine Learning: What are some good learning projects to teach oneself about machine learning?
- Implement the sequential version first, then parallelize with either Hadoop, or one of the alternatives (What are some promising open-source alternatives to Hadoop MapReduce for map/reduce?), or a self-made runtime; always abstract the MR logic away from the DFS. One of your teams could build a simple MapReduce engine; we did this for a term project (using an experimental language called X10) and it was fun.