What are some good class projects for machine learning using MapReduce?

最新推荐文章于 2024-10-07 16:32:55 发布

xingzhixi

最新推荐文章于 2024-10-07 16:32:55 发布

阅读量569

点赞数

分类专栏： mahout 文章标签： class optimization mapreduce processing matlab resources

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/xingzhixi/article/details/6913467

版权

mahout 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

What are some good class projects for machine learning using MapReduce?

We are looking for a (not necessarily academic) class project for a class where we are learning to implement various Machine Learning algorithms over the MapReduce framework using AWS. Should meet the following criteria:

1. Low time spent on cleaning the data
2. Parallelizable to ~3 people (eg. each person could try a different method and then combine it into an ensemble)
3. Duration: ~2-3 hours/(week * student) * 2/3 students/group * 4 weeks * 7 groups
4. Have a verifyable sub-result(s) (so that students have a way of knowing they are on the right track)
5. (Optional) Have an open ended question that could possibly be pursued by the more enthusiastic folks.

Try implementing some ML algorithms not yet covered in Apache Mahout: What are some important algorithms not yet covered in Mahout? , and What are the top 10 data mining or machine learning algorithms?

See open items: https://cwiki.apache.org/conflue... , you can also ask on the Mahout mailing list.

1) Matrix Decomposition routines (QR, Cholesky etc)

http://en.wikipedia.org/wiki/Mat...

Numerical Recipes: http://www.nr.com/
Matrix factorization algorithms: http://bickson.blogspot.com/2011...

2) Decision Trees with ID3, C4.5 or other heuristic ( https://issues.apache.org/jira/b... ). This is one of ther most popular algorithms in data mining with countless applications.
Tutorials: Decision Trees: What are some good resources for learning about decision trees?

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...

3) Linear Regression https://cwiki.apache.org/conflue... , Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord... also see Matlab statistics toolbox for ideas: http://www.mathworks.com/help/to...

4) Gradient Descent and other optimization and linear programming algorithms, see Convex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to... Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada...

6) SVM: https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... , Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models http://en.wikipedia.org/wiki/Vec...

8) Hidden Markov Models - an extremely popular method in NLP & bioinformatics. See Hidden Markov Models: What are some good resources for learning about Hidden Markov Models? and https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... , http://www.mendeley.com/c/424264...

9) Slope One by Daniel Lemire: http://en.wikipedia.org/wiki/Slo... or other Collaborative Filtering algorithms. See Mahout in Action by Sean Owen: http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to... , Image Processing toolbox: http://www.mathworks.com/help/to...Wavelet Toolbox http://www.mathworks.com/help/to... also see OpenCV catalog: http://opencv.willowgarage.com/w...

11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u...

12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap...

13) For a wealth of open ended problems see Programming Challenges: What are some good "toy problems" in data science?

Notes:

See Jimmy Lin's book Data-Intensive Text Processing with MapReduce for some good tips: http://www.umiacs.umd.edu/~jimmy... and Tom White's great book on Hadoop: http://www.hadoopbook.com/
Map-Reduce for Machine Learning on Multicore by Chu et al.: www-cs.stanford.edu/~ang/papers/...
Muthu Muthukrishnan's MapReduce resources: http://www.cs.rutgers.edu/~muthu...
Top 10 algorithms in data mining: http://www.mendeley.com/research...
Large Data Logistic Regression (with example Hadoop code): http://www.win-vector.com/blog/2...
A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04...
Seven data-mining algorithms which are 200-400x faster on GPUs: http://www.smedirector.com/2010/... via Michael E Driscoll
RecLab Core by Darren Erik Vengroff: http://code.richrelevance.com/re...
Amund Tveit's links: http://atbrox.com/2011/05/16/map...
Jeff Hammerbacher's links: http://www.mendeley.com/groups/1...
MR bibliography I've compiled a while back: http://www.columbia.edu/~ak2834/...
Scaling up machine learning: http://www.cs.umass.edu/~ronb/sc...
Machine Learning: What are some good learning projects to teach oneself about machine learning?
Implement the sequential version first, then parallelize with either Hadoop, or one of the alternatives (What are some promising open-source alternatives to Hadoop MapReduce for map/reduce?), or a self-made runtime; always abstract the MR logic away from the DFS. One of your teams could build a simple MapReduce engine; we did this for a term project (using an experimental language called X10) and it was fun.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。