What are some good class projects for machine learning using MapReduce?

What are some good class projects for machine learning using MapReduce?

We are looking for a (not necessarily academic) class project for a class where we are learning to implement various Machine Learning algorithms over the MapReduce framework using AWS. Should meet the following criteria:
1. Low time spent on cleaning the data
2. Parallelizable to ~3 people (eg. each person could try a different method and then combine it into an ensemble)
3. Duration: ~2-3 hours/(week * student) * 2/3 students/group * 4 weeks * 7 groups
4. Have a verifyable sub-result(s) (so that students have a way of knowing they are on the right track)
5. (Optional) Have an open ended question that could possibly be pursued by the more enthusiastic folks.













Try implementing some ML algorithms not yet covered in Apache Mahout: What are some important algorithms not yet covered in Mahout? , and What are the top 10 data mining or machine learning algorithms?

See open items: https://cwiki.apache.org/conflue... , you can also ask on the Mahout mailing list.

1) Matrix Decomposition routines (QR, Cholesky etc) 

2)  Decision Trees with ID3, C4.5 or other heuristic ( https://issues.apache.org/jira/b... ). This is one of ther most popular algorithms in data mining with countless applications. 
Tutorials:  Decision Trees: What are some good resources for learning about decision trees?

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...

3) Linear Regression  https://cwiki.apache.org/conflue... , Ordinary Least Squares or other linear least squares methods:  http://en.wikipedia.org/wiki/Ord... also see Matlab statistics toolbox for ideas:  http://www.mathworks.com/help/to...

4) Gradient Descent and other optimization and linear programming algorithms, see Convex Optimization: What are some good resources for learning about distributed optimization? ,  What are some fast gradient descent algorithms? , Matlab optimization toolbox:  http://www.mathworks.com/help/to...  Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms:  http://en.wikipedia.org/wiki/Ada...

6) SVM:  https://issues.apache.org/jira/b... ,  https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... ,  Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models  http://en.wikipedia.org/wiki/Vec...

8)  Hidden Markov Models - an extremely popular method in NLP & bioinformatics. See  Hidden Markov Models: What are some good resources for learning about Hidden Markov Models? and  https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... ,  http://www.mendeley.com/c/424264...

9) Slope One by  Daniel Lemirehttp://en.wikipedia.org/wiki/Slo... or other Collaborative Filtering algorithms. See  Mahout in Action by   Sean Owen: http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox:  http://www.mathworks.com/help/to... , Image Processing toolbox:  http://www.mathworks.com/help/to...Wavelet Toolbox  http://www.mathworks.com/help/to... also see OpenCV catalog: http://opencv.willowgarage.com/w... 

11)  PageRank, here is a good tutorial:  http://michaelnielsen.org/blog/u...

12) Build an eigensolver:  http://www.cs.cmu.edu/~ukang/pap...

13) For a wealth of open ended problems see  Programming Challenges: What are some good "toy problems" in data science?

Notes:
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值