Top 10 challenging problems in data mining
在以此以前的发布当中,我写了关于“位居前10的数据挖掘算法”,这篇文章被发表在了《知识和信息系统》上。这个 “选择”的过程如同从前一个已经被用来去识别最重要的(按照调查的回答)数据挖掘问题的选择。杨和吴写的这份报道已经早在2006年被发表在国际杂志《信息技术与做决定》上。这份报道包括了如下几个问题(没有明确的顺序关系):
1.建立一个统一的数据挖掘理论体系
2.按比例放大有高度空间的数据和高速的数据流
3.挖掘顺序存储的数据和与时间敏感的数据
4.从复杂的数据中挖掘复杂的知识
5.在网络数据集中进行数据挖掘
6.分布式的数据挖掘和挖掘和挖掘多代理的数据
7.针对生物学和环境问题的数据挖掘
8.对过程相关问题的数据挖掘
9.数据的安全性,保密性和完整性
10.研究动态的,不平衡的和成本敏感性数据
我有时候从对数据挖掘感兴趣的研究生和从业者们那里收到电子邮件。最通常问的问题是“在数据挖掘方面我能做什么?”。当然,答案取决于你喜欢什么和你在此时此刻的机会。然而,这篇报道可能可以给出关于可能的研究方向的一些暗示。
像往常一样,“数据挖掘自动化进程”的问题被提到了。但当从业者说他们可以做到的时候,它对于研究人员争论他们需要找到一条途径去实现数据挖掘自动化是无益的。最后,我想,在这篇报道中的一个句子中,一个最重要的问题被提出了:
“[...]但他们——“数据挖掘系统”不能去联系使他们受到影响的真实世界的决定这个挖掘的结果[...]”
在我看来,对于排列前几名的算法,去排列前几名的问题更加主观。很多人会确定的认同被选取的数据挖掘算法。问题是会有更多的主观的需要考虑的数据挖掘问题需要考虑,虽然它们中的一些会只涉及到确定的研究领域。
【原文】
In a previous post, I wrote about the top 10 data mining algorithms, a paper that was published in Knowledge and Information Systems. The “selective” process is the same as the one that has been used to identify the most important (according to answers of the survey) data mining problems. The paper by Yang and Wu has been published (in 2006) in the International Journal of Information Technology & Decision Making. The paper contains the following problems (in no specific order):
- Developing a unifying theory of data mining
- Scaling up for high dimensional data and high speed data streams
- Mining sequence data and time series data
- Mining complex knowledge from complex data
- Data mining in a network setting
- Distributed data mining and mining multi-agent data
- Data mining for biological and environmental problems
- Data Mining process-related problems
- Security, privacy and data integrity
- Dealing with non-static, unbalanced and cost-sensitive data
I sometimes receive emails from master student or practitioners interested in data mining. The usual question is “What can I do as research in data mining?”. Of course, the answer depends on what you like and the opportunities of the moment. However, this paper can maybe give some hints on possible directions for research.
As usual, the “data mining automation process” issue is mentioned. It is worth noting that researchers argue that they need to find a way to automate data mining, while practitioners say that they can do it (for example KXEN). Finally, I think that one of the most important issue is pointed out by the following sentence in the paper:
“[...] they’re [data mining systems] unable to relate the results of mining to the real-world decisions they affect [...]“
In my opinion, it is more subjective to rank top problems than top algorithms. Most people will certainly agree on the selected data mining algorithms. The question is more subjective regarding data mining problems since some of them may only be relevant to certain fields of research.