1997-2007,KDD CUP的二十年

2017年8月13-17日,第23届KDD大会在加拿大哈利法克斯召开。KDD CUP是ACM SIGKDD组织的有关数据挖掘和知识发现领域的年度赛事,作为KDD年会的重要组成部分,从1997年至今,已有二十年的历史,是目前数据挖掘领域最有影响力的赛事。今天,我们就一起来回顾下这二十年的KDD CUP吧。

KDD Cup 1997 Direct marketing for lift curve optimization 预测出最可能的善款捐赠人

**Intro:**This year’s challenge is to predict who is most likely to donate to a charity. Contestants were evaluated on the accuracy on the validation data set.Note: the data used in KDD Cup 1997 is exactly the same as KDD Cup 1998.今年的挑战是预测谁最有可能捐赠给慈善机构。选手们对验证数据集的准确性进行了评估。注:1997年KDD杯使用的数据与1998年KDD杯完全相同。

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-1997/Data

Results:

  • First Place (jointly shared):

    • Charles Elkan (University of California, San Diego)
      with his software BNB, Boosted Naive Bayesian Classifier
    • Urban Science Applications, Inc.
      with their software gain, Direct Marketing Selection System
  • Runner Up:

    • Silicon Graphics, Inc.
      with their software MineSet
  • KDD Cup 1998 Direct marketing for profit optimization 生成最佳直销名单

Intro:

The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-1998/Tasks

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-1998/Data

Results

  • First Place:
    Urban Science Applications, Inc. with their software GainSmarts

  • First Runner Up:
    SAS Institute, Inc. with their software Enterprise Miner

  • Second Runner Up:
    Quadstone Limited with their software Decisionhouse

KDD Cup 1999 Computer network intrusion detection 网络侵入侦测及报告

Intro:

The task for the classifier learning contest organized in conjunction with the KDD’99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-1999/Tasks

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data

Result:

  • First Place: Bernhard Pfahringer
    Austrian Research Institute for Artificial Intelligence

  • First Runner Up: Itzhak Levin
    LLSoft, Inc. (using Kernel Miner)

  • Second Runner Up: Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin**
    MP13 company, Moscow, Russia [details]

KDD Cup 2000 Online retailer website clickstream analysis web挖掘任务(根据点击流及交易数据)

Intro:

The KDD Cup 2000 domain contains clickstream and purchase data from Gazelle.com, a legwear and legcare web retailer that closed their online store on 8/18/2000.

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2000/Data

Result:

  • Question 1 of KDD Cup 2000

    • First Place: Amdocs
    • Honorable Mentions:
      Mui Seng Martin Lee, Chong Jin Ong and S. Sathiya Keerthi of Mechanical and Production Engineering Department, National University of Singapore
  • Question 2 of KDD Cup 2000

    • First Place: Salford Systems, Inc
    • Honorable Mentions:
      MP13 team of Alexei Vopilov, Ivan Shabalin and Vladimir Mikheyev, and the team of Mukund Deshpande, George Karypis, Department of Computer Science and Engineering, University of Minnesota
  • Question 3 of KDD Cup 2000

    • First Place: Salford Systems, Inc
    • Honorable Mentions:
      Orit Rafaely, Tel-Aviv University and Amdocs
  • Question 4 of KDD Cup 2000

    • First Place: e-steam
    • Honorable Mentions:
      SAS, Amdocs, and LLSoft, Ltd
KDD Cup 2001 Molecular bioactivity; plus protein locale prediction 生物信息及医药(医药设计中的生物活性预测、预测基因/蛋白质的功能及定位)

Intro:

Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2001/Tasks

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2001/Data

Result:

  • Task 1 - Thrombin

    • First Place: Jie Cheng
      Canadian Imperial Bank of Commerce [slides]

    • Honorable Mention: T. Silander
      University of Helsinki

  • Task 2 - Function

    • First Place: Mark-A. Krogel
      University of Magdeburg [slides]

    • Honorable Mentions:

    C. Lambert (Golden Helix)

    J. Sese, H. Hayashi, and S. Morishita (University of
    Tokyo)

    D. Vogel and R. Srinivasan (A.I. Insight)

    S. Pocinki, R. Wilkinson, and P. Gaffney (Lubrizol Corp.)

  • Task 3 - Localization

    • First Place: Hisashi Hayashi, Jun Sese, and Shinichi Morishita
      University of Tokyo
    • Honorable Mentions:

    M. Schonlau (RAND)

    W. DuMouchel, C. Volinsky and C. Cortes (AT & T)

    B. Frasca, Z. Zheng, R. Parekh, and R. Kohavi (Blue Martini)

KDD Cup 2002 BioMed document; plus gene role classification 生物信息及文本挖掘(分子生物学领域)

Intro:

This year the competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. Both are described in more detail on the Tasks page.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2002/Tasks

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2002/Data

Result:

  • Task 1: Information Extraction from Biomedical Articles

    • First Place: ClearForest and Celera

    Yizhar Regev and Michal Finkelstein

    • Honorable Mentions:

    Design Technology Institute Ltd., Department of Mechanical Engineering at the National University of Singapore and Genome Institute of Singapore (Shi Min)

    Data Mining Group, Imperial College and Inforsense Limited (Huma Lodhi and Yong Zhang)

    Verity Inc. and Exelixis, Inc. (Bin Chen)

  • Task 2: Yeast Gene Regulation Prediction

    • First Place: Adam Kowalczyk and Bhavani Raskutti

    Telstra Research Laboratories

    • Honorable Mentions:

    David Vogel and Randy Axelrod
    ;A.I. Insight Inc. and Sentara Healthcare

    Marcus Denecke, Mark-A. Krogel, Marco Landwehr and Tobias Scheffer
    Magdeburg University

    George Forman
    Hewlett Packard Laboratories

    Amal Perera, Bill Jockheck, Willy Valdivia Granda, Anne Denton, Pratap Kotala and William Perrizo
    North Dakota State University

KDD Cup 2003 Network mining and usage log analysis 网络挖掘及使用日志分析

Intro:

The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper’s popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2003/Tasks

* Rules*http://www.kdd.org/kdd-cup/view/kdd-cup-2003/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2003/Data

Result:

  • I. Citation Prediction Task

    • First Place: J N Manjunatha, Raghavendra Pandey, Sivaramakrishnan R., and M Narasimha Murty (1329)
    • First Runner Up: Claudia Perlich, Foster Provost, and Sofus Macskassy (1360)
    • Second Runner Up: David Vogel (1398)
KDD Cup 2004 Particle physics; plus protein homology prediction 有指导分类的多种性能度量

Intro:

This year’s competition focuses on data-mining for a variety of performance criteria such as Accuracy, Squared Error, Cross Entropy, and ROC Area. As described on this WWW-site, there are two main tasks based on two datasets from the areas of bioinformatics and quantum physics.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2004/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2004/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data

Result:

  • Quantum Physics Problem

    • First Place:

    David S. Vogel, Eric Gottschalk, and Morgan C. Wang

    MEDai / A.I. Insight / University of Central Florida
    * Honorable Mention for ROC Area
    * Honorable Mention for Cross Entropy
    * Honorable Mention for SLQ Score

    • First Runner Up:
      Arpita Chowdhury, Dinesh Bharule, Don Yan, Lalit Wangikar
      Inductis Inc.
      • Honorable Mention for Accuracy
    • Second Runner Up:
      Christophe Lambert
      Golden Helix Inc.
  • Protein Homology Problem

    • First Place:

    Bernhard Pfahringer

    University of Waikato, Computer Science Department

    • Tied for 1st Place Overall:

    Yan Fu, RuiXiang Sun, Qiang Yang, Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao

    Institute of Computing Technology, Chinese Academy of Sciences
    * Honorable Mention for Squared Error
    * Honorable Mention for Average Precision

    • Tied for 1st Place Overall:

    David S. Vogel, Eric Gottschalk, and Morgan C. Wang

    MEDai / A.I. Insight / University of Central Florida
    * Honorable Mention for Top-1 Accuracy

    • Honorable Mention for Rank of Last:
      Dirk Dach, Holger Flick, Christophe Foussette, Marcel Gaspar, Daniel Hakenjos, Felix Jungermann, Christian Kullmann, Anna Litvina, Lars Michele, Katharina Morik, Martin Scholz, Siehyun Strobel, Marc Twiehaus, Nazif Veliu

    Artificial Intelligence Unit, University of Dortmund, Germany

KDD Cup 2005 Internet user search query categorization 互联网用户查询分类

Intro:

This year’s competition is about classifying internet user search queries. The task was specifically designed to draw participation from industry, academia, and students.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2005/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2005/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2005/Data

Result:

  • Winners

    • Query Categorization Precision Award

    Hong Kong University of Science and Technology team

    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

    • Query Categorization Performance Award

    Hong Kong University of Science and Technology team

    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

  • Query Categorization Creativity Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

  • Runner-ups

  • Query Categorization Precision Award

    Budapest University of Technology team

    Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi

  • Query Categorization Performance Award

    MEDai/AI Insight/ Humboldt University team

    David S. Vogel, Steve Bridges, Steffen Bickel, Peter Haider, Rolf Schimpfky, Peter Siemen, Tobias Scheffer

  • Query Categorization Creativity Award

    Budapest University of Technology team

    Zsolt T. Kardkovacs, Domonkos Tikk, Zoltan Bansaghi

KDD Cup 2006 Pulmonary embolisms detection from image data 医疗数据挖掘

Intro:

This year’s KDD Cup challenge problem is drawn from the domain of medical data mining. The tasks are a series of Computer-Aided Detection problems revolving around the clinical problem of identifying pulmonary embolisms from three-dimensional computed tomography data. This challenging domain is characterized by:

  • Multiple instance learning
  • Non-IID examples
  • Nonlinear cost functions
  • Skewed class distributions
  • Noisy class labels
  • Sparse data

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2006/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2006/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2006/Data

Result:

  • Task 1 - PE Identification

    • First Place: Robert Bell, Patrick Haffner, and Chris Volinsky (AT & T Research)
    • First Runner Up: Dmitriy Fradkin (Ask.com)
    • Second Runner Up: Domonkos Tikk (Budapest University of Technology & Economics), Zsolt T. Kardkovacs (Budapest University of Technology & Economics), Ferenc P. Szidarovszky (Szidarovszky Ltd. and Budapest University of Technology & Economics), Gyorgy Biro (TextMiner Ltd.), and Zoltan Balint (Budapest University of Technology & Economics)
    • Best Student Entry: Karthik Kumara (team leader), Sourangshu Bhattacharya, Mehul Parsana, Shivramkrishnan K, Rashmin Babaria, Saketha Nath J, and Chiranjib Bhattacharyya (Indian Institute of Science)
  • Task 2 - Patient Classification

    • First Place: Domonkos Tikk (Budapest University of Technology & Economics), Zsolt T. Kardkovacs (Budapest University of Technology & Economics), Ferenc P. Szidarovszky (Szidarovszky Ltd. and Budapest University of Technology & Economics), Gyorgy Biro (TextMiner Ltd.), and Zoltan Balint (Budapest University of Technology & Economics)
    • First Runner Up: Ruiping Wang, Yu Su, Ting Liu, Fei Yang, Liangguo Zhang, Dong Zhang, Shiguang Shan, Weiqiang Wang, Ruixiang Sun, and Wen Gao (Institute of Computing Technology, Chinese Academy of Sciences)
    • Second Runner Up: Cas Zhang, Y. Zhou, Q. Wang, and H. Ge (Joint R & D Lab, Chinese Academy of Sciences)
    • Third Runner Up: Dmitriy Fradkin (Ask.com)
    • Best Student Entry: Zhang Cas (IA, PKU)
  • Task 3 - Negative Predictive Value

    • First Place: William Perrizo and Amal Shehan Perera (DataSURG Group, North Dakota State University)
    • Runner Up: Nimisha Gupta and Tarun Agarwal (Strand Life Sciences Pvt. Ltd.)
    • Best Student Entry: Karthik Kumara (team leader), Sourangshu Bhattacharya, Mehul Parsana, Shivramkrishnan K, Rashmin Babaria, Saketha Nath J, and Chiranjib Bhattacharyya (Indian Institute of Science)
KDD Cup 2007 Consumer recommendations 预测电影评价问题

Intro:

This year’s KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2007/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2007/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2007/Data

Result:

  • Tasks 1 - Who Rated What

    • First Place: Miklos Kurucz, Andras A. Benczur, Tamas Kiss, Istvan Nagy, Adrienn Szabo, Balazs Torma (Hungarian Academy of Sciences)
    • First Runner Up: Advanced Analytical Solutions Team of Neo Metrics directed by Jorge Sueiras (Neo Metrics)
    • Second Runner Up: Yan Liu, Zhenzhen Kou (IBM Research)
  • Tasks 2 - How Many Ratings

    • First Place: Saharon Rosset, Claudia Perlich, Yan Liu (IBM Research)
    • **First Runner Up:**Advanced Analytical Solutions Team of Neo Metrics directed by Jorge Sueiras (Neo Metrics)
    • Second Runner Up: James Malaugh (Team Lead), Sachin Gangaputra, Nikhil Rastogi, Rahul Shankar, Sandeep Gupta, Kushagra Gupta, Neha Gupta, Gaurav Lal (Inductis)
KDD Cup 2008 Breast cancer 乳腺癌早期检测问题

Intro:

The KDD Cup 2008 challenge focuses on the problem of early detection of breast cancer from X-ray images of the breast. In a screening population, a small fraction of cancerous patients have more than one malignant lesion. To simplify the problem, we only consider one type of cancer - cancerous masses - and only include cancer patients with at most one cancerous mass per patient. The challenge will consist of two parts, each of which is related to the development of algorithms for Computer Aided Detection (CAD) of early stage breast cancer from X-ray images.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2008/Tasks

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2008/Data

Result:

  • Challenge 1

    • First Place: PMG-IBM-Research

    Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.

    Affiliation: IBM Research

    • First Runner Up: Hung-Yi Lo

    Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin

    Affiliation: National Taiwan University

    • Second Runner Up: yazhene

    Team Members: Yazhene Krishnaraj and Chandan K. Reddy

    Affiliation: Wayne State University

  • Challenge 2

    • First Place: PMG-IBM-Research

    Team Members: Claudia Perlich, Prem Melville, Grzegorz Swirszcz, Yan Liu, Saharon Rosset, Richard Lawrence.

    Affiliation: IBM Research

    • First Runner Up: TZTeam

    Team Members: Didier Baclin

    • Second Runner Up: Hung-Yi Lo

    Team Members: Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin

    Affiliation: National Taiwan University

KDD Cup 2009 Customer relationship prediction 电信运营商客户行为预测

Intro:

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data

Result:

  • Fast Track

    • First Place: IBM Research

    Ensemble Selection for the KDD Cup Orange Challenge

    • First Runner Up: ID Analytics, Inc

    KDD Cup Fast Scoring on a Large Database

    • Second Runner Up: Old dogs with new tricks (David Slate, Peter W. Frey)
  • Slow Track

    • First Place: University of Melbourne

    University of Melbourne entry

    • First Runner Up: Financial Engineering Group, Inc. Japan

    Stochastic Gradient Boosting

    • Second Runner Up: National Taiwan University, Computer Science and Information Engineering

    Fast Scoring on a Large Database using regularized maximum entropy model, categorical/numerical balanced AdaBoost and selective Naive Bayes

KDD Cup 2010 Student performance evaluation 根据智能教学系统和学生之间的交互日志,来预测学生在数学题测验上的表现

Intro:

How generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks?

This year’s challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.

赛题介绍
根据智能教学辅导系统和学生之间的交互日志,来预测学生数学题的考试成绩。该任务兼具实践重要性和科学趣味性。竞赛提供3个开发(develop)数据集和2个挑战(challenge)数据集,每个数据集又分为训练(train)部分和测试(test)部分。Challenge数据集的test部分被隐藏,参赛者需要开发一种学习模型,来准确预测这部分隐藏部分的成绩。

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2010-student-performance-evaluation/Data

Result:

  • All Teams

    • First Place: National Taiwan University

    Feature engineering and classifier ensembling for KDD CUP 2010

    • First Runner Up: Zhang and Su

    Gradient Boosting Machines with Singular Value Decomposition

    • Second Runner Up: BigChaos @ KDD

    Collaborative Filtering Applied to Educational Data Mining

  • Student Teams

    • First Place: National Taiwan University

    Feature engineering and classifier ensembling for KDD CUP 2010

    • First Runner Up: Zach A. Pardos

    Using HMMs and bagged decision trees to leverage rich features of user and skill

    • Second Runner Up: SCUT Data Mining

    Split-Score-Predicate

KDD Cup 2011 Predict music ratings and identify favorite songs 音乐评分预测,识别音乐是否被用户评分

Intro:

  • Learn the rhythm, predict the musical scores

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: “We don’t like their sound, and guitar music is on the way out” (Decca Recording Co. rejecting the Beatles, 1962).

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

Such an exciting analysis introduces new scientific challenges. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.

  • Two Tracks

The competition is divided into two tracks:

The first track is aimed at predicting scores that users gave to various items.
The second track requires separation of loved songs from other songs.

Both tracks are open to all research groups in academia and industry.

The KDD Cup 2011 files are currently offline.

赛题介绍

Track1任务:Predicting scores that users gave to various items
(音乐评分预测)

根据用户在雅虎音乐上item的历史评分记录,来预测用户对其他item(包括歌曲、专辑等)的评分和实际评分之间的差异RMSE(最小均方误差)。同时提供的还有歌曲所属的专辑、歌手、曲风等信息
Track2任务:Separation of loved songs from other songs
(识别音乐是否被用户评分)

每个用户提供6首候选的歌曲,其中3首为用户已评分数据,另3首是该用户未评分,但是出自用户中整体评分较高的歌曲。歌曲的属性信息(专辑、歌手、曲风等)也同样提供。参赛者给出二分分类结果(0/1分类),并根据整体准确率计算最终排名

该赛题官方已下线,无数据集下载

KDD Cup 2012 (Track 1) Predict which users (or information sources) one user might follow in Tencent Weibo 社交网络中的个性化推荐系统

Intro:

Online social networking services have become tremendously popular in recent years, with popular social networking sites like Facebook, Twitter, and Tencent Weibo adding thousands of enthusiastic new users each day to their existing billions of actively engaged users. Since its launch in April 2010, Tencent Weibo, one of the largest micro-blogging websites in China, has become a major platform for building friendship and sharing interests online. Currently, there are more than 200 million registered users on Tencent Weibo, generating over 40 million messages each day. This scale benefits the Tencent Weibo users but it can also flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users’ interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature social networking websites like Tencent Weibo.

More information on KDD Cup 2012 (Track 1) can be found at Kaggle.com

赛题介绍

Track1任务:Predict which users(or information sources) one user might follow in Tencent
(社交网络中的个性化推荐系统)

根据腾讯微博中的用户属性(User Profile)、SNS社交关系、在社交网络中的互动记录(retweet、comment、at)等,以及过去30天内的历史item推荐记录,来预测接下来最有可能被用户接受的推荐item列表

大赛官网介绍
https://www.kaggle.com/c/kddcup2012-track1#description

大赛数据集
https://www.kaggle.com/c/kddcup2012-track1/data

KDD Cup 2012 (Track 2) Predict the click-through rate of ads given the query and user information 搜索广告系统的pTCR点击率预估

Intro:

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.

More information on KDD Cup 2012 (Track 2) can be found at Kaggle.com

赛题介绍

Track2任务:Predict the click-through rate of ads given the query and user information
(搜索广告系统的pTCR点击率预估)

提供用户在腾讯搜索的查询词(query)、展现的广告信息(包括广告标题、描述、url等),以及广告的相对位置(多条广告中的排名)和用户点击情况,以及广告主和用户的属性信息,来预测后续时间用户对广告的点击情况

大赛官网介绍
https://www.kaggle.com/c/kddcup2012-track2#description

大赛数据集
https://www.kaggle.com/c/kddcup2012-track2/data

KDD Cup 2013 (Track 1) Determine whether an author has written a given paper

Intro:

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. On one hand, there are many authors who publish under several variations of their own name. On the other hand, different authors might share a similar or even the same name.

As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.

More information on KDD Cup 2013 (Track 1) can be found at Kaggle.com

赛题介绍

Track1任务:Author-Paper Identification Challenge

微软学术搜索是一个开放的平台,它涵盖了各种学术领域超过5000万的出版物和1900多万作者,并保持着每周更新的速度。提供这项服务的主要挑战之一是作者名称的歧义。一方面,很多作者倾向于使用不同的笔名。另一方面,不同的作者可能有一个相似甚至相同的名字。
因此,名字有歧义的作者往往会导致作品与作者对应问题。本届挑战要求参与者能在作者档案中识别出本人所著论文。

大赛官网介绍
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

大赛数据集
https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-chal
lenge/data

KDD Cup 2013(Track 2) Identify which authors correspond to the same person

Intro:

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. This KDD Cup task challenges participants to determine which authors in a given data set are duplicates.

More information on KDD Cup 2013 (Track 2) can be found at Kaggle.com

赛题介绍

Track2任务:Author Disambiguation Challenge

本届挑战要求参与者能在数据集中辨别出哪些作者是同一个人。

大赛官网介绍
https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation

大赛数据集
https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation/data

KDD Cup 2014 Predict funding requests that deserve an A+ 帮助一个慈善网站识别出那些格外激动人心的项目

Intro:

DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school.

The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn.

Successful predictions may require a broad range of analytical skills, from natural language processing on the need statements to data mining and classical supervised learning on the descriptive factors around each project.

赛题介绍
KDD Cup2014要求参赛者帮助慈善网站DonorsChoose.org挑选有商业亮点的项目,所有项目都能满足某些特定需求,但是只有个别项目能大幅度超过平均水准。通过早期识别和推荐这些项目,他们能够获得更多的资金注入、更好的用户体验,同时帮助更多的学生获得他们需要的学习材料。

大赛官网介绍
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose

大赛数据集
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data

KDD Cup 2015 Predicting dropouts in MOOC用大数据预测MOOCer是否会“翘课”

Intro:

Students’ high dropout rate on MOOC platforms has been heavily criticized, and predicting their likelihood of dropout would be useful for maintaining and encouraging students’ learning activities. Therefore, in KDD Cup 2015, we will predict dropout on XuetangX, one of the largest MOOC platforms in China.

The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities. If a user C leaves no records for course C in the log during the next 10 days, we define it as dropout from course C For more details about log, please refer to the Data Descriptions.

赛题介绍
MOOC在线学习平台上学生的逃课率极高,因此预测他们接下来是否会选择逃课将对保持和激励学生的学习积极性十分有益。在KDD Cup 2015,我们的主题在于预测学生在学堂在线这个全中国最大幕课平台中的逃课率。参赛者需要基于用户个人行为预测接下来10天内他们的逃课几率。

大赛官网介绍
http://www.kddcup2015.com/information.html

大赛数据集
http://data-mining.philippe-fournier-viger.com/the-kddcup-2015-dataset-download-link/

KDD Cup 2016 Whose papers are accepted the most: towards measuring the impact of research institutions

Intro:

Finding influential nodes in a social network for identifying patterns or maximizing information diffusion has been an actively researched area with many practical applications. In addition to the obvious value to the advertising industry, the research community has long sought mechanisms to effectively disseminate new scientific discoveries and technological breakthroughs so as to advance our collective knowledge and elevate our civilization. For students, parents and funding agencies that are planning their academic pursuits or evaluating grant proposals, having an objective picture of the institutions in question is particularly essential. Partly against this backdrop we have witnessed that releasing a yearly Research Institution or University Ranking has become a tradition for many popular newspapers, magazines and academic institutes. Such rankings not only attract attention from governments, universities, students and parents, but also create debates on the scientific correctness behind the rankings. The most criticized aspect of these rankings is: the data used and the methodology employed for the ranking are mostly unknown to the public.

The 2016 KDD Cup will address this very important problem through publically available datasets, like the Microsoft Academic Graph (MAG), a freely available dataset that includes information on academic publications and citations. This dataset, being a heterogeneous graph, that can be used to study the influential nodes of various types including authors, affiliations and venues; we choose to focus on affiliations in this competition. In effect, given a research field, we are challenging the KDD Cup community to jointly develop data mining techniques to identify the best research institutions based on their publication and how they are cited in research articles.

Tasks:http://www.kdd.org/kdd-cup/view/kdd-cup-2016/Tasks

Rules:http://www.kdd.org/kdd-cup/view/kdd-cup-2016/Rules

Data:http://www.kdd.org/kdd-cup/view/kdd-cup-2016/Data

KDD Cup 2017Highway Tollgates Traffic Flow Prediction —— Travel Time & Traffic Volume Prediction

Intro:

Highway tollgates are well known bottlenecks in traffic networks. During rush hours, long queues at tollgates can overwhelm traffic management authorities. Effective preemptive countermeasures are desired to solve this challenge. Such countermeasures include expediting the toll collection process and streamlining future traffic flow. The expedition of toll collection could be simply allocating temporary toll collectors to open more lanes. Future traffic flow could be streamlined by adaptively tweaking traffic signals at upstream intersections. Preemptive countermeasures will only work when the traffic management authorities receive reliable predictions for future traffic flow. For example, if heavy traffic in the next hour is predicted, then traffic regulators could immediately deploy additional toll collectors and/or divert traffic at upstream intersections.
Traffic flow patterns vary due to different stochastic factors, such as weather conditions, holidays, time of the day, etc. The prediction of future traffic flow and ETA (Estimated Time of Arrival) is a known challenge. An unprecedented large amount of traffic data from mobile apps such as Waze (in the US) or Amap (in China) can help us take up that challenge. If the contestants in this proposed KDD CUP could design reliable approaches for future traffic flow and ETA prediction, then the traffic management authorities might be able to capitalize on big data & algorithms for fewer congestions at tollgates.

赛题介绍

高速公路收费站是交通网络中众所周知的瓶颈。如果可以提前预测接下来一小时的交通拥堵状况,那么交通管理部门可以及时采取措施进行上游路口的流量诱导和控制。KDD CUP 2017希望参赛者可以设计一套预测交通流量和车辆到达时间的算法,用算法和数据来赋能交通领域,减少拥堵的发生。

Task 1: To estimate the average travel time from designated intersections to tollgates(预测车辆从路口到收费站的平均用时)

Task 2: To predict average tollgate traffic volume(高速收费站车流量预测)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值