PAKDD 2009 Data Mining Competition

There are several files to be downloaded from the site, all in standard ANSI format, except the variables list which is in MS-Excel© format.

The 3 data sets are arranged in columns separated by "TAB" having the first column as "ID_CLIENT" which should be used as a key/identifier. The "ID_CLIENT" ranges from 1 to 50,000 in the modeling data set, from 50,001 to 60,000 in the leaderboard data set and from 60,001 to 70,000 in the prediction data set. The last column is the "TARGET_LABEL" which is filled only for the modeling data set with BAD=1 and GOOD=0. All numerical data use the dot "." as decimal separator (not the comma ",").

The column labels of the data files are in an isolate variables list file. The variable list file has two columns, containing the variables names and their descriptions.

The Leaderboard submission example is a file in the format required for submission for the leaderboard.

The AUC_ROC Java code is available for helping teams to calculate the metrics with the same algorithm used as the competition performance assessment metrics.

The files can only be downloaded one at a time.

Standard Classification Task
# Files Number of patterns Time interval Target variable Target proportion Release File Size(Kb)
Modeling 50,000 12 months Labeled 20% vs. 80% Feb 12 1,319
LeaderBoard 10,000 12 months Unlabeled Unrevealed Feb 16 293
Prediction 10,000 12 months Unlabeled Unrevealed Mar 02 265
Area Under ROC (Java Code) -- -- -- -- Feb 12 1
Leaderboard Submission Example -- -- -- -- Feb 12 76
Variables List -- -- -- -- Feb 12 6

