PAKDD 2009 Data Mining Competition
Problem Characterization

This credit risk assessment problem comes from the private label credit card operation of a major Brazilian retail chain.

The company has been operating its private label card for over 8 years and has applied two different methods for risk assessment with the application's acceptance rate varying from 50% to 75% within this period.

Each application accepted gives the applicant (now, client) the access to credit for purchasing on the retail chain to be billed 10 to 40 days after the purchase, on a monthly basis on a fixed month day. The client was labeled as bad (target variable=1) if, for 11 months after the first bill, he / she had any payment default (a delay longer than 60 days). Otherwise, the client was labeled as good (target variable=0). Therefore, after his / her credit acceptance, a client would take some time to make their first purchase and receive their first bill. Eleven months later, with or without further bills, his / her set of bills for credit risk assessment was completed. Further 60 days were used for maturing the period's last bill.

The competition focuses on performance robustness against degradation along time. Therefore, the competitors' task consists in extracting knowledge from modeling data to achieve the best performance on the company's clients analyzed in a one-year period starting three years later (the prediction data set). The competitor should produce scores for ranking the clients with the highest scores as the most likely to payment delinquency.

Three data sets are available for the participants: modeling, leaderboard and prediction sets. All data sets consist of data in a condition fully matured captured along a whole year in different time periods further separated by extra years. Labels are available only for the modeling data set which has roughly 20% bad clients. This proportion of classes may not be the same present on the leaderboard and prediction data sets, depending on several factors, mostly on the company's policy. Data samples general characteristics are presented in the table below.

Data set
Number of patterns 50,000 10,000 10,000
Time interval 12 months 12 months 12 months
Target variable Labeled Unlabeled Unlabeled
Target proportion 20% vs. 80% Unrevealed Unrevealed

An extra difficulty is that the data sets had their examples randomly sampled from non-adjacent periods. There are time lags between the sampled periods which last for 12 months as shown in the table below.

The information about the clients consists of 31 explanatory variables of several types affected by the typical imperfections of actual problems, such as noise, missing data, outliers etc. The 32nd variable (last column on the sheet) is the problem target with values 1 for bad clients and 0 for good clients. The variable list with their description is downloadable along with the data sets.

Participants should not give up or feel dismayed for having attained apparently low performance in the modeling or leaderboard data sets. It should be taken into account that several variables concerning residence localization and personal identification have been encoded or removed either to preserve client's confidentiality or to prevent advantage for teams with knowledge about Brazilian regions. Furthermore, the variable set has been reduced to represent the intersection set of all variables available the years the data sets information were stored.

After the competition, the labels of the leaderboard and prediction data sets will not be revealed because Neurotech aim to provide this environment as a benchmark on the problem for future impartial performance assessment, in a publicly accessible site, with several performance metrics for this binary decision problem.

Locations of visitors to this page