The 14th Pacific-Asia Knowledge Discovery and Data Mining conference (PAKDD 2010) is pleased to host another data mining competition, once again co-organized by NeuroTech Ltd. and Center for Informatics of the Federal University of Pernambuco (Brazil)
Competitions in scientific events have been organized world-wide for stimulating the application of state-of-the-art approaches to real world problems. In recent years, PAKDD has organized several data mining competitions. This year, despite being in the well known application of credit scoring, the solution also applies to several domains where permanent binary solutions have to be re-calibrated based on biased data originated from previous decisions made with a high quality decision support system.
All interesting features from last year's competition were preserved, particularly the real-time LeaderBoard for stimulating the competitors' daily participation. The webpage layout has been improved for larger screen sizes. There is now a moderated forum for interaction about the competition. Also, several binary decision metrics were added for submission's performance evaluation.
Up to the current moment, this competition will be held via the internet. There is still the possibility of having a workshop session during the PAKDD Conference
The competition is open for academia and industry. The only ineligible participants are staff and students from Center for Informatics of the Federal University of Pernambuco and NeuroTech Ltd.
Re-Calibration of a Credit Risk Assessment System Based on Biased Data
The most fundamental and most frequently found type of decision is the Binary Decision. This type of decision appears in any business activity where the decision outcome is either to "do that" or to "do something else".
In decision support systems, the typical approach for binary decision problems is to map the multivariate input space into a scalar space (the score) where a simple threshold becomes the control parameter for producing decisions.
Binary decisions, in principle, could be assessed "successful" or "unsuccessful" for either outcome, via errors type-I and type-II. In general, however, only the "do that" decision outcome is monitored for decision assessment due to several aspects, but mainly because of the cost of betting on expectedly wrong decisions.
As a consequence, only a part of the "market" is monitored and has its decisions assessed as a "successful" or "unsuccessful". Furthermore, this part is a very biased sample of the market for system re-calibration/re-training because, instead of having been randomly drawn, this sample has been extracted from the market by a process focused on optimizing the decision objective.
This competition focuses on how to build a model for a binary decision support system based on this type of biased sample in a credit scoring application. There are only data about the company's clients for modeling, but not about the rejected applicants. This is the context of PAKDD 2010 Competition.
The competition data set available for modeling comprises the companies' clients with their delinquency status labeled. The competition leaderboard data set also contains only data about the companies' clients. The prediction data set, however, contains randomly selected applicants who had their applications rejected by the credit scoring system but have received their credit, for the purpose of monitoring the decision support system's performance and collecting data for future model re-calibration.
This competition focuses on the credit scoring model's generalization capacity from partial biased data sets available for modeling.
Participants will download a labeled data set from a one year period for modeling; download an unlabeled data set from a period over one year later and submit the scores to the LeaderBoard; and download another unlabeled data set (the Prediction data set) and submit their scores. These data sets come from a private label credit card operation of a Brazilian credit company and its partner shops, along stable inflation condition (2006-2009)
The official competition performance metric will be the area under the ROC curve and a Java routine for calculating it is available for download. Some other model performance metrics will be used for comparative purposes