CURE: ClUster REsampling

This page has all the code and data used in the experiments reported in Paper 910 submitted for consideration at ECML PKDD 2019. We will describe below how the experiments can be reproduced. We start by explaning the system requirements and then how to use the code. Finally we show some auxiliary figures that were not included in the paper due to space constraints.

Requirements

You will need to use Python 2.7 for conducting the experiments. You will also need the following packages: Sklearn, SciPy, Pandas and Imbalanced-learn.

Usage

Tested on Python 2.7.9.

You can download all the code and data sets used in our experiments here. When this is done, you can use the intructions below to reproduce all our experiments.

Download Datasets and Initialize Databases

To download all of the necessary datasets:

python datasets.py

To initialize the databases:

python databases.py

Run the Experiments and Export the Results

To schedule the experiments associated with the [preliminary|final] analysis:

python experiments/schedule_final.py

To start a runner, pulling unfinished trials until there are none left (note that several runners can operate simultaneously):

python run.py

To export the results from a previously initialized database into a CSV file:

python databases.py

This experimental framework is based on the framework published in:

Koziarski, Michal, Bartosz Krawczyk, and Michal Woźniak. “Radial-Based oversampling for noisy imbalanced data classification.” Neurocomputing (2019).

Auxiliary Figures

In this section we present further auxiliary figures that were not included in the paper due to space constraints. These figures illustrate the impact of changing one of the parameters of our CURE algorithm when the other parameter is fixed at a certain value.

IR50 Data sets: Data sets with 50 minority class examples

Rankings variation of CURE method for IR50 for \(s=0.25\) and \(0.25 \leq \alpha \leq 0.85\)

Image Title

Rankings variation of CURE method for IR50 for \(s=0.45\) and \(0.25 \leq \alpha \leq 0.85\)

Image Title

Rankings variation of CURE method for IR50 for \(s=0.65\) and \(0.25 \leq \alpha \leq 0.85\)

Image Title

Rankings variation of CURE method for IR50 for \(s=0.85\) and \(0.25 \leq \alpha \leq 0.85\)

Image Title

Rankings variation of CURE method for IR50 for \(s=1.0\) and \(0.25 \leq \alpha \leq 0.85\)

Image Title

IR30 Data sets: Data sets with 30 minority class examples

Rankings variation of CURE method for IR30 for \(s=0.25\) and \(0.25 \leq \alpha \leq 0.85\)

Rankings variation of CURE method for IR30 for \(s=0.45\) and \(0.25 \leq \alpha \leq 0.85\)

Rankings variation of CURE method for IR30 for \(s=0.65\) and \(0.25 \leq \alpha \leq 0.85\)

Rankings variation of CURE method for IR30 for \(s=0.85\) and \(0.25 \leq \alpha \leq 0.85\)

Rankings variation of CURE method for IR30 for \(s=1.0\) and \(0.25 \leq \alpha \leq 0.85\)

IR10 Data sets: Data sets with 10 minority class examples

Rankings variation of CURE method for IR10 for \(s=0.25\) and \(0.25 \leq \alpha \leq 0.85\)

size10std0.25ByAlphaHist

Rankings variation of CURE method for IR10 for \(s=0.45\) and \(0.25 \leq \alpha \leq 0.85\)

std0.45ByAlphaHist

Rankings variation of CURE method for IR10 for \(s=0.65\) and \(0.25 \leq \alpha \leq 0.85\)

std0.65ByAlphaHist

Rankings variation of CURE method for IR10 for \(s=0.85\) and \(0.25 \leq \alpha \leq 0.85\)

std0.85ByAlphaHist

Rankings variation of CURE method for IR10 for \(s=1.0\) and \(0.25 \leq \alpha \leq 0.85\)

std1.0ByAlphaHist