What makes multi-class imbalanced problems difficult? An experimental study.

Mateusz Lango, Jerzy Stefanowski

1 Impact of class overlapping and imbalanced ratio on the classifiers performance

1.1 Decision Tree, datasets with 3 classes

Class recalls (odd rows) and OC Ratio (even rows) for the triangle datasets with imbalanced ratios (columns) from 1 to 20 versus class overlapping measured by standard deviation. Class 1 is the majority class whereas classes 2 and 3 are minority classes (except the balanced dataset were all classes have the same size).

1.2 K-nearest neighbors, datasets with 3 classes

Class recalls (odd rows) and OC Ratio (even rows) for the triangle datasets with imbalanced ratios (columns) from 1 to 20 versus class overlapping measured by standard deviation. Class 1 is the majority class whereas classes 2 and 3 are minority classes (except the balanced dataset were all classes have the same size).

1.3 Bagging with 30 decision trees, datasets with 3 classes

Class recalls (odd rows) and OC Ratio (even rows) for the triangle datasets with imbalanced ratios (columns) from 1 to 20 versus class overlapping measured by standard deviation. Class 1 is the majority class whereas classes 2 and 3 are minority classes (except the balanced dataset were all classes have the same size).

1.4 Decision Tree, datasets with 4 classes

Class recalls (odd rows) and OC Ratio (even rows) for the square datasets with imbalanced ratios (columns) from 1 to 20 versus class overlapping measured by standard deviation. Class 1 is the majority class whereas classes 2,3 and 4 are minority classes (except the balanced dataset were all classes have the same size).

1.5 K-nearest neighbors, datasets with 4 classes

Class recalls (odd rows) and OC Ratio (even rows) for the square datasets with imbalanced ratios (columns) from 1 to 20 versus class overlapping measured by standard deviation. Class 1 is the majority class whereas classes 2,3 and 4 are minority classes (except the balanced dataset were all classes have the same size).

1.6 Bagging with 30 decision trees, datasets with 4 classes

Class recalls (odd rows) and OC Ratio (even rows) for the square datasets with imbalanced ratios (columns) from 1 to 20 versus class overlapping measured by standard deviation. Class 1 is the majority class whereas classes 2,3 and 4 are minority classes (except the balanced dataset were all classes have the same size).

2 The role of the different class size configurations

2.1 Decision Tree, datasets with 3 classes

The values of OC Ratio obtained by the decision tree classifier on triangle datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows). The values of Recall obtained by the decision tree classifier on triangle datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows).

2.2 K-nearest neighbors, datasets with 3 classes

The values of OC Ratio obtained by the kNN classifier on triangle datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows). The values of Recall obtained by the kNN classifier on triangle datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows).

2.3 Bagging with 30 decision trees, datasets with 3 classes

The values of OC Ratio obtained by the bagging classifier on triangle datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows). The values of Recall obtained by the bagging classifier on triangle datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows).

2.4 Decision Tree, datasets with 4 classes

The values of OC Ratio obtained by the decision tree classifier on square datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows). The values of Recall obtained by the decision tree classifier on square datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows).

2.5 K-nearest neighbors, datasets with 4 classes

The values of OC Ratio obtained by the kNN classifier on square datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows). The values of Recall obtained by the kNN classifier on square datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows).

2.6 Bagging with 30 decision trees, datasets with 4 classes

The values of OC Ratio obtained by the bagging classifier on square datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows). The values of Recall obtained by the bagging classifier on square datasets for the imbalanced ratio IR growing from 1 to 20 with three different class overlapping std in {1, 3, 10} (columns) and three different types of class distributions (represented in next rows).

3 Overlapping and interrelation between different class types

3.1 Decision Tree

Results of Recalls on the traingle datasets (IR=8), where the class overlap was increased only with the selected class. Each row presents the results for one dataset configuration (the class cardinalities distribution is indicated in the plot title). First two plots in each row (form the left) shows class recalls when the selected class (green line) was overlapped with one of two other classes. The third plot (the right column) shows comparison of green lines from the first two plots for the selected class. Results of OC Ratio on the traingle datasets (IR=8), where the class overlap was increased only with the selected class. Each row presents the results for one dataset configuration (the class cardinalities distribution is indicated in the plot title). First two plots in each row (form the left) shows class recalls when the selected class (green line) was overlapped with one of two other classes. The third plot (the right column) shows comparison of green lines from the first two plots for the selected class.

3.2 K-nearest neighbors

Results of Recalls on the traingle datasets (IR=8), where the class overlap was increased only with the selected class. Each row presents the results for one dataset configuration (the class cardinalities distribution is indicated in the plot title). First two plots in each row (form the left) shows class recalls when the selected class (green line) was overlapped with one of two other classes. The third plot (the right column) shows comparison of green lines from the first two plots for the selected class. Results of OC Ratio on the traingle datasets (IR=8), where the class overlap was increased only with the selected class. Each row presents the results for one dataset configuration (the class cardinalities distribution is indicated in the plot title). First two plots in each row (form the left) shows class recalls when the selected class (green line) was overlapped with one of two other classes. The third plot (the right column) shows comparison of green lines from the first two plots for the selected class.

3.3 Bagging with 30 decision trees

Results of Recalls on the traingle datasets (IR=8), where the class overlap was increased only with the selected class. Each row presents the results for one dataset configuration (the class cardinalities distribution is indicated in the plot title). First two plots in each row (form the left) shows class recalls when the selected class (green line) was overlapped with one of two other classes. The third plot (the right column) shows comparison of green lines from the first two plots for the selected class. Results of OC Ratio on the traingle datasets (IR=8), where the class overlap was increased only with the selected class. Each row presents the results for one dataset configuration (the class cardinalities distribution is indicated in the plot title). First two plots in each row (form the left) shows class recalls when the selected class (green line) was overlapped with one of two other classes. The third plot (the right column) shows comparison of green lines from the first two plots for the selected class.

4 Increasing the number of classes

4.1 Decision Tree

The recall of the worst recognized class vs. the number of classes in the dataset for the tree classifier. Each row presents the results for datasets with different type of class size configuration (multi-minority, multi-majority and gradual), whereas each column presents results for datasets with different level of class overlapping (no-overlapping - std=1, moderate overlapping - std = 3 and strong overlapping - std=10).

4.2 K-nearest neighbors

The recall of the worst recognized class vs. the number of classes in the dataset for the tree classifier. Each row presents the results for datasets with different type of class size configuration (multi-minority, multi-majority and gradual), whereas each column presents results for datasets with different level of class overlapping (no-overlapping - std=1, moderate overlapping - std = 3 and strong overlapping - std=10).

4.3 Bagging with 30 decision trees

The recall of the worst recognized class vs. the number of classes in the dataset for the tree classifier. Each row presents the results for datasets with different type of class size configuration (multi-minority, multi-majority and gradual), whereas each column presents results for datasets with different level of class overlapping (no-overlapping - std=1, moderate overlapping - std = 3 and strong overlapping - std=10).