We first define the notion of operating context. Given a dataset T = <X, Y> where Y = R, the operating context c ∈ Y is defined as the value that determines a cutoff that splits the set Y (or its prediction Ŷ) into two disjoint sets: the values yi (or ŷi) that are above and below c, respectively.
The binarised regression problems are characterised by the following features that determine the context: the cutoff distribution and its range. According to the degree of knowledge about these two features of the context that we may have during training, we distinguish the following cases (from more to less information):
Case a is the simplest one. In this scenario, there is no need of learning a regression model since the problem can be solved directly by using the cutoff in the training stage. First, the problem is binarised by using the cutoff and, then, a classifier is learnt. However, for the rest of the cases, the performance of the models for the range of possible cutoff values which we are interested in must be examined. For cases c and d, where no information about the cutoff is available, uniform-distributed or the output-distributed cutoff may be the most straightforward assumptions as we will discuss in the following sections, being the uniform distribution especially applicable when we want to focus on a small region3 of cutoffs. Otherwise, the problem could not be solved.
Focussing on cases b and c, we have considered six possible binarised regression scenarios (types) which are shown in Table 1. In the second column, we consider different cases for the true distribution of cutoffs (w), which can follow a uniform distribution, be similar to the output distribution observed in the training set or any other distribution. The third column shows the range of this distribution as Full or Region, depending on whether we expect a wide range or a narrow range of cutoffs, respectively. The fourth column shows whether each type of problem can be considered common or not whereas the fifth column shows which error measures would be recommended for each case (which will be seen in the next sections).
For each problem in the repository, we have included information about the binarised regression task, the distribution and regions of cutoffs that make more sense for the problem, as well as the CFP vs CFN costs. The repository features three datasets for which the cutoff has a narrow uniform range (type 2), seven with a wide output-distributed range (type 3), one with a narrow output-distributed range (type 4), eight with a wide range with any other distribution (type 5) and one with a narrow range with any other distribution (type 6).
Type |
Cutoff distribution |
Range |
Common? |
Measure Method |
UNIFORM | Full | No |
MAE |
|
UNIFORM | Region | Yes |
Clipped MAE (cMAE) |
|
OUTPUT | Full | Yes |
AOCE |
|
OUTPUT | Region | Yes |
AOCE for the region (or cMAE) |
|
OTHER | Full | Yes |
ACE |
|
OTHER | Region | Yes |
ACE for the region (or cMAE) |
|
Types of cutoff contexts, including the distribution and range of cutoffs. Common? indicates how realistic each kind of problem seems to be, and Measure Method shows which evaluation measure is recommended, where AP denotes the area under the plot P. For cases 4 and 6, if the region is small it is acceptable not to use the true distribution, and use a flat uniform distribution instead, so we can ultimatly use cMAE. |