Binarised Regression - Context plots and Cutoff Distributions

We first define the notion of operating context. Given a dataset T = <X, Y> where Y = R, the operating context c ∈ Y is defined as the value that determines a cutoff that splits the set Y (or its prediction Ŷ) into two disjoint sets: the values y_i (or ŷ_i) that are above and below c, respectively.

Context plots and Cutoff Distributions

The binarised regression problems are characterised by the following features that determine the context: the cutoff distribution and its range. According to the degree of knowledge about these two features of the context that we may have during training, we distinguish the following cases (from more to less information):

We know the exact cutoff.
We know the range and the expected distribution of cutoffs for deployment precisely.
We may have some information about the range or the distribution but not about the distribution.
We may have complete absence of information about the range and the distribution.

Case a is the simplest one. In this scenario, there is no need of learning a regression model since the problem can be solved directly by using the cutoff in the training stage. First, the problem is binarised by using the cutoff and, then, a classifier is learnt. However, for the rest of the cases, the performance of the models for the range of possible cutoff values which we are interested in must be examined. For cases c and d, where no information about the cutoff is available, uniform-distributed or the output-distributed cutoff may be the most straightforward assumptions as we will discuss in the following sections, being the uniform distribution especially applicable when we want to focus on a small region3 of cutoffs. Otherwise, the problem could not be solved.

Focussing on cases b and c, we have considered six possible binarised regression scenarios (types) which are shown in Table 1. In the second column, we consider different cases for the true distribution of cutoffs (w), which can follow a uniform distribution, be similar to the output distribution observed in the training set or any other distribution. The third column shows the range of this distribution as Full or Region, depending on whether we expect a wide range or a narrow range of cutoffs, respectively. The fourth column shows whether each type of problem can be considered common or not whereas the fifth column shows which error measures would be recommended for each case (which will be seen in the next sections).

For each problem in the repository, we have included information about the binarised regression task, the distribution and regions of cutoffs that make more sense for the problem, as well as the CF_P vs CF_N costs. The repository features three datasets for which the cutoff has a narrow uniform range (type 2), seven with a wide output-distributed range (type 3), one with a narrow output-distributed range (type 4), eight with a wide range with any other distribution (type 5) and one with a narrow range with any other distribution (type 6).

Type	Cutoff distribution	Range	Common?	Measure Method
1	UNIFORM	Full	No	MAE
2	UNIFORM	Region	Yes	Clipped MAE (cMAE)
3	OUTPUT	Full	Yes	A_OCE
4	OUTPUT	Region	Yes	A_OCE for the region (or cMAE)
5	OTHER	Full	Yes	A_CE
6	OTHER	Region	Yes	A_CE for the region (or cMAE)
Types of cutoff contexts, including the distribution and range of cutoffs. Common? indicates how realistic each kind of problem seems to be, and Measure Method shows which evaluation measure is recommended, where A_P denotes the area under the plot P. For cases 4 and 6, if the region is small it is acceptable not to use the true distribution, and use a flat uniform distribution instead, so we can ultimatly use cMAE.

Back to the top
Back to main page

Common Problem Families

Supply-demand regulation: It includes all problems where we predict sales, trade, stocks, etc., and there is a supply-demand equilibrium, which aligns the cutoffs with the output values.
Trend sign detection: It includes all problems where we predict the change of a quantity in time and the decision is whether the value is higher or lower than the current value (so cutoffs are taken from the output values).
Who's above me: It includes all problems where an individual is interested in knowing the examples that are above itself, independently of the magnitude. Again, the cutoffs are samples from the output distribution.

Back to the top
Back to main page