Binarised Regression - Data sets


Summary

Type

Dataset

Cutoff distribution

Is it common?

Expected error

1

None provided
UNIFORM with WIDE range

No, quite unlikely

AUCE

2

Airfoil Data Set (airfoil)
Energy efficiency Data Set - Cooling task (data_cooling)
Energy efficiency Data Set - Heating task (data_heating)
UNIFORM with NARROW range

Yes

Clipped AUCE

3

Yacht Hydrodynamics Data Set (yathc)
Concrete Data Set (concrete)
Housing Data Set (housing)
Solar Flare (flare.data1) Data Set (solar1)
Solar Flare (flare.data2) Data Set (solar2)
Plastic Data Set (plastic)
Auto Price Data Set (autoprice)
OUTPUT with WIDE range

Yes

AOCE

4

Auto MPG Data Set (autoMpg)
OUTPUT with NARROW range

Yes

AOCE(*)

5

Wankara Data Set (ankara)
Wizmir Data Set (izmir)
Treasury-Mortgage Data Sets (mortgage)
Dee Data Set (dee)
Wine Quality (white) Data Set (wineW)
Wine Quality (red) Data Set (wineR)
Forest Fires Data Set (forest)
Combined Cycle Power Plant Data Set (combined)
OTHER with WIDE range

Yes

ACE

6

CPU Small Data Set (cpu)
OTHER with NARROW range

Yes

ACE(*)

Note 1: Dataset identifiers regarding the paper in green.

Note 2: (*) The error measure is only applied to the cutoff region.




Type 1


None provided

Back to the top
Back to main page

Type 2



Airfoil Data Set (TYPE 2)

STORY: The airfoil self-noise dataset consists of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel. The input space X is composed by the frequency, the angle of attack, the chord length, the free-stream velocity and the suction side displacement thickness. The output space Y is a numerical variable describing the scaled sound pressure level, in decibels. As the output is numerical, the data presentation looks like a regression problem. Making a model m(x) that estimates the sound y' that belongs to Y of a new blade x that belongs to X (before building it) can be very useful. However, in this scenario we have to consider that aircraft regulations about noise may change from country to country (or even for different types of aircraft), so once the blade is built and put into place, the operating context c (the cutoff above which the plane is not authorised to fly because of its noise) may change, and is rarely known in advance. Once the cutoff c is known, a decision process has to discard the blade if its noise is above the cutoff c or take it into production if it is below. As the cutoff c is subject to change, we want procedures and models that behave well for the range of possible cutoffs.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: No, it's reasonable to think that the range will be smaller than the range of output values.
FPCost vs FNcost: Not very uneven.
SUGGESTED RANGE: absolute cutoff values 110-140 (30%-75% of the cutoffs with UNIFORM distribution).
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=1 and beta=1.
SIZE: 1503
Download the associated plot

Energy efficiency Data Set, Cooling task (TYPE 2)

STORY: This dataset performs energy analysis using 12 different building shapes. As there are two responses, we could derive two datasets from it. cutoffs could be some kind of certification above (or below) which the building is considered "energy efficient", depending on the country, the region, the use of the building.
PARADIGM: Supply-demand regulation
WIDE-RANGE OF CUTOFFS: No.
FPCost vs FNcost: False positives are more costly.
SUGGESTED RANGE: absolute cutoff values 18-30 (0%-55% of the cutoffs with UNIFORM distribution).
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 768
Download the associated plot

Energy efficiency Data Set, Heating task (TYPE 2)

STORY: This dataset performs energy analysis using 12 different building shapes. As there are two responses, we could derive two datasets from it. cutoffs could be some kind of certification above (or below) which the building is considered "energy efficient", depending on the country, the region, the use of the building.
PARADIGM: Supply-demand regulation
WIDE-RANGE OF CUTOFFS: No.
FPCost vs FNcost: False positives are more costly.
SUGGESTED RANGE: absolute cutoff values 26-38 (50%-85% of the cutoffs with UNIFORM distribution).
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 768
Download the associated plot

Back to the top
Back to main page

Type 3



Yacht Hydrodynamics Data Set (TYPE 3)

STORY: Prediction of residuary resistance of sailing yachts at the initial design stage is of a great value for evaluating the ship's performance and for estimating the required propulsive power. Depending of that estimation, some engines will produce propulsive power that may be insufficient. But having the engine power as cutoff is unrealistic. It is perhaps more realistic to think that there are cutoffs for minimum efficiency. If this cutoff is not reached the ship design must be rethought or rebuilt.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: False negatives are more costly.
SUGGESTED RANGE: absolute cutoff values 0-60 (0%-60% of the cutoffs with OUTPUT distribution).
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=5.
SIZE: 308
Download the associated plot

Concrete Data Set (TYPE 3)

STORY: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. Depending on the part of a building different strengths are obtained and used. Also, during construction (depending on the days), the strength is increasing on some parts of the building. A reasonable cutoff here is that depending on the part of the structure we may need a level of compressive strength (to build upon or to resist a given weight). All seems reasonable here, but there is a problem that false positives are much more costly than false negatives here.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes, depending on the part of the structure.
FPCost vs FNcost: False positives are much more costly.
SUGGESTED RANGE: absolute cutoff values 15-85 (15%-85% of the cutoffs with OUTPUT distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 1030
Download the associated plot

Housing Data Set (TYPE 3)

STORY: House price per city areas (suburbs of Boston). The cutoff could be motivated by a company that want to open some offices in the city (e.g., a bank or an estate agent). The company may be interested in areas where the house prices are above (or below) a cutoff.
PARADIGM: Supply-demand regulation
WIDE-RANGE OF CUTOFFS: Not clear.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 10-50 (20%-70% of the cutoffs with OUTPUT distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 506
Download the associated plot

Solar Flare (solar_flare1) Data Set (TYPE 3)

STORY: Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period. There are three outputs, so this could be converted into three different datasets or the three types of flares could be aggregated (summed) into one variables. A possible case for binarisation would be if a research team is only interested in analysing regions for which the number of flares is above a cutoff.
Note: The data are divided into two sections. The second section (flare.data2) has had much more error correction applied to the it, and has consequently been treated as more reliable.
PARADIGM: Who's above me
WIDE-RANGE OF CUTOFFS: Not clear.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 0-4 (0%-70% of the cutoffs with OUTPUT distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=6.
SIZE: 1066
Download the associated plot

Solar Flare (solar_flare2) Data Set (TYPE 3)

STORY: Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period. There are three outputs, so this could be converted into three different datasets or the three types of flares could be aggregated (summed) into one variables. A possible case for binarisation would be if a research team is only interested in analysing regions for which the number of flares is above a cutoff.
PARADIGM: Who's above me
WIDE-RANGE OF CUTOFFS: Not clear.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 0-8 (0%-70% of the cutoffs with OUTPUT distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=6.
SIZE: 1066
Download the associated plot

Plastic Data Set (TYPE 3)

STORY: Depending on whether the piece that is to be constructed from the plastic requires a given cutoff of resistance to pressure.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: False positives are more costly.
SUGGESTED RANGE: absolute cutoff values 10-20 (0%-100% of the cutoffs with OUTPUT distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=1 and beta=1.
SIZE: 1650
Download the associated plot

Auto Price Data Set (TYPE 3)

STORY: Predict the price of a car from a set of attributes,buyer's limits are cutoffs (a case of SALES, TRADE, STOCKS, etc).
PARADIGM: Supply-demand regulation
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: False positives are more costly.
SUGGESTED RANGE: absolute cutoff values 5000-35000 (0%-70% of the cutoffs with OUTPUT distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 159
Download the associated plot

Back to the top
Back to main page

Type 4



Auto MPG Data Set (TYPE 4)

STORY: Fuel consumption (1993). The cutoffs could be companies or individuals buying the cars and setting the threshold on some fuel consumption levels (e.g., I want a car that does not consume more than...). This is ok, but it is a little bit strange to consider that we are going to have a model for fuel consumption (this is usually given by the car manufacturer). Perhaps a vintage example could make more sense.
PARADIGM: Supply-demand regulation
WIDE-RANGE OF CUTOFFS: Perhaps, if it is individual choice by a buyer. If it is an external regulation, this cutoff range will be usually narrow.
FPCost vs FNcost: False positives are more costly.
SUGGESTED RANGE: absolute cutoff values 10-40 (20%-60% of the cutoffs with OTHER distribution).
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=4.
SIZE: 398
Download the associated plot

Back to the top
Back to main page

Type 5



Wankara Data Sets (TYPE 5)

STORY: Depending on whether the temperature goes up or down we may decide to do or postpone a given activity (Ankara).
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: May be similar (or not).
SUGGESTED RANGE: absolute cutoff values 20-80 (20%-80% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 1609
Download the associated plot

Wizmir Data Sets (TYPE 5)

STORY: Depending on whether the temperature goes up or down we may decide to do or postpone a given activity (Izmir).
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: May be similar (or not).
SUGGESTED RANGE: absolute cutoff values 30-90 (20%-80% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 1461
Download the associated plot

Treasury-Mortgage Data Sets (TYPE 5)

STORY: We use the rate today as cutoff and we want to know whether the rate is going to be higher or lower in the near future (e.g., next week). If it is lower, we delay a buy. If it is higher, we delay a sale.
For Mortgage see here PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: May be similar.
SUGGESTED RANGE: absolute cutoff values 5-20 (0%-100% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=1.5 and beta=3.
SIZE: 1049
Download the associated plot

Dee Data Set (TYPE 5)

STORY: Depending on whether the energy prize goes up or down we may decide to delay the engaging of a machine that consumes a lot in a company or a washing machine in a house. If it is about companies selling and buying energy, we can wait or not depending on whether the price is goint to go up or down.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: May be similar.
SUGGESTED RANGE: absolute cutoff values 1-5 (0%-100% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 365
Download the associated plot

Wine Quality (white) Data Set (TYPE 5)

STORY: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical test. Story: depending on the customer, we want to give wines above a quality cutoff.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 3-9 (10%-90% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 4898
Download the associated plot

Wine Quality (red) Data Set (TYPE 5)

STORY: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical test. Story: depending on the customer, we want to give wines above a quality cutoff.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: Yes.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 3-8 (10%-90% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 4898
Download the associated plot

Forest Fires Data Set (TYPE 5)

STORY: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data. A cutoff can be set for the need of a special resource (a hydroplane or a firestation) depending on how much fire is expected. It's not a very natural problem.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: No, very narrow.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 0-1000 (0%-100% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=1 and beta=5.
SIZE: 517
Download the associated plot

Combined Cycle Power Plant Data Set (TYPE 5)

STORY: The task here is to predict the net hourly electrical energy output and perhaps we could think in a cutoff related to the needs of energy outside the Power plant, that is, if the plant would be able to supply energy at any hour. For instance, whether a second plant or reconnection is needed? That is, production will be enough? This makes sense as a binarisation problem.
PARADIGM: Supply-demand regulation
WIDE-RANGE OF CUTOFFS: Yes, probably. I expect the production of a combined cycle power plant to be less variable than the consumption, so perhaps in this case the distribution of cutoffs goes even beyond the limits of the min and max of the response variable. This is not a problem, because these can be clipped.
FPCost vs FNcost: It is not clear which ones are more costly.
SUGGESTED RANGE: absolute cutoff values 420-480 (15%-85% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 9568
Download the associated plot

Back to the top
Back to main page

Type 6



CPU Small Data Set (TYPE 6)

STORY: Predict user using a restricted number of attribures (excluding the paging information (10-18), (usr - Portion of time (%) that CPUs run in user mode). We want to predict when the CPU is free in a certain portion of time (%), cutoff. We may be interested in launching other processes.
PARADIGM: Trend sign detection
WIDE-RANGE OF CUTOFFS: No.
FPCost vs FNcost: False positives are more costly.
SUGGESTED RANGE: absolute cutoff values 0-100 (30%-70% of the cutoffs with OTHER distribution)
SUGGESTED DISTRIBUTION: a beta distribution, with alpha=2 and beta=2.
SIZE: 8192
Download the associated plot

Back to the top
Back to main page
© 2015 José Hernández Orallo Cèsar Ferri.