Estimation of random accuracy and its use in validation of predictive quality of classification models within predictive challenges

Bono Lučić; Jadranko Batista; Viktor Bojović; Mario Lovric; Ana Sović Križić; Drago Bešlo; Damir Nadramija; Dražen Vikić Topić

doi:10.5562/cca3551

Estimation of random accuracy and its use in validation of predictive quality of classification models within predictive challenges

Bono Lučić, Jadranko Batista, Viktor Bojović, Mario Lovric, Ana Sović Križić, Drago Bešlo, Damir Nadramija, Dražen Vikić Topić

Know-Center GmbH Research Center for Data-Driven Business & Big Data Analytics (98770)

Publikation: Beitrag in einer Fachzeitschrift › Artikel › Begutachtung

Abstract

Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.

Originalsprache	englisch
Seiten (von - bis)	379
Seitenumfang	391
Fachzeitschrift	Croatica Chemica Acta
Jahrgang	92
Ausgabenummer	3
DOIs	https://doi.org/10.5562/cca3551
Publikationsstatus	Veröffentlicht - 21 Okt. 2019

Zugriff auf Dokument

10.5562/cca3551Lizenz: CC BY 4.0

Dieses zitieren

@article{805ae22bbff047768d4529c6dd651c37,

title = "Estimation of random accuracy and its use in validation of predictive quality of classification models within predictive challenges",

abstract = "Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.",

author = "Bono Lu{\v c}i{\'c} and Jadranko Batista and Viktor Bojovi{\'c} and Mario Lovric and {Sovi{\'c} Kri{\v z}i{\'c}}, Ana and Drago Be{\v s}lo and Damir Nadramija and {Viki{\'c} Topi{\'c}}, Dra{\v z}en",

year = "2019",

month = oct,

day = "21",

doi = "10.5562/cca3551",

language = "English",

volume = "92",

pages = "379",

journal = "Croatica Chemica Acta",

issn = "0011-1643",

publisher = "Croatian Chemical Society",

number = "3",

}

TY - JOUR

T1 - Estimation of random accuracy and its use in validation of predictive quality of classification models within predictive challenges

AU - Lučić, Bono

AU - Batista, Jadranko

AU - Bojović, Viktor

AU - Lovric, Mario

AU - Sović Križić, Ana

AU - Bešlo, Drago

AU - Nadramija, Damir

AU - Vikić Topić, Dražen

PY - 2019/10/21

Y1 - 2019/10/21

N2 - Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.

AB - Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.

U2 - 10.5562/cca3551

DO - 10.5562/cca3551

M3 - Article

SN - 0011-1643

VL - 92

SP - 379

JO - Croatica Chemica Acta

JF - Croatica Chemica Acta

IS - 3

ER -

Estimation of random accuracy and its use in validation of predictive quality of classification models within predictive challenges

Abstract

Zugriff auf Dokument

Fingerprint

Dieses zitieren