Model Evaluation Metrics

1. Common Model Classification Evaluation Metrics#

Accuracy#

Accuracy: The percentage of correct predictions out of the total samples.

code

from sklearn.metrics import accuracy_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

accuracy_score(y_true, y_pred)

Disadvantages
Accuracy can fail when samples are imbalanced. For example, when judging whether users browsing a shopping website will make a purchase, if 100 users browse and only 1 makes a purchase, the model could predict that no one will purchase, resulting in an accuracy of 99%.
1️⃣ Handling sample imbalance: Resampling, undersampling, oversampling, etc.

2️⃣ Switching to appropriate metrics: F1-Score, which considers not only the number of incorrect predictions but also the types of errors.

Confusion Matrix#

Look at the diagonal.
code

import matplotlib.pyplot as plt
import scikitplot as skplt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
clf = RandomForestClassifier(n_estimators=5, max_depth=5, random_state=1)
clf.fit(X, y)
clf.score(X, y)
pred = clf.predict(X)

skplt.metrics.plot_confusion_matrix(y, pred, normalize=True)
plt.show()

Binary Classification Diagonal Derived Metrics#

True Positive (TP): Positive samples predicted as positive by the model;
False Positive (FP): Negative samples predicted as positive by the model;
False Negative (FN): Positive samples predicted as negative by the model;
True Negative (TN): Negative samples predicted as negative by the model;
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1 score = 2*(P * R)/(P+R)

Precision#

Precision =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}

The percentage of samples predicted as 1 that are actually 1.
Case:

When predicting stocks, we care more about precision, i.e., among the stocks we predict will rise, how many actually do, because those are the stocks we invest in.
For predicting criminals, we want the predictions to be very accurate; even if some actual criminals are let go, we cannot wrongly accuse an innocent person.

code

from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

y_true = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

# None precision for each class (not averaged)
precision_score(y_true, y_pred, average=None)
# [0.375 1.   ] 

# 'macro' average precision for each class (unweighted)
precision_score(y_true, y_pred, average='macro')
# (0.375 + 1.)/2 = 0.6875

# 'weighted' average precision weighted by the number of samples in each class
precision_score(y_true, y_pred, average='weighted')
# 0.375*0.3+1*0.7 = 0.8125

# 'micro' overall precision for all samples
precision_score(y_true, y_pred, average='micro')
# Equals Accuracy 0.5

accuracy_score(y_true, y_pred) 
# 0.5

Recall#

Recall =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}

The probability of actual 1 samples being recalled by the model, also known as coverage rate.

Case:

Suppose there are 10 earthquakes in total; we would prefer to issue 1000 alerts to cover all 10 earthquakes (at this point recall is 100%, precision is 1%), rather than issue 100 alerts where 8 earthquakes are predicted but 2 are missed (at this point recall is 80%, precision is 8%).
In the context of predicting diseases, we care more about recall, meaning we want to minimize the number of true patients we incorrectly predict, as failing to detect a real patient can have serious consequences; the previous naive algorithm had a recall of 0.

code

from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

y_true = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

recall_score(y_true, y_pred, average=None)  
# 3 zeros were recalled, 2 out of 7 ones were recalled [1.         0.28571429]

recall_score(y_true, y_pred, average='macro')  
# (1. + 0.28571429)/2 = 0.6428571428571428

recall_score(y_true, y_pred, average='weighted')  
# 1*0.3+0.28571429*0.7 = 0.5

recall_score(y_true, y_pred, average='micro')  
# Equals Accuracy =0.5
accuracy_score(y_true, y_pred)
# 0.5

Why Precision and Recall Contradict Each Other#

1️⃣ If you want higher recall, the model needs to cover more samples, but this increases the likelihood of errors, meaning precision will be lower.

2️⃣ If the model is conservative and only detects samples it is very certain about, precision will be high, but recall will be relatively low.

F1 Score#

\frac{1}{F 1}=\frac{1}{2}\left(\frac{1}{\text { precision }}+\frac{1}{\text { recall }}\right)

F 1 =\frac{2}{\frac{1}{precision}+\frac{1}{recall}}

Harmonic Mean#

What is the harmonic mean?

H=\frac{n}{\frac{1}{x_{1}}+\frac{1}{x_{2}}+\ldots+\frac{1}{x_{n}}}

🔴 Since it is calculated based on the reciprocals of the variables, it is also known as the reciprocal average.

For a distance of 2 kilometers, with a speed of 20 km/h for the first kilometer and 10 km/h for the second kilometer, what is the average speed?

Simple average:
(20+10)/2 = 15

Time-weighted average:

Total time = (1/20 + 1/10) = 0.15

Time for the first kilometer: 1/20=0.05 Weight 33%

Time for the second kilometer: 1/10=0.1 Weight 66%

Average time = 20 * 33% + 10 * 66% = 13.33

Harmonic Mean

Average speed = \frac{Total distance}{Total time}

Average speed = \frac{2}{\frac{1}{20}+\frac{1}{10}}=13.33

Why use the harmonic mean for F1?#

If using a simple average, P=0.8, R=0.8 and P=0.7, R=0.9 would both yield an arithmetic average of 0.8, suggesting that precision and recall are interchangeable.
The harmonic mean effectively adds a penalty mechanism: higher values receive lower weights (for example, in the previous case, the weight of the speed of 20 is only 33%).
This avoids the situation where one high value and one low value lead to an inflated average when using arithmetic mean (for instance, if p and r are 1.0 and 0.1, the arithmetic mean would be close to 0.5 while the harmonic mean would be close to 0.2).
The idea behind the F1 score is that an algorithm with balanced precision and recall is more reliable than one with one metric significantly better than the other.
In summary: both metrics need to be good for it to be truly good.

code

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

y_true = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

f1_score(y_true, y_pred, average=None)
# [0.54545455 0.44444444]

f1_score(y_true, y_pred, average='macro')  
# (0.54545455+0.44444444)/2 = 0.4949494949494949

f1_score(y_true, y_pred, average='weighted')  
# 0.54545455*0.3+0.44444444*0.7 = 0.47474747474747475

f1_score(y_true, y_pred, average='micro')  
# Equals Accuracy 0.5
accuracy_score(y_true, y_pred)
# 0.5