Calculate mean Average Precision (mAP) for multi-label classification
Suppose we want to train a model to recognize ingredients in a food image, one effective way to evaluate the performance is mean Average Precision(mAP), another is ROC curve. I’m always confused about how mAP is calculated actually and here I’ll use an example to learn it (hopefully).
Assume we have a toy data set with two samples (e.g. food images) and the corresponding target and predicted scores. In my case, we would have four possible food ingredients [a,b,c,d]
. Sample 1 has ingredients b, c, d. Sample 2 only has ingredients c.
After a few epochs of training, we get the predicted scores shown below, which is far from correct. Let’s start from here. We first introduce precision and recall, then move to average precision, finally talk about mAP.
Precision and Recall
The precision and recall is in terms of each class, NOT each sample!
The numerators of precision and recall are the same, that is the number of true samples in the retrieved(positive) samples, we call it true positive (or TP)
The denominator of precision is the number of retrieved(positive) samples, we call it P, so P keeps increasing as we retrieve more samples.
The denominator of recall is the number of true samples in the dataset, we call it T, so T is fixed.
- precision = TP/P
- recall = TP/T
In our toy example, we have 4 classes(ingredients), let’s do two practices.
Practice 1: class a
For the first class, we have the target and predictions shown in dark red.
- If we just retrieve one sample for class a, we will choose sample 3, now precision=1/1=1.00, recall=1/1=1.00
- If we retrieve two samples for class a, we will choose sample 3, then sample 2, now precision=1/2=0.50, recall=1/1=1.00
- If we retrieve three samples for class a, we will choose sample 3, then sample 2, then sample 1, now precision=1/3=0.33, recall=1/1=1.00
We immediately see that as we retrieve more samples, recall always increases(at least keeps the same), however precision may raises and falls. What we really want is to choose a threshold that both precision and recall are high. That’s why people develop F1 score as a metric to combine them together:
F1 = 2 * (precision * recall) / (precision + recall)
However, F1 just evaluates the model’s performance at a specific threshold, people continue to develop metrics like ROC (not covered here) and mAP to evaluate the performance over all possible thresholds. Before touching mAP, let’s have one more practice to make sure we understand precision and recall.
Practice 2: class b
For the second class, we have the target and predictions shown in dark red.
- If we just retrieve one sample for class b, we will choose sample 2, now precision=0/1=0.00, recall=0/2=0.00
- If we retrieve two samples for class b, we will choose sample 2, then sample 1, now precision=1/2=0.50, recall=1/2=0.50
- If we retrieve three samples for class b, we will choose sample 2, then sample 1, then sample 3, now precision=2/3=0.66, recall=2/2=1.00
Ha, for class b the precision always raises! Why? If we take a close look at the data we are using, we see that the true samples have smaller probabilities, this means we need large threshold in order to get these true samples, not a good sign…
All classes
Hopefully I have compute all these values correctly.
Average Precision (AP)
Now we want to know the performance of each class.
We need to take both precision and recall into account, one simple way is to just average the precisions of all possible recalls.
- class a, the possible recalls are T=1 (see here for why we ignore T=2 and T=3)
- class b, the possible recalls are T=2, T=3
- …
We see that class a and class d are better than class c, the worst class is class b, the conclusion makes sense if we check Tab .1 again.
In real world, for different classes, we choose different thresholds, e.g. for class a and class d T=1, is pretty good; for class b and class c, it depends on whether we take more care for precision or recall, if we treat them equally, then F1 would be a good metric to help us choose the threshold, in this case, we want both of them to use T=3
mean Average Precision (mAP)
Now we want to know the performance of the model over all classes.
mAP just moves one step further to average the APs over all classes. It measure how good the model is for the entire data set. In our case, mAP=0.81, not bad!