Home

Awesome

<p align="left"> <a href="https://zenodo.org/badge/latestdoi/134606465"> <img src="https://zenodo.org/badge/134606465.svg"/></a> <a href="https://opensource.org/licenses/MIT" > <img src="https://img.shields.io/badge/License-MIT-yellow.svg" /></a> <a href="https://github.com/rafaelpadilla/Object-Detection-Metrics/raw/master/paper_survey_on_performance_metrics_for_object_detection_algorithms.pdf"> <img src="https://img.shields.io/badge/paper-published-red"/></a> </p>

Citation

If you use this code for your research, please consider citing:

@Article{electronics10030279,
AUTHOR = {Padilla, Rafael and Passos, Wesley L. and Dias, Thadeu L. B. and Netto, Sergio L. and da Silva, Eduardo A. B.},
TITLE = {A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit},
JOURNAL = {Electronics},
VOLUME = {10},
YEAR = {2021},
NUMBER = {3},
ARTICLE-NUMBER = {279},
URL = {https://www.mdpi.com/2079-9292/10/3/279},
ISSN = {2079-9292},
DOI = {10.3390/electronics10030279}
}

Download the paper here or here.

@INPROCEEDINGS {padillaCITE2020,
    author    = {R. {Padilla} and S. L. {Netto} and E. A. B. {da Silva}},
    title     = {A Survey on Performance Metrics for Object-Detection Algorithms}, 
    booktitle = {2020 International Conference on Systems, Signals and Image Processing (IWSSIP)}, 
    year      = {2020},
    pages     = {237-242},}

Download the paper here


Attention! A new version of this tool is available here

The new version includes all COCO metrics, supports other file formats, provides a User Interface (UI) to guide the evaluation process, and presents the STT-AP metric to evaluate object detection in videos.


Metrics for object detection

The motivation of this project is the lack of consensus used by different works and implementations concerning the evaluation metrics of the object detection problem. Although on-line competitions use their own metrics to evaluate the task of object detection, just some of them offer reference code snippets to calculate the accuracy of the detected objects.
Researchers who want to evaluate their work using different datasets than those offered by the competitions, need to implement their own version of the metrics. Sometimes a wrong or different implementation can create different and biased results. Ideally, in order to have trustworthy benchmarking among different approaches, it is necessary to have a flexible implementation that can be used by everyone regardless the dataset used.

This project provides easy-to-use functions implementing the same metrics used by the the most popular competitions of object detection. Our implementation does not require modifications of your detection model to complicated input formats, avoiding conversions to XML or JSON files. We simplified the input data (ground truth bounding boxes and detected bounding boxes) and gathered in a single project the main metrics used by the academia and challenges. Our implementation was carefully compared against the official implementations and our results are exactly the same.

In the topics below you can find an overview of the most popular metrics used in different competitions and works, as well as samples showing how to use our code.

Table of contents

<a name="different-competitions-different-metrics"></a>

Different competitions, different metrics

Important definitions

Intersection Over Union (IOU)

Intersection Over Union (IOU) is a measure based on Jaccard Index that evaluates the overlap between two bounding boxes. It requires a ground truth bounding box and a predicted bounding box . By applying the IOU we can tell if a detection is valid (True Positive) or not (False Positive).

IOU is given by the overlapping area between the predicted bounding box and the ground truth bounding box divided by the area of union between them:  

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Ctext%7BIOU%7D%3D%5Cfrac%7B%5Ctext%7Barea%7D%5Cleft%28B_%7Bp%7D%20%5Ccap%20B_%7Bgt%7D%20%5Cright%29%7D%7B%5Ctext%7Barea%7D%5Cleft%28B_%7Bp%7D%20%5Ccup%20B_%7Bgt%7D%20%5Cright%29%7D"> </p> <!--- \text{IOU}=\frac{\text{area}\left(B_{p} \cap B_{gt} \right)}{\text{area}\left(B_{p} \cup B_{gt} \right)} --->

The image below illustrates the IOU between a ground truth bounding box (in green) and a detected bounding box (in red).

<!--- IOU ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/iou.png" align="center"/></p>

True Positive, False Positive, False Negative and True Negative

Some basic concepts used by the metrics:

threshold: depending on the metric, it is usually set to 50%, 75% or 95%.

Precision

Precision is the ability of a model to identify only the relevant objects. It is the percentage of correct positive predictions and is given by:

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Ctext%7BPrecision%7D%20%3D%20%5Cfrac%7B%5Ctext%7BTP%7D%7D%7B%5Ctext%7BTP%7D&plus;%5Ctext%7BFP%7D%7D%3D%5Cfrac%7B%5Ctext%7BTP%7D%7D%7B%5Ctext%7Ball%20detections%7D%7D"> </p> <!--- \text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}=\frac{\text{TP}}{\text{all detections}} --->

Recall

Recall is the ability of a model to find all the relevant cases (all ground truth bounding boxes). It is the percentage of true positive detected among all relevant ground truths and is given by:

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Ctext%7BRecall%7D%20%3D%20%5Cfrac%7B%5Ctext%7BTP%7D%7D%7B%5Ctext%7BTP%7D&plus;%5Ctext%7BFN%7D%7D%3D%5Cfrac%7B%5Ctext%7BTP%7D%7D%7B%5Ctext%7Ball%20ground%20truths%7D%7D"> </p> <!--- \text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}=\frac{\text{TP}}{\text{all ground truths}} --->

Metrics

In the topics below there are some comments on the most popular metrics used for object detection.

Precision x Recall curve

The Precision x Recall curve is a good way to evaluate the performance of an object detector as the confidence is changed by plotting a curve for each object class. An object detector of a particular class is considered good if its precision stays high as recall increases, which means that if you vary the confidence threshold, the precision and recall will still be high. Another way to identify a good object detector is to look for a detector that can identify only relevant objects (0 False Positives = high precision), finding all ground truth objects (0 False Negatives = high recall).

A poor object detector needs to increase the number of detected objects (increasing False Positives = lower precision) in order to retrieve all ground truth objects (high recall). That's why the Precision x Recall curve usually starts with high precision values, decreasing as recall increases. You can see an example of the Prevision x Recall curve in the next topic (Average Precision). This kind of curve is used by the PASCAL VOC 2012 challenge and is available in our implementation.

Average Precision

Another way to compare the performance of object detectors is to calculate the area under the curve (AUC) of the Precision x Recall curve. As AP curves are often zigzag curves going up and down, comparing different curves (different detectors) in the same plot usually is not an easy task - because the curves tend to cross each other much frequently. That's why Average Precision (AP), a numerical metric, can also help us compare different detectors. In practice AP is the precision averaged across all recall values between 0 and 1.

From 2010 on, the method of computing AP by the PASCAL VOC challenge has changed. Currently, the interpolation performed by PASCAL VOC challenge uses all data points, rather than interpolating only 11 equally spaced points as stated in their paper. As we want to reproduce their default implementation, our default code (as seen further) follows their most recent application (interpolating all data points). However, we also offer the 11-point interpolation approach.

11-point interpolation

The 11-point interpolation tries to summarize the shape of the Precision x Recall curve by averaging the precision at a set of eleven equally spaced recall levels [0, 0.1, 0.2, ... , 1]:

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Ctext%7BAP%7D%3D%5Cfrac%7B1%7D%7B11%7D%20%5Csum_%7Br%5Cin%20%5Cleft%20%5C%7B%200%2C%200.1%2C%20...%2C1%20%5Cright%20%5C%7D%7D%5Crho_%7B%5Ctext%7Binterp%7D%5Cleft%20%28%20r%20%5Cright%20%29%7D"> </p> <!--- \text{AP}=\frac{1}{11} \sum_{r\in \left \{ 0, 0.1, ...,1 \right \}}\rho_{\text{interp}\left ( r \right )} --->

with

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Crho_%7B%5Ctext%7Binterp%7D%7D%20%3D%20%5Cmax_%7B%5Ctilde%7Br%7D%3A%5Ctilde%7Br%7D%20%5Cgeq%20r%7D%20%5Crho%5Cleft%20%28%20%5Ctilde%7Br%7D%20%5Cright%20%29"> </p> <!--- \rho_{\text{interp}} = \max_{\tilde{r}:\tilde{r} \geq r} \rho\left ( \tilde{r} \right ) --->

where is the measured precision at recall .

Instead of using the precision observed at each point, the AP is obtained by interpolating the precision only at the 11 levels taking the maximum precision whose recall value is greater than .

Interpolating all points

Instead of interpolating only in the 11 equally spaced points, you could interpolate through all points <img src="https://latex.codecogs.com/gif.latex?n"> in such way that:

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Csum_%7Bn%3D0%7D%20%5Cleft%20%28%20r_%7Bn&plus;1%7D%20-%20r_%7Bn%7D%20%5Cright%20%29%20%5Crho_%7B%5Ctext%7Binterp%7D%7D%5Cleft%20%28%20r_%7Bn&plus;1%7D%20%5Cright%20%29"> </p> <!--- \sum_{n=0} \left ( r_{n+1} - r_{n} \right ) \rho_{\text{interp}}\left ( r_{n+1} \right ) --->

with

<p align="center"> <img src="https://latex.codecogs.com/gif.latex?%5Crho_%7B%5Ctext%7Binterp%7D%7D%5Cleft%20%28%20r_%7Bn&plus;1%7D%20%5Cright%20%29%20%3D%20%5Cmax_%7B%5Ctilde%7Br%7D%3A%5Ctilde%7Br%7D%20%5Cge%20r_%7Bn&plus;1%7D%7D%20%5Crho%20%5Cleft%20%28%20%5Ctilde%7Br%7D%20%5Cright%20%29"> </p> <!--- \rho_{\text{interp}}\left ( r_{n+1} \right ) = \max_{\tilde{r}:\tilde{r} \ge r_{n+1}} \rho \left ( \tilde{r} \right ) --->

where is the measured precision at recall .

In this case, instead of using the precision observed at only few points, the AP is now obtained by interpolating the precision at each level, taking the maximum precision whose recall value is greater or equal than . This way we calculate the estimated area under the curve.

To make things more clear, we provided an example comparing both interpolations.

An ilustrated example

An example helps us understand better the concept of the interpolated average precision. Consider the detections below:

<!--- Image samples 1 ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/samples_1_v2.png" align="center"/></p>

There are 7 images with 15 ground truth objects represented by the green bounding boxes and 24 detected objects represented by the red bounding boxes. Each detected object has a confidence level and is identified by a letter (A,B,...,Y).

The following table shows the bounding boxes with their corresponding confidences. The last column identifies the detections as TP or FP. In this example a TP is considered if IOU 30%, otherwise it is a FP. By looking at the images above we can roughly tell if the detections are TP or FP.

<!--- Table 1 ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/table_1_v2.png" align="center"/></p> <!--- | Images | Detections | Confidences | TP or FP | |:------:|:----------:|:-----------:|:--------:| | Image 1 | A | 88% | FP | | Image 1 | B | 70% | TP | | Image 1 | C | 80% | FP | | Image 2 | D | 71% | FP | | Image 2 | E | 54% | TP | | Image 2 | F | 74% | FP | | Image 3 | G | 18% | TP | | Image 3 | H | 67% | FP | | Image 3 | I | 38% | FP | | Image 3 | J | 91% | TP | | Image 3 | K | 44% | FP | | Image 4 | L | 35% | FP | | Image 4 | M | 78% | FP | | Image 4 | N | 45% | FP | | Image 4 | O | 14% | FP | | Image 5 | P | 62% | TP | | Image 5 | Q | 44% | FP | | Image 5 | R | 95% | TP | | Image 5 | S | 23% | FP | | Image 6 | T | 45% | FP | | Image 6 | U | 84% | FP | | Image 6 | V | 43% | FP | | Image 7 | X | 48% | TP | | Image 7 | Y | 95% | FP | --->

In some images there are more than one detection overlapping a ground truth (Images 2, 3, 4, 5, 6 and 7). For those cases, the predicted box with the highest IOU is considered TP (e.g. in image 1 "E" is TP while "D" is FP because IOU between E and the groundtruth is greater than the IOU between D and the groundtruth). This rule is applied by the PASCAL VOC 2012 metric: "e.g. 5 detections (TP) of a single object is counted as 1 correct detection and 4 false detections”.

The Precision x Recall curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first we need to order the detections by their confidences, then we calculate the precision and recall for each accumulated detection as shown in the table below (Note that for recall computation, the denominator term ("Acc TP + Acc FN" or "All ground truths") is constant at 15 since GT boxes are constant irrespective of detections).:

<!--- Table 2 ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/table_2_v2.png" align="center"/></p> <!--- | Images | Detections | Confidences | TP | FP | Acc TP | Acc FP | Precision | Recall | |:------:|:----------:|:-----------:|:---:|:--:|:------:|:------:|:---------:|:------:| | Image 5 | R | 95% | 1 | 0 | 1 | 0 | 1 | 0.0666 | | Image 7 | Y | 95% | 0 | 1 | 1 | 1 | 0.5 | 0.6666 | | Image 3 | J | 91% | 1 | 0 | 2 | 1 | 0.6666 | 0.1333 | | Image 1 | A | 88% | 0 | 1 | 2 | 2 | 0.5 | 0.1333 | | Image 6 | U | 84% | 0 | 1 | 2 | 3 | 0.4 | 0.1333 | | Image 1 | C | 80% | 0 | 1 | 2 | 4 | 0.3333 | 0.1333 | | Image 4 | M | 78% | 0 | 1 | 2 | 5 | 0.2857 | 0.1333 | | Image 2 | F | 74% | 0 | 1 | 2 | 6 | 0.25 | 0.1333 | | Image 2 | D | 71% | 0 | 1 | 2 | 7 | 0.2222 | 0.1333 | | Image 1 | B | 70% | 1 | 0 | 3 | 7 | 0.3 | 0.2 | | Image 3 | H | 67% | 0 | 1 | 3 | 8 | 0.2727 | 0.2 | | Image 5 | P | 62% | 1 | 0 | 4 | 8 | 0.3333 | 0.2666 | | Image 2 | E | 54% | 1 | 0 | 5 | 8 | 0.3846 | 0.3333 | | Image 7 | X | 48% | 1 | 0 | 6 | 8 | 0.4285 | 0.4 | | Image 4 | N | 45% | 0 | 1 | 6 | 9 | 0.7 | 0.4 | | Image 6 | T | 45% | 0 | 1 | 6 | 10 | 0.375 | 0.4 | | Image 3 | K | 44% | 0 | 1 | 6 | 11 | 0.3529 | 0.4 | | Image 5 | Q | 44% | 0 | 1 | 6 | 12 | 0.3333 | 0.4 | | Image 6 | V | 43% | 0 | 1 | 6 | 13 | 0.3157 | 0.4 | | Image 3 | I | 38% | 0 | 1 | 6 | 14 | 0.3 | 0.4 | | Image 4 | L | 35% | 0 | 1 | 6 | 15 | 0.2857 | 0.4 | | Image 5 | S | 23% | 0 | 1 | 6 | 16 | 0.2727 | 0.4 | | Image 3 | G | 18% | 1 | 0 | 7 | 16 | 0.3043 | 0.4666 | | Image 4 | O | 14% | 0 | 1 | 7 | 17 | 0.2916 | 0.4666 | --->

Example computation for the 2nd row (Image 7): Precision = TP/(TP+FP) = 1/2 = 0.5 and Recall = TP/(TP+FN) = 1/15 = 0.066

Plotting the precision and recall values we have the following Precision x Recall curve:

<!--- Precision x Recall graph ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/precision_recall_example_1_v2.png" align="center"/> </p>

As mentioned before, there are two different ways to measure the interpolted average precision: 11-point interpolation and interpolating all points. Below we make a comparisson between them:

Calculating the 11-point interpolation

The idea of the 11-point interpolated average precision is to average the precisions at a set of 11 recall levels (0,0.1,...,1). The interpolated precision values are obtained by taking the maximum precision whose recall value is greater than its current recall value as follows:

<!--- interpolated precision curve ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/11-pointInterpolation.png" align="center"/> </p>

By applying the 11-point interpolation, we have:



Calculating the interpolation performed in all points

By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The intention is to reduce the impact of the wiggles in the curve. By applying the equations presented before, we can obtain the areas as it will be demostrated here. We could also visually have the interpolated precision points by looking at the recalls starting from the highest (0.4666) to 0 (looking at the plot from right to left) and, as we decrease the recall, we collect the precision values that are the highest as shown in the image below:

<!--- interpolated precision AUC ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/interpolated_precision_v2.png" align="center"/> </p>

Looking at the plot above, we can divide the AUC into 4 areas (A1, A2, A3 and A4):

<!--- interpolated precision AUC ---> <p align="center"> <img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/interpolated_precision-AUC_v2.png" align="center"/> </p>

Calculating the total area, we have the AP:







The results between the two different interpolation methods are a little different: 24.56% and 26.84% by the every point interpolation and the 11-point interpolation respectively.

Our default implementation is the same as VOC PASCAL: every point interpolation. If you want to use the 11-point interpolation, change the functions that use the argument method=MethodAveragePrecision.EveryPointInterpolation to method=MethodAveragePrecision.ElevenPointInterpolation.

If you want to reproduce these results, see the Sample 2.

<!--In order to evaluate your detections, you just need a simple list of `Detection` objects. A `Detection` object is a very simple class containing the class id, class probability and bounding boxes coordinates of the detected objects. This same structure is used for the groundtruth detections.-->

How to use this project

This project was created to evaluate your detections in a very easy way. If you want to evaluate your algorithm with the most used object detection metrics, you are in the right place.

Sample_1 and sample_2 are practical examples demonstrating how to access directly the core functions of this project, providing more flexibility on the usage of the metrics. But if you don't want to spend your time understanding our code, see the instructions below to easily evaluate your detections:

Follow the steps below to start evaluating your detections:

  1. Create the ground truth files
  2. Create your detection files
  3. For Pascal VOC metrics, run the command: python pascalvoc.py
    If you want to reproduce the example above, run the command: python pascalvoc.py -t 0.3
  4. (Optional) You can use arguments to control the IOU threshold, bounding boxes format, etc.

Create the ground truth files

If you prefer, you can also have your bounding boxes in the format: <class_name> <left> <top> <width> <height> (see here * how to use it). In this case, your "2008_000034.txt" would be represented as:

bottle 6 234 39 128
person 1 156 102 180
person 36 111 162 305
person 91 42 247 458

Create your detection files

Also if you prefer, you could have your bounding boxes in the format: <class_name> <confidence> <left> <top> <width> <height>.

Optional arguments

Optional arguments:

Argument                          DescriptionExampleDefault
-h,<br>--help show help messagepython pascalvoc.py -h
-v,<br>--versioncheck versionpython pascalvoc.py -v
-gt,<br>--gtfolderfolder that contains the ground truth bounding boxes filespython pascalvoc.py -gt /home/whatever/my_groundtruths//Object-Detection-Metrics/groundtruths
-det,<br>--detfolderfolder that contains your detected bounding boxes filespython pascalvoc.py -det /home/whatever/my_detections//Object-Detection-Metrics/detections/
-t,<br>--thresholdIOU thershold that tells if a detection is TP or FPpython pascalvoc.py -t 0.750.50
-gtformatformat of the coordinates of the ground truth bounding boxes *python pascalvoc.py -gtformat xyrbxywh
-detformatformat of the coordinates of the detected bounding boxes *python pascalvoc.py -detformat xyrbxywh
-gtcoordsreference of the ground truth bounding bounding box coordinates.<br>If the annotated coordinates are relative to the image size (as used in YOLO), set it to rel.<br>If the coordinates are absolute values, not depending to the image size, set it to abspython pascalvoc.py -gtcoords relabs
-detcoordsreference of the detected bounding bounding box coordinates.<br>If the coordinates are relative to the image size (as used in YOLO), set it to rel.<br>If the coordinates are absolute values, not depending to the image size, set it to abspython pascalvoc.py -detcoords relabs
-imgsize image size in the format width,height <int,int>.<br>Required if -gtcoords or -detcoords is set to relpython pascalvoc.py -imgsize 600,400
-sp,<br>--savepathfolder where the plots are savedpython pascalvoc.py -sp /home/whatever/my_results/Object-Detection-Metrics/results/
-np,<br>--noplotif present no plot is shown during executionpython pascalvoc.py -npnot presented.<br>Therefore, plots are shown

<a name="asterisk"> </a> (*) set -gtformat xywh and/or -detformat xywh if format is <left> <top> <width> <height>. Set to -gtformat xyrb and/or -detformat xyrb if format is <left> <top> <right> <bottom>.

References