Multi-Object Tracking with DeepSORT
This example shows how to integrate appearance features from a re-Identification (Re-ID) Deep Neural Network with a multi-object tracker to improve the performance of camera-based object tracking. The implementation closely follows the Deep Simple Online and Realtime (DeepSORT) multi-object tracking algorithm [1]. This example uses the Sensor Fusion and Tracking Toolbox™ and the Computer Vision Toolbox™.
Introduction
The objectives of multi-object tracking are to estimate the number of objects in a scene, to accurately estimate their position, and to establish and maintain unique identities for all objects. You often achieve this through a tracking-by-detection approach that consists of two consecutive tasks. First, you obtain the detections of objects in each frame. Second, you perform track the association and management across frames.
This example builds upon the SORT algorithm, introduced in the Implement Simple Online and Realtime Tracking example. The data association and track management of SORT is efficient and simple to implement, but it is ineffective when tracking objects over occlusions in single-view camera scenes.
The increasingly popular Re-ID networks provide appearance features, sometimes called appearance embeddings, for each object detection. Appearance features are a representation of the visual appearance of an object. They offer an additional measure of the similarity (or distance) between a detection and a track. The integration of appearance information into the data association is a powerful technique to handle tracking over longer occlusions and therefore reduces the number of switches in track identities.
Pre-Trained Person Re-Identification Network
Download the re-identification pre-trained network from the internet. Refer to the Reidentify People Throughout a Video Sequence Using ReID Network (Computer Vision Toolbox) example to learn about this network and how to train it. You use this pre-trained network to evaluate appearance feature for each detection.
helperDownloadReIDResNet();
Downloading Pretrained Person ReID Network (~198 MB)
Load the Re-ID network.
load("personReIDResNet_v2.mat","net");
To obtain the appearance feature vector of a detection, you extract the bounding box coordinates and convert them to image frame indices. You can then crop out the bounding box of the detection and use the extractReidentificationFeatures
(Computer Vision Toolbox) method of the reidentificationNetwork
(Computer Vision Toolbox) object to obtain the appearance features. The associationExampleData MAT-file contains a detection object and a frame, the following code illustrates the use of the extractReidentificationFeatures
method.
load("associationExampleData.mat","newDetection","frame"); % Crop frame to measurement bounding box bbox = newDetection.Measurement; croppedPerson = imcrop(frame, bbox); imshow(croppedPerson);
% Extract appearance features of the cropped pedestrian.
appearanceVect = extractReidentificationFeatures(net,croppedPerson)
appearanceVect = 2048×1 single column vector
-0.4880
0.3705
-0.4299
-0.0240
0.6064
0.4683
0.0888
0.4270
-0.0068
-0.0947
⋮
Use the supporting function runReIDNet
to iterate over a set of detections and perform the steps above.
Assignment Distances
In this section, you learn about the three types of distances that the DeepSORT assignment strategy relies on.
Consider the previous frame and detection. They are depicted in the image below. In the current frame, an object detector returns the detection (Det: 1
, in yellow) which should be associated with existing tracks maintained by the multi-object tracker. The tracker hypothesizes that an object with TrackID
1 exist in the current frame, and its estimated bounding box is shown in orange. The track shown on the image is also saved in the associationExampleData
MAT-file.
Each distance type may return values in a different range but larger values always indicate that the detection and track are less likely to be of the same object.
load("associationExampleData.mat","predictedTrack");
Bounding Box Intersection Over Union
This is the distance metric used in SORT. It formulates a distance between a track and a detection based on the overlap ratio of the two bounding boxes.
The output, is a scalar between 0 and 1. Evaluate the intersection-over-union distance using the helperDeepSORT.distanceIoU
function.
helperDeepSORT.distanceIoU(predictedTrack, newDetection)
ans = 0.5688
Mahalanobis Distance
Another common approach to evaluate the distance between detections and tracks is the Mahalanobis distance, a statistical distance between probability density functions. It accounts for the uncertainty in the current bounding box location estimate and the uncertainty in the measurement. The distance is given by the following equation
is the bounding box measurement of the detection and is the track state. is the Jacobian of the measurement function, which can also be interpreted as the projection from the 8-dimensional state space to the 4-dimensional measurement space in this example. In other words, is the predicted measurement. is the innovation covariance matrix with the following definition.
where is the measurement noise covariance.
Evaluate the Mahalanobis distance between the predicted track and the detection.
predictedMeasurement = predictedTrack.State([1 3 5 7])' % Same as Hx
predictedMeasurement = 1×4
962.4930 353.9284 54.4362 174.6672
innovation = newDetection.Measurement-predictedMeasurement % z - Hx
innovation = 1×4
16.9370 -0.4384 15.2838 0.7628
S = predictedTrack.StateCovariance([1 3 5 7],[1 3 5 7]) + newDetection.MeasurementNoise % Same as HPH' + R
S = 4×4
49.1729 0 0 0
0 49.1729 0 0
0 0 49.1729 0
0 0 0 49.1729
Use the helperDeepSORT.distanceMahalanobis
function to calculate the distance.
helperDeepSORT.distanceMahalanobis(predictedTrack,newDetection)
ans = 10.5999
The output of is a positive scalar. Unlike, the other two distances, it is not bounded.
Appearance Cosine Distance
This distance metric evaluates the distance between a detection and the predicted track in the appearance feature space.
In DeepSORT [1], each track keeps the history of appearance feature vectors from previous detection assignments. Inspect the Appearance
field of the saved track, under the ObjectAttributes
property. In this example, appearance vectors are unit vectors with 2048 elements. The following predicted track history has 3 vectors.
appearanceHistory = predictedTrack.ObjectAttributes.Appearance
appearanceHistory = 2048×3 single matrix
-0.2481 -0.5268 -0.7212
0.5355 1.1441 2.0087
-0.5731 -0.8569 -1.7000
-0.1705 0.1594 0.0325
0.6062 1.3976 1.8887
0.4375 0.4383 1.0141
-0.2393 0.0501 0.2047
0.1737 -0.0448 -0.3690
-0.3931 -0.7453 -1.9172
-0.1576 -0.1666 0.0391
⋮
The distance between two appearance vectors is derived directly from their scalar product.
With this formula, you can calculate the distance between the appearance vector of a detection and the track history as follows.
detectionAppearance = newDetection.ObjectAttributes.Appearance; 1- (detectionAppearance./vecnorm(detectionAppearance))' *(appearanceHistory./vecnorm(appearanceHistory))
ans = 1×3 single row vector
0.1729 0.1154 0.1303
Define the appearance cosine distance between a track and a detection as the minimum distance across the history of the track appearance vectors, also called a gallery. Use the helperDeepSORT.distanceCosine
function to calculate it.
helperDeepSORT.distanceCosine(predictedTrack, newDetection)
ans = single
0.1154
The appearance cosine distance returns a scalar between 0 and 2.
In this example you use the three distance metrics to formulate the overall assignment problem in terms of cost minimization. You calculate distances for all possible pairs of detections and tracks to form cost matrices.
Matching Cascade
The original idea behind DeepSORT is to combine the Mahalanobis distance and the appearance feature cosine distance to assign a set of new detections to the set of current tracks. The combination is done using a weight parameter that has a value between 0 and 1.
Both Mahalanobis and the appearance cosine cost matrices are subjected to gating thresholds. Thresholding is done by setting cost matrix elements larger than their respective thresholds to Inf
.
Due to the growth of the state covariance for unassigned tracks, the Mahalanobis distance tends to favor tracks that have not been updated in the last few frames over tracks with a smaller prediction error. DeepSORT handles this effect by splitting tracks into groups according to the last frame they were assigned. The algorithm assigns tracks that were updated in the previous frame first. Tracks are assigned to the new detections using linear assignment. Any remaining detections are considered for the assignment with the next track group. Once all track groups have been given a chance to get assigned, the remaining unassigned tracks of unassigned age 1, and the remaining unassigned detections are selected for linear assignment based on their IoU cost matrix. The flowchart below describes the matching cascade.
The helperDeepSORT
class implements the assignment routine. You can modify the code and try your own assignment instead.
Build DeepSORT Tracker
In this section you construct DeepSORT. The remaining components are the estimation filters, the feature update, and the track initialization and deletion routine. The diagram below gives a summary of all the components involved in tracking-by-detection with DeepSORT.
Matching Cascade
The following properties configure the DeepSORT's matching cascade assignment described in the previous section.
AppearanceWeight
MahalanobisAssignmentThreshold
AppearanceAssignmentThreshold
IOUAssignmentThreshold
Set IOUAssignmentThreshold
to 0.95 to allow assignment of detections to new tentative tracks with as little as 5% bounding box overlap. In this video, the low frame-rate, the closeness of the camera to the scene, and the small number of people in the scene lead to few and small overlap between consecutive detections. You can set the threshold to a lower value in videos with higher frame-rate or more crowded scenes.
Next, set the MahalanobisAssignmentThreshold
and AppearanceAssignmentThreshold
properties. The Mahalanobis distance follows a chi-square distribution. Therefore, draw the threshold from the inverse chi-square distribution for a confidence interval of about 95%. For a 4-dimensional measurement space, the value is 9.4877. Manual tuning leads to an appearance threshold of 0.4.
In [1], setting the AppearanceWeight
to 0 gives better results. In this scene, the combination of the Mahalanobis threshold and the appearance threshold resolves most assignment ambiguities. Therefore, you can choose any value between 0 and 1. For more crowded scenes, consider including some Mahalanobis distance by using non-zero appearance weight as noted in [2].
Estimation filters
As in SORT, the bounding boxes are estimated with a linear Kalman Filter using a constant velocity motion model. You use the initvisionbboxkf
filter initialization function. Set the following properties of the helperDeepSORT
object to configure the filter. Refer to the initvisionbboxkf
documentation for more details.
FrameRate
FrameSize
NoiseIntensity
The video has a frame-rate of 1Hz and a frame size of [1288 964] pixels. The Kalman filter noise intensity is a tuning parameter. In this example, the value 0.001 leads to satisfying results.
Track Initialization and Deletion
A new track is confirmed if it has been assigned for 2 consecutive frames. An existing track is deleted if it is missed for more than frames. In this example you set . This is long enough to account for all the occlusions in the video which has a low frame-rate (1Hz). For videos with higher frame-rate, you should increase this value accordingly. The following properties inherited from the trackerGNN
System object specify the confirmation and deletion logic.
ConfirmationThreshold
DeletionThreshold
Set ConfirmationThreshold
to [2 2]
and DeletionThreshold
to [
] according to the above.
Appearance Feature Update
For each assigned track, DeepSORT stores the appearance feature vectors of detections into their assigned tracks. Configure the tracker using the following properties.
AppearanceUpdate
MaxNumAppearanceFrames
AppearanceMomentum
In the original algorithm, DeepSORT stores a gallery of appearance vectors from past frames. Set the AppearanceUpdate
property to "Gallery"
and the MaxNumAppearanceFrames
property to choose the depth of the gallery. You first use a value of 50 frames. Consider increasing this value for high frame-rate videos.
There exist variants of DeepSORT using a different update mechanism [2 ,3]. You can also set AppearanceUpdate
to "EMA"
to use an exponential moving average update. In this configuration, each track only stores a single appearance vector and updates it with the assigned detection's appearance using the equation:
where is a real between 0 and 1 called the momentum term.
In this configuration, the MaxNumAppearanceFrames
property is not used. Similarly, in the previous gallery configuration, the AppearanceMomentum
property is not used.
A second variant consists of combining the exponential moving average with the gallery method. In this method, each track stores a gallery of EMA appearance vector. Set AppearanceUpdate
to "EMA Gallery"
to use this option. Both MaxNumAppearanceFrames
and AppearanceMomentum
are applicable properties in this configuration.
The gallery method captures long term appearance changes for tracks because it stores previous frames appearances and it does not favor the appearance from the latest frames over older frames. The gallery method is not robust to errors in association for the same reason. Once an erroneous appearance feature is stored in the gallery, it will corrupt the distance evaluation in later frames (this is due to taking the minimum of distances across the gallery). The exponential moving average appearance update is more robust to erroneous association since the error will be averaged out, at the expense of not capturing long term appearance change. The exponential moving average gallery offers a compromise between the two methods.
Configure DeepSORT Tracker
With all the previous considerations, create a DeepSORT tracker.
lambda = 0.02; Tlost = 5; tracker = helperDeepSORT(ConfirmationThreshold = [2 2],... DeletionThreshold = [Tlost Tlost],... AppearanceUpdate = "Gallery",... MaxNumAppearanceFrames = 50,... MahalanobisAssignmentThreshold = 10,... AppearanceAssignmentThreshold = 0.4,... IOUAssignmentThreshold = 0.95,... AppearanceWeight = lambda,.... FrameSize = [1288 964], ... FrameRate = 1,... NoiseIntensity = 1e-3*ones(1,4));
Evaluate DeepSORT
In this section, you exercise the tracker on the pedestrian tracking video and evaluate its performance using tracking metrics.
Pedestrian Tracking Dataset
Download the pedestrian tracking video file.
helperDownloadPedestrianTrackingVideo();
The PedestrianTrackingYOLODetections
MAT-file contains detections generated from a YOLO v4 object detector using CSP-DarkNet-53 network and trained on the COCO dataset. See the yolov4ObjectDetector
(Computer Vision Toolbox) object for more details. The PedestrianTrackingGroundTruth
MAT-file contains the ground truth for this video. Refer to the Import Camera-Based Datasets in MOT Challenge Format for Object Tracking example to learn how to import the ground truth and detection data into appropriate Sensor Fusion and Tracking Toolbox™ formats.
datasetname="PedestrianTracking"; load(datasetname+"GroundTruth.mat","truths"); load(datasetname+"YOLODetections.mat","detections");
Set the measurement noise covariance matrix using a standard deviation of 5 pixels for the corner coordinates and the width and height of the bounding box. The measurement noise covariance depends on the statistics of the detector, you should modify this value accordingly if you are using a different object detector.
R = diag ([25, 25, 25, 25]); for i=1:numel(detections) for j=1:numel(detections{i}) detections{i}(j).MeasurementNoise = R; end end
Run the Tracker
Next, exercise the complete tracking workflow on the Pedestrian Tracking video. To use the tracker, call the tracker with an array of objectDetection
objects as the input, as if it were a function. The tracker returns confirmed, tentative, and all tracks, and an analysis info structure, similar as the trackerGNN
object.
Filter out the YOLO detections with a confidence score lower than 0.5. Delete tracks if their bounding box is entirely out of the camera frame. This is to prevent maintaining tracks that are outside of the camera field of view more than 5 frames.
% Display reader = VideoReader("PedestrianTrackingVideo.avi"); % Initialize track log deepSORTTrackLog = objectTrack.empty; % Set minimum detection score detectionScoreThreshold = 0.5; % Choose appearance update method tracker.AppearanceUpdate = "Gallery"; tracker.AppearanceMomentum = 0.9; % Choose cost appearance weight tracker.AppearanceWeight = 0; % Toggle on/off visualization player = vision.DeployableVideoPlayer; reset(tracker); for i=1:reader.NumFrames % Advance reader frame = readFrame(reader); % Parse detections set to retrieve detections on the ith frame curFrameDetections = detections{i}; attributes = arrayfun(@(x) x.ObjectAttributes, curFrameDetections); scores = arrayfun(@(x) x.Score, attributes); highScoreDetections = curFrameDetections(scores > detectionScoreThreshold); % Run Re-ID Network on detections highScoreDetections = runReIDNet(net, frame, highScoreDetections); [tracks, tenttracks, ~, info] = tracker(highScoreDetections); deleteOutOfFrameTracks(tracker, tracks); frameWithTracks = helperAnnotateDeepSORTTrack(tracks, frame); step(player, frameWithTracks); % Log tracks for evaluation deepSORTTrackLog = [deepSORTTrackLog ; tracks]; %#ok<AGROW> end
From the results, the person tracked with ID = 3 is occluded multiple times and makes abrupt change of direction. This makes it difficult to track with only motion information by the means of the Mahalanobis distance or bounding box overlap. The use of appearance feature allows to maintain a unique track identifier for this person over this entire sequence and for the rest of the video. This is not achieved with the simpler SORT algorithm or when setting DeepSORT to only use the Mahalanobis distance. You can verify this by setting the AppearanceWeight
parameter to 1
and relaxing the appearance threshold by setting AppearanceAssignmentThreshold
to 2
.
Tracking Metrics
The CLEAR multi-object tracking metrics provide a standard set of tracking metrics to evaluate the quality of tracking algorithm. These metrics are popular for video-based tracking applications. Use the trackCLEARMetrics
object to evaluate the CLEAR metrics for the two SORT runs.
The CLEAR metrics require a similarity method to match track and true object pairs in each frame. In this example, you use the IoU2d
similarity method and set the SimilarityThreshold
property to 0.01. This means that a track can only be considered a true positive match with a truth object if their bounding boxes overlap by at least 1%. The metric results can vary depending on the choice of this threshold.
tcm = trackCLEARMetrics(SimilarityMethod ="IoU2d", SimilarityThreshold = 0.01);
The first step is to convert the objectTrack
format to the trackCLEARMetrics
input format specific to the IoU2d
similarity method. Convert the track log.
deepSORTTrackedObjects = repmat(struct("Time",0,"TrackID",1,"BoundingBox", [0 0 0 0]),size(deepSORTTrackLog)); for i=1:numel(deepSORTTrackedObjects) deepSORTTrackedObjects(i).Time = deepSORTTrackLog(i).UpdateTime; deepSORTTrackedObjects(i).TrackID = deepSORTTrackLog(i).TrackID; deepSORTTrackedObjects(i).BoundingBox(:) = getTrackPositions(deepSORTTrackLog(i), [1 0 0 0 0 0 0 0; 0 0 1 0 0 0 0 0 ; 0 0 0 0 1 0 0 0; 0 0 0 0 0 0 1 0])'; end
To evaluate the results on the Pedestrian class only, you only keep ground truth elements with ClassID equal to 1 and filter out other classes.
truths = truths([truths.ClassID]==1);
Use the evaluate
object function to obtain the metrics as a table.
deepSORTresults = evaluate(tcm, deepSORTTrackedObjects, truths); disp(deepSORTresults)
MOTA (%) MOTP (%) Mostly Tracked (%) Partially Tracked (%) Mostly Lost (%) False Positive False Negative Recall (%) Precision (%) False Track Rate ID Switches Fragmentations ________ ________ __________________ _____________________ _______________ ______________ ______________ __________ _____________ ________________ ___________ ______________ 89.037 92.064 84.615 15.385 0 25 41 93.189 95.734 0.14793 0 3
The CLEAR MOT metrics corroborate the quality of DeepSORT in keeping identities of tracks over time with no ID switch and very little fragmentation. This is the main benefit of using DeepSORT over SORT. Meanwhile, maintaining tracks alive over occlusions results in predicted locations being maintained (coasting) and compared against true position, which leads to increased number of false positives and false negatives when the overlap between the coasted tracks and true bounding boxes is less than the metric threshold. This is reflected in the MOTA score of DeepSORT.
Refer to the trackCLEARMetrics
page for additional information about all the CLEAR metrics quantities.
Note that the matching cascade is the original idea behind DeepSORT to handle the spread of covariance during occlusions. The Mahalanobis distance can be modified to be more robust to such effects, and a single step assignment can lead to identical or even better performance, as shown in [2].
Conclusion
In this example you have learned how to implement the DeepSORT object tracking algorithm. This is an example of attribute fusion by using deep appearance features for the assignment. The appearance attribute is updated using a simple memory buffer. You also have learned how to integrate a Re-Identification Deep Learning network as part of the tracking-by-detection framework to improve the performance of camera-based tracking in the presence of occlusions.
Supporting Functions
function detections = runReIDNet(net, frame, detections) if isempty(detections) detections = objectDetection.empty; else for j =1:numel(detections) % Crop frame bbox = detections(j).Measurement; croppedPerson = imcrop(frame,bbox); % Extract appearance features of the cropped pedestrian. appearanceVect = extractReidentificationFeatures(net,croppedPerson); detections(j).ObjectAttributes.Appearance = appearanceVect; end end end
deleteOutOfFrameTracks
deletes tracks if their bounding box is entirely out of the video frame.
function deleteOutOfFrameTracks(tracker, confirmedTracks) % Get bounding boxes allboxes = helperDeepSORT.getTrackRectangles(confirmedTracks); allboxes = max(allboxes, realmin); alloverlaps = bboxOverlapRatio(allboxes,[1,1,1288,964]); isOutOfFrame = ~alloverlaps; allTrackIDs = [confirmedTracks.TrackID]; trackToDelete = allTrackIDs(isOutOfFrame); for i=1:numel(trackToDelete) tracker.deleteTrack(trackToDelete(i)); end end
Reference
[1] Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." In 2017 IEEE international conference on image processing (ICIP), pp. 3645-3649.
[2] Du, Yunhao, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. "Strongsort: Make deepsort great again." IEEE Transactions on Multimedia (2023).
[3] Du, Yunhao, Junfeng Wan, Yanyun Zhao, Binyu Zhang, Zhihang Tong, and Junhao Dong. "Giaotracker: A comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021." In Proceedings of the IEEE/CVF International conference on computer vision, pp. 2809-2819. 2021.