Multi Target Single Camera Tracking Paper


interesting to see a variant of SORT (observation-centered) achieve decent results

not tracking but seems applicable in MC-tracking, detect bbox from images and match roughly, use interesting GNN formulation to refine camera pose: image as node, edge as relative pose, bbox info added during message passing


at first associate box with high detection score, then associate box with low detection score, improve tracking on occluded objects

instance similarity learning based on region proposal, flexible, no external data required

Transformer, detection and tracking simultaneously


Deep Hungarian Net, approximate MOTA, MOTP for loss function directly

apperance embedding (node) and geometry distance embedding (edge) for graph, edge classification with cross entropy loss

pipeline: detection, feature extraction, affinity, association

end-to-end MOT, use adjacent frames (chained) to combine detection, feature extraction and tracking


use appearance, location and topology cues for similarity score, then graph solved by Hungarian algorithm

GNN, Siamese network

motion and appearance extention -> Tracktor++

traditional and deep visual trackers

correlation filter, deep learning and convolutional features


use epipolar geometry, tracklet as node in graph

online MOT tracker


learn statistics to normalize effect of camera poses, temporal adjacent constraint for data association

not use appearance feature, very fast, not accurate

IoU tracker, no visual cues used, fast

RNN as tracker, LSTM for data association


use Siamese CNN to learn similarity, for data association, graph solved by Linear Programming


interaction between objects, relax the dependency of tracking on detections

Multi Target Multi Camera Tracking Paper


step 1: single camera tracking & generate appearance feature, step 2: multi camera association with GNN (single camera trajectories as node, averaged feature as node feature, cos(feature) as edge feature), weighted loss for imbalance


tracklet as node, link prediction for data association, ok for w/wo overalaping view, use large training data

detection-> feature extraction, homography -> cross-camera cluster -> incremental temporal association, small latency, not very accurate


fusion all views into ground-plane occupancy heatmap

tracklet representation with spatial-temporal attention, then tracklet-to-target assignment

tracklet-to-target assignment

single camera tracklet -> multi-camera tracklet fusion with appearance and physical features

use TrackletNet for single camera trajectory -> inter-camera tracking

single camera tracking -> match tracklets across camera views

Reinforcement learning, collaborative multi-camera

camera synchronization, SfM, Bundle Adjustment, spline representation for drone trajectory

combine appearance and homography for hierachical clustering, known camera pose


Centralized (combine cross-camera views before tracking, like Wen et al.) and Distributed methods (single-camera tracking before fusion)

single camera detection -> create/match to track, with apperance, motion, spatial-temporal cues (cross-camera)


tracklet -> single camera trajectory (correlation clustering) -> multi camera trajectory

single camera tracking -> CNN feature extraction -> multi camera tracking (KMeans)


3D position for affinity computation, need know camera parameters, cross-view coupling before trajectory


two tracker (detection and regression) in parallel, measure their correspondence


detection as node in hypergraph to find 3d reconstruction, which is node in a min-cost flow graph, solved by binary linear programming


Related Github Repo

Related Dataset

Related Competition

