Home

Awesome

MATLAB Deep Learning Model Hub

Discover pretrained models for deep learning in MATLAB.

Models <a name="Models"/>

Computer Vision

Natural Language Processing

Audio

Lidar

Robotics

Image Classification <a name="ImageClassification"/>

Pretrained image classification networks have already learned to extract powerful and informative features from natural images. Use them as a starting point to learn a new task using transfer learning.

Inputs are RGB images, the output is the predicted label and score:

These networks have been trained on more than a million images and can classify images into 1000 object categories.

Models available in MATLAB:

Note 1: Since R2024a, please use the imagePretrainedNetwork function instead and specify the pretrained model. For example, use the following code to access googlenet:

[net, classes] = imagePretrainedNetwork("googlenet");
NetworkSize (MB)ClassesAccuracy %Location
googlenet<sup>1<sup>27100066.25Doc <br />GitHub
squeezenet<sup>1<sup>5.2100055.16Doc
alexnet<sup>1<sup>227100054.10Doc
resnet18<sup>1<sup>44100069.49Doc <br />GitHub
resnet50<sup>1<sup>96100074.46Doc <br />GitHub
resnet101<sup>1<sup>167100075.96Doc <br />GitHub
mobilenetv2<sup>1<sup>13100070.44Doc <br />GitHub
vgg16<sup>1<sup>515100070.29Doc
vgg19<sup>1<sup>535100070.42Doc
inceptionv3<sup>1<sup>89100077.07Doc
inceptionresnetv2<sup>1<sup>209100079.62Doc
xception<sup>1<sup>85100078.20Doc
darknet19<sup>1<sup>78100074.00Doc
darknet53<sup>1<sup>155100076.46Doc
densenet201<sup>1<sup>77100075.85Doc
shufflenet<sup>1<sup>5.4100063.73Doc
nasnetmobile<sup>1<sup>20100073.41Doc
nasnetlarge<sup>1<sup>332100081.83Doc
efficientnetb0<sup>1<sup>20100074.72Doc
ConvMixer7.710-GitHub
Vison TransformerLarge-16 - 1100<br /> Base-16 - 331.4<br /> Small-16 - 84.7<br /> Tiny-16 - 22.21000Large-16 - 85.59<br /> Base-16 - 85.49<br /> Small-16 - 83.73<br /> Tiny-16 - 78.22Doc

Tips for selecting a model

Pretrained networks have different characteristics that matter when choosing a network to apply to your problem. The most important characteristics are network accuracy, speed, and size. Choosing a network is generally a tradeoff between these characteristics. The following figure highlights these tradeoffs:

Figure. Comparing image classification model accuracy, speed and size.

Back to top

Object Detection <a name="ObjectDetection"/>

Object detection is a computer vision technique used for locating instances of objects in images or videos. When humans look at images or video, we can recognize and locate objects of interest within a matter of moments. The goal of object detection is to replicate this intelligence using a computer.

Inputs are RGB images, the output is the predicted label, bounding box and score:

These networks have been trained to detect 80 objects classes from the COCO dataset. These models are suitable for training a custom object detector using transfer learning.

NetworkNetwork variantsSize (MB)Mean Average Precision (mAP)Object ClassesLocation
EfficientDet-D0efficientnet15.933.780GitHub
YOLO v9yolo9t<br />yolo9s<br />yolo9m<br />yolo9c<br />yolo9e7.5 <br /> 25 <br /> 67.2 <br /> 85 <br />19038.3<br /> 46.8<br /> 51.4<br />53.0 <br />55.680GitHub
YOLO v8yolo8n<br />yolo8s<br />yolo8m<br />yolo8l<br />yolo8x10.7 <br /> 37.2<br />85.4 <br />143.3<br />222.737.3<br />44.9<br />50.2<br />52.9<br />53.980GitHub
YOLOXYoloX-s<br />YoloX-m<br />YoloX-l32 <br /> 90.2<br />192.939.8 <br />45.9<br />48.680Doc<br />GitHub
YOLO v4yolov4-coco <br /> yolov4-tiny-coco229 <br /> 21.544.2 <br />19.780Doc<br />GitHub
YOLO v3darknet53-coco <br /> tiny-yolov3-coco220.4 <br /> 31.534.4 <br /> 9.380Doc
YOLO v2darknet19-COCO <br />tiny-yolo_v2-coco181 <br /> 4028.7 <br /> 10.580Doc<br />GitHub

Tips for selecting a model

Pretrained object detectors have different characteristics that matter when choosing a network to apply to your problem. The most important characteristics are mean average precision (mAP), speed, and size. Choosing a network is generally a tradeoff between these characteristics.

Application Specific Object Detectors

These networks have been trained to detect specific objects for a given application.

NetworkApplicationSize (MB)LocationExample Output
Spatial-CNNLane detection74GitHub <img src="Images/lanedetection.jpg" width=150>
RESARoad Boundary detection95GitHub <img src="Images/road_boundary.png" width=150>
Single Shot Detector (SSD)Vehicle detection44Doc <img src="Images/ObjectDetectionUsingSSD.png" width=150>
Faster R-CNNVehicle detection118Doc<img src="Images/faster_rcnn.png" width=150>

Back to top

Semantic Segmentation <a name="SemanticSegmentation"/>

Segmentation is essential for image analysis tasks. Semantic segmentation describes the process of associating each pixel of an image with a class label, (such as flower, person, road, sky, ocean, or car).

Inputs are RGB images, outputs are pixel classifications (semantic maps). <img src="Images/semanticseg.png" class="center">

This network has been trained to detect 20 objects classes from the PASCAL VOC dataset:

NetworkSize (MB)Mean AccuracyObject ClassesLocation
DeepLabv3+2090.8720GitHub

Zero-shot image segmentation model:

NetworkSize (MB)Example Location
segmentAnythingModel358Doc

Application Specific Semantic Segmentation Models

NetworkApplicationSize (MB)LocationExample Output
U-netRaw Camera Processing31Doc<img src="Images/rawimage.png" width=150>
3-D U-netBrain Tumor Segmentation56.2Doc<img src="Images/Segment3DBrainTumor.gif" width=150>
AdaptSeg (GAN)Model tuning using 3-D simulation data54.4Doc<img src="Images/adaptSeg.png" width=150>

Back to top

Instance Segmentation <a name="InstanceSegmentation"/>

Instance segmentation is an enhanced type of object detection that generates a segmentation map for each detected instance of an object. Instance segmentation treats individual objects as distinct entities, regardless of the class of the objects. In contrast, semantic segmentation considers all objects of the same class as belonging to a single entity.

Inputs are RGB images, outputs are pixel classifications (semantic maps), bounding boxes and classification labels.

NetworkObject ClassesLocation
Mask R-CNN80Doc <br /> Github

Back to top

Image Translation <a name="ImageTranslation"/>

Image translation is the task of transferring styles and characteristics from one image domain to another. This technique can be extended to other image-to-image learning operations, such as image enhancement, image colorization, defect generation, and medical image analysis.

Inputs are images, outputs are translated RGB images. This example workflow shows how a semantic segmentation map input translates to a synthetic image via a pretrained model (Pix2PixHD):

NetworkApplicationSize (MB)LocationExample Output
Pix2PixHD(CGAN)Synthetic Image Translation648Doc<img src="Images/SynthesizeSegmentation.png" width=150>
UNIT (GAN)Day-to-Dusk Dusk-to-Day Image Translation72.5Doc<img src="Images/day2dusk.png" width=150>
UNIT (GAN)Medical Image Denoising72.4Doc<img src="Images/unit_imagedenoising.png" width=150>
CycleGANMedical Image Denoising75.3Doc<img src="Images/cyclegan_imagedenoising.png" width=150>
VDSRSuper Resolution (estimate a high-resolution image from a low-resolution image)2.4Doc<img src="Images/SuperResolution.png" width=150>

Back to top

Pose Estimation <a name="PoseEstimation"/>

Pose estimation is a computer vision technique for localizing the position and orientation of an object using a fixed set of keypoints.

All inputs are RGB images, outputs are heatmaps and part affinity fields (PAFs) which via post processing perform pose estimation.

NetworkBackbone NetworksSize (MB)Location
OpenPosevgg1914Doc
HR Nethuman-full-body-w32<br />human-full-body-w48106.9<br />237.7Doc

Back to top

3D Reconstruction <a name="3DReconstruction"/>

3D reconstruction is the process of capturing the shape and appearance of real objects.

NetworkSize (MB)LocationExample Output
NeRF3.78GitHubNeRF

Back to top

Video Classification <a name="VideoClassification"/>

Video classification is a computer vision technique for classifying the action or content in a sequence of video frames.

All inputs are Videos only or Video with Optical Flow data, outputs are gesture classifications and scores.

NetworkInputsSize(MB)Classifications (Human Actions)DescriptionLocation
SlowFastVideo124400Faster convergence than Inflated-3DDoc
R(2+1)DVideo112400Faster convergence than Inflated-3DDoc
Inflated-3DVideo & Optical Flow data91400Accuracy of the classifier improves when combining optical flow and RGB data.Doc

Back to top

Text Detection and Recognition <a name="textdetection"/>

Text detection is a computer vision technique used for locating instances of text within in images.

Inputs are RGB images, outputs are bounding boxes that identify regions of text.

NetworkApplicationSize (MB)Location
CRAFTTrained to detect English, Korean, Italian, French, Arabic, German and Bangla (Indian).3.8Doc <br /> GitHub

Application Specific Text Detectors

NetworkApplicationSize (MB)LocationExample Output
Seven Segment Digit RecognitionSeven segment digit recognition using deep learning and OCR. This is helpful in industrial automation applications where digital displays are often surrounded with complex background.3.8Doc <br /> GitHub

Back to top

Transformers (Text) <a name="transformers"/>

Transformer pretained models have already learned to extract powerful and informative features features from text. Use them as a starting point to learn a new task using transfer learning.

Inputs are sequences of text, outputs are text feature embeddings.

NetworkApplicationsSize (MB)Location
BERTFeature Extraction (Sentence and Word embedding), Text Classification, Token Classification, Masked Language Modeling, Question Answering390GitHub <br /> Doc
all-MiniLM-L6-v2Document Embedding, Clustering, Information Retrieval80Doc
all-MiniLM-L12-v2Document Embedding, Clustering, Information Retrieval120Doc

Application Specific Transformers

NetworkApplicationSize (MB)LocationOutput Example
FinBERTThe FinBERT model is a BERT model for financial sentiment analysis388GitHub
GPT-2The GPT-2 model is a decoder model used for text summarization.1.2GBGitHub

Back to top

Audio Embeddings <a name="AudioEmbeddings"/>

Audio embedding pretrained models have already learned to extract powerful and informative features from audio signals. Use them as a starting point to learn a new task using transfer learning.

Inputs are audio signals, outputs are audio feature embeddings.

Note 2: Since R2024a, please use the audioPretrainedNetwork function instead and specify the pretrained model. For example, use the following code to access VGGish:

net = audioPretrainedNetwork("vggish");
NetworkApplicationSize (MB)Location
VGGish<sup>2<sup>Feature Embeddings257Doc
OpenL3<sup>2<sup>Feature Embeddings200Doc

Application Specific Audio Models<a name="Application Specific Audio Models"/>

NetworkApplicationSize (MB)Output ClassesLocationOutput Example
<a name="SoundClassification"/>vadnet<sup>2<sup>Voice Activity Detection (regression)0.427-Doc<img src="Images/vadnet.png" width=150>
<a name="SoundClassification"/>YAMNet<sup>2<sup>Sound Classification13.5521Doc<img src="Images/audio_classification.png" width=150>
<a name="PitchEstimation"/>CREPE<sup>2<sup>Pitch Estimation (regression)132-Doc<img src="Images/pitch_estimation.png" width=150>

Speech to Text <a name="Speech2Text"/>

Speech-to-text models provide a fast, efficient method to convert spoken language into written text, enhancing accessibility for individuals with disabilities, enabling downstream tasks like text summarization and sentiment analysis, and streamlining documentation processes. As a key element of human-machine interfaces, including personal assistants, it allows for natural and intuitive interactions, enabling machines to understand and execute spoken commands, improving usability and broadening inclusivity across various applications.

Inputs are audio signals, outputs is text.

NetworkApplicationSize (MB)Word Error Rate (WER)Location
wav2vecSpeech to Text2363.2GitHub
deepspeechSpeech to Text1675.97GitHub

Back to top

Lidar <a name="PointCloud"/>

Point cloud data is acquired by a variety of sensors, such as lidar, radar, and depth cameras. Training robust classifiers with point cloud data is challenging because of the sparsity of data per object, object occlusions, and sensor noise. Deep learning techniques have been shown to address many of these challenges by learning robust feature representations directly from point cloud data.

Inputs are Lidar Point Clouds converted to five-channels, outputs are segmentation, classification or object detection results overlayed on point clouds.

NetworkApplicationSize (MB)Object ClassesLocation
PointNetClassification514Doc
<a name="PointCloudSeg"/>PointNet++Segmentation38Doc
PointSegSegmentation143Doc
SqueezeSegV2Segmentation512Doc
SalsaNextSegmentation20.913GitHub
<a name="PointCloudObj"/>PointPillarsObject Detection83Doc
Complex YOLO v4Object Detection233 (complex-yolov4) <br /> 21 (tiny-complex-yolov4)3GitHub

Back to top

Manipulator Motion Planning <a name="ManipMotionPlanning"/>

Manipulator motion planning is a technique used to plan a trajectory for a robotic arm from a start position to a goal position in an obstacle environment.

Pretrained deep learning models have learned to plan such trajectories for repetitive tasks such as picking and placing of objects, leading to speed ups over traditional algorithms.

Inputs are start configuration, goal configuration and obstacle environment encoding for the robot, outputs are intermediate trajectory guesses.

NetworkApplicationSize (MB)Location
Deep-Learning-Based CHOMP (DLCHOMP)Trajectory Prediction25Doc<br />GitHub

Back to top

Path Planning with Motion Planning Networks <a name="PathPlanningMPNet"/>

Motion Planning Networks (MPNet) is a deep-learning-based approach for finding optimal paths between a start point and goal point in motion planning problems. MPNet is a deep neural network that can be trained on multiple environments to learn optimal paths between various states in the environments. The MPNet uses this prior knowledge to,

To know more please visit Get Started with Motion Planning Networks

NetworkApplicationSize (MB)Location
mazeMapTrainedMPNETPath Planning0.23Doc

Back to top

Model requests

If you'd like to request MATLAB support for additional pretrained models, please create an issue from this repo.

Alternatively send the request through to:

Jianghao Wang <br /> Deep Learning Product Manager <br /> jianghaw@mathworks.com

Copyright 2024, The MathWorks, Inc.