Home

Awesome

VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation

High-Speed Video Deep Learning Bubble Segmentation Multi-Modality Analysis Dynamic Fluids Patchification IoU Metrics Composite Frames CNN Comparison MIT License

Overview

VideoSAM is a large-scale vision foundation model that excels at high-speed video segmentation, particularly in dynamic fluid environments. The model was rigorously tested across different data modalities including Argon, Nitrogen, FC-72, and Water. VideoSAM uses a patchification process for detailed segmentation and was capable of high-accuracy bubble segmentation across different data types.

Key features include:


Key Experiments

  1. Zero-Shot Generalization Across Modalities:

    • VideoSAM was trained on Argon data and tested across other modalities like Nitrogen, FC-72, and Water. It demonstrated superior segmentation in complex fluids, especially with intricate bubble boundaries.
  2. Performance Across Multiple Modalities:

    • The model was trained on multiple datasets and consistently outperformed baseline models like SAM, particularly excelling in fluids with complex dynamics such as Nitrogen and FC-72.
  3. Comparison with U-Net CNN:

    • VideoSAM was benchmarked against U-Net, a traditional CNN architecture. While U-Net performed better on simpler datasets like Water, VideoSAM surpassed it in handling more dynamic and complex fluid environments.

Data Location

All training and testing datasets for VideoSAM are located in the following directories within the project structure:

To reconstruct the data from its split files, follow the instructions below.


How to Unpack Split Zip Files

To reassemble files that were split into smaller parts, follow these steps:

  1. Navigate to the directory where the split files are located.
  2. Use the cat command to combine them:
    cat train_image_masks_part_* > train_image_masks.zip
    
  3. Unzip the combined file:
    unzip train_image_masks.zip
    

Apply the same process for test data or any other split zip files.


Installation

  1. Clone the repository:

    git clone https://github.com/chikap421/videosam.git
    cd videosam
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Set up the environment:

    python setup.py install
    

Inference Pipeline

The inference pipeline for VideoSAM was designed to evaluate its performance across different data modalities. This includes:

  1. Grayscale Conversion and Normalization:

    • Frames are first converted to grayscale and normalized.
  2. Patchification:

    • For both single and composite frames, the dataset is segmented into smaller patches using a grid-based bounding box.
  3. Mask Extraction:

    • Patches are processed through both VideoSAM and SAM models, and the predicted masks are stitched together to reconstruct full-image masks.
  4. Metrics Evaluation:

    • IoU, F1 Score, and Precision metrics are used for both single-frame and sequence-based performance evaluation.

License

This project is licensed under the MIT License. See the LICENSE file for details.

MIT License

🖋️ Citations

If you use this repository in your research, please cite:

@misc{maduabuchi2024videosamlargevisionfoundation,
      title={VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation}, 
      author={Chika Maduabuchi and Ericmoore Jossou and Matteo Bucci},
      year={2024},
      eprint={2410.21304},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.21304}, 
}