Awesome

Seamless Intro

Seamless is a family of AI models that enable more natural and authentic communication across languages. SeamlessM4T is a massive multilingual multimodal machine translation model supporting around 100 languages. SeamlessM4T serves as foundation for SeamlessExpressive, a model that preserves elements of prosody and voice style across languages and SeamlessStreaming, a model supporting simultaneous translation and streaming ASR for around 100 languages. SeamlessExpressive and SeamlessStreaming are combined into Seamless, a unified model featuring multilinguality, real-time and expressive translations.

Links

Demos

	SeamlessM4T v2	SeamlessExpressive	SeamlessStreaming
Demo	SeamlessM4T v2 Demo	SeamlessExpressive Demo
HuggingFace Space Demo	🤗 SeamlessM4T v2 Space	🤗 SeamlessExpressive Space	🤗 SeamlessStreaming Space

Papers

Seamless

EMMA

SONAR

Blog

AI at Meta Blog

Tutorial

An exhaustive tutorial given at the NeurIPS 2023 - Seamless EXPO, which is a one-stop shop to learn how to use the entire suite of Seamless models. Please feel free to play with the notebook.

SeamlessM4T

SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.

SeamlessM4T models support the tasks of:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)

:star2: We are releasing SeamlessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.

To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the SeamlessM4T README or 🤗 Model Card.

[!NOTE] Seamless M4T is also available in the 🤗 Transformers library. Visit this section for more details.

SeamlessExpressive

SeamlessExpressive is a speech-to-speech translation model that captures certain underexplored aspects of prosody such as speech rate and pauses, while preserving the style of one's voice and high content translation quality.

To learn more about SeamlessExpressive models, visit the SeamlessExpressive README or 🤗 Model Card

SeamlessStreaming

SeamlessStreaming is a streaming translation model. The model supports speech as input modality and speech/text as output modalities.

The SeamlessStreaming model supports the following tasks:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Automatic speech recognition (ASR)

To learn more about SeamlessStreaming models, visit the SeamlessStreaming README or 🤗 Model Card

Seamless

The Seamless model is the unified model for expressive streaming speech-to-speech translations.

What's new

[12/18/2023] We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.
[12/14/2023] We are releasing the Seamless tutorial given at NeurIPS 2023.

Quick Start

Installation

[!NOTE] One of the prerequisites is fairseq2 which has pre-built packages available only for Linux x86-64 and Apple-silicon Mac computers. In addition it has a dependency on libsndfile which might not be installed on your machine. If you experience any installation issues, please refer to its README for further instructions.

pip install .

[!NOTE] Transcribing inference audio for computing metric uses Whisper, which is automatically installed. Whisper in turn requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers.

Running inference

SeamlessM4T Inference

Here’s an example of using the CLI from the root directory to run inference.

S2ST task:

m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio>

T2TT task:

m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>

Please refer to the inference README for detailed instruction on how to run inference and the list of supported languages on the source, target sides for speech, text modalities.

For running S2TT/ASR natively (without Python) using GGML, please refer to the unity.cpp section.

SeamlessExpressive Inference

[!NOTE] Please check the section on how to download the model.

Here’s an example of using the CLI from the root directory to run inference.

expressivity_predict <path_to_input_audio> --tgt_lang <tgt_lang> --model_name seamless_expressivity --vocoder_name vocoder_pretssel --output_path <path_to_save_audio>

SeamlessStreaming and Seamless Inference

Streaming Evaluation README has detailed instructions for running evaluations for the SeamlessStreaming and Seamless models. The CLI has an --no-scoring option that can be used to skip the scoring part and just run inference.

Please check the inference README for more details.

Running SeamlessStreaming Demo

You can duplicate the SeamlessStreaming HF space to run the streaming demo.

You can also run the demo locally, by cloning the space from here. See the README of the SeamlessStreaming HF repo for more details on installation.

Running SeamlessM4T & SeamlessExpressive Gradio demos locally

To launch the same demo Space we host on Hugging Face locally:

cd demo
pip install -r requirements.txt
python app.py

Resources and usage

Model

SeamlessM4T models

Model Name	#params	checkpoint	metrics
SeamlessM4T-Large v2	2.3B	🤗 Model card - checkpoint	metrics
SeamlessM4T-Large (v1)	2.3B	🤗 Model card - checkpoint	metrics
SeamlessM4T-Medium (v1)	1.2B	🤗 Model card - checkpoint	metrics

SeamlessExpressive models

🤗 Model card

To access and download SeamlessExpressive, please request the model artifacts through this request form. Upon approval, you will then receive an email with download links to each model artifact.

Please note that SeamlessExpressive is made available under its own License and Acceptable Use Policy.

SeamlessStreaming models

Model Name	#params	checkpoint	metrics
SeamlessStreaming	2.5B	🤗 Model card - monotonic decoder checkpoint - streaming UnitY2 checkpoint	metrics

Seamless models

Seamless model is simply the SeamlessStreaming model with the non-expressive vocoder_v2 swapped out with the expressive vocoder_pretssel. Please check out above section on how to acquire vocoder_pretssel checkpoint.

W2v-BERT 2.0 speech encoder

Model Name	#params	checkpoint
W2v-BERT 2.0	600M	🤗 Model card - checkpoint

Here's how you should do a foward pass through the speech encoder:

import torch

from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from fairseq2.data import Collater
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model


audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
    num_mel_bins=80,
    waveform_scale=2**15,
    channel_last=True,
    standardize=True,
    device=device,
    dtype=dtype,
)
collater = Collater(pad_value=1)

model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()

with Path(audio_wav_path).open("rb") as fb:
    block = MemoryBlock(fb.read())

decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)

with torch.inference_mode():
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
  seqs, padding_mask = model.encoder(seqs, padding_mask)

Evaluation

SeamlessM4T Evaluation

To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the README here.

SeamlessExpressive Evaluation

Below is the script for efficient batched evaluation.

export MODEL_DIR="/path/to/SeamlessExpressive/model"
export TEST_SET_TSV="input.tsv" # Your dataset in a TSV file, with headers "id", "audio"
export TGT_LANG="spa" # Target language to translate into, options including "fra", "deu", "eng" ("cmn" and "ita" are experimental)
export OUTPUT_DIR="tmp/" # Output directory for generated text/unit/waveform
export TGT_TEXT_COL="tgt_text" # The column in your ${TEST_SET_TSV} for reference target text to calcuate BLEU score. You can skip this argument.
export DFACTOR="1.0" # Duration factor for model inference to tune predicted duration (preddur=DFACTOR*preddur) per each position which affects output speech rate. Greater value means slower speech rate (default to 1.0). See expressive evaluation README for details on duration factor we used.
expressivity_evaluate ${TEST_SET_TSV} \
  --gated-model-dir ${MODEL_DIR} --task s2st --tgt_lang ${TGT_LANG} \
  --audio_root_dir "" --output_path ${OUTPUT_DIR} --ref_field ${TGT_TEXT_COL} \
  --model_name seamless_expressivity --vocoder_name vocoder_pretssel \
  --text_unk_blocking True --duration_factor ${DFACTOR}

Please check out this README section

SeamlessStreaming and Seamless Evaluation

Streaming Evaluation README has detailed instructions for running evaluations on the SeamlessStreaming and Seamless models.

Unity.cpp

To enable Seamless Communication Everywhere, we implemented unity.cpp so users could run SeamlessM4T models in GGML - a C tensor library allowing easier integration on verbose platforms.

To transcribe/translte a given audio,

./ggml/bin/unity --model seamlessM4T_medium.ggml input.wav

For details of build and more usage please check out unity.cpp

Expressive Datasets

We created two expressive speech-to-speech translation datasets, mExpresso and mDRAL, between English and five other languages -- French, German, Italian, Mandarin and Spanish. We currently open source the speech-to-text of mExpresso for out-of-English directions, and we will open source the remaining part of the datasets soon. For details, please check out README

SeamlessAlignExpressive

We’re introducing the first expressive speech alignment procedure. Starting with raw data, the expressive alignment procedure automatically discovers pairs of audio segments sharing not only the same meaning, but the same overall expressivity. To showcase this procedure, we are making metadata available to create a benchmarking dataset called SeamlessAlignExpressive, that can be used to validate the quality of our alignment method. SeamlessAlignExpressive is the first large-scale (11k+ hours) collection of multilingual audio alignments for expressive translation. More details can be found on the SeamlessAlignExpressive README.

Converting raw audio to units

Please check out the README here. Note that SeamlessM4T v1 model uses reduced units and other models use non-reduced units.

Libraries

Seamless Communication depends on 4 libraries developed by Meta.

fairseq2

fairseq2 is our next-generation open-source library of sequence modeling components that provides researchers and developers with building blocks for machine translation, language modeling, and other sequence generation tasks. All SeamlessM4T models in this repository are powered by fairseq2.

SONAR and BLASER 2.0

SONAR, Sentence-level multimOdal and laNguage-Agnostic Representations is a new multilingual and -modal sentence embedding space which outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. SONAR provides text and speech encoders for many languages. SeamlessAlign was mined based on SONAR embeddings.

BLASER 2.0 is our latest model-based evaluation metric for multimodal translation. It is an extension of BLASER, supporting both speech and text. It operates directly on the source signal, and as such, does not require any intermediate ASR system like ASR-BLEU. As in the first version, BLASER 2.0 leverages the similarity between input and output sentence embeddings. SONAR is the underlying embedding space for BLASER 2.0. Scripts to run evaluation with BLASER 2.0 can be found in the SONAR repo.

stopes

As part of the seamless communication project, we've extended the stopes library. Version 1 provided a text-to-text mining tool to build training dataset for translation models. Version 2 has been extended thanks to SONAR, to support tasks around training large speech translation models. In particular, we provide tools to read/write the fairseq audiozip datasets and a new mining pipeline that can do speech-to-speech, text-to-speech, speech-to-text and text-to-text mining, all based on the new SONAR embedding space.

SimulEval

SimulEval is a library used for evaluating simulaneous translation models. SimulEval also provides a backend for generation using partial/incremental inputs with flexible/extensible states, which is used to implement streaming inference. Users define agents which implement SimulEval's interface, which can be connected together in a pipeline. You can find agents implemented for SeamlessStreaming here.

[Legacy] SeamlessM4T v1 instructions

Finetuning SeamlessM4T v1 models

Please check out the README here.

On-device models

Apart from Seamless-M4T large (2.3B) and medium (1.2B) models, we are also releasing a small model (281M) targeted for on-device inference. To learn more about the usage and model details check out the README here.

SeamlessAlign mined dataset

We open-source the metadata to SeamlessAlign, the largest open dataset for multimodal translation, totaling 270k+ hours of aligned Speech and Text data. The dataset can be rebuilt by the community based on the SeamlessAlign readme.

Citation

If you use Seamless in your work or any models/datasets/artifacts published in Seamless, please cite :

@inproceedings{seamless2023,
   title="Seamless: Multilingual Expressive and Streaming Speech Translation",
   author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
  journal={ArXiv},
  year={2023}
}

License

We have three license categories.

The following non-generative components are MIT licensed as found in MIT_LICENSE:

W2v-BERT 2.0 speech encoder
Code
Text only part of the mExpresso dataset found in the SeamlessExpressive README.
UnitY2 forced alignment extractor found in the UnitY2 Aligner README.
Speech toxicity tool with the etox dataset found in the ETOX README.
MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector Mutox README

The following models are CC-BY-NC 4.0 licensed as found in the LICENSE:

SeamlessM4T models (v1 and v2).
SeamlessStreaming models.

The following models are Seamless licensed as found in SEAMLESS_LICENSE:

Seamless models.
SeamlessExpressive models.