Awesome
🐧Pengi: An Audio Language Model for Audio Tasks
[Paper
] [Checkpoints
]
Pengi is an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions.
News
[Sep 23] 🐧Pengi is accepted at NeurIPS 2023
Setup
- You are required to install the dependencies:
pip install -r requirements.txt
. If you have conda installed, you can run the following:
cd Pengi && \
conda create -n pengi python=3.8 && \
conda activate pengi && \
pip install -r requirements.txt
- Download Pengi weights: Pretrained Model [Zenodo]
- Move the
base.pth
andbase_no_text_enc.pth
underconfigs
folder
Supported models
The wrapper supports two models. The base
option is Pengi architecture reported in paper and shown above. The base_no_text_enc
is the Pengi architecture without the text encoder and only $m_2$ to encode tokenized text. All models only support 44.1 kHz input audio.
Usage
The wrapper provides an easy way to get Pengi output given and audio and text input. To use the wrapper, inputs required are:
config
: Choose between "base" or "base_no_text_enc"audio_file_paths
: List of audio file paths for inferencetext_prompts
: List of input text prompts corresponding to each of the files in audio_file_paths. Example: ["generate metadata", "generate metadata"]. Refer to Table 1 and 11 for prompts and performance in paper. The default recommendation is to "generate metadata" promptadd_texts
: List of additional text corresponding to each of the files in audio_file_paths and prompt in prompts. This is used additional text input user can provide to guide GPT2.
Supported functions:
generate
: Produces text response for the given audio file and text promptsdescribe
: Produces text description of the given audio file by concatenating the concatenating output of predefined text promptsget_audio_embeddings
: Load list of audio files and return audio prefix and audio embeddingsget_prompt_embeddings
: Load list of text prompts and return prompt prefix and embeddings
Text generation
from wrapper import PengiWrapper as Pengi
pengi = Pengi(config="<choice of config>")
generated_response = pengi.generate(audio_paths=audio_file_paths,
text_prompts=["generate metadata"],
add_texts=[""],
max_len=30,
beam_size=3,
temperature=1.0,
stop_token=' <|endoftext|>'
)
Audio description
from wrapper import PengiWrapper as Pengi
pengi = Pengi(config="<choice of config>")
generated_summary = pengi.describe(audio_paths=audio_file_paths,
max_len=30,
beam_size=3,
temperature=1.0,
stop_token=' <|endoftext|>'
)
Generate audio, audio prefix and prompt embeddings
audio_prefix, audio_embeddings = pengi.get_audio_embeddings(audio_paths=audio_file_paths)
text_prefix, text_embeddings = pengi.get_prompt_embeddings(prompts=["generate metadata"])
Citation
@inproceedings{deshmukh2023pengi,
author = {Deshmukh, Soham and Elizalde, Benjamin and Singh, Rita and Wang, Huaming},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {18090--18108},
publisher = {Curran Associates, Inc.},
title = {Pengi: An Audio Language Model for Audio Tasks},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/3a2e5889b4bbef997ddb13b55d5acf77-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.