Home

Awesome

<div> <h1> <img src="docs/images/logo.png" height=40 align="top"> OmAgent</h1> </div> <p align="center"> <img src="docs/images/intro.png" width="600"/> </p> <p align="center"> <a href="https://twitter.com/intent/follow?screen_name=OmAI_lab" target="_blank"> <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/OmAI_lab"> </a> <a href="https://discord.gg/Mkqs8z5U" target="_blank"> <img alt="Discord" src="https://img.shields.io/discord/1296666215548321822?style=flat&logo=discord"> </a> </p> <p align="center"> <a>English</a> | <a href="README_ZH.md">ไธญๆ–‡</a> </p>

๐Ÿ—“๏ธ Updates

๐Ÿ“– Introduction

OmAgent is an open-source agent framework designed to streamlines the development of on-device multimodal agents. Our goal is to enable agents that can empower various hardware devices, ranging from smart phone, smart wearables (e.g. glasses), IP cameras to futuristic robots. As a result, OmAgent creates an abstraction over various types of device and simplifies the process of connecting these devices to the state-of-the-art multimodal foundation models and agent algorithms, to allow everyone build the most interesting on-device agents. Moreover, OmAgent focuses on optimize the end-to-end computing pipeline, on in order to provides the most real-time user interaction experience out of the box.

In conclusion, key features of OmAgent include:

๐Ÿ› ๏ธ How To Install

1. Deploy the Workflow Orchestration Engine

OmAgent utilizes Conductor as its workflow orchestration engine. Conductor is an open-source, distributed, and scalable workflow engine that supports a variety of programming languages and frameworks. By default, it uses Redis for persistence and Elasticsearch (7.x) as the indexing backend.
It is recommended to deploy Conductor using Docker:

docker-compose -f docker/conductor/docker-compose.yml up -d

2. Install OmAgent

3. Connect Devices

If you wish to use smart devices to access your agents, we provide a smartphone app and corresponding backend, allowing you to focus on agent functionality without worrying about complex device connection issues.

๐Ÿš€ Quick Start

Hello World

1ใ€Configuration

The container.yaml file is a configuration file that manages dependencies and settings for different components of the system. To set up your configuration:

  1. Generate the container.yaml file:

    cd examples/step2_outfit_with_switch
    python compile_container.py
    

    This will create a container.yaml file with default settings under examples/step2_outfit_with_switch.

  2. Configure your LLM settings in configs/llms/gpt.yml and configs/llms/text_res.yml:

    • Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file
    export custom_openai_key="your_openai_api_key"
    export custom_openai_endpoint="your_openai_endpoint"
    
  3. Update settings in the generated container.yaml:

    • Configure Redis connection settings, including host, port, credentials, and both redis_stream_client and redis_stm_client sections.
    • Update the Conductor server URL under conductor_config section
    • Adjust any other component settings as needed
  4. Websearch gives multiple providers, you can choose one of them by modifying the configs/tools/all_tools.yml file.

    1. [Recommend] Use Tavily as the websearch tool, all_tools.yml file should be like this:
    llm: ${sub|text_res}
    tools:
        - ...other tools...
        - name: TavilyWebSearch
          tavily_api_key: ${env|tavily_api_key, null}
    

    You can get the tavily_api_key from here. It start with tvly-xxx. By setting the tavily_api_key, you can get better search results. 2. Use bing search or duckduckgo search, all_tools.yml file should be like this:

    llm: ${sub|text_res}
    tools:
        - ...other tools...
        - name: WebSearch
          bing_api_key: ${env|bing_api_key, null}
    

    For better results, it is recommended to configure Bing Search setting the bing_api_key.

For more information about the container.yaml configuration, please refer to the container module

2ใ€Running the Example

  1. Run the outfit with switch example:

    For terminal/CLI usage: Input and output are in the terminal window

    cd examples/step2_outfit_with_switch
    python run_cli.py
    

    For app/GUI usage: Input and output are in the app

    cd examples/step2_outfit_with_switch
    python run_app.py
    

    For app backend deployment, please refer to here
    For the connection and usage of the OmAgent app, please check app usage documentation

๐Ÿ— Architecture

The design architecture of OmAgent adheres to three fundamental principles:

  1. Graph-based workflow orchestration;
  2. Native multimodality;
  3. Device-centricity.

With OmAgent, one has the opportunity to craft a bespoke intelligent agent program.

For a deeper comprehension of OmAgent, let us elucidate key terms:

<p align="center"> <img src="docs/images/architecture.jpg" width="700"/> </p>

Basic Principles of Building an Agent

Examples

We provide exemplary projects to demonstrate the construction of intelligent agents using OmAgent. You can find a comprehensive list in the examples directory. Here is the reference sequence:

  1. step1_simpleVQA illustrates the creation of a simple multimodal VQA agent with OmAgent.

  2. step2_outfit_with_switch demonstrates how to build an agent with switch-case branches using OmAgent.

  3. step3_outfit_with_loop shows the construction of an agent incorporating loops using OmAgent.

  4. step4_outfit_with_ltm exemplifies using OmAgent to create an agent equipped with long-term memory.

  5. dnc_loop demonstrates the development of an agent utilizing the DnC algorithm to tackle complex problems.

  6. video_understanding showcases the creation of a video understanding agent for interpreting video content using OmAgent.

API Documentation

The API documentation is available here.

๐Ÿ”— Related works

If you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
๐Ÿ”† How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
๐Ÿ  GitHub Repository

๐Ÿ”† OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
๐Ÿ  Github Repository

โญ๏ธ Citation

If you find our repository beneficial, please cite our paper:

@article{zhang2024omagent,
  title={OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer},
  author={Zhang, Lu and Zhao, Tiancheng and Ying, Heting and Ma, Yibo and Lee, Kyusong},
  journal={arXiv preprint arXiv:2406.16620},
  year={2024}
}

Third-Party Dependencies

This project includes code from the following third-party projects:

Star History

Star History Chart