Home

Awesome

octoml-profile

octoml-profile is a python library and cloud service that enables ML engineers to easily assess the performance and cost of PyTorch models on cloud hardware with state-of-the-art ML acceleration technology.

Whether you're building machine learning models for research, development, or production, benchmarking your AI applications is a necessary step before deployment. An optimally chosen hardware + runtime deployment strategy can reduce cloud costs by more than 10x over default solutions.

With octoml-profile, within minutes, you can measure the performance and cost of your PyTorch models on different cloud hardware. Our ML acceleration technology ensures that you get the most accurate and efficient results, so you can make informed decisions about how to optimize your AI applications.

Note: this tool is not designed for profiling individual PyTorch ops on the local machine. Please use torch.profiler for such purpose.

Key Features

Limitation

Demos

Latest

Documentation quick links

"Hello World" example

Let's say you have a PyTorch model that performs sentiment analysis using a DistilBert model, and you want to optimize it for cloud deployment. With octoml-profile, you can easily benchmark the predict function on various cloud hardware and use different acceleration techniques to find the optimal deployment strategy.

Distilbert Example

Within a few seconds, you will find the runtime and cost that help you pick the optimal hardware and inference engine for deployment.

Function `predict` has 1 profile:
- Profile `predict[1/1]` ran 3 times. (1 discarded because compilation happened)

Instance     Processor           Backend              Backend Time (ms)  Total Time (ms)  Cost ($/MReq)
=======================================================================================================
r6i.large    Intel Ice Lake CPU  torch-eager-cpu                 24.735           52.009          $1.82
g4dn.xlarge  Nvidia T4 GPU       torch-eager-cuda                 5.336           32.610          $4.76
g4dn.xlarge  Nvidia T4 GPU       torch-inductor-cuda              3.249           30.523          $4.46
g4dn.xlarge  Nvidia T4 GPU       onnxrt-cuda                      1.399           28.673          $4.19
-------------------------------------------------------------------------------------------------------
Total time above is `remote backend time + local python code time`,
 in which local python code run time is 27.274 ms.
Graph level profile is located at /tmp/octoml_profile_n603dewx/0/predict_1*

Installation and Getting Started

Issues with installation

Dynamic shapes

This is an experimental feature that requires installing nightly of PyTorch. The dynamic shape graph capture feature is still under active development by the PyTorch team, so your results may vary. If you find any problems, please report via github issue.

pip install --pre torch==2.1.0.dev20230416 torchaudio==2.1.0.dev20230416 torchvision==0.16.0.dev20230416 --index-url https://download.pytorch.org/whl/nightly/cpu

By default, the @accelerate decorator will recompile a new graph if the input shapes to the graph is changed. For generative model cases such as text generation, it is inefficient to have to compile a separate graph for each sequence length. The solution is to turn on "dynamic-shapes" for the compiler, which means the graph compilation will be agnostic to the input shapes, resulting in drastically fewer graphs to be compiled and lower memory to run end to end.

As an toy example:

import torch
from octoml_profile import accelerate, remote_profile

conv = torch.nn.Conv2d(16, 16, 3)

# With `dynamic=True` any model inside will not be specialized to the input shape
@accelerate(dynamic=True)
def predict(x: torch.Tensor):
    return conv(x)

with remote_profile(backends=["r6i.large/onnxrt-cpu"]):
    # batch size is different but compilation only 
    # happens once
    for i in range(1, 5):
        predict(torch.randn(i, 16, 10, 10))

Set @accelerate(dynamic=True) on any accelerate usage.

How it works

How octoml-profile works

octoml-profile consists of two main components: a Python library and a cloud service. The Python library is used to automatically extract PyTorch models on your local machine and send them to the cloud service for remote benchmarking. The cloud service provides access to different cloud hardware targets that are prepared with various deep learning inference engines. This enables users to optimize and measure their PyTorch models in a variety of deployment configurations.

Architecture Illustration

In various examples above, we first import octoml-profile python library, and then decorate a predict function with the @accelerate decorator. By default, it behaves like @torch.compile: TorchDynamo is used to extract one or more computation graphs, optimize them, and replace the bytecode inside the function with the optimized version.

When the code is surrounded with remote_profile() context manager, the behavior of the @accelerate decorator changes. Instead of running the extracted graphs on the local machine, the graphs are sent to one or more remote inference workers for execution and measurement. The run time of the offloaded graphs are referred to as "remote backend run time" in the output above.

Code that cannot be captured as a computation graph is not offloaded -- such code runs locally and is shown as "local python". For more details see the local python code section below.

When the remote_profile() context manager is entered, it reserves exclusive access to hardware specified in the optional backends keyword argument (or to a set of default hardware targets if the argument is omitted). If there are multiple backends, they will run in parallel.

The predict function may contain pre/post processing code, non tensor logic like control flows, side effects, and multiple models. Only eligible graphs will be intelligently extracted and offloaded for remote execution.

As a result, the estimated end to end run time of the decorated function for a particular hardware and acceleration engine is remote backend run time + local python run time. If local python run time is much smaller comparing to the total time, the estimate is fairly accurate because the impact of potential difference between local and remote machine for local python code is minimal.

Where to apply @accelerate

In general, @accelerate is a drop-in replacement for @torch.compile and should be applied to function which contains PyTorch Model that performs inference. When the function is called under the context manager of with remote_profile(), the remote execution and profiling activated. When called without remote_profile() it behaves just as torch.compile. By default, torch.no_grad() is set in the remote_profile() context, because the remote execution does not support training yet.

If you expect the input shape to change especially for generative models, see Dynamic Shapes.

Last but not least, @accelerate should not be used to decorate a function that has already been decorated with @accelerate or @torch.compile.

The profile report

Most users should be satisfied with the output of just total run time and cost of using different hardware and software backends. However, advanced users like ML compiler engineers may be interested in diving into graph level performance analysis or reducing the number of graph breaks. This section shows you where to find the next level details.

The location of the Profile report is printed at the end of total runtime table, for instance:

Graph level profile is located at /tmp/octoml_profile_8o45fe39/0/predict_1*

For each decorated function, the profile report is suffixed with .profile.txt, i.e. /tmp/octoml_profile_8o45fe39/0/predict_1.profile.txt is for function predict.

An example report for DistilBert is the following:

   Segment                            Samples  Avg ms  Failures
===============================================================
0  Local python                             2  27.227

1  Graph #1
     r6i.large/torch-eager-cpu             20  25.648         0
     g4dn.xlarge/torch-eager-cuda          20   5.381         0
     g4dn.xlarge/torch-inductor-cuda       20   3.208         0
     g4dn.xlarge/onnxrt-cuda               20   1.400         0

2  Local python                             2   0.110
---------------------------------------------------------------

To understand this report, let's first define some terminology.

Terminology:

On function, subgraph, and profile:

To find out what operations each graph contains and where in the source code does each graph come from, take a look at graph_dump.txt under the same directory.

Like the DistilBert example above, when there are only a few segments, a profile will show the linear sequence of subgraph segment runs.

What happens when there are tens or hundreds of segments? When too many segments are run, we collapse the linear segment sequence into an abridged summary where only a few subgraphs that have the highest aggregate run times across their run segments are shown.

For example, a generative encoder-decoder based model that produces a large number of run segments will display an abridged report displaying runtime by subgraph instead of runtime by segment by default. In cases like this, you'll see:

Top subgraph               Avg ms/call  Avg ms/run  Runtime % of e2e  Failures
==============================================================================
Graph #7 (17 calls)      
  r6i.large/onnxrt-cpu          36.979     628.645              70.2         0
  g4dn.xlarge/onnxrt-cuda        5.612      95.403              37.9         0

Graph #4 (1 calls)       
  r6i.large/onnxrt-cpu          43.823      43.823               4.9         0
  g4dn.xlarge/onnxrt-cuda        5.002       5.002               2.0         0

Graph #2 (1 calls)       
  r6i.large/onnxrt-cpu          43.357      43.357               4.8         0
  g4dn.xlarge/onnxrt-cuda        3.154       3.154               1.3         0

4 other subgraphs        
  r6i.large/onnxrt-cpu                      35.892               4.0         0
  g4dn.xlarge/onnxrt-cuda                    4.833               1.9         0

42 local python segments                   143.621
------------------------------------------------------------------------------

Other graphs are hidden. If your output has been abridged in this way but you want to see the full, sequential results of your profiling run, you can print the report with verbose:

# print_results_to=None silences the default output profile report.
with remote_profile(print_results_to=None) as prof:
    ...
prof.report().print(verbose=True)

Local python segments

You will see Local python in the profiling report because TorchDynamo only captures PyTorch code as graph. Where it cannot continue capturing, a "graph break" happens and the uncaptured code runs in python eagerly.

Let's look at the DistilBert example again:

@accelerate
def predict(input: str):
    inputs = tokenizer(input, return_tensors="pt")
    logits = model(**inputs).logits
    predicted_class_id = logits.argmax().item()
    return model.config.id2label[predicted_class_id]
   Segment                            Samples  Avg ms  Failures
===============================================================
0  Local python                             2  27.227

1  Graph #1
     r6i.large/torch-eager-cpu             20  25.648         0
     g4dn.xlarge/torch-eager-cuda          20   5.381         0
     g4dn.xlarge/torch-inductor-cuda       20   3.208         0
     g4dn.xlarge/onnxrt-cuda               20   1.400         0

2  Local python                             2   0.110
---------------------------------------------------------------

Segment 1 is runs Graph #1 which corresponds to the model. Segment 0 corresponds to the tokenizer function, and Segment 2 the logits and label mapping.

Local Python segment is run locally once for every invocation. The total estimated time equals total remote backend time for all the graphs and total local python time.

To print graph breaks and understand more of what TorchDynamo is doing under the hood, see the TorchDynamo Deeper Dive, Torch.FX and PyTorch Troubleshooting pages.

Quota

Each user has a limit on the number of concurrent backends held by the user's sessions which automatically get created and closed within remote_profile() context manager. If you find yourself hitting quota limits because you need to run more tasks concurrently, please contact us to increase the limit.

Supported backends

To programmatically access a list of supported backends, please invoke:

import octoml_profile
print(octoml_profile.get_supported_backends())

Supported Cloud Hardware

AWS

Supported Acceleration Libraries

ONNXRuntime

PyTorch

If no backends are specified while calling remote_profile(backends=[...]), then defaults are used, which are determined by the server. At the moment of writing, the default is ["r6i.large/torch-eager-cpu", "g4dn.xlarge/torch-eager-cuda", "g4dn.xlarge/torch-inductor-cuda"].

Data privacy

We know that keeping your model data private is important to you. We guarantee that no other user has access to your data. We do not scrape any model information internally, and we do not use your uploaded data to try to improve our system -- we rely on you filing github issues or otherwise contacting us to understand your use case, and anything else about your model. Here's how our system currently works.

We leverage TorchDynamo's subgraph capture to identify PyTorch-only code, serialize those subgraphs, and upload them to our system for benchmark. g On model upload, we cache your model in AWS S3. This helps with your development iteration speed -- every subsequent time you want profiling results, you won't have to wait for model re-upload on every minor tweak to your model or update to your requested backends list. When untouched for four weeks, any model subgraphs and constants are automatically removed from S3.

Your model subgraphs are loaded onto our remote workers and are cleaned up on the creation of every subsequent session. Between your session's closure and another session's startup, serialized subgraphs may lie around idle. No users can access these subgraphs in this interval.

If you still have concerns around data privacy, please contact the team.

Known issues

Waiting for Session

When you create a session, you get exclusive access to the requested hardware. When there are no hardware available, new session requests will be queued.

OOM for large models

When a function contains too many graph breaks or individual graph exceeds the memory on the worker, the remote inference worker may run out of CPU/GPU memory. When it happens, you may get an "Error on loading model component". This is known to happen with models like Stable Diffusion. We are actively working on optimizing the memory allocation of many subgraphs.

Limitations of TorchDynamo

TorchDynamo is under active development. You may encounter errors that are TorchDynamo related. These should not be fundamental problems as we believe TorchDynamo will continue to improve its coverage. If you find a broken model, please file an issue.

Contact the team