Home

Awesome

DevOps AI Assistant Open Leaderboard

This project tracks, ranks, and evaluates DevOps AI Assistants across knowledge domains.

📅 Book a time on my calendar or email derek@opstower.ai to chat about these benchmarks.

🏆 Current Leaderboard

AWS Services (dataset)

NameAccuracyMedian Duration (s)Created At
OpsTower.ai92% 🏆292023-09-17
ReleaseAI72%112023-09-17

AWS CloudWatch Metrics (dataset)

NameAccuracyMedian Duration (s)Created At
OpsTower.ai89% 🏆422023-09-17
ReleaseAI56%202023-09-18

AWS Billing (dataset)

NameAccuracyMedian Duration (s)Created At
OpsTower.ai91% 🏆532023-09-18
ReleaseAI73%232023-09-18

kubectl (dataset)

NameAccuracyMedian Duration (s)Created At
abhishek-ch/kubectl-GPT83% 🏆52023-09-19
devinjeon/kubectl-gpt50%12023-09-19
mico17%12023-09-19

Metrics:

What is a DevOps AI Assistant?

A DevOps AI Assistant is an LLM-backed autonomous agent that helps DevOps engineers perform their daily tasks. They connect to external systems like AWS and Kubernetes to perform actions on behalf of the user.

List of DevOps AI Assistants

Only includes assistants that can be invoked from the command line or via a REST API, are functional, and are available for immediate use (not in private beta).

NameFocusEvaluated?
aiacTerraform, kubectl, AWSNo - code generation only
aiwsAWSNo - does not decipher command output
Aptible AI?No
ArgonKubernetesNo
cloud copilotAzureNo - does not decipher command output
k8sgptKubernetesPlanned
kubectl-GPTkubectl
kubectl-gptkubectl
KubeCtl-aiKubernetes manifestsNo - code generation only
micokubectl
OpsTower.aiAWS
ReleaseAIAWS, Kubectl
Terraform AITerraformNo - code generation only
tfgptTerraformNo - code generation only

Submit a DevOps AI Assistant for evaluation

Open a PR and submit a DevOps AI Assistant for automated evaluation. To be evaluated, the agent must meet the following criteria:

  1. Can be invoked from the command line or via a REST API.
  2. Not in private BETA.

Question Datasets

See the datasets/ directory for the question datasets. There are 3 columns in each dataset csv file:

  1. question: The question to ask the DevOps AI Assistant
  2. answer_format: The expected answer in natural language.
  3. reference_functions: The reference functions that the DevOps AI Assistant should call to answer the question.

List of datasets:

NameExample Question
aws_cloudwatch_metrics.csvWere there any Lambda invocations that lasted over 30 seconds in the last day?
aws_services.csvDo our ec2 instances have are any unexpected reboots or terminations over the past 7 days?
aws_billing.csvWhich region has the highest AWS expenses for me over the past 3 months?
kubectl.csvHow many pods are currently running in the default namespace?

Evaluation Process

  1. Iterate over each question in the dataset and store:
  1. Iterate over the answer results, using the dynamic eval prompt to compare the results of the DevOps AI Assistant to the truth answer. This generates a confidence score and a short explanation for background on the score.
  2. Store the results in the results/ directory.

A note on dynamic evaluation

A critical component of the evaluation process is the dynamic evaluation. It's not feasible to provide a static answer for most questions as the correct answer is environment-specific. For example, the answer to "What is the average CPU utilization across my EC2 instances?" is not a static answer. It depends on the current state of the EC2 instances.

To solve this, I've stored a set of human-evaluated functions to generate the data that provide correct answers. Then, I use an LLM prompt to generate a natural language answer from the data. This would be a poor evaluation process if the LLM provided an incorrect answer based on the returned data, but I have yet to observe significant errors in the LLM's reasoning of the function output.

Please submit a PR if you believe a reference function is incorrect.

Contact Info

Reach out derek@opstower.ai if you have general questions about this leaderboard.