Awesome

DevOps AI Assistant Open Leaderboard

This project tracks, ranks, and evaluates DevOps AI Assistants across knowledge domains.

📅 Book a time on my calendar or email derek@opstower.ai to chat about these benchmarks.

🏆 Current Leaderboard

AWS Services (dataset)

Name	Accuracy	Median Duration (s)	Created At
OpsTower.ai	92% 🏆	29	2023-09-17
ReleaseAI	72%	11	2023-09-17

AWS CloudWatch Metrics (dataset)

Name	Accuracy	Median Duration (s)	Created At
OpsTower.ai	89% 🏆	42	2023-09-17
ReleaseAI	56%	20	2023-09-18

AWS Billing (dataset)

Name	Accuracy	Median Duration (s)	Created At
OpsTower.ai	91% 🏆	53	2023-09-18
ReleaseAI	73%	23	2023-09-18

kubectl (dataset)

Name	Accuracy	Median Duration (s)	Created At
abhishek-ch/kubectl-GPT	83% 🏆	5	2023-09-19
devinjeon/kubectl-gpt	50%	1	2023-09-19
mico	17%	1	2023-09-19

Metrics:

Accuracy: The percent of questions that the DevOps AI Assistant answered correctly.
Median Duration: The median duration in seconds that it took the DevOps AI Assistant to answer a question.

What is a DevOps AI Assistant?

A DevOps AI Assistant is an LLM-backed autonomous agent that helps DevOps engineers perform their daily tasks. They connect to external systems like AWS and Kubernetes to perform actions on behalf of the user.

List of DevOps AI Assistants

Only includes assistants that can be invoked from the command line or via a REST API, are functional, and are available for immediate use (not in private beta).

Name	Focus	Evaluated?
aiac	Terraform, kubectl, AWS	No - code generation only
aiws	AWS	No - does not decipher command output
Aptible AI	?	No
Argon	Kubernetes	No
cloud copilot	Azure	No - does not decipher command output
k8sgpt	Kubernetes	Planned
kubectl-GPT	kubectl	✅
kubectl-gpt	kubectl	✅
KubeCtl-ai	Kubernetes manifests	No - code generation only
mico	kubectl	✅
OpsTower.ai	AWS	✅
ReleaseAI	AWS, Kubectl	✅
Terraform AI	Terraform	No - code generation only
tfgpt	Terraform	No - code generation only

Submit a DevOps AI Assistant for evaluation

Open a PR and submit a DevOps AI Assistant for automated evaluation. To be evaluated, the agent must meet the following criteria:

Can be invoked from the command line or via a REST API.
Not in private BETA.

Question Datasets

See the datasets/ directory for the question datasets. There are 3 columns in each dataset csv file:

question: The question to ask the DevOps AI Assistant
answer_format: The expected answer in natural language.
reference_functions: The reference functions that the DevOps AI Assistant should call to answer the question.

List of datasets:

Name	Example Question
aws_cloudwatch_metrics.csv	Were there any Lambda invocations that lasted over 30 seconds in the last day?
aws_services.csv	Do our ec2 instances have are any unexpected reboots or terminations over the past 7 days?
aws_billing.csv	Which region has the highest AWS expenses for me over the past 3 months?
kubectl.csv	How many pods are currently running in the default namespace?

Evaluation Process

Iterate over each question in the dataset and store:

the answer from the DevOps AI Assistant
the truth answer derived from evaluating the human-evaluated reference functions with a prompt to summarize the results into an answer.

Iterate over the answer results, using the dynamic eval prompt to compare the results of the DevOps AI Assistant to the truth answer. This generates a confidence score and a short explanation for background on the score.
Store the results in the results/ directory.

A note on dynamic evaluation

A critical component of the evaluation process is the dynamic evaluation. It's not feasible to provide a static answer for most questions as the correct answer is environment-specific. For example, the answer to "What is the average CPU utilization across my EC2 instances?" is not a static answer. It depends on the current state of the EC2 instances.

To solve this, I've stored a set of human-evaluated functions to generate the data that provide correct answers. Then, I use an LLM prompt to generate a natural language answer from the data. This would be a poor evaluation process if the LLM provided an incorrect answer based on the returned data, but I have yet to observe significant errors in the LLM's reasoning of the function output.

Please submit a PR if you believe a reference function is incorrect.

Contact Info

Reach out derek@opstower.ai if you have general questions about this leaderboard.