Home

Awesome

Chaos Lemur

Build Status

This project is a self-hostable application to randomly destroy virtual machines in a BOSH-managed environment, as an aid to resilience testing of high-availability systems. Its main features are:

Although Chaos Lemur recognizes deployments and jobs, it is not possible to select an entire deployment or job for destruction. Entire deployments and jobs will be destroyed over time by chance, given sufficient runs.

Requirements

Java, Maven

The application is written in Java 8 and packaged as a self executable JAR file. This enables it to run anywhere that Java is available. Building the application (required for deployment) requires Maven.

Configuration

Since the application is designed to work in a PaaS environment, all configuration is done with environment variables.

KeyDescription
`<DEPLOYMENTJOB>_PROBABILITY`
BLACKLISTA comma delimited list of deployments and jobs. Any member of the deployment or job will be excluded from destruction. Default is blank, i.e. all members of all deployments and jobs are eligible for destruction. Can be combined with WHITELIST (see below).
DEFAULT_PROBABILITYThe default probability for a VM to be destroyed, ranging from 0.0 (will never be destroyed) to 1.0 (will always be destroyed). The probability is per run, with each run independent of any other. Default is 0.2.
DRYRUNCauses Chaos Lemur to omit the actual destruction of VMs, but work properly in all other respects. The default is false.
SCHEDULEThe schedule to trigger a run of Chaos Lemur. Defined using Spring cron syntax, so 0 0/10 * * * * would run every 10 minutes. Default is 0 0 * * * * (once per hour, on the hour).
WHITELISTA comma delimited list of deployments and jobs. If specified, only members of the deployment or job will be considered for destruction. If WHITELIST is not specified or blank, all deployments and jobs are eligible for destruction. Default is blank. Can be combined with BLACKLIST (see below).

BLACKLIST and WHITELIST can be used individually as noted above. They can also be combined for more complex filtering. The list of deployments and jobs is filtered first by excluding anything not in the whitelist, and then by excluding everything in the blacklist.

For example, say you had a BOSH environment with three deployments 'cf', 'redis', and 'mysql' and plan on adding additional deployments over time. You only want want the 'dea' and 'router' jobs in the 'cf' deployment to be eligible for destruction. One option is to BLACKLIST: "redis, mysql" as well as all the jobs in 'cf' except for 'dea' and 'router' (e.g. BLACKLIST: "nfs_server, ccdb, uaadb, ha_proxy, ...") If you added any additional deployments or any new jobs were added to the 'redis' and 'mysql' deployments, you must remember to add those to the blacklist as well. Alternatively, you could WHITELIST: "cf" and then BLACKLIST: "nfs_server, ccdb, uaadb, ha_proxy, ...". With this approach you need only worry about managing the blacklist of 'cf' as its jobs change over time.

Infrastructure

Chaos Lemur requires an infrastructure to be configured, so you must set either the AWS, VSPHERE, or SIMPLE_INFRASTRUCTURE values.

KeyDescription
AWS_ACCESSKEYIDGives Chaos Lemur access to your AWS infrastructure to destroy VMs.
AWS_REGIONThe AWS region in which to kill VM's. Default is us-east-1
AWS_SECRETACCESSKEYUsed with the AWS_ACCESSKEYID to give AWS access.
DIRECTOR_HOSTThe BOSH Director host to query for destruction candidates
DIRECTOR_PASSWORDUsed with DIRECTOR_HOST to give BOSH Director access.
DIRECTOR_USERNAMEUsed with DIRECTOR_HOST to give BOSH Director access.
OPENSTACK_ENDPOINTThe openstack api endpoint to use to destroy VMs.
OPENSTACK_PASSWORDUsed with OPENSTACK_ENDPOINT to give vSphere access.
OPENSTACK_TENANTUsed with OPENSTACK_ENDPOINT to give the openstack tenant VMs if the to destroy .
OPENSTACK_USERNAMEUsed with OPENSTACK_ENDPOINT to give vSphere access.
SIMPLE_INFRASTRUCTUREChaos Lemur will use its built-in infrastructure rather than AWS or vSphere. Useful for testing. The value for the variable is not read, but something is required for Cloud Foundry (e.g. 'true').
VSPHERE_HOSTThe vSphere host used to destroy VMs.
VSPHERE_PASSWORDUsed with VSPHERE_HOST to give vSphere access.
VSPHERE_USERNAMEUsed with VSPHERE_HOST to give vSphere access.

Reporting

KeyDescription
DATADOG_APIKEYAllows Chaos Lemur to log destruction events to DataDog. If this value is not set Chaos Lemur will redirect the output to the logger at INFO level.
DATADOG_APPKEYUsed with the DATADOG_APIKEY to give DataDog access.
DATADOG_TAGSA set of tags to attach to each DataDog event.

Security

KeyDescription
SECURITY_BASIC_ENABLEDEnables authentication. Default is true.
SECURITY_USER_NAMEThe username for authenticating to the API. Default is user.
SECURITY_USER_PASSWORDThe password for authenticating to the API. Default is randomly generated by Spring Boot Security and printed at INFO level when the application starts up. See the Spring Boot security reference documentation for additional details. We recommend setting your own username and password, or at least noting the default when it is logged rather than retrieving it later.

Services

Chaos Lemur can use Redis to persist its status across restarts (e,g. if the PaaS environment forces a reboot). If a Redis Cloud instance called chaos-lemur-persistence exists, Chaos Lemur will use it. Chaos Lemur only requires a few bytes of storage, so the smallest Redis plan should be sufficient.

Deployment

The following instructions assume that you have created an account and installed the cf command line tool.

In order to automate the deployment process as much as possible, the project contains a Cloud Foundry manifest. To deploy run the following commands:

mvn clean package
cf push

To confirm that Chaos Lemur has started correctly run:

cf logs chaos-lemur --recent

API

Chaos Lemur is designed to run continuously, destroying VMs on a definable schedule. To help with testing and development it is possible to pause and resume destroys using its RESTful API. All data is sent and received as application/json.

The API requires credentials provided via Basic Authentication. See the SECURITY_USER_NAME and SECURITY_USER_PASSWORD environment variables.

CallPayloadStatusDescription
GET /state`{ "status": "[STOPPEDSTARTED]" }`200
POST /state{ "status": "STOPPED" }202Pause Chaos Lemur indefinitely
POST /state{ "status": "STARTED" }202Resume Chaos Lemur
POST /chaos{ "event": "DESTROY" }202Initiate a round of destroys. The destroys will happen even if Chaos Lemur is stopped. Returns the Location of a task for the destroy.
GET /task-200Reports the above information for all tasks.
GET /task/{id}-200Reports the status (`{ "COMPLETE

Developing

The project is set up as a Maven project and doesn't have any special requirements beyond that. It has been created using IntelliJ and contains configuration information for that environment, but should work with other IDEs.

License

The project is released under version 2.0 of the Apache License.