Awesome
aurproxy
aurproxy is a load balancer manager with knowledge of Apache Aurora's service discovery mechanism and integration with Aurora's task lifecycle. It is a python application that manages a fully featured but not Aurora-aware load balancer (currently only nginx is supported). When Aurora service discovery events occur, aurproxy detects them, rewrites the load balancer's configuration file and then triggers a graceful restart of the load balancer. Use aurproxy to expose dynamic Aurora service endpoints to services and applications that don't have knowledge of Aurora's service discovery mechanism.
Features
- Battle tested at TellApart
- EC2 Classic-hosted Aurora cluster
- Regularly proxies ~150k QPS in production
- Load balance HTTP (nginx)
- Load balance TCP (nginx)
- Active healthchecks for managed endpoints
- Delayed start for new endpoints
- Traffic ramping for new endpoints
- Simple overload protection
- HTTP traffic mirroring & replay modes
- Using gor
- Route HTTP requests to a pool of endpoints from multiple jobs
- Route HTTP requests based on HTTP route
- Route HTTP requests based on HTTP "Host" header
- Register and deregister aurproxy task instance with upstream services
- DNS (Route53)
- Other load balancers (ELBs)
Try it
Locally
Replace your zk_servers, role, environment, job, etc. in the value of the "config" argument below:
# 1 - Build and push container to your registry - replace registry address.
PACKAGE_VERSION=20150430.0; docker build -t docker.mydomain.com/library/aurproxy:$PACKAGE_VERSION . && docker push docker.mydomain.com/library/aurproxy:$PACKAGE_VERSION
# 2 - Launch container
docker run -t -i --net=host docker.mydomain.com/aurproxy:20150430.0 bash
# 3 - Run setup to configure load balancer
cd /opt/aurproxy && \
python -m "tellapart.aurproxy.command" run \
--setup \
--management-port=31325 \
--config '{"backend": "nginx", "servers": [{"routes": [{"sources": [{"endpoint": "http", "zk_servers": "0.zk.mycluster.mydomain.com:2181,1.zk.mycluster.mydomain.com:2181,2.zk.mycluster.mydomain.com:2181", "environment": "devel", "job": "web", "role": "myrole", "source_class": "tellapart.aurproxy.source.AuroraProxySource"}], "locations": ["/"]}], "hosts": ["default"], "ports": [8080], "context": {"default_server": "True", "location_blacklist": ["/health", "/quitquitquit", "/abortabortabort"]}}]}'
# 4 - Start load balancer
/usr/sbin/nginx -c /etc/nginx/nginx.conf &
# 5 - Re-run #2 without --setup flag.
# 6 - Test from another shell - assuming host networking
# (otherwise you'll have to map port through in #1 above)
# and something to point to in configured ProxySource.
curl 127.0.0.1:8080/robots.txt
On Aurora devcluster
-
Set up Aurora devcluster
-
Build and push container to your registry - replace registry address.
PACKAGE_VERSION=20150430.0; docker build -t docker.mydomain.com/library/aurproxy:$PACKAGE_VERSION . && docker push docker.mydomain.com/library/aurproxy:$PACKAGE_VERSION
-
Edit example/aurproxy_hello_world.aur to match the information for the docker image above.
-
Copy example/aurproxy_hello_world.aur, example/hello_world.aur, example/hello_world.py to your local aurora source directory (will show up in "/vagrant/" in devcluster).
-
"vagrant ssh" into aurora devcluster instance.
-
Install hello_world.py prerequisites:
apt-get install python-pip pip install Flask==0.10.1
-
Create jobs:
aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aur aurora job create devcluster/www-data/devel/aurproxy /vagrant/aurproxy_hello_world.aur
-
Wait for the jobs to come up - the first download of the aurproxy docker image may take a while:
http://192.168.33.7:8081/scheduler/www-data/devel/aurproxy
-
Test from host:
curl 192.168.33.7:8080 # Expect 200 "Hello World!"
-
Find running aurproxy task instance in Aurora web interface, open stderr of "aurproxy" process.
-
Restart hello_world job while watching aurproxy stderr log to see proxy configuration update in action:
aurora job restart devcluster/www-data/devel/hello_world
Deployment Suggestions:
- Make sure that Mesos' Docker containerizer is configured for Host mode networking.
- Currently the default
- If you need aurproxy to listen on a fixed port (say, 80), make sure that it isn't in your ephemeral port range and then set your job up to be run in a dedicated group with a "host" constraint value of "limit:1" in order to ensure only 1 aurproxy task instance per slave for that job.
Traffic Mirroring & Replay Modes:
Aurproxy HTTP traffic mirroring and replay features use gor, which can be set up to mirror a fixed number of queries per second from a full aurproxy task instance over a TCP stream to one or more gor replay servers. Aurproxy finds gor instances using sources, and will update its gor command line to add new endpoints as they appear and remove old ones as they disappear.
To use traffic mirroring:
-
Set up a gor replay server. It can be but doesn't have to be in Aurora. See "traffic replay" for instructions on how to set one up using aurproxy to manage it.
-
Add a gor_process to your Aurproxy task definition.
# max_failures=0 is important: # aurproxy kills the gor process when it updates mirror.sh, # and max_failures=0 means that Aurora won't treat the task as # unhealthy after some number of gor process restarts. gor_process = Process( name='gor', cmdline='/etc/aurproxy/gor/dynamic.sh', max_failures=0)
-
Pass values for mirror_source (an aurproxy source configured to point to your gor replication server(s)), mirror_ports, mirror_max_qps, and mirror_max_update_frequency into aurproxy when starting it.
To use traffic replay:
-
Set up a replay job to run aurproxy "run_replay" with the "--setup" flag and then both run_replay and gor, as per above in the traffic mirroring instructions. run_replay command line example:
cd /opt/aurproxy && \ python -m tellapart.aurproxy.command run_replay \ --management-port 12345 \ --replay-port 12346 \ --replay-source '{"name": "replay", "source_class": "tellapart.aurproxy.source.ApiSource"}' \ --replay-max-qps 1000
Lifecycle
- Setup
- Configure proxy (nginx.conf)
- Start proxy (nginx)
- Run aurproxy
- Initialize metrics collection plugin, if configured.
- Initialize and execute registration plugin, if configured.
- Initialize and start proxy updater.
- Periodically check for update requests signaled by proxy sources and share adjusters.
- If need to update is signaled, update proxy configuration and restart proxy.
- Start aurproxy's web server
- Listens for Aurora lifecycle events - health, shutdown, etc.
- Aurora signals aurproxy task instance should shut down (/quitquitquit)
- Run shutdown handlers.
- Deregister, if registration plugin configured.
- Flush metrics.
- Run shutdown handlers.
Develop
Build
Build the image. EG:
docker build -t docker.mydomain.com/aurproxy:latest .
Test
Set up for testing
pip install virtualenv
virtualenv -p python2.7 --no-site-packages --distribute ~/.virtualenvs/aurproxy
cd ~/src/aurproxy
source ~/.virtualenvs/aurproxy/bin/activate
pip install -r ~/src/aurproxy/requirements.txt
pip install -r ~/src/aurproxy/requirements_dev.txt
Run tests
nosetests --all-modules --where ./tellapart --with-coverage --cover-package=tellapart.aurproxy --cover-inclusive
Terminology
Backends
aurproxy is designed to support using different load balancers as backends. Within aurproxy, a backend extends aurproxy.backends.ProxyBackend and is where the loadbalancer-specific logic and management code should live. nginx is currently the only backend implementation, but an HAProxy backend implementation should be possible.
Endpoints
Endpoints are host+port pairs that represent a running task instance. They don't have to be hosted in Aurora.
Sources
Sources are classes that are responsible for providing, maintaining, and signaling updates to a list of endpoints.
Source Lifecycle
- Setup
- Start
- Set up watches on service discovery data sources, if necessary.
- Ongoing:
- On update (node being added or removed), signal that an update is required.
- Return list of SourceEndpoints whenever requested.
Source Implementations
- AuroraSource, which watches Aurora's service discovery database in Zookeeper and whenever a ServiceInstance is added or removed, signals Aurproxy's ProxyUpdater that an update is required.
- StaticProxySource, which returns a single, statically configured endpoint.
Registration
Active Registration
Active Registration refers to registration events that take place during the normal Aurora lifecycle for Aurproxy task instances - on startup, registration is triggered (if configured), and on graceful shutdown, deregistration is triggered (if configured).
Active Registration Implementations
-
ELB
--registration-arg=region=us-east-1 --registration-arg=elb_names=MyElb --registration-class=tellapart.aurproxy.register.elb.ElbSelfRegisterer
-
Route53
--registration-arg=region=us-east-1 --registration-arg=domain=myproxy.mydomain.com --registration-arg=hosted_zone_id={{hosted_zone}} --registration-arg=ttl=60 --registration-class=tellapart.aurproxy.register.route53.Route53SelfRegisterer
Synchronization
Synchronization refers to a way to run a registration plugin as an administrative script or as an Aurora cron job. It is intended to both remove outdated aurproxy tasks left behind by slaves that may have disappeared or on which task shutdown was not graceful as well as add missing any aurproxy tasks o supported systems (EG: other load balancers or DNS).
Synchronization Implementations
-
ELB
cd /opt/aurproxy && python -m "aurproxy.command" synchronize \ --registration-arg=region=us-east-1 \ --registration-arg=elb_names=MyElb \ --registration-class=tellapart.aurproxy.register.elb.ElbJobRegisterer
-
Route53
cd /opt/aurproxy && python -m "aurproxy.command" synchronize \ --registration-arg=region=us-east-1 \ --registration-arg=domain=myproxy.mydomain.com \ --registration-arg=hosted_zone_id={{hosted_zone}} \ --registration-arg=ttl=60 \ --registration-class=tellapart.aurproxy.register.route53.Route53JobRegisterer
Metrics
Aurproxy supports recording:
- aurproxy metrics (restart attempts, etc.)
- Proxy backend metrics (nginx stats)
Metrics Implementations
-
Librato
--metric-publisher-class=tellapart.aurproxy.metrics.publisher.LibratoMetricPublisher \ --metric-publisher-arg=source={{cluster}}.{{role}}.{{environment}}.{{job}} \ --metric-publisher-arg=period=60 \ --metric-publisher-arg=api_user={{librato_user}} \ --metric-publisher-arg=api_token={{librato_token}}
-
OpenTSDB
--metric-publisher-class=tellapart.aurproxy.metrics.publisher.OpenTSDBMetricPublisher \ --metric-publisher-arg=source={{cluster}}.{{role}}.{{environment}}.{{job}} \ --metric-publisher-arg=period=60 \ --metric-publisher-arg=prefix=apps. \ --metric-publisher-arg=host=localhost \ --metric-publisher-arg=port=4242
Log Aggregation
Currently, the only log aggregation implementation is Sentry. If you pass a sentry_dsn in via the command line, logged exceptions will be recorded to Sentry.
--sentry-dsn='gevent+https://{{sentry_user}}:{{sentry_pass}}@app.getsentry.com/{{sentry_project}}'
Configuration Elements
Aurproxy Configuration (JSON String)
- servers: Required list of one or more ProxyServer dictionaries
- backend: Required name of backend to use. Use 'nginx'.
- context: Optional dictionary. See "context" note below.
ProxyServer
- routes: Optional list of one or more ProxyRoute dictionaries. Use routes for HTTP load balancing. At least one route or one stream must be set.
- streams: Optional list of one or more ProxyStream dictionaries. Use streams for TCP load balancing. At least one route or one stream must be set.
- ports: Required list of one or more ports to listen on. Shouldn't collide with other ports in use. Use dedicated group + host limit constraints to control.
- healthcheck_route: Optional HTTP route (EG: "/health"). Will always return 200 if set. Only applied to ProxyRoutes.
- context: Optional dictionary. See "context" note below.
ProxyRoute
- locations: Required list with one or more HTTP routes (EG: ["/"])
- sources: Required list with one or more ProxySource dictionaries
- empty_endpoint_status_code: Optional status code to return if there are no available endpoints. Defaults to 503 (Service Unavailable) if not specified.
- overflow_sources: Optional list of one or more ProxySource dictionaries
- overflow_threshold_pct: Optional threshold past which traffic will be sent to overflow_source.
ProxyStream
- sources: Required list with one or more ProxySource dictionaries
- overflow_sources: Optional list of one or more ProxySource dictionaries
- overflow_threshold_pct: Optional threshold past which traffic will be sent to overflow_source.
ProxySource (Base)
- source_class: Python class path for source type. EG: tellapart.aurproxy.source.AuroraSource
- share_adjusters: Optional list of ShareAdjuster dictionaries.
AuroraSource
- source_class: 'tellapart.aurproxy.source.AuroraSource'
- role: Required Aurora job role string
- environment: Required Aurora job environment string
- job: Required Aurora job name string
- zk_servers: Required comma-separated list in string of ZooKeeper servers with Aurora Service Discovery registry
- announcer_serverset_path: Optional ZooKeeper path under which Aurora Service Discovery lives. Default: '/aurora/'
- share_adjusters: Optional list of ShareAdjuster dictionaries.
StaticProxySource
- source_class: 'tellapart.aurproxy.source.StaticProxySource'
- name: Required name of static source (EG: 'devnull')
- host: Required host for static endpoint (EG: '127.0.0.1')
- port: Required port for static endpoint (EG: 12345)
- share_adjusters: Optional list of ShareAdjuster dictionaries.
ApiSource
- source_class: 'tellapart.aurproxy.source.ApiSource'
- name: Required name of static source (EG: 'replay')
- source_whitelist: Optional list(str) of source class paths that are allowed to be created under ApiSource.
ShareAdjuster (Base)
- share_adjuster_class: Required python class path for share adjuster type.
RampingShareAdjuster
- share_adjuster_class: 'tellapart.aurproxy.share.adjusters.RampingShareAdjuster'
- ramp_delay: Required, number of seconds before starting to ramp traffic up. (EG: 90)
- ramp_seconds: Required, number of seconds over which to ramp traffic up. (EG: 60)
- update_frequency: Required, how often need to update is signaled to proxy updater.
- curve: Optional, type of curve along which to ramp traffic. Currently only supports "linear".
HttpHealthCheckShareAdjuster
- share_adjuster_class: 'tellapart.aurproxy.share.adjusters.HttpHealthCheckShareAdjuster'
- route: Route to healthcheck. (EG: '/health')
- interval: Number of seconds between checks. (EG: 3)
- timeout: Number of seconds after which to time out a check. (EG: 2)
- unhealthy_threshold: Number of failed checks before a healthy task is marked as being unhealthy. (EG: 2)
- healthy_threshold: Number of failed checks before an unhealthy task is marked as being healthy. (EG: 2)
- http_method: Which http method to use for healthchecks (EG: HEAD or GET)
Context
Some of the configuration elements above support a "context" dictionary. This a "bucket of stuff" that gets passed directly down to the configuration rendering context and is the place to put backend-specific configuration values (EG: "default_server": True)
NginxServerContext
- default_server: boolean, whether to treat this server as the default (vs requiring a Host header match).
- location_blacklist: list(str), nginx-style locations for which to deny requests for this server.
Configuration Example
import json
cmd = 'cd /opt/aurproxy && python -m "tellapart.aurproxy.command" run ' \
'--setup --management-port=31325 --max-update-frequency=10 ' \
'--update-period=2 --sentry-dsn='' --config \''
config = {
"servers": [
{
"routes": [
{
"sources": [
{
"endpoint": "http",
"announcer_serverset_path": "sd/mycluster",
"job": "myjob",
"environment": "prod",
"zk_servers": "0.zk.mycluster.mydomain.com:2181,"
"1.zk.mycluster.mydomain.com:2181,"
"2.zk.mycluster.mydomain.com:2181",
"role": "myrole",
"source_class": "tellapart.aurproxy.source.AuroraProxySource",
"share_adjusters": [
{
"share_adjuster_class": "tellapart.aurproxy.share.adjusters.RampingShareAdjuster",
"ramp_seconds": 60,
"ramp_delay": 10,
"update_frequency": 10,
"curve": "linear"
},
{
"share_adjuster_class": "tellapart.aurproxy.share.adjusters.HttpHealthCheckShareAdjuster",
"route": "/health",
"interval": 3,
"timeout": 2,
"unhealthy_threshold": 2,
"healthy_threshold": 2,
}
]
}
],
"locations": [
"/"
]
}
],
"healthcheck_route": "/healthcheck",
"hosts": [
"default"
],
"ports": [
10080
],
"context": {
"default_server": "True",
"location_blacklist": ["/health", "/quitquitquit", "/abortabortabort"]
}
}
],
"backend": "nginx"
}
cmd += json.dumps(config) + '\''
print cmd