Home

Awesome

Ferry: Big Data Development Environment using Docker

Ferry lets you launch, run, and manage big data clusters on AWS, OpenStack, and your local machine. It does this by leveraging awesome technologies such as Docker. Ferry currently supports:

All you have to do to start is specify your stack using YAML or download a pre-existing application.

Why?

Ferry is made for developers and data scientists that want to develop big data applications without the fuss of setting up the infrastructure. It will help you:

Because Ferry uses Docker underneath, each virtual cluster is completely isolated. That means you can create multiple clusters for different applications.

Getting started

Ferry is a Python application and runs on your local machine. All you have to do to get started is have docker installed and type the following pip install -U ferry. More detailed installation instructions and examples can be found here.

Once installed, you can create your big data application using YAML files.

   backend:
      - storage:
           personality: "gluster"
           instances: 2
        compute:
           - personality: "yarn"
             instances: 2
   connectors:
      - personality: "hadoop-client"
        name: "control-0"

This stack consists of two GlusterFS data nodes, and two Hadoop/YARN compute nodes. There's also an Ubuntu-based client that automatically connects to those backend components. Of course you can substitute your own customized client.

To create this stack, just type ferry start yarn. Once you create the stack, you can log in by typing ferry ssh control-0.

Contributing

Contributions are totally welcome. Here are some suggestions on how to get started:

I strongly recommend using GitHub issues + pull requests for contributions. Tweets sent to @open_core_io are also welcome. Happy hacking!

Under the hood

Ferry leverages some awesome open source projects: