Awesome
Cassandra + Spark + Zeppelin
This is a repository for a couple of docker-compose scripts, one of which that creates two Docker containers - one with a Zeppelin instance and the other one with a Cassandra node, the other one starting 4 containers - one with Zeppelin and 3 with a Cassandra three node cluster
Configuration and Installation
Make sure to have a valid Docker and docker-compose Installation, running on a 64-bit system (either directly on a mac or Linux machine, or on a VirtualBox - or similar - VM running a 64-bit guest; this means that you'll end up running Docker inside a VM, this is fine for testing and learning purposes).
To install/configure Docker and/or Docker Compose follow the steps described at https://docs.docker.com/compose/install/ and https://docs.docker.com/engine/installation/linux/ubuntu/ (this is for Ubuntu based Linux systems)
As a last step, clone this repository (you might need to do first apt-get install git
)
git clone https://github.com/academyofdata/cassandra-zeppelin
Starting a single node Cassandra + Zeppelin instance
Once the docker & docker-compose prerequisites are met and the repository is cloned (example below assumes it is cloned in a folder called cassandra-zeppelin), do the following
cd cassandra-zeppelin
docker-compose build
docker-compose up -d
Assuming that you haven't encountered problems during build or run phase, you can now test that the containers are running by issuing the following command
docker ps
which should have an output similar with the one below
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
110e8f4b16b3 zeppelin_zeppelin "bin/zeppelin.sh" 4 days ago Up 3 days 0.0.0.0:4040->4040/tcp, 0.0.0.0:8080-8081->8080-8081/tcp zeppelin_zeppelin_1
bbb70c263987 cassandra:3.9 "/docker-entrypoint.s" 4 days ago Up 3 days 0.0.0.0:7000-7001->7000-7001/tcp, 0.0.0.0:7199->7199/tcp, 0.0.0.0:9042->9042/tcp, 0.0.0.0:9160->9160/tcp zeppelin_cassandra_1
(pay attention in special to the STATUS column - it should say Up and not Exited) Once the containers are running you can go to http://virtualmachineip:8080 (replace with your own VirtualBox or local machine IP) and you should see the Zeppelin interface
Starting a Zeppelin instance connected to a Cassandra cluster (with 3 nodes)
PLEASE NOTE If you've previously started other containers with Zeppelin (for instance the Zeppelin + a single Cassandra node as outlined above), make sure to stop them before starting the instance connected to the cluster. You can do that with
docker-compose stop
Otherwise there will be port conflicts when attempting to start the new cluster and the new Zeppelin instance.
Start with this more complex configuration by issuing the command below (in the same folder where you've cloned this git repository)
docker-compose -f docker-cluster.yml up -d
After starting check that the containers are running (docker ps -a
), wait for a few seconds (20-30 should be enough), log into one of the cassandra nodes (docker exec -ti zeppelin_node01_1 bash
) and check the cluster status (run this in the container)
nodetool status
If the cluster started correctly you should see back a few lines, three of them starting with UN, like this
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.3 110.13 KiB 256 67.6% 5460abe0-cf14-4d87-bf11-04f4ccd3f14c rack1
UN 172.17.0.2 108.46 KiB 256 62.0% 17d1e7cd-2ff6-4397-8495-a42c12a3807f rack1
UN 172.17.0.4 103.09 KiB 256 70.4% 70d2d32c-d7cd-4662-9e98-906167b0e4b7 rack1
This means that all the nodes are up (U) and operating normally (N)
Bulk-Loading data in Cassandra
PLEASE NOTE If you already have a 'test' keyspace it's better to drop it before executing the steps below.
To load all the exercise data into a newly created "test" keyspace and creating all the required tables, run the following command inside the Cassandra container (if you have an existing "test" keyspace, drop it)
apt-get update && apt-get install -y wget && wget -qO- https://raw.githubusercontent.com/academyofdata/cassandra-zeppelin/master/script.sh | bash
(to log into the container run 'docker exec -ti containers_cassandra_1 bash' from your container host, after you check the exact name of your container with 'docker ps -a')
Connecting Zeppelin to Cassandra
To be able to run queries from Zeppelin against a cassandra cluster (or a single node) we need to instruct Zeppelin's interpreter for Cassandra to connect to the right host. Since when using docker-compose we've specified that the cassandra container (or, when using a cluster, one of the containers) is available as the host 'cassandra', we just need adjust a single configuration value. For this, click in the right top corner of Zeppelin the "Anonymous" button to open the menu with a few options, one of which is "Interpreter"
<img src="https://github.com/academyofdata/cassandra-zeppelin/blob/master/assets/1.png">Once on that page scroll to the Cassandra section and edit the value for cassandra.hosts to read cassandra as shown below
<img src="https://github.com/academyofdata/cassandra-zeppelin/blob/master/assets/2.png">NOTE We could configure Zeppelin to connect to any of the hosts when running in the cluster configuration. For this we would first need to ammend the docker-compose configuration to also link the other nodes into zeppelin (in "links" section) and then we could set the cassandra.hosts to the hostnames separated by comma (e.g. "cassandra,cassandra2,cassandra3")
Starting containers without docker-compose
Assuming that you already have a running Cassandra container, in order to connect a new zeppelin instance to it run the following
docker run -d -p 8080:8080 -p 8081:8081 -p 4040:4040 -e MASTER="local[*]" -e ZEPPELIN_PORT="8080" -e ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=1g -Dspark.executor.memory=2g" -e SPARK_SUBMIT_OPTS="--conf spark.driver.host=localhost --conf spark.driver.port=8081" --link <id_or_name_of_cassandra_container>:cassandra --name zeppelin dylanmei/zeppelin
after the container starts run
docker exec -ti `docker ps --format '{{.Names}}' | grep zeppelin` bash -c "/usr/zeppelin/bin/install-interpreter.sh --name cassandra"
Starting a Zeppelin only instance
Edit the docker-compose.yml file to read as below
zeppelin:
image: dylanmei/zeppelin
environment:
ZEPPELIN_PORT: 8080
ZEPPELIN_JAVA_OPTS: >-
-Dspark.driver.memory=1g
-Dspark.executor.memory=2g
SPARK_SUBMIT_OPTIONS: >-
--conf spark.driver.host=localhost
--conf spark.driver.port=8081
MASTER: local[*]
ports:
- 8080:8080
- 8081:8081
- 4040:4040
volumes:
- ./znotebooks:/usr/zeppelin/notebook
and issue the same docker-compose up -d command
get_num_processes
If you get a get_num_processes() takes no keyword arguments error, get out of cqlsh (but stay in the container shell, not on the host system) and run
rm /usr/lib/pymodules/python2.7/cqlshlib/copyutil.so