Awesome
streaming-data-pipeline
Streaming pipeline repo for data engineering training program
See producers and consumers set up README in their respective directories
#local environment setup
###Prerequisites:
- Make sure you have sbt installed.
- Make sure you have docker installed and running.
- Make sure you don't have a previous instance of Zookeeper, Kafka or Spark running before executing the script (it won't be able to allocate the port)
###Steps
-
Run
./sbin/buildAndRunLocal.sh
. This creates various Docker containers (each with an independent purpose) for running and testing this setup on your local machine. -
If everything us up and running, you should be able to see data in hadoop. To check for data:
docker ps | grep hadoop
- you should see at least one container referencing hadoop (we can ignore hadoop_seed for now)docker exec -it $CONTAINER_ID bash
/usr/local/hadoop/bin/hadoop fs -ls /free2wheelers/stationMart/data
- Tada! We have data! (if you don't -- something went wrong, check "Considerations")
###Considerations
- Your docker machine may need at least
CPUs: 2
/Memory: 4GiB
/Swap: 512 MiB
; remember to "Apply & Restart" - When running the script run
docker stats
for some insights - There's a script for stopping:
./sbin/stopAndRemoveLocal.sh
, try stopping and restarting - If you're interested in execution logs:
docker logs $CONTAINER_ID