Awesome

ngods stock market demo

This repository contains a stock market analysis demo of the ngods data stack. The demo performs the following steps:

Download selected stock symbols data from Yahoo Finance API.
Store the stock data in ngods data warehouse (using Iceberg format).
Transform the data (e.g. normalize stock prices) using dbt.
Expose analytics data model using cube.dev.
Visualize data as reports and dashboards using Metabase.
Predicts stock prices using ARIMA in Apache Spark.

The demo is packaged as docker-compose script that downloads, installs, and runs all components of the data stack.

UPDATES

2023-02-03:
- Upgrade to Apache Iceberg 1.1.0
- Upgrade to Trino 406
- Migrated to the new JDBC catalog (removed the heavyweigt Hive Metastore)

ngods

ngods stands for New Generation Opensource Data Stack. It includes the following components:

Apache Spark for data transformation
Apache Iceberg as a data storage format
Trino for federated data query
dbt for ELT
Dagster for data orchetsration
cube.dev for data analysis and semantic data model
Metabase for self-service data visualization (dashboards)
Minio for local S3 storage

ngods components

ngods is open-sourced under a BSD license and it is distributed as a docker-compose script that supports Intel and ARM architectures.

Running the demo

ngods requires a machine with at least 16GB RAM and Intel or Arm 64 CPU running Docker. It requires docker-compose.

Clone the ngods repo

git clone https://github.com/zsvoboda/ngods-stocks.git

Start the data stack with the docker-compose up command

cd ngods-stocks

docker-compose up -d

NOTE: This can take quite long depending on your network speed.

Stop the data stack via the docker-compose down command

docker-compose down

Execute the data pipeline from the Dagster console at http://localhost:3070/ with this yaml config file.

Dagster e2e

Cut and paste the content of the e2e.yaml file to this Dagster UI console page and start the data pipeline by clicking the Launch Run button.

NOTE: You can customize the list of stock symbols that will be downloaded.

Review and customize the cube.dev metrics, and dimensions. Test these metrics in the cube.dev playground.

cube.dev playground

See the cube.dev documentation for more information.

Check out the Metabase data visualizations that is connected to the cube.dev analytical model. You can run SQL queries on top of the cube.dev schema.

Use username metabase@ngods.com and password metabase1.

Metabase

You can create your own data visualizations and dashboards. See the Metabase documentation for more information.

Predict stock close price. Run the ARIMA time-series prediction model notebook that is trained on 29 months of the Apple:AAPL stock data and predicts the next month.

Jupyter ARIMA

Download DBeaver SQL tool.
Connect to the Postgres database that contains the gold stage data. Use jdbc:postgresql://localhost:5432/ngods JDBC URL with username ngods and password ngods.

Postgres JDBC connection

Connect to the Trino database that has access to all data stages (bronze, silver, and gold schemas of the warehouse database). Use jdbc:trino://localhost:8060 JDBC URL with username trino and password trino.

Trino JDBC connection

Trino schemas

Connect to the Spark database that is used for data transformations. Use jdbc:hive2://localhost:10009 JDBC URL with no username and password.

Spark JDBC connection

Customizing the demo

This chapter contains useful information for customizing the demo.

ngods directories

Here are few distribution's directories that you may need to customize:

conf configuration of all data stack components
- cube cube.dev schema (semantic model definition)
data main data directory
- minio root data directory (contains buckets and file data)
- spark Jupyter notebooks
- stage file stage data. Spark can access this directory via /var/lib/ngods/stage path.
projects dbt, Dagster, and DataHub projects
- dagster Dagster orchestration project
- dbt dbt transformations (one project per each medallion stage: bronze, silver, and gold)

ngods endpoints

The data stack has the following endpoints

Spark
- http://localhost:8888 - Jupyter notebooks
- jdbc:hive2://localhost:10009 JDBC URL (no username / password)
- localhost:7077 - Spark API endpoint
- http://localhost:8061 - Spark master node monitoring page
- http://localhost:8062 - Spark slave node monitoring page
- http://localhost:18080 - Spark history server page
Trino
- jdbc:trino://localhost:8060 JDBC URL (username trino / no password)
Postgres
- jdbc:postgresql://localhost:5432/ngods JDBC URL (username ngods / password ngods)
Cube.dev
- http://localhost:4000 - cube.dev development UI
- jdbc:postgresql://localhost:3245/cube JDBC URL (username cube / password cube)
Metabase
- http://localhost:3030 Metabase UI (username metabase@ngods.com / password metabase1)
Dagster
- http://localhost:3070 - Dagster orchestration UI
Minio
- http://localhost:9001 - Minio UI (username minio / password minio123)

ngods databases: Spark, Trino, and Postgres

ngods stack includes three database engines: Spark, Trino, and Postgres. Both Spark and Trino have access to Iceberg tables in warehouse.bronze and warehouse.silver schemas. Trino engine can also access the analytics.gold schema in Postgres. Trino can federate queries between the Postgres and Iceberg tables.

The Spark engine is configured for ELT and pyspark data transformations.

Spark

The Trino engine is configured for data federation between the Iceberg and Postgres tables. Additional catalogs can be configured as needed.

Trino

The Postgres database has accesses only to the analytics.gold schema and it is used for executing analytical queries over the gold data.

Demo data pipeline

The demo data pipeline is utilizes the medallion architecture with bronze, silver, and gold data stages.

data pipeline

and consists of the following phases:

Data are downloaded from Yahoo Finance REST API to the local Minio bucket (./data/stage) using this Dagster operation.
The downloaded CSV file is loaded to the bronze stage Iceberg tables (warehouse.bronze Spark schema) using dbt models that are executed in Spark (./projects/dbt/bronze).
Silver stage Iceberg tables (warehouse.silver Spark schema) are created using dbt models that are executed in Spark (./projects/dbt/silver).
Gold stage Postgres tables (analytics.gold Trino schema) are created using dbt models that are executed in Trino (./projects/dbt/gold).

DBT models

All data pipeline phases are orchestrated by Dagster framework. Dagster operations, resources and jobs are defined in the Dagster project.

Dagster console

The pipeline is executed by running the e2e job from the Dagster console at http://localhost:3070/ using this yaml config file

ngods analytics layer

ngods includes cube.dev for semantic data model and Metabase for self-service analytics (dashboards, reports, and visualizations).

Analytics

Analytical (semantic) model is defined in cube.dev and is used for executing analytical queries over the gold data.

cube.dev

Metabase is connected to the cube.dev via SQL API. End users can use it for self-service creation of dashboards, reports, and data visualizations. Metabase is also directly connected to the gold schema in the Postgres database.

Metabase

ngods machine learning

Jupyter Notebooks with Scala, Java and Python backends can be used for machine learning.

Jupyter

Support

Create a github issue if you have any questions.