Awesome

Datalake Studio

Datalake Studio is an enhanced Data Exploration and Management tool

Key Features of Datalake Studio:

Quick for big data: Datalake Studio is built on top of DuckDB, a high-performance, embedded SQL OLAP database management system. DuckDB is designed to handle large datasets, making it ideal for data exploration and analysis.

See your data: Plot automatically your data or see data over a map: Points, H3 aggregations, etc

Versatile Data Loading Options: Users can effortlessly upload data from a several sources: directly from local computer, via a URL, or from an Amazon S3 bucket. Additionally, it supports direct data downloads from PostgreSQL databases, enhancing its utility for database administrators and data analysts.

Several data formats: Wide range of data formats, Datalake Studio is compatible with CSV, TSV, Parquet and Shapefile formats. Load data without tedious conversions.

ChatGPT Integration with SQL Assistants: Users with ChatGPT credentials can use the power of SQL assistants. These assistants provide contextual understanding about your tables and fields, making data manipulation and query formulation more intuitive and efficient.

Enhancement through Remote APIs: Users have the ability to enrich their data by integrating information from remote APIs.

API Exposure for Data Sharing: After completing data transformation processes, users can expose their data through APIs. This feature allows for easy sharing and collaboration, making Datalake Studio not just a tool for data exploration, but also a platform for data distribution.

Project build with Docker

docker-compose up --build

Open http://localhost:8080/ in your browser.

## If you dont want to use compose

docker build -t datalakestudioserver . docker run --name datalakestudioserver -p 8000:8000 datalakestudioserver

docker build -t datalakestudiofront . docker run --name datalakestudiofront -p 8080:8080 datalakestudiofront

Project build without Docker

Server

Inside server folder run:

pip3 install -r requirements.txt
python3 server.py

If you want to use venv:

python3 -m venv venv
source venv/bin/activate

Exit venv:

deactivate

Client

Inside the client folder of the project, run these commands to build the Vue UI project:

npm install
npm run dev -- --port 8080

Open http://localhost:8080/ in your browser.

Configuration files

Server

Inside server folder create a file named config.yml. Example:

port: 8000
database: "data/datalakeStudio.db"

And another file named secrets.yml with properties:

# Optional for DuckDB to work with S3, if not defined, user aws credentials will be loaded through the AWS Default Credentials Provider Chain
s3_access_key_id: "YOUR_S3_ACCESS_KEY_ID"
s3_secret_access_key: "YOUR_S3_SECRET_ACCESS_KEY"

# For OpenAI
openai_organization: "YOUR_OPENAI_ORGANIZATION"
openai_api_key: "YOUR_OPENAI_API_KEY"

# For API search
api_domain: "YOUR_API_DOMAIN"
api_context: "YOUR_API_CONTEXT"

# Database connections
pgpass_file: "YOUR_PG_PASS_FILE"

# Mapbox
mapbox_access_token: "YOUR_MAPBOX_ACCESS_TOKEN"

Also, docker-compose will get the credentials in .aws for AWS access.

If you want to use remote database, copy your pgpass file to the server folder. pgpass is a file with the following format:

hostname:port:database:username:password

Usage

Load data

You can load data from local filesystem, from any URL or from S3. Try to load this example: https://raw.githubusercontent.com/javitorres/GenericCross/main/public/data/iris.csv

Table explorer

Inspect loaded data. Export data to CSV or Parquet

Get data profile

or use crossfilter to play with your data

If your data has spatial info you can see in a map:

Captura desde 2024-08-19 18-09-32

Query panel

Query your data and generate new tables. Save or load your queries. Use ChatGPT to create new queries

Load data from APIs

Enrich your datasets calling external APIs

New table:

Load data from remote databases

Explore your external databases and load data into Datalake Studio for local analysis

Expose your data via API

Publish endpoints serving your data with parametrized queries:

Keep control of endpoints published:

Explore your S3 buckets

Move in your S3 buckets and write descriptions

Preview files or load them into DatalaleStudio

Talk to ChatGPT

Talk to explore your data (experimental)