Awesome
<div align="center"> <img src="https://github.com/javitorres/datalakeStudio/assets/4235424/462ac5ee-21a8-4a75-b3bc-cf90d36089b4" height="200"> <br/> <img src="https://github.com/javitorres/datalakeStudio/assets/4235424/3306a67f-91d3-4427-8214-96f8a1f02eb1" width=60% height=auto> <br/><br/> </div> <p align="center"> <img src="https://img.shields.io/badge/Version-1.0.0-red" alt="Latest Release"> <img src="https://img.shields.io/badge/Vue-3.3.4-blue" alt="Vue3"> <img src="https://img.shields.io/badge/DuckDB-1.0.0-yellow" alt="DuckDB"> <img src="https://img.shields.io/badge/OpenAI-1.6.1-green" alt="OpenAI"> </p>Datalake Studio
Datalake Studio is an enhanced Data Exploration and Management tool
Key Features of Datalake Studio:
<b>Quick for big data:</b> Datalake Studio is built on top of DuckDB, a high-performance, embedded SQL OLAP database management system. DuckDB is designed to handle large datasets, making it ideal for data exploration and analysis.
<b>See your data:</b> Plot automatically your data or see data over a map: Points, H3 aggregations, etc
<b>Versatile Data Loading Options:</b> Users can effortlessly upload data from a several sources: directly from local computer, via a URL, or from an Amazon S3 bucket. Additionally, it supports direct data downloads from PostgreSQL databases, enhancing its utility for database administrators and data analysts.
<b>Several data formats:</b> Wide range of data formats, Datalake Studio is compatible with CSV, TSV, Parquet and Shapefile formats. Load data without tedious conversions.
<b>ChatGPT Integration with SQL Assistants:</b> Users with ChatGPT credentials can use the power of SQL assistants. These assistants provide contextual understanding about your tables and fields, making data manipulation and query formulation more intuitive and efficient.
<b>Enhancement through Remote APIs:</b> Users have the ability to enrich their data by integrating information from remote APIs.
<b>API Exposure for Data Sharing:</b> After completing data transformation processes, users can expose their data through APIs. This feature allows for easy sharing and collaboration, making Datalake Studio not just a tool for data exploration, but also a platform for data distribution.
Project build with Docker
docker-compose up --build
Open http://localhost:8080/ in your browser.
## If you dont want to use compose
docker build -t datalakestudioserver . docker run --name datalakestudioserver -p 8000:8000 datalakestudioserver
docker build -t datalakestudiofront . docker run --name datalakestudiofront -p 8080:8080 datalakestudiofront
Project build without Docker
Server
Inside server folder run:
pip3 install -r requirements.txt
python3 server.py
If you want to use venv:
python3 -m venv venv
source venv/bin/activate
Exit venv:
deactivate
Client
Inside the client folder of the project, run these commands to build the Vue UI project:
npm install
npm run dev -- --port 8080
Open http://localhost:8080/ in your browser.
Configuration files
Server
Inside server folder create a file named config.yml. Example:
port: 8000
database: "data/datalakeStudio.db"
And another file named secrets.yml with properties:
# Optional for DuckDB to work with S3, if not defined, user aws credentials will be loaded through the AWS Default Credentials Provider Chain
s3_access_key_id: "YOUR_S3_ACCESS_KEY_ID"
s3_secret_access_key: "YOUR_S3_SECRET_ACCESS_KEY"
# For OpenAI
openai_organization: "YOUR_OPENAI_ORGANIZATION"
openai_api_key: "YOUR_OPENAI_API_KEY"
# For API search
api_domain: "YOUR_API_DOMAIN"
api_context: "YOUR_API_CONTEXT"
# Database connections
pgpass_file: "YOUR_PG_PASS_FILE"
# Mapbox
mapbox_access_token: "YOUR_MAPBOX_ACCESS_TOKEN"
Also, docker-compose will get the credentials in .aws for AWS access.
If you want to use remote database, copy your pgpass file to the server folder. pgpass is a file with the following format:
hostname:port:database:username:password
Usage
Load data
You can load data from local filesystem, from any URL or from S3. Try to load this example: https://raw.githubusercontent.com/javitorres/GenericCross/main/public/data/iris.csv
Table explorer
Inspect loaded data. Export data to CSV or Parquet
Get data profile
or use crossfilter to play with your data
If your data has spatial info you can see in a map:
<img width="1239" alt="image" src="https://github.com/user-attachments/assets/5be35c2f-72ba-4678-a36f-4c2cab45046b"> <img width="1251" alt="image" src="https://github.com/user-attachments/assets/641c757a-c228-4f1d-9ad0-5c1d72554a99">Query panel
Query your data and generate new tables. Save or load your queries. Use ChatGPT to create new queries
Load data from APIs
Enrich your datasets calling external APIs
New table:
Load data from remote databases
Explore your external databases and load data into Datalake Studio for local analysis
Expose your data via API
Publish endpoints serving your data with parametrized queries:
Keep control of endpoints published:
Explore your S3 buckets
Move in your S3 buckets and write descriptions
Preview files or load them into DatalaleStudio
Talk to ChatGPT
Talk to explore your data (experimental)