Awesome

Getl

About

Groovy ETL (Getl) - open source project on Groovy, developed since 2012 to automate loading and processing data from different sources.

When do you need Getl?

Copying datasets between RDBMS, file and cloud sources;
Capturing and delivering data increment from sources to the data warehouse;
Copying and processing files from local and external file sources;
Rapid development of pilot projects for data warehouses (converting the structure of sources to warehouse tables, multi-threaded reloading of data from sources to warehouse tables)
Organization of development, testing and prodiuction stands for ETL projects;
Automation of testing etl processes;
Automating data health monitoring in a data warehouse;
Centralization of storage of descriptions of data sources and their structures in the repository;
Simplify the development of data processing patterns.

Supported RDBMS

IBM DB2, FireBird, H2 Database, Hadoop Hive, Cloudera Impala, MS SQLServer, MySql, IBM Netezza, NetSuite, Oracle, PostgreSql, SAP Hana, SQLite, Micro Focus Vertica.

Supported file sources

CSV, MS Excel, Json, XML, Yaml, DBF.

Supported cloud sources

Kafka, SalesForce, WebServices.

Supported file systems

Local file systems Windows and Linux, FTP, SFTP, Hadoop HDFS.

Operations with RDBMS data sources

Creating and dropping tables and temporary tables;
Creating indexes;
Managing table partitions (only Vertica)
Reading records from tables and queries;
Adding, modifying and deleting records in tables;
Batch loading CSV files into tables (only H2, Vertica, Hive and Impala);
Execution of SQL queries.

Operations with file sources

Reading records from files;
Saving records to files (only CSV and Json);
Splitting files into portions when recording (only CSV);
Working with files compressed in Gz;
Deleting files.

Working with cloud sources

Reading records from datasets;
Saving records to datasets (only Kafka).

Working with file systems

Creating and deleting directories;
Uploading and uploading a file from the server to a local directory;
Renaming and moving files at source
Executing OS command at source (only local file systems and SFTP).

Getl language operators

ETL:
- Copy records from source to source or multiple sources at the same time
- Reading records from source with error handling;
- Saving records to a source or multiple sources at the same time;
ELT:
- Executing scripts in Getl stored procedure language;
- Copying records between tables of the same RDBMS;
File systems:
- Copying a hierarchy of directories and their files between two file systems;
- Clearing the hierarchy of directories and their files in the file system;
- Multi-threaded parsing of files from the file system;
Object repository:
- Saving and reading repository object descriptions from repository files;
- Processing repository objects by name group or mask;
Configuration and resources:
- Loading and saving configuration files;
- Working with loaded configuration content;
Logging and profiling:
- Logging messages, events and errors;
- Profiling the running time of commands and code blocks;
Workflow management:
- Controlling permission to start processes;
- Automatic cloning and freeing of repository objects in threads;
- Automatically freeing temporary repository objects upon termination of their processes;
- Process and application termination Commands.

Examples

Registration of connections to Oracle and Vertica:

package packet1

import groovy.transform.BaseScript
import groovy.transform.Field
import getl.lang.Getl

@BaseScript Getl main

oracleConnection('ora', true) {
    connectHost = 'oracle-host'
    connectDatabase = 'oradb'
    login = 'user'
    password = 'password'
}

verticaConnection('ver', true) {
    connectHost = 'vertica-host1'
    connectDatabase = 'verdb'
    extended.backupservernode = 'vertica-host2,vertica-host3'
    login = 'user'
    password = 'password'
}

Creating a table in Vertica based on Oracle table:

oracleTable('ora:table1', true) {
    useConnection oracleConnection('ora')
    schemaName = 'user'
    tableName = 'table1'
}

verticaTable('ver.stage:table1', true) {
    useConnection verticaConnection('ver')
    schemaName = 'stage'
    tableName = 'table1'
    if (!exists) {
        setOracleFields oracleTable('ora:table1')
        create()
    }
}

verticaTable('ver.work:table1', true) {
    useConnection verticaConnection('ver')
    schemaName = 'public'
    tableName = 'table1'
    if (!exists)
        createLike verticaTable('ver.stage:table1')
}

Copying all rows from the Oracle table to the Vertica table by uploading to an temporary csv file and loading it through the COPY statement:

etl.copyRows(oracleTable('ora:table1'), verticaTable('ver.stage:table1')) { bulkLoad = true }

Copying table data from a staging area into a working one:

sql {
    useConnection verticaConnection('ver')
    exec '''
        /*:count_insert*/
        INSERT INTO public.table1 SELECT * FROM stage.table1;
        IF ({count_insert} > 0);
            ECHO Copied {count_insert} rows successfull.
        END IF;
        COMMIT;
    ''' 
}

Truncate staging table and purge working table:

verticaTable('ver.stage:table1') { truncate() }
verticaTable('ver.prod:table1') { purgeTable() }

Unloading rows from the Vertica table to a csv file according to a specified condition:

csv('file1') {
    useConnection csvConnection { path = '/tmp/unload' }
    fileName = 'data.table1'
    extenstion = 'csv'
    fieldDelimiter = '|'
    codePage = 'utf-8'
    header = true
}

verticaTable('ver.work:table1') {
    readOpts {
        where = 'field1 = CURRENT_DATE'
        order = ['field2', 'field3']
    }
}

etl.copyRows(verticaTable('ver.work:table1'), csv('file1'))

Copying a file via ssh to another server:

files('unload_files', true) {
    rootPath = '/tmp/unload'
}

sftp('ssh1', true) {
    host = 'ssh-host'
    login = 'user'
    password = 'password'
    rootPath = '/csv'
}

fileman.copier(files('unload_files'), sftp('ssh1')) {
    useSourcePath { mask = 'data.{table}.csv' }
}