Awesome

JSQLTranspiler - Transpile Dialect, Resolve Columns, Show Lineage

A pure Java stand-alone SQL Transpiler, Column- and Lineage Resolver for translating various large RDBMS SQL Dialects into a few smaller RDBMS Dialects for Unit Testing. Based on JSQLParser.

Supports SELECT queries as well as INSERT, UPDATE, DELETE and MERGE statements.

Internal Functions will be rewritten based on the actual meaning and purpose of the function (since the DuckDB Any() function does not necessarily behave like the RDBMS specific Any()). Respecting different function arguments count, order and type.

Rewrite of Window- and Aggregate-Functions with full coverage of the RDBMS specific published samples. The matrix of supported features and functions is shared on Google Sheets.

Dialects

Input: Google BigQuery, Databricks, Snowflake, Amazon Redshift

Output: DuckDB

Transpile Example

Google BigQuery specific SQL

-- BigQuery specific DATE() function
SELECT
  DATE(2016, 12, 25) AS date_ymd,
  DATE(DATETIME '2016-12-25 23:59:59') AS date_dt,
  DATE(TIMESTAMP '2016-12-25 05:30:00+07', 'America/Los_Angeles') AS date_tstz;

/* Output
"date_ymd","date_dt","date_tstz"
"2016-12-15","2016-12-15","2016-12-15"
*/

will become DuckDB compatible SQL

-- DuckDB compliant rewrite producing the same result
SELECT
  MAKE_DATE(2016, 12, 25) AS date_ymd,
  CAST(DATETIME '2016-12-25 23:59:59' AS DATE) AS date_dt,
  CAST(TIMESTAMP '2016-12-25 05:30:00+07' AS DATE) AS date_tstz;

/* Output
"date_ymd","date_dt","date_tstz"
"2016-12-15","2016-12-15","2016-12-15"
*/

Column Lineage Example

For the simplified schema definition and the given query

String[][] schemaDefinition = {
        // Table A with Columns col1, col2, col3, colAA, colAB
        {"a", "col1", "col2", "col3", "colAA", "colAB"},

        // Table B with Columns col1, col2, col3, colBA, colBB
        {"b", "col1", "col2", "col3", "colBA", "colBB"}
};

String sqlStr =
        "SELECT Case when Sum(colBA + colBB)=0 then c.col1 else a.col2 end AS total FROM a INNER JOIN (SELECT * FROM b) c ON a.col1 = c.col1";

JdbcResultSetMetaData resultSetMetaData = new JSQLColumResolver(databaseMetaData).getResultSetMetaData(sqlStr);

the ResultSetMetaData return a list of JdbcColumns, each traversable using the TreeNode interface. The resulting Column Lineage can be illustrated as:

SELECT
 └─total AS CaseExpression: CASE WHEN Sum(colBA + colBB) = 0 THEN c.col1 ELSE a.col2 END
    ├─WhenClause: WHEN Sum(colBA + colBB) = 0 THEN c.col1
    │  ├─EqualsTo: Sum(colBA + colBB) = 0
    │  │  └─Function: Sum(colBA + colBB)
    │  │     └─Addition: colBA + colBB
    │  │        ├─c.colBA → b.colBA : Other
    │  │        └─c.colBB → b.colBB : Other
    │  └─c.col1 → b.col1 : Other
    └─a.col2 : Other

Resolve `*` Star Operator Example

For the simplified schema definition and the given query with Star Operators

String[][] schemaDefinition = {
    // Table A with Columns col1, col2, col3, colAA, colAB
    {"a", "col1", "col2", "col3", "colAA", "colAB"},

    // Table B with Columns col1, col2, col3, colBA, colBB
    {"b", "col1", "col2", "col3", "colBA", "colBB"}
};

String sqlStr = "SELECT * FROM ( (SELECT * FROM b) c inner join a on c.col1 = a.col1 ) d;";
String resolved =  new JSQLColumResolver(schemaDefinition).getResolvedStatementText(sqlStr);

the query will be resolved and (optionally rewritten into):

SELECT  d.col1                 /* Resolved Column*/
        , d.col2               /* Resolved Column*/
        , d.col3               /* Resolved Column*/
        , d.colBA              /* Resolved Column*/
        , d.colBB              /* Resolved Column*/
        , d.col1_1             /* Resolved Column*/
        , d.col2_1             /* Resolved Column*/
        , d.col3_1             /* Resolved Column*/
        , d.colAA              /* Resolved Column*/
        , d.colAB              /* Resolved Column*/
FROM (  (   SELECT  b.col1     /* Resolved Column*/
                    , b.col2   /* Resolved Column*/
                    , b.col3   /* Resolved Column*/
                    , b.colba  /* Resolved Column*/
                    , b.colbb  /* Resolved Column*/
            FROM b ) c
            INNER JOIN a
                ON c.col1 = a.col1 ) d
;

Alternatively, the information about returned columns can be fetched as JDBC ResultsetMetaData (without actually executing this query):

import java.sql.DatabaseMetaData;

String sqlStr = "SELECT * FROM (  (  SELECT * FROM sales ) c INNER JOIN listing a ON c.listid = a.listid ) d;";
// the meta data of catalgogs, schemas, tables, columns, either virtually and physically
DatabaseMetaData databaseMetaData = ...;
ResultSetMetaData resultSetMetaData = new JSQLColumResolver(databaseMetaData).getResultSetMetaData(sqlStr);
System.out.println(resultSetMetaData.toString());

/*
"#","label","name","table","schema","catalog","type","type name","precision","scale","display size"
"1","salesid","salesid","d",,"JSQLTranspilerTest","INTEGER","INTEGER","0","32","0"
"2","listid","listid","d",,"JSQLTranspilerTest","INTEGER","INTEGER","0","32","0"
... (shortened) ...
"17","totalprice","totalprice","d",,"JSQLTranspilerTest","DECIMAL","DECIMAL(8,2)","0","8","0"
"18","listtime","listtime","d",,"JSQLTranspilerTest","TIMESTAMP","TIMESTAMP","0","0","0"
 */

How to use

Java Library

Maven Artifact with Snapshot support:

<repositories>
    <repository>
        <id>jsqltranspiler-snapshots</id>
        <snapshots>
            <enabled>true</enabled>
        </snapshots>
        <url>https://s01.oss.sonatype.org/content/repositories/snapshots/</url>
    </repository>
</repositories>

<dependency>
    <groupId>ai.starlake.jsqltranspiler</groupId>
    <artifactId>jsqltranspiler</artifactId>
    <version>0.7-SNAPSHOT</version>
</dependency>

Calling the Java class:

import ai.starlake.transpiler.JSQLTranspiler;

String providedSQL="SELECT Nvl(null, 1) a";
String expectedSQL="SELECT Coalesce(null, 1) a";

String result = JSQLTranspiler.transpile(providedSQL, Dialect.AMAZON_REDSHIFT);
assertEquals(expectedSQL, result);

Web API

curl -X 'POST'                                                                   \
  'https://starlake.ai/api/v1/transpiler/transpile?dialect=SNOWFLAKE'            \
  -H 'accept: text/plain'                                                        \
  -H 'Content-Type: text/plain'                                                  \
  -d 'SELECT Nvl(null, 1) a'

Java Command Line Interface

usage: java -jar JSQLTranspilerCLI.jar [-d <arg> | --any | --bigquery |
       --databricks | --snowflake | --redshift]      [-D <arg> | --duckdb]
       [-i <arg>] [-o <arg>] [-h]

 -d,--input-dialect <arg>    The SQL dialect to parse.
                             [ANY*, GOOGLE_BIG_QUERY, DATABRICKS,
                             SNOWFLAKE, AMAZON_REDSHIFT]
    --any                    Interpret the SQL as Generic Dialect
                             [DEFAULT].
    --bigquery               Interpret the SQL as Google BigQuery Dialect.
    --databricks             Interpret the SQL as DataBricks Dialect.
    --snowflake              Interpret the SQL as Snowflake Dialect.
    --redshift               Interpret the SQL as Amazon Snowflake
                             Dialect.
 -D,--output-dialect <arg>   The SQL dialect to write.
                             [DUCKDB*]
    --duckdb                 Write the SQL in the Duck DB Dialect
                             [DEFAULT].
 -i,--inputFile <arg>        The input SQL file or folder.
                             - Read from STDIN when no input file
                             provided.
 -o,--outputFile <arg>       The out SQL file for the formatted
                             statements.
                             - Create new SQL file when folder provided.
                             - Append when existing file provided.
                             - Write to STDOUT when no output file
                             provided.
 -h,--help                   Print the help synopsis.

TimeKey substitution

The transpiler can substitute time key expressions such as CURRENT_DATE or CURRENT_TIMESTAMP with System's properties like

System.setProperty("CURRENT_TIMESTAMP", "2024-06-09 16:24:23.123");
String expected = "SELECT TIMESTAMP WITHOUT TIME ZONE '2024-06-09T16:24:23.123'";
String actual = JSQLTranspiler.transpileQuery("SELECT CURRENT_TIMESTAMP", JSQLTranspiler.Dialect.ANY);

Assertions.assertThat(actual).isEqualTo(expected);

Alternatively parameters can be provided as Map<String,Object> (which would take precedence over any System's properties):

String expected = "SELECT TIME WITHOUT TIME ZONE '17:24:23.123'";
String actual =
        JSQLTranspiler.transpileQuery(
                "SELECT CURRENT_TIME"
                , JSQLTranspiler.Dialect.ANY
                , Map.of("CURRENT_TIME", "17:24:23.123")
        );

Assertions.assertThat(actual).isEqualTo(expected);

Geography vs. Geometry

DuckDB currently only supports the GEOMETRY type. So for overloaded functions we need to decide if to interpret as GEOMETRY or as GEOGRAPHY. One can use the property GEO_MODE and set it either to GEOGRAPHY or GEOMETRY (with GEOMETRY being the default). Alternatively the Parameter Map can be used.

String expected = "SELECT ST_Area_Spheroid(ST_GEOMFROMTEXT('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))')) AS area";
String actual = JSQLTranspiler
                    .transpileQuery(
                            "select st_area(ST_GEOGFROMTEXT('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))')) as area;"
                            , JSQLTranspiler.Dialect.GOOGLE_BIG_QUERY
                            , Map.of("GEO_MODE", "GEOGRAPHY")
                    );
Assertions.assertEquals(expected, actual);
Assertions.assertEquals(12308778361.469452, getQueryResults(actual)[1][0]);

System.setProperty("GEO_MODE", "GEOMETRY");
String expected = "SELECT ST_Area(ST_GEOMFROMTEXT('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))')) AS area";
String actual = JSQLTranspiler
                    .transpileQuery(
                            "select ST_Area(ST_GEOGFROMTEXT('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))')) as area;"
                            , JSQLTranspiler.Dialect.GOOGLE_BIG_QUERY
                    );
Assertions.assertEquals(expected, actual);
Assertions.assertEquals(1.0, getQueryResults(actual)[1][0]);

Error Handling

In case the query refers to objects not existing in the provided database schema, the JSQLColumnResolver offers three modes:

STRICT will let the resolution and lineage fail with an error message, which (first) object were not resolved
IGNORE will simply ignore the node of the unresolvable object
LENIENT will insert a "virtual" column node pointing on the unresolvable column of an unknown type

STRICT is the default error mode. It can be changed for the JdbcMetaData before passing it to the JSQLColumnResolver as shown in the code example below:


String sqlStr =
            "with \"mycte\" as (\n"
            + "    select invalidColumn, \"c\".\"id\", CURRENT_TIMESTAMP() as \"timestamp\"\n"
            + "    from nonExistingTable \"o\", \"sales\".\"customers\" \"c\"\n"
            + "    where \"o\".\"customer_id\" = \"c\".\"id\"\n"
            + ")\n"
            + "select \"id\", sum(\"amount\") as sum, \"timestamp\"\n"
            + "from \"mycte\"\n"
            + "group by \"mycte\".\"id\", \"mycte\".\"timestamp\"";

// STRICT MODE will throw an Exception
ResultSetMetaData res =
        JSQLColumResolver.getResultSetMetaData(sqlStr, JdbcMetaData.copyOf(metaData.setErrorMode(JdbcMetaData.ErrorMode.STRICT)));

// LENIENT MODE will show an unresolvable node
ResultSetMetaData res =
        JSQLColumResolver.getResultSetMetaData(sqlStr, JdbcMetaData.copyOf(metaData.setErrorMode(JdbcMetaData.ErrorMode.LENIENT)));
String lineage =
        "SELECT\n"
        + " ├─mycte.id → sales.customers.id : Other\n"
        + " ├─sum AS Function sum\n"
        + " │  └─unresolvable\n"
        + " └─mycte.timestamp → timestamp : Other\n";

// IGNORE will skip and supress the unresolvable node
ResultSetMetaData res =
        JSQLColumResolver.getResultSetMetaData(sqlStr, JdbcMetaData.copyOf(metaData.setErrorMode(JdbcMetaData.ErrorMode.IGNORE)));
String lineage =
        "SELECT\n"
        + " ├─mycte.id → sales.customers.id : Other\n"
        + " ├─sum AS Function sum\n"
        + " └─mycte.timestamp → timestamp : Other\n";

More Details at JSQLColumnResolverTest

Unsupported features

Please refer to the Feature Matrix:

DuckDB's Number and Currency formatting is very limited right now
SELECT * REPLACE(...) on DuckDB works very differently (replaces value instead of label)

License

JSQLTranspiler is licensed under Apache License, Version 2.0.