Awesome
SQL DB plugin for Nextflow
This plugin provides support for interacting with SQL databases in Nextflow scripts.
The following databases are currently supported:
- AWS Athena (Setup guide here)
- DuckDB
- Google BigQuery (Setup guide here)
- H2
- MySQL
- MariaDB
- PostgreSQL
- SQLite
NOTE: THIS IS A PREVIEW TECHNOLOGY, FEATURES AND CONFIGURATION SETTINGS CAN CHANGE IN FUTURE RELEASES.
Getting started
This plugin requires Nextflow 22.08.1-edge
or later. You can enable the plugin by adding the following snippet to your nextflow.config
file:
plugins {
id 'nf-sqldb'
}
Support for BigQuery is provided in a separate plugin:
plugins {
id 'nf-bigquery'
}
Configuration
You can configure any number of databases under the sql.db
configuration scope. For example:
sql {
db {
foo {
url = 'jdbc:mysql://localhost:3306/demo'
user = 'my-user'
password = 'my-password'
}
}
}
The above example defines a database named foo
that connects to a MySQL server running locally at port 3306 and
using the demo
schema, with my-name
and my-password
as credentials.
The following options are available:
sql.db.'<DB-NAME>'.url
: The database connection URL based on the JDBC standard.
sql.db.'<DB-NAME>'.driver
: The database driver class name (optional).
sql.db.'<DB-NAME>'.user
: The database connection user name.
sql.db.'<DB-NAME>'.password
: The database connection password.
Dataflow Operators
This plugin provides the following dataflow operators for querying from and inserting into database tables.
fromQuery
The fromQuery
factory method queries a SQL database and creates a channel that emits a tuple for each row in the corresponding result set. For example:
include { fromQuery } from 'plugin/nf-sqldb'
channel.fromQuery('select alpha, delta, omega from SAMPLE', db: 'foo').view()
The following options are available:
db
: The database handle. It must be defined under sql.db
in the Nextflow configuration.
batchSize
: Query the data in batches of the given size. This option is recommended for queries that may return large a large result set, so that the entire result set is not loaded into memory at once.
: NOTE: this feature requires that the underlying SQL database supports LIMIT
and OFFSET
.
emitColumns
: When true
, the column names in the SELECT
statement are emitted as the first tuple in the resulting channel.
sqlInsert
The sqlInsert
operator collects the items in a source channel and inserts them into a SQL database. For example:
include { sqlInsert } from 'plugin/nf-sqldb'
channel
.of('Hello','world!')
.map( it -> tuple(it, it.length) )
.sqlInsert( into: 'SAMPLE', columns: 'NAME, LEN', db: 'foo' )
The above example executes the following SQL statements into the database foo
(as defined in the Nextflow configuration).
INSERT INTO SAMPLE (NAME, LEN) VALUES ('HELLO', 5);
INSERT INTO SAMPLE (NAME, LEN) VALUES ('WORLD!', 6);
NOTE: the target table (e.g. SAMPLE
in the above example) must be created beforehand.
The following options are available:
db
: The database handle. It must be defined under sql.db
in the Nextflow configuration.
into
: The target table for inserting the data.
columns
: The database table column names to be filled with the channel data. The column names order and cardinality must match the tuple values emitted by the channel. The columns can be specified as a list or as a string of comma-separated values.
statement
: The SQL INSERT
statement to execute, using ?
as a placeholder for the actual values, for example: insert into SAMPLE(X,Y) values (?,?)
. The into
and columns
options are ignored when this option is provided.
batchSize
: Insert the data in batches of the given size (default: 10
).
setup
: A SQL statement that is executed before inserting the data, e.g. to create the target table.
: NOTE: the underlying database should support the create table if not exist idiom, as the plugin will execute this statement every time the script is run.
Querying CSV files
This plugin supports the H2 database engine, which can query CSV files like database tables using SQL statements.
For example, create a CSV file using the snippet below:
cat <<EOF > test.csv
foo,bar
1,hello
2,ciao
3,hola
4,bonjour
EOF
Then query it in a Nextflow script:
include { fromQuery } from 'plugin/nf-sqldb'
channel
.fromQuery("SELECT * FROM CSVREAD('test.csv') where foo>=2;")
.view()
The CSVREAD
function provided by the H2 database engine allows you to query any CSV file in your filesystem. As shown in the example, you can use standard SQL clauses like SELECT
and WHERE
to define your query.
Caveats
Like all dataflow operators in Nextflow, the operators provided by this plugin are executed asynchronously.
In particular, data inserted using the sqlInsert
operator is not guaranteed to be available to any subsequent queries using the fromQuery
operator, as it is not possible to make a channel factory operation dependent on some upstream operation.