Awesome
Datapumps: Simple ETL for node.js
Overview
Use pumps to import, export, transform or transfer data. A data pump will read from its input stream, array or datapumps Buffer and will write to its output buffers. A pump will finish when all data is consumed from its output buffers. Make a group of pumps to handle complex ETL tasks.
Installation
$ npm install datapumps --save
Usage example: export mongodb to excel
var
datapumps = require('datapumps'),
Pump = datapumps.Pump,
MongodbMixin = datapumps.mixin.MongodbMixin,
ExcelWriterMixin = datapumps.mixin.ExcelWriterMixin,
pump = new Pump();
pump
.mixin(MongodbMixin('mongodb://localhost/marketing'))
.useCollection('Contact')
.from(pump.find({ country: "US" }))
.mixin(ExcelWriterMixin())
.createWorkbook('/tmp/ContactsInUs.xlsx')
.createWorksheet('Contacts')
.writeHeaders(['Name', 'Email'])
.process(function(contact) {
return pump.writeRow([ contact.name, contact.email ]);
})
.logErrorsToConsole()
.run()
.then(function() {
console.log("Done writing contacts to file");
});
Usage example with more details:
-
First, we create a pump and setup reading from mongodb
var pump = new Pump(); pump .mixin(MongodbMixin('mongodb://localhost/marketing')) .useCollection('Contact') .from(pump.find({ country: "US" }))
Mixins extend the functionality of a pump. The MongodbMixin adds
.find()
method which executes a query on the collection specified with.useCollection()
method. The pump will read the query results and controls data flow, i.e. it pauses read when it cannot write excel rows. -
Write data to excel with ExcelWriterMixin:
pump .mixin(ExcelWriterMixin()) .createWorkbook('/tmp/ContactsInUs.xlsx') .createWorksheet('Contacts') .writeHeaders(['Name', 'Email']) .process(function(contact) { return pump.writeRow([ contact.name, contact.email ]); })
The excel workbook, worksheet and header rows are created after adding ExcelWriterMixin to the pump. Each pump has a
.process()
callback that may transform or filter data. The callback is called for every data item of the buffer and should return a promise (we use bluebird library) that fulfills when the data is processed. In this example, the default processing callback (which copies data to the output buffer by default) is overridden with writing rows to the excel worksheet. -
Finally, start the pump and write to console when it's done.
pump .logErrorsToConsole() .run() .then(function() { console.log("Done writing contacts to file"); });
The
.logErrorsToConsole()
will log any error to the console, surprisingly. The pump will start on calling.run()
. It returns a promise that resolves when the pump finished.
Pump
A pump reads data from its input buffer or stream and copies it to the output buffer by default:
datapumps = require('datapumps');
(pump = new datapumps.Pump())
.from(<put a nodejs stream or datapumps buffer here>)
.run()
To access the output buffer, use the .buffer()
method, which returns a Buffer instance:
buffer = pump.buffer('output');
buffer = pump.buffer(); // equivalent with previous as the default buffer
// of the pump is called 'output'
Use the .buffers()
method when you need to write data into multiple output buffers:
ticketsPump
.buffers({
openTickets: ticketsPump.createBuffer(),
closedTickets: ticketsPump.createBuffer(),
});
reminderMailer = new datapumps.Pump()
reminderMailer
.from(ticketPump.buffer('openTickets'))
...
Note that the ticketsPump pump has two output buffers: openTickets and closedTickets. The reminderMailer pump reads data from the openTickets buffer of the tickets pump.
Transforming data
Use the .process()
method to set the function which processes data:
ticketsPump
.process(function(ticket) {
ticket.title = 'URGENT: ' + ticket.title;
return this.buffer('openTickets').writeAsync(ticket);
});
The argument of .process()
is a function that will be executed after the pump reads a data item.
The function is executed in the context of the pump object, i.e. this
refers to the pump itself. The
function should return a Promise that fulfills when the data is processed (i.e. written into a buffer
or stored elsewhere).
Start and end of pumping
A pump is started by calling the .start()
method. The end
event will be emitted when the
input stream or buffer ended and all output buffers became empty.
pump.on('end', function() {
console.log('Pumped everything, and all my output buffers are empty. Bye.')
})
Pump group
You often need multiple pumps to complete an ETL task. Pump groups help starting multiple pump in one step, and also enables handling the event when every pump ended:
sendMails = datapumps.group();
sendMails.addPump('tickets')
...;
sendMails.addPump('reminderMailer')
...;
sendMails
.start()
.whenFinished().then(function() {
console.log('Tickets processed.');
});
The .addPump()
method creates a new pump with given name and returns it for configuration.
.start()
will start all pumps in the group, while .whenFinished()
returns a Promise the fulfills
when every pump ended (Note: end
event is also emitted).
Encapsulation
Sometimes you wish to encapsulate a part of an ETL process and also use it elsewhere. It is possible
to set an input pump and expose buffers from the group, so it will provide the same interface as a
simple pump (i.e. it has .from()
, .start()
, .buffer()
methods and emits end
event).
Most likely, you want to extend datapumps.Group
class (example is written in CoffeeScript):
{ Group, mixin: { MysqlMixin } } = require 'datapumps'
class Notifier extends Group
constructor: ->
super()
@addPump 'emailLookup'
.mixin MysqlMixin connection
.process (data) ->
@query('SELECT email FROM user where username = ?', [ data.username ])
.then (result) =>
data.emailAddress = result.email
@buffer().writeAsync data
@addPump 'sendMail'
.from @pump 'emailLookup'
.process (data) ->
... # send email to data.emailAddress
@buffer().writeAsync
recipient:
name: data.name
email: data.emailAddress
@setInputPump 'emailLookup'
@expose 'output', 'sendMail/output'
The Notifier
will behave like pump, but in the inside, it does an email address lookup using
mysql, and sends mail to those addresses. The output buffer of sendMail
pump is filled with
recipient data.
Use the created class like this:
etlProcess = datapumps.group()
etlProcess
.addPump 'notifier', new Notifier
.from <node stream or datapumps buffer>
etlProcess
.addPump 'logger'
.from etlProcess.pump('notifier').buffer()
.process (data) ->
console.log "Email sent to #{data.name} (#{data.email})"
Please note that you cannot use .process
method on a group.
Error handling
Errors may occur while data is transferred between systems. Most of the time, you don't want to stop
on the first error but complete the transfer and re-run after fixing problems. Therefore
the pump group has an error buffer (.errorBuffer()
) which can hold ten error messages by default.
When the error buffer fills up, error
event is triggered and .whenFinised()
promise is rejected:
group
.start()
.whenFinished()
.then(function() {
if (!group.errorBuffer().isEmpty()) {
console.log("Transfer finished, but with errors.");
// errors list will be at group.errorBuffer().getContent()
}
})
.catch(function() {
console.log("Pump group failed with errors");
// errors list will be at group.errorBuffer().getContent()
});
You can use the .logErrorsToConsole()
helper method will configure the pump or group to print
errors when processing finished:
group
.logErrorsToConsole()
.start();
You can use the .logErrorsToLogger()
helper method will configure the pump or group to print
errors to a logger when processing finished:
group
.logErrorsToLogger(logger)
.start();
This is useful for running the ETL on a server. The logger can be any logging method that contains
an .error()
method such as Winston, Log4js, etc.
Debugging
The following example shows a fingers-crossed type logging, i.e. debug logging is turned on after the first error occured:
{ group } = require('datapumps')
(d = group())
.addPump 'test'
.from d.createBuffer
sealed: true,
content: [ 'first', 'second', 'third', 'fourth' ]
.process (data) ->
throw new Error 'Start debugging', data if data == 'second'
@copy data
d.errorBuffer().on 'write', (data) ->
console.log data
d.buffer('test/output').on 'write', (data) ->
console.log "#{data} was written to test/output buffer"
d.start()
The output:
{ message: [Error: Start debugging], pump: 'test' }
third was written to test/output buffer
fourth was written to test/output buffer
Mixins
The core components of datapumps is only responsible for passing data in a flow-controlled manner. The features required for import, export or transfer is provided by mixins:
- BatchMixin - Processes input in batches. Useful with MysqlMixin or other database writing mixins (batch insert can be much faster than inserting one by one).
- MergeMixin - Enables pump to read from multiple input buffers.
- ObjectTransformMixin - Common object transformation and validation methods
- CsvWriterMixin - Writes csv files using fast-csv package
- ExcelWriterMixin - Writes excel xlsx workbooks
- ExcelReaderMixin - Reads excel xlsx workbooks
- MysqlMixin - Queries and writes mysql databases
- PostgresqlMixin - Queries and writes postgresql databases
- MongodbMixin - Queries and writes mongodb
- RestMixin - Interact with REST services
When you implement new mixins, please fork datapumps and make a pull request.