Home

Awesome

Gather

Build Status

Gather is a command-line tool that merges JSON files, with a twist: gather can optionally add metadata from the filename or the file's stats to each dataset. Because sometimes filenames are just meaningless descriptors, but often they're not.

Install with NPM (bundled with node.js):

npm install gather-cli -g

Examples

Combine all of last month's analytics data into a single file, without losing track of when those analytics were recorded:

gather 'analytics/{date}.json' > metrics.json

Convert your Markdown blogposts with YAML frontmatter into JSON, bundle them together with Gather and then render them:

yaml2json posts \
    --output posts \
    --prose \
    --convert markdown
gather 'posts/{year}-{month}-{day}-{permalink}.json' \
    --annotate \
    --output posts/all.json
render post.jade
    --input posts/all.json \
    --output 'build/{year}/{permalink}.html' \
    --many

Reorganize your data with a gather-and-groupby one-two punch:

gather 'staff/{department}/{username}.json' | \
groupby 'staff/{office}/{firstName}-{lastName}.json' --unique

Path metadata

By default, filled-in filename placeholders will get added to the data.

With this gather command...

gather 'analytics/{date}.json' > metrics.json

... the resulting metrics.json file will contain a date key

[
    {
        "date": "2014-10-01", 
        ...
    }, 
    {
        "date": "2014-10-02", 
        ...
    }, 
    ...
]

File metadata

File metadata includes:

While path metadata is enabled by default, file metadata is not. Use the --annotate flag to enable file metadata.

Here's an example of file metadata:

{
    "origin": {
        "relative": "...", 
        "absolute": "...", 
        "basename": "...", 
        "extension": "..."
    }, 
    "date": {
        "accessed": {
            "iso": "...", 
            "year": ..., 
            "month": ..., 
            "day": ...,
            ...
        }, 
        "modified": ..., 
        "created": ..., 
        "inferred": ...
    }
}

Compact, underscored and extended metadata naming schemes

Metadata from the filename or from the file's stats can conflict with keys already present in the data. If you are concerned about naming clashes, there are two ways to avoid this:

An example of the extended naming scheme:

{
    "origin": "file path, extension et cetera", 
    "date": "created, modified, accessed and inferred dates", 
    "metadata": "metadata extracted from path placeholders", 
    "data": "the original data"
}

Partial rebuilds

When adding additional metadata using the --annotate option, the origin of each piece of data that makes up the merged dataset will be a part of the output. This metadata makes it possible, on subsequent gathering operations, to only update or remove data that has changed rather than redoing the entire merge from scratch.

For example, you've added a new staff member at /staff/smith.json and would like to update the staff.json file which contains thousands of staff members. For every staff member in /staff, gather will first try to see if it can't get up-to-date information from the existing staff.json file. Only for smith.json it can't, so only the smith.json will need to be loaded and parsed from disk.

Especially when merging thousands of files, these partial rebuilds dramatically speed up gathering operations. Because the caching mechanism is generally safe (it will never use stale data, it will remove data for files that are no longer there, et cetera) it is enabled by default.

Nevertheless, it is possible to disable partial rebuilds: use --force to force a full redo of the merge. Alternatively, just rm the output file before using gather.

Use from node.js

var gather = require('gather-cli');
var source = 'examples/staff';
var options = {
    "extended": true, 
    "scheme": "underscored"
}
gather(source, options, function(err, staffMembers) {
    staffMembers.forEach(function(staff){
        console.log(staff.name);
    });
});