Awesome
RcppSimdJson: Rcpp Bindings for the simdjson Header Library
Motivation
simdjson by Daniel Lemire (with contributions by Geoff Langdale, John Keiser and many others) is an engineering marvel. Through very clever use of SIMD instructions, it manages to parse JSON files faster than disc access. Wut? Yes you read that right: parallel processing with so little overhead that the net throughput is limited only by disk speed.
Moreover, it is implemented in neat modern C++ and can be accessed as a header-only library. (Well, one library in two files, really.) Which makes R packaging easy and convenient and compelling. So here we are.
For further introduction, see the arXiv paper by Langdale and Lemire (out/to appear in VLDB Journal 28(6) as well) and/or the video of the recent talk by Daniel Lemire at QCon (voted best talk).
Example
jsonfile <- system.file("jsonexamples", "twitter.json", package="RcppSimdJson")
library(RcppSimdJson)
validateJSON(jsonfile) # validate a JSON file
res <- fload(jsonfile) # parse a JSON file
Comparison
A simple file-oriented parsing benchmark against the other R-accessible JSON parsers:
> print(res)
Unit: microseconds
expr min lq mean median uq max neval cld
yyjsonr 312.267 347.683 405.177 390.11 425.827 926.776 100 a
simdjson 274.367 323.998 447.691 467.79 526.237 773.070 100 a
jsonify 2727.874 2813.681 2952.804 2896.84 2972.852 7442.755 100 b
jsonlite 4237.538 4435.683 4587.428 4552.38 4668.345 7082.673 100 c
RJSONIO 9131.864 9425.515 9707.274 9599.48 9845.006 13516.616 100 d
ndjson 91668.822 92628.357 95386.212 93192.37 94507.484 152179.095 100 e
>
Or in chart form, also including the second benchmark parsing strings:
Status
All three major OSs are supported, and JSON can be parsed from file and string under a variety of settings. A C++17 compiler is required for ease of setup (though the upstream can fall back to older compiler; one can edit src/Makevars accordingly if need be).
Contributing
Any problems, bug reports, or features requests for the package can be submitted and handled most conveniently as Github issues in the repository.
Before submitting pull requests, it is frequently preferable to first discuss need and scope in such an issue ticket. See the file Contributing.md (in the Rcpp repo) for a brief discussion.
See Also
For standard JSON work on R, as well as for other nicely done C++ libraries, consider these:
- jsonlite by Jeroen Ooms is excellent, very versatile, and probably most-widely used;
- rapidjsonr and jsonify by David Cooley bringing RapidJSON to R;
- ndjson by Bob Rudis builds on the JSON for Modern C++ library by Niels Lohmann;
- RJSONIO by Duncan Temple Lang started all this but could use a little love;
- yyjsonr by Mike Cheng is a more recent and performant addition based on yyjson.
Author
For the R package, Dirk Eddelbuettel and Brendan Knapp.
For everything pertaining to simdjson, Daniel Lemire (and many contributors).