Home

Awesome

<img src="https://raw.githubusercontent.com/mcaceresb/mcaceresb.github.io/master/assets/icons/gtools-icon/gtools-icon-text.png" alt="Gtools" width="500px"/>

Overview | Installation | Examples | Remarks | FAQs <img src="https://upload.wikimedia.org/wikipedia/commons/6/64/Icon_External_Link.png" width="13px"/> | Benchmarks <img src="https://upload.wikimedia.org/wikipedia/commons/6/64/Icon_External_Link.png" width="13px"/> | Compiling <img src="https://upload.wikimedia.org/wikipedia/commons/6/64/Icon_External_Link.png" width="13px"/>

Faster Stata for big data. This packages uses C plugins and hashes to provide a massive speed improvements to common Stata commands, including: reshape, collapse, xtile, tabstat, isid, egen, pctile, winsor, contract, levelsof, duplicates, unique/distinct, and more.

Beta Version Supported Platforms github linux status github osx status Appveyor Build status

Faster Stata for Big Data

This package provides a fast implementation of various Stata commands using hashes and C plugins. The syntax and purpose is largely analogous to their Stata counterparts; for example, you can replace collapse with gcollapse, reshape with greshape, and so on. For a comprehensive list of differences (including some extra features!) see the remarks below; for details and examples see the official project page.

Quickstart

ssc install gtools
gtools, upgrade

Some quick benchmarks:

NOTE: Stata 17 introduced massive speed improvements to sort and collapse. In the MP version, in particular with many cores available, the native collapse can be up to twice as fast. (YMMV; overall native collapses could still be slower in some use cases.) gcollapse remains faster in SE and older Stata versions.

<img src="https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/docs/benchmarks/quick.png#gh-light-mode-only" alt="Gtools quick benchmark" style="display:block;margin-left:auto;margin-right:auto" width="80%"/>

<img src="https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/docs/benchmarks/quickdark.png#gh-dark-mode-only" alt="Gtools quick benchmark" style="display:block;margin-left:auto;margin-right:auto" width="80%"/>

Gtools commands with a Stata equivalent

FunctionReplacesSpeedup (IC / MP)UnsupportedExtras
gcollapsecollapse-0.5 to 2 (Stata 17+); 4 to 100 (Stata 16 and earlier)Quantiles, merge, labels, nunique, etc.
greshapereshape4 to 20 / 4 to 15"advanced syntax"fast, spread/gather (tidyr equiv)
gegenegen9 to 26 / 4 to 9 (+,.)labelsWeights, quantiles, nunique, etc.
gcontractcontract5 to 7 / 2.5 to 4
gisidisid8 to 30 / 4 to 14using, sortif, in
glevelsoflevelsof3 to 13 / 2 to 7Multiple variables, arbitrary levels
gduplicatesduplicates8 to 16 / 3 to 10
gquantilesxtile10 to 30 / 13 to 25 (-)by(), various (see usage)
pctile13 to 38 / 3 to 5 (-)Ibid.
_pctile25 to 40 / 3 to 5Ibid.
gstats tabtabstat10 to 50 / 5 to 30 (-)See remarksvarious (see usage)
gstats sumsum, detail10 to 20 / 5 to 10See remarksvarious (see usage)

<small>(+) The upper end of the speed improvements are for quantiles (e.g. median, iqr, p90) and few groups. Weights have not been benchmarked.</small>

<small>(.) Only gegen group was benchmarked rigorously.</small>

<small>(-) Benchmarks computed 10 quantiles. When computing a large number of quantiles (e.g. thousands) pctile and xtile are prohibitively slow due to the way they are written; in that case gquantiles is hundreds or thousands of times faster, but this is an edge case.</small>

Extra commands

FunctionSimilar (SSC/SJ)Speedup (IC / MP)Notes
fasterxtilefastxtile20 to 30 / 2.5 to 3.5Allows by()
egenmisc (SSC) (-)8 to 25 / 2.5 to 6
astile (SSC) (-)8 to 12 / 3.5 to 6
gstats hdfe(.)Allows weights, by()
gstats winsorwinsor210 to 40 / 10 to 20Allows weights
guniqueunique4 to 26 / 4 to 12
gdistinctdistinct4 to 26 / 4 to 12Also saves results in matrix
gtop (gtoplevelsof)groups, select()(+)See table notes (+)
gstats rangerangestat10 to 20 / 10 to 20Allows weights; no flex stats
gstats transformVarious statistical functions

<small>(-) fastxtile from egenmisc and astile were benchmarked against gquantiles, xtile (fasterxtile) using by().</small>

<small>(+) While similar to the user command 'groups' with the 'select' option, gtoplevelsof does not really have an equivalent. It is several dozen times faster than 'groups, select', but that command was not written with the goal of gleaning the most common levels of a varlist. Rather, it has a plethora of features and that one is somewhat incidental. As such, the benchmark is not equivalent and gtoplevelsof does not attempt to implement the features of 'groups'</small>

<small>(.) Other than the dated 'hdfe' command, I do not know of a stata command that residualizes variables from a set of fixed effects. The 'hdfe' command, as far as I can tell, morphed into the 'reghdfe' package; the latter, however, is a fully-functioning regression command, while 'gstats hdfe' only residualizes a set of variables.</small>

Regression models

WARNING: Regression models are in beta and are only intended as utilities to compute coefficients and standard errors. I do not recommend their use in production; various post-estimation commands and statistics are not availabe. (See gstats hdfe for residualizing variables net of fixed effects.)

FunctionModelSimilar
gregressOLSregress, reghdfe
givregress2SLSivregress 2sls, ivreghdfe
gglmIRLSlogit, poisson, ppmlhdfe

All commands allow the user to optionally add:

Linear regression is computed via OLS (or WLS), IV regression is computed via two-stage least squares (2SLS), and GLM (poisson or logit) regression is computed via iteratively reweighted least squares (IRLS). See the TODO section for planned features, or the Missing Features section in the documentation for what is missing before the first non-beta release.

Extra features

Several commands offer additional features on top of the massive speedup. See the remarks section below for an overview; for details and examples, see each command's help page:

In addition, several commands take gsort-style input, that is

[+|-]varname [[+|-]varname ...]

This does not affect the results in most cases, just the sort order. Commands that take this type of input include:

Ftools

The commands here are also faster than the commands provided by ftools; further, gtools commands take a mix of string and numeric variables, which is a limitation of ftools. (Note I could not get several parts of ftools working on the Linux server where I have access to Stata/MP; hence the IC benchmarks.)

GtoolsFtoolsSpeedup (IC)
gcollapsefcollapse2-9
gegenfegen2.5-4 (+)
gisidfisid4-14
glevelsofflevelsof1.5-13
hashsortfsort2.5-4

<small>(+) Only egen group was benchmarked rigorously.</small>

Limitations

Acknowledgements

Installation

I only have access to Stata 13.1, so I impose that to be the minimum. You can install gtools from Stata via SSC:

ssc install gtools
gtools, upgrade

By default this syncs to the master branch, which is stable. To install the latest version directly, type:

local github "https://raw.githubusercontent.com"
net install gtools, from(`github'/mcaceresb/stata-gtools/master/build/)

Examples

The syntax is generally analogous to the standard commands (see the corresponding help files for full syntax and options):

sysuse auto, clear

* gstats {hdfe|residualize} varlist [if] [in] [weight], [absorb(varlist) options]
gstats hdfe hdfe_price = price, absorb(foreign rep78)
gstats residualize price mpg, absorb(foreign rep78) prefix(res_)

* gstats {sum|tab} varlist [if] [in] [weight], [by(varlist) options]
gstats sum price [pw = gear_ratio / 4]
gstats tab price mpg, by(foreign) matasave

* gquantiles [newvarname =] exp [if] [in] [weight], {_pctile|xtile|pctile} [options]
gquantiles 2 * price, _pctile nq(10)
gquantiles p10 = 2 * price, pctile nq(10)
gquantiles x10 = 2 * price, xtile nq(10) by(rep78)
fasterxtile xx = log(price) [w = weight], cutpoints(p10) by(foreign)

* gstats winsor varlist [if] [in] [weight], [by(varlist) cuts(# #) options]
gstats winsor price gear_ratio mpg, cuts(5 95) s(_w1)
gstats winsor price gear_ratio mpg, cuts(5 95) by(foreign) s(_w2)
drop *_w?

* hashsort varlist, [options]
hashsort -make
hashsort foreign -rep78, benchmark verbose mlast

* gegen target  = stat(source) [if] [in] [weight], by(varlist) [options]
gegen tag   = tag(foreign)
gegen group = tag(-price make)
gegen p2_5  = pctile(price) [w = weight], by(foreign) p(2.5)

* gisid varlist [if] [in], [options]
gisid make, missok
gisid price in 1 / 2

* gduplicates varlist [if] [in], [options gtools(gtools_options)]
gduplicates report foreign
gduplicates report rep78 if foreign, gtools(bench(3))

* glevelsof varlist [if] [in], [options]
glevelsof rep78, local(levels) sep(" | ")
glevelsof foreign mpg if price < 4000, loc(lvl) sep(" | ") colsep(", ")
glevelsof foreign mpg in 10 / 70, gen(uniq_) nolocal

* gtop varlist [if] [in] [weight], [options]
* gtoplevelsof varlist [if] [in] [weight], [options]
gtoplevelsof foreign rep78
gtop foreign rep78 [w = weight], ntop(5) missrow groupmiss pctfmt(%6.4g) colmax(3)

* gregress depvar indepvars [if] [in] [weight], [by(varlist) options]
gregress price mpg rep78, mata(coefs) prefix(b(_b_) se(_se_))
gregress price mpg [fw = rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)

* givregress depvar (endog = instruments) exog [if] [in] [weight], [by(varlist) options]
givregress price (mpg = gear_ratio) rep78, mata(coefs) prefix(b(_b_) se(_se_)) replace
givregress price (mpg = gear_ratio) [fw = rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)

* gglm depvar indepvars [if] [in] [weight], family(...) [by(varlist) options]
gglm price mpg rep78, family(poisson) mata(coefs) prefix(b(_b_) se(_se_)) replace
gglm price mpg [fw = trunk], family(poisson) by(foreign) absorb(rep78 headroom) cluster(rep78)

gglm foreign price rep78 [fw = trunk], family(binomial) absorb(headroom) mata(coefs)
gglm foreign price if rep78 > 2, family(binomial) by(rep78) prefix(b(_b_) se(_se_)) replace

* gcollapse (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]
gen h1 = headroom
gen h2 = headroom
local lbl labelformat(#stat:pretty# #sourcelabel#)

gcollapse (mean) mean = price (median) p50 = gear_ratio, by(make) merge v `lbl'
disp "`:var label mean', `:var label p50'"
gcollapse (iqr) irq? = h? (nunique) turn (p97.5) mpg, by(foreign rep78) bench(2) wild

* gcontract varlist [if] [if] [fweight], [options]
gcontract foreign [fw = turn], freq(f) percent(p)

* greshape wide varlist,    i(i) j(j) [options]
* greshape long prefixlist, i(i) [j(j) string options]
*
* greshape spread varlist, j(j) [options]
* greshape gather varlist, j(j) value(value) [options]

gen j = _n
greshape wide f p, i(foreign) j(j)
greshape long f p, i(foreign) j(j)

greshape spread f p, j(j)
greshape gather f? p?, j(j) value(fp)

* gstats transform (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]
* gstats range  (stat) out = src [...] [if] [if] [weight], by(varlist) [options]
* gstats moving (stat) out = src [...] [if] [if] [weight], by(varlist) [options]

sysuse auto, clear
gstats transform (normalize) price (demean) price (range mean -sd sd) price, auto
gstats range  (mean) mean_r = price (sd) sd_r = price, interval(-10 10 mpg)
gstats moving (mean) mean_m = price (sd) sd_m = price, by(foreign) window(-5 5)

See the FAQs or the respective documentation for a list of supported gcollapse and gegen functions.

Remarks

Functions available with gegen, gcollapse, gstats tab

gcollapse supports every collapse function, including their weighted versions. In addition, weights can be selectively applied via rawstat(), and several additional statistics are allowed, including nunique, select#, and so on.

gegen technically does not support all of egen, but whenever a function that is not supported is requested, gegen hashes the data and calls egen grouping by the hash, which is often faster (gegen only supports weights for internal functions, since egen does not normally allow weights).

Hence both should be able to replicate all of the functionality of their Stata counterparts. Last, gstats tab allows every statistic allowed by tabstat as well as any statistic allowed by gcollapse; the syntax for the statistics specified via statistics() is the same as in tabstat.

The following are implemented internally in C:

Functiongcollapsegegengstats tab
tagX
groupX
totalX
countXXX
nuniqueXXX
nmissingXX (+)X
sumXXX
nansumXXX
rawsumXX
rawnansumXX
meanXXX
geomeanXXX
medianXXX
percentilesXXX
iqrXXX
sdXXX
varianceXX (+)X
cvXXX
maxXXX
minXXX
rangeXXX
selectXXX
rawselectXX
percentXXX
firstXX (+)X
lastXX (+)X
firstnmXX (+)X
lastnmXX (+)X
semeanXX (+)X
sebinomialXXX
sepoissonXXX
skewnessXXX
kurtosisXXX
giniXXX
gini dropnegXXX
gini keepnegXXX

<small>(+) indicates the function has the same or a very similar name to a function in the "egenmore" packge, but the function was independently implemented and is hence analogous to its gcollapse counterpart, not necessarily the function in egenmore.</small>

The percentile syntax mimics that of collapse and egen, with the addition that quantiles are also supported. That is,

gcollapse (p#) target = var [target = var ...] , by(varlist)
gegen target = pctile(var), by(varlist) p(#)

where # is a "percentile" with arbitrary decimal places (e.g. 2.5 or 97.5). gtools also supports selecting the #th smallest or largest value:

gcollapse (select#) target = var [(select-#) target = var ...] , by(varlist)
gegen target = select(var), by(varlist) n(#)
gegen target = select(var), by(varlist) n(-#)

In addition, the following are allowed in gegen as wrappers to other gtools functions (stat is any stat available to gcollapse, except percent, nunique):

Functioncalls
xtilefasterxtile
standardizegstats transform
normalizegstats transform
demeangstats transform
demediangstats transform
moving_statgstats transform
range_statgstats transform
cumsumgstats transform
shiftgstats transform
rankgstats transform
winsorgstats winsor
winsorizegstats winsor

Last, when gegen calls a function that is not implemented internally by gtools, it will hash the by variables and call egen with by set to an id based on the hash. That is, if fcn is not one of the functions above,

gegen outvar = fcn(varlist) [if] [in], by(byvars)

would be the same as

hashsort byvars, group(id) sortgroup
egen outvar = fcn(varlist) [if] [in], by(id)

but preserving the original sort order. In case an egen option might conflict with a gtools option, the user can pass gtools_capture(fcn_options) to gegen.

Differences and Extras

Differences from collapse

Differences from reshape

Differences from regression models

gregress, givregress, and gglm do not aim to replicate the entire table of estimation results, nor the entire suite of post-estimation results and tests, that regress (reghdfe), ivregress 2sls (ivreghdfe), poisson (ppmlhdfe), or logit make available. At the moment, they are considered beta software and only coefficients and standard errors are computed.

Differences from xtile, pctile, and _pctile

Differences from egen

Differences from tabstat

Differences from summarize, detail

Differences from levelsof

Differences from isid

Differences from gsort

Differences from duplicates

Differences from rangestat

Hashing and Sorting

There are two key insights to the massive speedups of Gtools:

  1. Hashing the data and sorting a hash is a lot faster than sorting the data to then process it by group. Sorting a hash can be achieved in linear O(N) time, whereas the best general-purpose sorts take O(N log(N)) time. Sorting the groups would then be achievable in O(J log(J)) time (with J groups). Hence the speed improvements are largest when N / J is largest.

  2. Compiled C code is much faster than Stata commands. While it is true that many of Stata's underpinnings are compiled code, several operations are written in ado files without much thought given to optimization. If you're working with tens of thousands of observations you might barely notice (and the difference between 5 seconds and 0.5 seconds might not be particularly important). However, with tens of millions or hundreds of millions of rows, the difference between half a day and an hour can matter quite a lot.

Stata Sorting

It should be noted that Stata's sorting mechanism is hard to improve upon because of the overhead involved in sorting. We have implemented a hash-based sorting command, hashsort, which should be faster Stata's sort for groups, but not necessarily otherwise:

FunctionReplacesSpeedup (IC / MP)UnsupportedExtras
hashsortsort2.5 to 4 / .8 to 1.3Group (hash) sorting
gsort2 to 18 / 1 to 6mfirst (see mlast)Sorts are stable

The overhead involves copying the by variables, hashing, sorting the hash, sorting the groups, copying a sort index back to Stata, and having Stata do the final swaps. The plugin runs fast, but the copy overhead plus the Stata swaps often make the function be slower than Stata's native sort.

The reason that the other functions are faster is because they don't deal with all that overhead. By contrast, Stata's gsort is not efficient. To sort data, you need to make pair-wise comparisons. For real numbers, this is just a > b. However, a generic comparison function can be written as compare(a, b) > 0. This is true if a is greater than b and false otherwise. To invert the sort order, one need only use compare(b, a) > 0, which is what gtools does internally.

However, Stata creates a variable that is the inverse of the sort variable. This is equivalent, but the overhead makes it slower than hashsort.

TODO

Planned features:

These are options/features/improvements I would like to add, but I don't have an ETA for them (i.e. they are a wishlist because I am either not sure how to implement them or because writing the code will take a long time). Roughly in order of likelihood:

About

Hi! I'm Mauricio Caceres; I made gtools after some of my Stata jobs were taking literally days to run because of repeat calls to egen, collapse, and similar on data with over 100M rows. Feedback and comments are welcome! I hope you find this package as useful as I do.

Along those lines, here are some other Stata projects I like:

License

Gtools is MIT-licensed. ./lib/spookyhash and ./src/plugin/common/quicksort.c belong to their respective authors and are BSD-licensed. Also see gtools, licenses.