Awesome
CryptOSS
This repository contains tooling for collecting and viewing Cryptocurrency Open Source Software (OSS) development.
<details> <summary>Click to expand Papers and Citations related to this project.</summary>@inproceedings{trockman-striking-gold-2019,
title = {{Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub}},
booktitle = "International Conference on Mining Software Repositories", author = "Trockman, Asher and {van~Tonder}, Rijnard and Vasilescu, Bogdan",
series = {MSR '19},
year = 2019
}
@inproceedings{van-tonder-crypto-oss-2019,
title = {{A Panel Data Set of Cryptocurrency Development Activity on GitHub}},
booktitle = "International Conference on Mining Software Repositories",
author = "{van~Tonder}, Rijnard and Trockman, Asher and {Le~Goues}, Claire",
series = {MSR '19},
year = 2019
}
</details>
CSV and raw data
View the GitHub data online. This does not include the full data available in the CSV above, which includes cryptocurrency prices, market cap, and trading volumes.
Building the tooling
- Install opam. Typically:
sh <(curl -sL https://raw.githubusercontent.com/ocaml/opam/master/shell/install.sh)
- Then run
opam init
opam switch create 4.05.0 4.05.0
- Next:
opam install core opium yojson hmap tyxml tls
Then:
opam pin add github https://github.com/rvantonder/ocaml-github.git
Then type make
in this repository. The scripts and command-line utilities should now work. Let's step through the possible uses.
Collecting your own data
The cronjob
folder contains the crontab
for actively polling and collecting GitHub data. It's a good place to look if you want to understand how to collect data.
-
cronjob/crontab
: The crontab pulls data by invokingcronjob/save.sh
andcronjob/ranks.sh
at certain intervals (these can be customized). -
cronjob/save.sh
: Essentially runs thecrunch.exe save
command (with a user-supplied GitHub token), see here. This command takes a list of comma-separated names registered in thedb.ml
file. You can see the invocation of thesave.sh
script in thecrontab
file. -
cronjob/ranks.sh
: Pulls cryptocurrency data from CoinMarketCap -
batches
: The crontab uses batches of cryptcurrencies (listed in files) example). Each batch corresponds to a list of cryptocurrencies that fit within the 5000 request rate limit for GitHub, so that batched requests can be spaced out over 24 hours. The interval and batch size can be changed depending on need (seecronjob/batches/generate.sh
).
Besides the cronjob, you can manually save data by running, say, crunch.exe save Bitcoin -token <your-github-token>
. This produces a .dat
file, as processed by ./pipeline.sh
.
The list of supported cryptocurrencies are in the database file. Modify it to include your own, and type make
again to update the tooling. You can then run crunch.exe save <My Crypto> -token ...
.
Processing data
If you want more control over data processing besides ./pipeline.sh
, you can use crunch.exe load
. You can generate a CSV file from a .dat
with a command like:
crunch.exe load -no-forks -csv -with-ranks <ranks.json file from CoinMarketCap> -with-date <DD-MM-YYYY> <some-dat-file>.dat
A similar command is used in the csv-of-dat.sh script to generate the MSR data set.
You can generate aggregate values by running ./crunch.exe aggregate
on some directory containing .dat
files. This will create .agg
files. .agg
files can be used to generate the web view.
Generating the web view
The ./deploy.sh
script builds a static site. If you want to create the webview for a particular date, say Oct 10, 2018 (containing .dat
s), simply run ./deploy.sh datastore/2018-10-10
. This will generate a web view in docs
.
Recreating the MSR dataset from the raw data
Create a directory called datastore
. Download and untar the raw data file in this directory.
In the toplevel of this repository, run ./pipeline.sh <N>
, where N
is the number of parallel jobs (this speeds up processing). You can ignore any warnings/errors. Once finished, you'll have generated .csv
files in the toplevel directory.
Feel free to add your own data in the datastore
(for some date), and rerun ./pipeline.sh
.