Home

Awesome

Clubmark (former PyCABeM) - Bench bench mark icon marking Framework for the Clustering Algorithms Evaluation

\brief Benchmarking and real-time profiling of the [overlapping] clustering algorithms working on graphs specified by pairwise relations (community detection algorithms). The quality evaluation is performed using extrinsic (all existing accuracy measures applicable for overlapping clustering algorithms on large datasets: Mean F1 Score family, Normalized Mutual Information(NMI) and Omega Index / fuzzy ARI family) and some intrinsic (Modularity(Q) and Conductance(f)) measures considering overlaps/covers (shared node membership by multiple clusters [on the same resolution level]) and multiple resolutions (the same node can be a full member of a cluster and its super-clusters).
\authors (c) Artem Lutov artem@exascale.info
\license Apache 2.0, optional commercial support and relicensing are provided by the request (write me)

The evaluating clustering algorithms (located in algorithms/) have own, possibly proprietary licenses, they are a) not a part of and b) not covered by the terms of the Clubmark framework.

\organizations eXascale Infolab, Lumais, ScienceWise
\keywords overlapping clustering benchmarking, community detection benchmarking, algorithms benchmarking framework.

\cite Clubmark: a Parallel Isolation Framework for Benchmarking and Profiling Clustering Algorithms on NUMA Architectures

@inproceedings{Clb18,
	author={Artem Lutov and Mourad Khayati and Philippe Cudr{\'e}-Mauroux},
	title={Clubmark: a Parallel Isolation Framework for Benchmarking and Profiling Clustering Algorithms on NUMA Architectures},
	booktitle={2018 IEEE International Conference on Data Mining Workshops (ICDMW)},
	year={2018},
	pages = {1481--1486},
	doi = {10.1109/ICDMW.2018.00212},
	keywords={benchmarking framework, clustering evaluation, parallel benchmarking, algorithm profiling, community detection benchmarking, constraint-aware load-balancing}
}

Content

Fast Start Tutorial

On Linux Ubuntu 16.04 LTS x64:

  1. Clone this repository: $ git clone https://github.com/eXascaleInfolab/clubmark.git
  2. Install dependencies ($ install_reqs.sh) and optionally fetch the SNAP networks with ground-truth communities (described in install_reqs.sh).
  3. Prepare system environment for the load balancer: $ prepare_hostenv.sh
  4. Run the benchmark from the cloned repository: $ python3 ./benchmark.py -w=0.0.0.0:8080 -g=3%3 -i%3=realnets/*.nse -a="CggciRg Daoc DaocAR DaocX LouvainIg Pscan Scd CggcRg DaocA DaocR Ganxis Oslom2 Randcommuns Scp" -r -q="Xmeasures -fh -o -s" -s -t=8h --runtimeout=15d 1>> ./bench.log 2>> ./bench.err
    • -w=0.0.0.0:8080 - deploy profiling Web UI at 0.0.0.0:8080
    • -g=3%3 - generate 4 instances of each kind of the synthetic network (weighted undirected graph with overlapping clusters in the ground-truth) and additionally shuffle (reorder links and nodes of) each instance 3 times
    • -i%3=realnets/*.nse - additionally process [real-world] networks from ./realnets (the networks should be put there in advance, see ./prepare_snapdata.sh)
    • -a=... - use specified clustering algorithms
    • -r - run (execute) the clustering algorithms (specified with -a, otherwise all available), produces sets (hierarchy / multiple levels of) clusters and execution profiles in ./results/<algname>
    • -q=... - evaluate produced clusters of the specified algorithms (-a) with the specified quality measures, produces (raw) evaluation results in ./results/qmeasures/qmeasures_*.h5

      Use utils/hdf5ToTxt.py script to convert qmeasures/qmeasures_*.h5 files to the textual representation (some HDF5 viewers do not support the latest HDF5 storage drivers used here)

    • -s - aggregate and summarize the quality evaluation results, produces ./results/qmeasures/aggqms_*.h5 from ./results/qmeasures/qmeasures_*.h5
    • -t=8h - limit execution time for each algorithm / evaluation measure on each dataset to 8 hours
    • --runtimeout=15d - limit execution time for the whole benchmarking to 15 days
    • 1>> ... output execution log (stdout tracing) to the ./bench.log
    • 2>> ... output execution log (stderr tracing) to the ./bench.err
  5. Look at the evaluation results in the ./results/ directory

To run the benchmark on other [POSIX-compatible] platfroms / operating systems, either the docker container for the default environment should be used, or all clustering algorithms and evaluating measures should be recompiled for the target platform.

Overview

The benchmark executes specified applications (clustering algorithms) on the specified or generated input datasets (networks - graphs specified by the pairwise relations), measures execution statistics and evaluates quality (including accuracy) of the results using specified measures. Clubmark is a general-purpose modular benchmarking framework specialized for the clustering (community detection) algorithms evaluation. The generic functionality is based on PyExPool multiprocess execution pool and load balancer. <!-- (the only another open-source load balancer I'm aware about, which has sufficient features to power Clubmark is [Slurm Workload Manager](https://slurm.schedmd.com/overview.html)). -->

Generic properties:

Properties specific for the clustering algorithms benchmarking:

Executing algorithms and evaluation measures with the accessory processing tools are described in the respective pages.

Clubmark Demo Poster for ICDM18: Clubmark_Poster-w1024

Motivation

I have not found any open source cross-platform framework for the [efficient] execution and evaluation of custom applications having significant variation of the time/memory complexity and custom constraints, so decided to write the own one.

Particularly, I had to evaluate various clustering (community detection) algorithms on large networks using specific measures. The main challenges are the following:

Ideally, the applications should be executed in parallel in a way to guarantee that they:

Before starting Clubmark development I found several open source frameworks for the "Community Detection Algorithms" evaluation but they do not fulfill the outlined constraints. The most comprehensive one is Circulo from Lab41, another one is called CommunityEvaluation. Circulo is an excellent framework until you don't run evaluations on the large networks, don't need to specify per-algorithm time/memory constraints and in case the default pipeline is sufficient, which was not the case for me.

Requirements

The benchmarking can be run either directly on the Linux Ubuntu 16.04 x64 or via the Docker container with the pre-installed environment on any platform. Anyway, the sources are required:

$ git clone https://github.com/eXascaleInfolab/clubmark.git

Prerequisites

The benchmarking framework itself is a cross-platform application implemented on Python, and works at least on CPython 2/3 and Pypy interpreters. However, the benchmark executes clustering algorithms and evaluation utilities built for the specific platform. The build is performed for the Linux Ubuntu 16.04 x64, on other NIX systems some requirements might be missed and not easily solvable. Docker image is prepared to run the benchmark from the docker container on any other platform without any issues.

Windows 10+ x64 provides Ubuntu-compatible bash shell, which allows to install and execute terminal Ubuntu apps and execute the benchmarking like on Linux Ubuntu 16.04 x64.

The subsequent steps are described for the NIX platforms including MaxOS. To be sure that the operational system allows to work with lots of opened files and has adequate swapping policy, execute:

$ . ./prepare_hostenv.sh

Alternatively, the operational system environment can be updated permanently.

To setup fs.file-max permanently in the system add the following line to the /etc/sysctl.conf:

fs.file-max = 1048576

and then reload it with # sysctl -p. To setup the ulimit permanently add the following lines to the /etc/security/limits.conf:

*               hard    nofile          524288
*               soft    nofile          32768

And then execute ulimit -n 32768 to set this value for the current shell.

ulimit can't be set higher than the hard limit ulimit -Hn, so if the latter < 32768 then execute:

$ ulimit -n `ulimit -Hn`

Reduce the system swappiness setting to 1..10 by $ sudo sysctl -w vm.swappiness=5 or set it permanently in /etc/sysctl.conf:

vm.swappiness = 5

Deployment via Docker

This section is optional if your host OS is Linux Ubuntu 16.04 x64, which makes possible to run the benchmarking directly.

The Docker can be installed on the Linux Ubuntu 16.04 executing:

$ sudo apt-get update && apt-get upgrade
$ sudo apt-get install -y docker.io

To install the Docker on any other platform refer the official installation instructions.

It is recommended to use overlay2 storage driver on any OS, see details in the official documentation. overlay2 requires Linux kernel v4.x, which can be updated from 3.x to 4.x on Red Hat / CentOS 7 / Scientific Linux as described in this article.

Add your user to the docker group to use it without sudo:

$ sudo groupadd docker
$ sudo usermod -aG docker $USER

Log out and log in back so that your group membership is re-evaluated, or execute:

$ su - $USER

Optionally, configure Docker to start on boot:

$ sudo systemctl enable docker

and check other docker post-installation steps.

Start Docker service:

$ sudo systemctl start docker
$ docker version

See also the brief tutorial on Docker installation and usage or the official getting started tutorial.

Optionally, the clubmark docker image containing the execution environment can be built from the source Dockerfile by executing from the repository directory:

$ docker build -t luaxi/clubmark-env:v3.0-U16.04 -t luaxi/clubmark-env .

The prebuilt image will be automatically pulled from the Docker Hub repository on first docker run if it has not been built locally.

To run the benchmark you can either execute:

$ docker run -it -u $UID -v `pwd`:/opt/clubmark luaxi/clubmark-env:v3.0-U16.04 [<clubmark_args>]

which calls ./benchmark.py [<clubmark_args>], or open a shell in the benchmarking directory:

$ docker run -it --entrypoint "" -u $UID -v `pwd`:/opt/clubmark luaxi/clubmark-env:v3.0-U16.04

and call ./benchmark.py with the required parameters there.

$UID might not be defined in the non-bash shell (sh, etc), then use id -u $USER instead. `pwd` projects to <CLUBMARK_REPOSITORY_PATH>, which is the current directory and the working directory of the benchmarking. /opt/clubmark or any other local directory denotes th e directory in the CONTAINER where the benchmarking results and traces are stored, which is mapped to the current host directory (`pwd`). See also Docker cheat sheet.

See the Usage section to learn more about the benchmark execution and results structure.

Direct Deployment

This section should be omitted if the benchmarking is run on the docker container.

The target environment is Linux Ubuntu 16.04 x64. To install all the requirements locally, execute from the Clubmark repository directory:

$ ./install_reqs.sh

See install_reqs.sh and pyreqs.txt for the details about the installing packages.

The benchmarking framework can be executed under Python 2.7+/3.x and verified on CPython and PyPy JIT. Some executing algorithms support only Python2 / pypy, others both Python3 and Python2. The appropriate interpreter for each executable is automatically selected in the runtime. The recommended environment, which is installed by the script is both Python3 and pypy. See also dependencies of the utilities, which are installed automatically.

Usage

There are two main scripts to execute the benchmark:

Note: Execution of the benchmark has been verified only on Linux Ubuntu 16.04 x64, but it should work on any platform where the dependences are satisfied.

The benchmark executes specified applications on the specified datasets in interactive or daemon mode logging the resources consumption, algorithms output and exceptions, and performing the workflow management (termination by timeout, resistance to exceptions, etc.) and results aggregation. All results and logs are stored even in case of the internal (crash) or external termination of the benchmarking applications or the whole framework itself.

<!-- #region BenchParams -->

To see the possible input parameters, run the benchmark without the arguments or with -h: $ ./benchmark.py:

$ ./benchmark.py
Usage:
  ./benchmark.py [-g[o][a]=[<number>][%<shuffles_number>][=<outpdir>] [-i[f][a][%<shuffles_number>]=<datasets_{dir,file}_wildcard> [-c[f][r]] [-a=[-]"app1 app2 ..."] [-r] [-q[="qmapp [arg1 arg2 ...]"]] [-s[p][*][[{-,+}]=<alg>[/<qmeasure1>,<qmeasure2>,...][:<net1>,<net2>,...][;<alg>...]]] [-t[{s,m,h}]=<timeout>] [-d=<seed_file>] [-w=<webui_addr>] | -h

Example:
  ./benchmark.py -w=0.0.0.0:8080 -g=3%5 -r -q="Xmeasures -fh -O -s" -s* -th=2.5 1> bench.log 2> bench.err
NOTE:
  - The benchmark should be executed exclusively from the current directory (./).
  - The expected format of input datasets (networks) is .ns<l> - network specified by <links> (arcs / edges), a generalization of the .snap, .ncol and Edge/Arcs Graph formats.
  - Paths can contain wildcards: *, ?, +.
  - Multiple paths can be specified with multiple -i, -s options (one per the item).

Parameters:
  --help, -h  - show this usage description
  --generate, -g[o][a]=[<number>][%<shuffles_number>][=<outpdir>]  - generate <number> synthetic datasets of the required format in the <outpdir> (default: syntnets/), shuffling (randomly reordering network links and saving under another name) each dataset <shuffles_number> times (default: 0). If <number> is omitted or set to 0 then ONLY shuffling of <outpdir>/networks//* is performed. The generated networks are automatically added to the begin of the input datasets.
    o  - overwrite existing network instances (old data is backed up) instead of skipping generation
    a  - generate networks specified by arcs (directed) instead of edges (undirected)
NOTE: shuffled datasets have the following naming format:
	<base_name>[(seppars)<param1>...][^<instance_index>][%<shuffle_index>].<net_extension>
  --input, -i[f][a][%<shuffles_number>]=<datasets_dir>  - input dataset(s), wildcards of files or directories, which are shuffled <shuffles_number> times. Directories should contain datasets of the respective extension (.ns{e,a}). Default: -i=syntnets/networks/*/, which are subdirs of the synthetic networks dir without shuffling.
    f  - make flat derivatives on shuffling instead of generating the dedicated directory (having the file base name) for each input network, might cause flooding of the base directory. Existed shuffles are backed up.
    NOTE: variance over the shuffles of each network instance is evaluated only for the non-flat structure.
    a  - the dataset is specified by arcs (asymmetric, directed links) instead of edges (undirected links), considered only for not .ns{a,e} extensions.
NOTE:
  - The following symbols in the path name have specific semantic and processed respectively: ('=', '^', '%', '#').
  - Paths may contain wildcards: *, ?, +.
  - Multiple directories and files wildcards can be specified with multiple -i options.
  - Existent shuffles are backed up if reduced, the existent shuffles are RETAINED and only the additional shuffles are generated if required.
  - Datasets should have the .ns<l> format: <node_src> <node_dest> [<weight>]
  - Ambiguity of links weight resolution in case of duplicates (or edges specified in both directions) is up to the clustering algorithm.
  --apps, -a[=[-]"app1 app2 ..."]  - apps (clustering algorithms) to be applied, default: all.
Leading "-" means apply all except the specified apps. Available apps (24): CggcRg, CggciRg, Daoc, DaocA, DaocAR, DaocARB, DaocARB1, DaocARB5, DaocB, DaocB1, DaocB5, DaocR, DaocRB, DaocRB1, DaocRB5, DaocRBX, DaocX, Ganxis, LouvainIg, Oslom2, Pscan, Randcommuns, Scd, Scp.
Impacts {r, q} options. Optional, all registered apps (see benchapps.py) are executed by default.
NOTE: output results are stored in the "results/<algname>/" directory
  --runapps, -r  - run specified apps on the specified datasets, default: all
  --quality, -q[="qmapp [arg1 arg2 ...]"  - evaluate quality (accuracy) with the specified quality measure application (<qmapp>) for the algorithms (specified with "-a") on the datasets (specified with "-i"). Default: MF1p, GNMI_max, OIx extrinsic and Q, f intrinsic measures on all datasets. Available qmapps (4): Gnmi, Imeasures, Onmi, Xmeasures.
NOTE:
  - Multiple quality measure applications can be specified with multiple -q options.
  - Existent quality measures with the same seed are updated (extended with the lacking evaluations omitting the already existent) until --quality-revalue is specified.
Notations of the quality measurements:
 = Extrinsic Quality (Accuracy) Measures =
   - GNMI[_{max,sqrt}]  - Generalized Normalized Mutual Information for overlapping and multi-resolution clusterings (collections of clusters), equals to the standard NMI when applied to the non-overlapping single-resolution clusterings.
   - MF1{p,h,a}[_{w,u,c}]  - mean F1 measure (harmonic or average) of all local best matches by the Partial Probabilities or F1 (harmonic mean) considering macro/micro/combined weighting.
   - OI[x]  - [x - extended] Omega Index for the overlapping clusterings, non-extended version equals to the Adjusted Rand Index when applied to the non-overlapping single-resolution clusterings.
 --- Less Indicative Extrinsic Quality Measures ---
   - F1{p,h}_[{w,u}]  - perform labeling of the evaluating clusters with the specified ground-truth and evaluate F1-measure of the labeled clusters
   - ONMI[_{max,sqrt,avg,lfk}]  - Ovelapping NMI suitable for a single-resolution clusterings having light overlaps, the resulting values are not compatible with the standard NMI when applied to the non-overlapping clusters.
 = Intrinsic Quality Measures =
   - Cdt  - conductance f for the overlapping clustering.
   - Q[a]  - [autoscaled] modularity for the overlapping clustering, non-autoscaled equals to the standard modularity
 when applied to the non-overlapping single-resolution clustering.
  --timeout, -t=[<days:int>d][<hours:int>h][<minutes:int>m][<seconds:float>] | -t[X]=<float>  - timeout for each benchmarking application per single evaluation on each network; 0 - no timeout, default: 1d12h. X option:
    s  - time in seconds, default option
    m  - time in minutes
    h  - time in hours
    Examples: `-th=2.5` is the same as `-t=2h30m` and `--timeout=2h1800`
  --quality-noupdate  - always create a new storage file for the quality measure evaluations or aggregations instead of updating the existent one.
NOTE: the shape of the updating dataset is retained, which results in distinct semantics for the evaluations and aggregations when if applied on the increased number of networks:
  1. The raw quality evaluation dataset has multi-dimensional fixed shape, which results in omission out of bound values logging these omissions.
  2. The quality metrics aggregation dataset has a single-dimensional resizable shape, so the absent networks are appended with the respective values.
  --quality-revalue  - evaluate resulting clusterings with the quality measures or aggregate the resulting raw quality measures from scratch instead of retaining the existent values (for the same seed) and adding only the non-existent.
NOTE: actual (makes sense) only when --quality-noupdate is NOT applied.
  --seedfile, -d=<seed_file>  - seed file to be used/created for the synthetic networks generation, stochastic algorithms and quality measures execution, contains uint64_t value. Default: results/seed.txt.
NOTE:
  - The seed file is not used on shuffling, so the shuffles are DISTINCT for the same seed.
  - Each re-execution of the benchmarking reuses once created seed file, which is permanent and can be updated manually.

Advanced parameters:
  --convret, -c[X]  - convert input networks into the required formats (app-specific formats: .rcg[.hig], .lig, etc.), deprecated
    f  - force the conversion even when the data is already exist
    r  - resolve (remove) duplicated links on conversion (recommended to be used)
  --summary, -s[p][*][[{-,+}]=<alg>[/<qmeasure1>,<qmeasure2>,...][:<net1>,<net2>,...][;<alg>...]]  - summarize evaluation results of the specified algorithms on the specified networks extending the existent quality measures storage considering the specified update policy. Especially useful to include extended evaluations into the final summarized results.
    p  - plot the aggregated results to the <aggqms>.png
    *  - aggregate all available quality evaluations besides the one matching the seed
    -/+  - filter inclusion prefix: "-" to filter out specified data (exclude) and "+" (default) to filter by (include only such data).
    <qmeasure>  - quality measure in the format:  <appname>[:<qmetric>][+u], for example "Xmeasures:MF1h_w+u", where "+u" denotes representative clusters fetched from the multi-resolution clustering and represented in a single level.
  - --quality-noupdate and --quality-revalue options are applied
  --webaddr, -w  - run WebUI on the specified <webui_addr> in the format <host>[:<port>], default port=8080.
  --runtimeout  - global clustering algorithms execution timeout in the format [<days>d][<hours>h][<minutes>m<seconds>], default: 10d.
  --evaltimeout  - global clustering algorithms execution timeout in the format [<days>d][<hours>h][<minutes>m<seconds>], default: 5d.
<!-- #endregion BenchParams -->

REPRODUCIBILITY NOTICE: Use seed to reproduce the evaluations but be aware that:

It is possible to have multiple input directories with similarly named files inside, which represent different instances / snapshots of the datasets. In such case, the results are produced per each snapshot, plus aggregated weighted average over all snapshots. This is useful to avoid occasional bias to the specific instance or to analyze evolving networks. If any application is crashed, the crash is logged and does not affect execution of the remaining applications. The benchmark can be terminated by the timeout or manually. automatic extension / back up of the already existent results to the timestamped .gzip archives on the benchmarking re-execution. logging of traces (stdout) and errors (stderr) (outputs of the executables and their runtime statistics) for each executable and for the benchmarking framework itself;

Synthetic networks generation, clustering algorithms execution and evaluation

To speed up generation of the synthetic networks, run the benchmarking under the PyPy:

$ pypy ./benchmark.py -g=3%2=syntnets_i3_s4 -a="scp oslom2" -r -q="Xmeasures -fh -s" -s -tm=90

This command generates synthetic networks producing 3 instances of each network with 2 shuffles (random reordering of network nodes) of each instance yielding 3*2=6 synthetic networks of each type (for each set of the network generation parameters). The generated networks are stored in the specified ./syntnets_i3_s4/ directory. scp and oslom2 clustering algorithms are executed for each generated network, default quality measures are evaluated for these algorithms with subsequent aggregation of the results over all instances and shuffles of each network. The timeout is set to 90 min for the following actions: networks generation and then for each network a) clustering and b) evaluation by each specified measure. The network here is each shuffle of each instance of each network type.

Shuffling of the existent networks, clustering algorithm execution and evaluation

$ pypy ./benchmark.py -i%5='realnets' -cr -a="daoc" -r -qenx -th=1

This command makes 5 shuffles for each network in the ./realnets/ directory (already existent shuffles of this network are backed up), converts networks to the .rcg format, executes daoc algorithm for all shuffles and evaluates NMI_max with timeout for each activity equal to 1 hour.

Already existent shuffles / clusterings / evaluations are backed up to the timestamped .gzip archive.

Aggregation of the specified evaluation results

$ pypy benchmark.py -s

Results aggregation is performed with automatic identification of the target clustering algorithm and evaluation measure taken from the already produced raw evaluations.

Benchmark Structure

Example of the <entity>.rcp format:

# ExecTime(sec)	CPU_time(sec)	CPU_usr(sec)	CPU_kern(sec)	RSS_RAM_peak(Mb)	Rcode TaskName
2.575555	2.574302	2.540420	0.033882	6.082	0 5K5
0.528582	0.528704	0.519277	0.009427	3.711	0 2K10
...

qmeasures_<seed>.h5 format:

<algname>
  <netname>
    <measure>
      Array(<instance_id>, <shuffle_id>[, <level_id>, <run_id>]): float32

aggqms_<seed>.h5 format:

<measure>
  <algname>
      Array(<netid>): (avg: float32, sd: float32, conf: float32)

where <netid> to the <netname> mapping is specified in the nets attribute of the aggqms_<seed>.h5 HFD5 dataset, avg is the average value, sd is the standard deviation and conf is the confidence (≤ 1, how many instances are successfully processed by the app among the all processed instances).

Structure of the generating synthetic networks:

<syntnets>/
  lastseed.txt
  seeds/
    <netname>.ngs
  params/
    <netname>.ngp
  networks/
    <netname>/
      <netname>^<instid>.{cnl,nse, nst}
      <netname>^<instid>/  # Optional, exists only when the shuffling is performed
        <netname>^<instid>.nse  # Hardlink
        <netname>^<instid>%<shufid>.nse

where <instid> is id of the synthetic network instance with the fixed parameters and <shufid> is id of the shuffle of the instance.

Expected structure of the input real-world networks with generating shuffles:

<realnets>/
    <netname>.{cnl, nse}
    <netname>/  # Optional, exists only when the shuffling is performed
        <netname>.nse  # Hardlink
        <netname>%<shufid>.nse

Benchmark Extension

To add custom apps / algorithms to be benchmarked just add corresponding function for "myalgorithm" app to benchapps.py:

def execMyalgorithm(execpool, netfile, asym, odir, timeout, seed=None, task=None, pathidsuf='', workdir=ALGSDIR)
  """Execute the algorithm (stub)

  execpool: ExecPool  - execution pool of worker processes
  netfile: str  - the input network to be clustered
  asym: bool  - whether the input network is asymmetric (directed, specified by arcs)
  odir: bool  - whether to output results to the dedicated dir named by the network instance name,
    which is actual for the shuffles with non-flat structure
  timeout: ufloat32  - processing (clustering) timeout of the input file, 0 means infinity
  seed: uint64 or None  - random seed, uint64_t
  task: Task  - owner task
  pathidsuf: str  - network path id prepended with the path separator
  workdir: str  - relative working directory of the app, actual when the app contains libs

  return  njobs: uint  - number of performed executions (started jobs)
  """

All the evaluations will be performed automatically, the algorithm should just follow conversion of the execution results output.

Related Projects

See also eXascale Infolab GitHub repository and our website where you can find another projects and research papers related to Big Data processing! Note: Please, star this project if you use it.

<!-- ## HDF5 File Format The HDF5 storage uses a memory-mapped file (core driver) results/qmeasures_<seed>.h5 (.hdf5) as a physical storage and holds the userblock consisting of the seed and timestamps of the updates. The storage format, where size of the working set among the allocated is stored in the attributes: // Note: rescons is formed on the storage file creation rescons.inf // Resource names: vector<utf8str>, associates name with the index // Note: nets can be formed and extended dynamically nets.inf // Net names with pathid: vector<utf8str>, associates basename with the index // HDF groups, netbasedir: {relnets, syntnets/networks, ...} /algs/<algorithm{levs}>/<net[#pathid]{insts,shufs}>/ // Note: Attributes are specified inside {} <qmname{runs}>[:<metric_name>]^insts%shufs[+u].dat // NetAlgParam: Vector<utf8str>, // Data is a Multi-dim array: // (iinst)[(ishuf)][(ilev)][(qmirun)]: float4 // float4 is 32 bit float; // (iins)[(ishf)][(ilev)][(qmirun)]: float4 // float4 is 32 bit float; // levs number (typically fixed, NOTE: “Data does not “rearrange” itself as it does when resizing a NumPy array.”) is defined by <algorithm>, qmiruns number by <qmname>[:<metric_name>]; // ilev is the index of the respective level file name in the ordered list of levels; // +u - a multi-resolution clustering (actual for DAOC representative levs, nlev is 1) // // Note: rescons and levs are formed either on the post-processing after the levs refined // or on the qms computing // Aggregated quality measures storage aggqms[%maxins]_<id>.h5: /<measure>/ <alg>.dat // Data format: <net[#pathid]>Enum: (avgares)(sdval)(conf=rins*rshf*rrun) -->