Home

Awesome

Software Metadata Extraction Framework (SOMEF)

Documentation Status Python PyPI DOI Binder Project Status: Active – The project has reached a stable, usable state and is being actively developed.

<img src="docs/logo.png" alt="logo" width="150"/>

A command line interface for automatically extracting relevant metadata from code repositories (readme, configuration files, documentation, etc.).

Demo: See a demo running somef as a service, through the SOMEF-Vider tool.

Authors: Daniel Garijo, Allen Mao, Miguel Ángel García Delgado, Haripriya Dharmala, Vedant Diwanji, Jiaying Wang, Aidan Kelley, Jenifer Tabita Ciuciu-Kiss and Luca Angheluta.

Features

Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present), listed in alphabetical order:

We use different supervised classifiers, header analysis, regular expressions and the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the output format description

Documentation

See full documentation at https://somef.readthedocs.io/en/latest/

Cite SOMEF:

Journal publication (preferred):

@article{10.1162/qss_a_00167,
    author = {Kelley, Aidan and Garijo, Daniel},
    title = "{A Framework for Creating Knowledge Graphs of Scientific Software Metadata}",
    journal = {Quantitative Science Studies},
    pages = {1-37},
    year = {2021},
    month = {11},
    issn = {2641-3337},
    doi = {10.1162/qss_a_00167},
    url = {https://doi.org/10.1162/qss_a_00167},
    eprint = {https://direct.mit.edu/qss/article-pdf/doi/10.1162/qss\_a\_00167/1971225/qss\_a\_00167.pdf},
}

Conference publication (first):

@INPROCEEDINGS{9006447,
author={A. {Mao} and D. {Garijo} and S. {Fakhraei}},
booktitle={2019 IEEE International Conference on Big Data (Big Data)},
title={SoMEF: A Framework for Capturing Scientific Software Metadata from its Documentation},
year={2019},
doi={10.1109/BigData47090.2019.9006447},
url={http://dgarijo.com/papers/SoMEF.pdf},
pages={3032-3037}
}

Requirements

SOMEF has been tested on Unix, MacOS and Windows Microsoft operating systems.

If you face any issues when installing SOMEF, please make sure you have installed the following packages: build-essential, libssl-dev, libffi-dev and python3-dev.

Install from Pypi

SOMEF is available in Pypi! To install it just type:

pip install somef

Install from GitHub

To run SOMEF, please follow the next steps:

Clone this GitHub repository

git clone https://github.com/KnowledgeCaptureAndDiscovery/somef.git

For better dependency management, it is necessary to have Poetry installed beforehand. It can be installed as follows:

curl -sSL https://install.python-poetry.org | python3 -

This option is recommended over installing Poetry with pip install.

Now Poetry will handle the installation of SOMEF and all its dependencies configured in the TOML file.

Test the correct installation of poetry

poetry --version

We can first review the list of libraries and dependencies configured as necessary for the operation.

poetry show

Install somef and all their dependencies.

poetry install

With the following instruction, we can see the environments available in the project and which one is currently active.

poetry env list

And this way, we enter the virtual environment established by Poetry. Once inside the environment, we can perform the installation test for SOMEF detailed later.

poetry shell

Test SOMEF installation

somef --help

If everything goes fine, you should see:

Usage: somef [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help  Show this message and exit.

Commands:
  configure  Configure credentials
  describe   Running the Command Line Interface
  version    Show somef version.

Installing through Docker

We provide a Docker image with SOMEF already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:

docker build -t somef .

Or just use the Docker image already built in DockerHub:

docker pull kcapd/somef

Then, to run your image just type:

docker run -it kcapd/somef /bin/bash

And you will be ready to use SOMEF (see section below). If you want to have access to the results we recommend mounting a volume. For example, the following command will mount the current directory as the out folder in the Docker image:

docker run -it --rm -v $PWD/:/out kcapd/somef /bin/bash

If you move any files produced by somef into /out, then you will be able to see them in your current directory.

Configure

Before running SOMEF for the first time, you must configure it appropriately (you only need to do this once). Run:

somef configure

And you will be asked to provide the following:

If you want somef to be automatically configured (without GitHUb authentication key and using the default classifiers) just type:

somef configure -a

For showing help about the available options, run:

somef configure --help

Which displays:

Usage: somef configure [OPTIONS]

  Configure GitHub credentials and classifiers file path

Options:
  -a, --auto  Automatically configure SOMEF
  -h, --help  Show this message and exit.

Updating SOMEF

If you update SOMEF to a newer version, we recommend you configure again the library (by running somef configure). The rationale is that different versions may rely on classifiers which may be stored in a different path.

Usage

$ somef describe --help
  SOMEF Command Line Interface
Usage: somef describe [OPTIONS]

  Running the Command Line Interface

Options:
  -t, --threshold FLOAT           Threshold to classify the text  [required]
  Input: [mutually_exclusive, required]
    -r, --repo_url URL            Github/Gitlab Repository URL
    -d, --doc_src PATH            Path to the README file source
    -i, --in_file PATH            A file of newline separated links to GitHub/
                                  Gitlab repositories

  Output: [required_any]
    -o, --output PATH             Path to the output file. If supplied, the
                                  output will be in JSON

    -c, --codemeta_out PATH       Path to an output codemeta file
    -g, --graph_out PATH          Path to the output Knowledge Graph export
                                  file. If supplied, the output will be a
                                  Knowledge Graph, in the format given in the
                                  --format option chosen (turtle, json-ld)

  -f, --graph_format [turtle|json-ld]
                                  If the --graph_out option is given, this is
                                  the format that the graph will be stored in

  -p, --pretty                    Pretty print the JSON output file so that it
                                  is easy to compare to another JSON output
                                  file.

  -m, --missing                   The JSON will include a field
                                  somef_missing_categories to report with the
                                  missing metadata fields that SOMEF was not
                                  able to find.

  -kt, --keep_tmp PATH            SOMEF will NOT delete the temporary folder
                                  where files are stored for analysis. Files
                                  will be stored at the
                                  desired path


  -h, --help                      Show this message and exit.

Usage example:

The following command extracts all metadata available from https://github.com/dgarijo/Widoco/.

somef describe -r https://github.com/dgarijo/Widoco/ -o test.json -t 0.8

Try SOMEF in Binder with our sample notebook: Binder

Contribute:

If you want to contribute with a pull request, please do so by submitting it to the dev branch.

Next features:

To see upcoming features, please have a look at our open issues and milestones

Extending SOMEF categories:

To run a classifier with an additional category or remove an existing one, a corresponding path entry in the config.json should be provided and the category type should be added/removed in the category variable in cli.py.