Home

Awesome

CASICS Extractor

<img width="100px" align="right" src=".graphics/casics-logo-small.svg">

The goal of the CASICS Extractor is to extract text and features from files, and serve them over a network API to other modules that do inference.

Authors: Michael Hucka<br> Repository: https://github.com/casics/extractor<br> License: Unless otherwise noted, this content is licensed under the GPLv3 license.

☀ Introduction

This module is a network server (Extractor) that is meant to run on the source code repository server. Given a repository identifier, it can return different extracted text and features from our cached copy of that source code repository. The network communication protocol is implemented using Pyro4, a very easy-to-use Python RPC library that allows complex objects to be exchanged.

The goal of Extractor is to extract text and features from files. It does some generic cleanup of what it finds, to produce a result that (hopefully) subsequent stages can use as a more effective starting point for different processes.

⧈ Interacting with the Extractor using the API

To call Extractor from a program, first import the module and then create an object of class ExtractorClient with arguments for the URI and the crypto key (in that order). After that, the same methods above will be available on the object. Example:

from extractor import ExtractorClient

extractor = ExtractorClient(uri, key)
print(extractor.get_status())

☛ Interacting with Extractor on the command line

The API provided by Extractor consists of a handful of methods on the RPC endpoint. The file extractor.py implements a simple interactive REPL interface for exploring and testing, although calling programs can interact with the interface interface more directly. The interactive interface in extractor.py can be started by providing the URI to the server instance, and a cryptographic key:

./extractor.py -k THE_KEY -u THE_URI

The values of THE_KEY and THE_URI are not stored anywhere and must be communicated separately. Once the interactive interface starts up (it's just a normal IPython loop), the object extractor is the handle to the RPC interface. The following are the available methods:

❀ The formats of the text data returned by get_words()

The extractor get_words(...) function returns a list of textual words found in files in a repository. It currently only understands (1) English plain-text files, (2) files that can be converted to plain text (such as HTML and Markdown), and (3) Python code files. It takes a repository identifier and an optional argument to tell it to limit consideration to only text or code files:

For text files, it automatically converts some structured text files and some document formats into plain text. These are currently: HTML, Markdown, AsciiDoc, reStructured Text, RTF, TeXinfo, Textile, LaTeX/TeX, DOCX and ODT. For code files, it uses the text it finds in the (1) file header (or file docstring, in the case of Python), (2) comments in the file, and (3) documentation strings on classes and functions.

The list of words is processed to a limited extent. The transformations are:

The processes above are intended to balance the risk of incorrectly interpreting a word or term, with the need to normalize the text to some degree.

✐ The format of the file/directory elements data returned by get_elements()

The Extractor function get_elements(...) returns a dictionary representation of all the files and subdirectories in a repository.

The outer dictionary (the value actually returned by get_elements(...)) is a wrapper with two key-value pairs: full_path, whose value is the full path to the directory on the disk, and elements, which is the actual content data.

The format of the elements structure is simple and recursive. Each element is a dictionary with at least the following key-value pairs: the key 'name' having as its associated value the name of a file or directory, the key 'type' having as its value either 'dir' or 'file', and the key 'body' containing the contents of the file or directory. In the case of files, the dictionary has two additional keys: 'text_language', for the predominant human language found in the text (based on the file header and comments), and 'code_language', for the language of the program (if the file represents code). In summary:

In the case of a directory, the value associated with the key 'body' is a list that can be either empty ([]) if the directory is empty, or else a list of dictionaries, each of which the same basic structure. In the case of a file, the content associated with the key 'body' can be one of four things: (1) an empty string, if the file is empty; (2) a string containing the plain-text content of the file (if the file is a non-code text file), (3) a dictionary containing the processed and reduced contents of the file (if the file is a code file), or (4) None, if the file is something we ignore. Reasons for ignoring a file include if it is a non-code file larger than 1 MB.

Altogether, this leads to the following possibilities:

When it comes to non-code text files, if the file is not literally plain text, Extractor extracts the text from it. It currently converts the following formats: HTML, Markdown, AsciiDoc, reStructuredText, RTF, Textile, and LaTeX/TeX. It does this by using a variety of utilities such as BeautifulSoup to convert the formats to plain text, and returns this as a single string. In the case of a code file, the value associated with the 'body' key is a dictionary of elements described in more detail below.

The text language inside files is inferred using langid and the value for the key text_language is a two-letter ISO 639-1 code (e.g., 'en' for English). The language inferrence is not perfect, particularly when there is not much text in a file, but by examining all the text chunks in a file (including all the separate comments) and returning the most frequently-inferred language, Extractor can do a reasonable job. If there is no text at all (no headers, no comments), Extractor defaults to 'en' because programming languages themselves are usually written in English.

☺︎ More information about the parsed file contents

The file parser attempts to reduce code files to "essentials" by throwing away most things and segregating different types of elements. Some of the entities are just lists of strings, while others are tuples of ('thing', frequency) in which the number of times the 'thing' is found is counted and reported. The elements in the dictionary are as follows:

With respect to names of functions, classes and variables: The system ignores names (variables, functions, etc.) that are less than 3 characters long, and those that are common Python function names, built-in functions, and various commonly-used Python functions like join and append.

To illustrate, suppose that we have the following very simple Python file:

#!/usr/bin/env python
# This is a header comment.

import foo
import floop

# This is a comment after the first line of code.

class SomeClass():
    '''Some class doc.'''

    def __init__(self):
        pass

    def some_function_on_class():
        '''Some function doc.'''
        some_variable = 1
        some_variable = foo.func()

if __name__ == '__main__':
    bar = SomeClass()
    print(bar.some_function_on_class())

Then, the dictionary returned by Extractor for this file will look like this:

{'name': 'test.py',
 'type': 'file',
 'code_language': 'Python',
 'text_language': 'en',
 'body':
  {'header': 'This is a header comment.'
   'comments': ['This is a comment after the first line of code.'],
   'docstrings': ['Some class doc.', 'Some function doc.'],
   'strings': [('__main__', 1)],
   'imports': [('foo', 1), ('floop', 1)],
   'classes': [('SomeClass', 1)],
   'functions': [('SomeClass.some_function_on_class', 1)],
   'variables': [('bar', 1), ('some_variable', 1)],
   'calls': [('foo.func', 1), ('SomeClass', 1),
             ('bar.some_function_on_class', 1)],
   },
}

The dictionary of elements can have empty values for the various keys even when a file contains code that we parse (currently, Python). This can happen if the file is empty or something goes badly wrong during parsing such as encountering a syntactic error in the code. (The latter happens because the process parses code into an AST to extract the elements, and this can fail if the code is unparseable.)

☕ More information about text processing performed

There are two text processing situations that Extractor encounters: non-code files that contain text, and code (currently, Python) files. In both cases, Extractor cleans up and processes the text to a limited extent in order to try to make it more easily processed by subsequent natural language procedures.

For non-code files, if the file is not literally plain text, Extractor first extracts the text from it. It currently converts the following formats: HTML, Markdown, AsciiDoc, reStructuredText, RTF, and Textile. Its approach always begins by converting these formats to HTML, and then it post-processes the results. The post-processing performed is to add periods at the ends of titles, paragraphs, list items, and table cells, if periods are missing, to make the result appear more like normal English sentences. (This makes it easier for a sentence segmentation function to take the result and parse the text into sentences.) Extractor also removes <pre> and <img> elements completely.

For code files, the text that appears in (1) the header section, (2) doc strings, (3) comments and (4) other strings, is extracted individually as described in a separate section above. Extractor performs some additional post-processing on the text, again mostly aimed at turning fragments of text into sentences based on heuristics, but also including some miscellaneous cleanup operations like removing non-alphanumeric characters that are not part of identifiers.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing — info for developers

A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing list casics-team@googlegroups.com.

Everyone is asked to read and respect the code of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

<br> <div align="center"> <a href="https://www.nsf.gov"> <img width="105" height="105" src=".graphics/NSF.svg"> </a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="https://www.caltech.edu"> <img width="100" height="100" src=".graphics/caltech-round.svg"> </a> </div>