Home

Awesome

Spiral<img width="130px" align="right" src=".graphics/spiral.svg">

Spiral is a Python module that provides several different functions for splitting identifiers found in source code files. The name Spiral is a loose acronym based on "SPlitters for IdentifieRs: A Library".

License: GPL v3 Python Latest release DOI DOI

Authors: Michael Hucka<br> Repository: https://github.com/casics/spiral<br> License: Unless otherwise noted, this content is licensed under the GPLv3 license.

🏁 Recent news and activities

April 2018: Version 1.1.0 fixes a bug that prevented importing Spiral, and another bug that cause setup.py to fail to install dependencies automatically. Additional enhancements include improved command-line help and internal code refactoring.

Table of Contents

☀ Introduction

Spiral is a Python 3 package that implements numerous identifier splitting algorithms. Identifier splitting (also known as identifier name tokenization) is the task of breaking apart program identifier strings such as getInt or readUTF8stream into component tokens: [get, int] and [read, utf8, stream]. The need for splitting identifiers arises in a variety of contexts, including natural language processing (NLP) methods applied to source code analysis and program comprehension. Spiral provides some basic naive splitting algorithms, such as a straightforward camel-case splitter, as well as more elaborate heuristic splitters, such as a new algorithm called Ronin.

♥️ Please cite the Spiral paper and the version you use

Article citations are critical for academic developers. If you use Spiral and you publish papers about your software, we ask that you please cite the Spiral paper:

<dl> <dd> Hucka, M. (2018). Spiral: splitters for identifiers in source code files. <i>Journal of Open Source Software</i>, 3(24), 653, <a href="https://doi.org/10.21105/joss.00653">https://doi.org/10.21105/joss.00653</a> </dd> </dl>

Please also indicate the specific version of Spiral you use, to improve other people's ability to reproduce your results. You can use the Zenodo DOIs we provide for this purpose:

✺ Installation instructions

For basic usage, the following is probably the simplest and most direct way to install Spiral on your computer:

sudo pip3 install git+https://github.com/casics/spiral.git

Alternatively, you can clone this GitHub repository and then run setup.py:

git clone https://github.com/casics/spiral.git
cd spiral
sudo python3 -m pip install .

The above should be all you need to run Spiral out of the box. If you plan on experimenting with alternative parameter values or alternative dictionaries, you will additionally need to install the following:

▶︎ Basic operation

Spiral is extremely easy to use. To use a Spiral splitter in a Python program, simply import a splitter and then call it.

from spiral.simple_splitters import pure_camelcase_split
print(pure_camelcase_split('TestString'))

Some splitters take optional parameters, and the more complex splitters have an init() function that you can optionally call to set additional parameters or load data. Currently, only Ronin and Samurai have initialization functions, and calling init() is optional—if you do not call it, Ronin and Samurai will call their init() functions automatically.

Here are some examples of using the Ronin splitter algorithm. The following input,

from spiral import ronin
for s in [ 'mStartCData', 'nonnegativedecimaltype', 'getUtf8Octets', 'GPSmodule', 'savefileas', 'nbrOfbugs']:
    print(ronin.split(s))

produces the following output:

['m', 'Start', 'C', 'Data']
['nonnegative', 'decimal', 'type']
['get', 'Utf8', 'Octets']
['GPS', 'module']
['save', 'file', 'as']
['nbr', 'Of', 'bugs']

Spiral also includes a command-line program named spiral in the bin subdirectory; it will split strings provided on the command line or in a file. (Note: Ronin and Samurai load internal data files when they start up. In normal use, called from an application program, the resulting startup delay will only happen once. In the command-line program, the data is reloaded at every invocation, causing a startup delay every time. The delay is not typical for normal Spiral usage.)

By the way, if your goal is to split identifiers obtained during source code mining and you need to filter out gibberish strings of characters before attempting to split them, you may find Nostril (the Nonsense String Evaluator) useful.

🎯 Performance of Ronin

Splitting identifiers is deceptively difficult. It is a research-grade problem for which no perfect solution exists. Even in cases where the input consists of identifiers that strictly follow conventions such as camel case, ambiguities can arise. For example, correctly splitting J2SEProjectTypeProfiler into J2SE, Project, Type, Profiler requires the reader to recognize J2SE as a unit. The task of splitting identifiers is made more difficult when there are no case transitions or other obvious boundaries in an identifier. And then there is the small problem that humans are often simply inconsistent!

Ronin is an advanced splitter that uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin includes a default table of term frequencies derived from an analysis of over 46,000 randomly selected software projects in GitHub that contained at least one Python source code file. The tokens were extracted using software from Spiral's parent project, CASICS (specifically, the extractor package), and the frequency table constructed using a procedure encoded in the small program create_frequency_file included with Spiral. Ronin also has a number of parameters that need to be tuned; this can be done using the optimization program optimize-ronin in the dev/optimization subdirectory. The default parameter values were derived by optimizing performance against two data sets available from other research groups:

  1. The Loyola University of Delaware Identifier Splitting Oracle (Ludiso)
  2. The INTT data set, extracted from the zip archive of INTT

Spiral includes copies of these data sets in the tests/data subdirectory. The parameters derived primarily by running against the INTT database of 18,772 identifiers and their splits. The following table summarizes the results:

Data setNumber of splits matched by RoninTotal in data setAccuracy
INTT17,28718,77292.09%
Ludiso2,2482,66384.42%

Many of the "failures" against these sets of identifiers are actually not failures, but cases where Ronin gets a more correct answer or where there is a legitimate difference in interpretation. Here are some examples:

IdentifierLudiso resultRonin split
a.ecarta ecarta e cart
ConvertToAUTF8StringConvert To AUTF 8 StringConvert To A UTF8 String
MAPI_BCCMAPI BCCM API BCC
GetWSIsEnabledGet WSIs EnabledGet WS Is Enabled
freadfreadf read
FF_LOSS_COLORSPACEFF LOSS COLORSPACEFF LOSS COLOR SPACE
m_bFilenameModem b File name Modem b Filename Mode

Names like fread could be considered appropriate to leave alone, but the f in fread actually stands for "file", and thus IMHO it makes sense to split it—splitting identifiers into meaningful tokens is the purpose of Ronin, after all. Conversely, tokens such as Utf8 should be left as units because they are meaningful. Differences involving some other terms such as color space are due to Ronin splitting terms that are typically considered separate words but are treated as one word in Ludiso, or vice versa. (The main entry in Wikipedia for "color space" is two words, while for "filename" it is one word.) This sometimes also arises because Ronin recognizes prefixes and sometimes does not split words because they would not be in normal written English, such as nonblocking instead of non blocking. Still other differences between Ronin's output and the expected splits given in Ludiso and INTT are due to inconsistencies in the Ludiso and INTT data sets. For example, sometimes a token within a larger identifier is split in one case but not in another. Finally, some other differences are cases where the Ludiso split seems to favor program-specific names. For example, QWheelEvent is split as [ QWheel, Event ] whereas Ronin will split it as [ Q, Wheel, Event ].

So, Ronin's performance may be better than the pure numbers imply. However, without checking every Ludiso case and manually reinterpreting the splits, we can't say for sure. All that aside, Ronin definitely gets many cases wrong.

⚙️ Other splitters in Spiral

Here is a list of all the splitters implemented in Spiral at this time:

Splitter nameOperation
delimiter_splitsplit only at characters $ ~ _ . : / @
digit_splitsplit only at digits
pure_camelcase_splitsplit at forward camel case transitions (lower to upper case)
safe_simple_splitsplit at hard delimiter characters and forward camel case only; won't split strings that don't follow strict camel case
simple_splitsplit at hard delimiter characters and forward camel case, even if a string doesn't follow strict camel case conventions
elementary_splitsplit by hard delimiters, forward camel case, and digits
heuristic_splitsplit by hard delimiters, forward camel case, and digits, but recognize special cases such as utf8, sha256, etc.
Samuraifrequency-based algorithm published in the literature
Roninfrequency-based algorithm based on Samurai but greatly extended (see previous section)

The following table illustrates the differences between the simpler splitters.

Input stringpure camelsafe simplesimpleelementaryheuristic
allloweralllowerallloweralllowerallloweralllower
ALLUPPERALLUPPERALLUPPERALLUPPERALLUPPERALLUPPER
a_delimitera_delimitera delimitera delimitera delimitera delimiter
a.delimitera.delimitera delimitera delimitera delimitera delimiter
a$delimitera$delimitera delimitera delimitera delimitera delimiter
a:delimitera:delimitera delimitera delimitera delimitera delimiter
a_fooBara_foo Bara foo Bara foo Bara foo Bara foo Bar
fooBarfoo Barfoo Barfoo Barfoo Barfoo Bar
FooBarFoo BarFoo BarFoo BarFoo BarFoo Bar
FoobarFoobarFoobarFoobarFoobarFoobar
fooBARfoo BARfooBARfoo BARfoo BARfoo BAR
fooBARbiffoo BARbiffooBARbiffoo BARbiffoo BARbiffoo BARbif
fooBARzBiffoo BARz BiffooBARzBiffoo BARz Biffoo BARz Biffoo BARz Bif
ABCfooABCfooABCfooABCfooABCfooABCfoo
ABCFooABCFooABCFooABCFooABCFooABCFoo
ABCFooBarABCFoo BarABCFooBarABCFoo BarABCFoo BarABCFoo Bar
ABCfooBarABCfoo BarABCfooBarABCfoo BarABCfoo BarABCfoo Bar
fooBar2dayfoo Bar2dayfoo Bar2dayfoo Bar2dayfoo Bar 2 dayfoo Bar 2 day
fooBar2Dayfoo Bar2Dayfoo Bar2Dayfoo Bar2Dayfoo Bar 2 Dayfoo Bar 2 Day
fooBAR2dayfoo BAR2dayfooBAR2dayfoo BAR2dayfoo BAR 2 dayfoo BAR 2 day
fooBAR2Dayfoo BAR2DayfooBAR2Dayfoo BAR2Dayfoo BAR 2 Dayfoo BAR 2 Day
foo3000foo3000foo3000foo3000foo 3000foo 3000
99foo300099foo300099foo300099foo300099 foo 300099 foo 3000
foo2Barfoo2Barfoo2Barfoo2Barfoo 2 Barfoo 2 Bar
foo2bar2foo2bar2foo2bar2foo2bar2foo 2 bar 2foo 2 bar 2
Foo2Bar2Foo2Bar2Foo2Bar2Foo2Bar2Foo 2 Bar 2Foo 2 Bar 2
MyInt32My Int32My Int32My Int32My Int 32My Int32
MyInt42My Int42My Int42My Int42My Int 42My Int 42
MyInt32VarMy Int32VarMy Int32VarMy Int32VarMy Int 32 VarMy Int32 Var
2ndvar2ndvar2ndvar2ndvar2 ndvar2nd var
the2ndvarthe2ndvarthe2ndvarthe2ndvarthe 2 ndvarthe 2nd var
the2ndVarthe2nd Varthe2nd Varthe2nd Varthe 2 nd Varthe 2nd Var
row10row10row10row10row 10row 10
utf8utf8utf8utf8utf 8utf8
aUTF8vara UTF8varaUTF8vara UTF8vara UTF 8 vara UTF8 var
J2SE4meJ2SE4meJ2SE4meJ2SE4meJ 2 SE 4 meJ2SE 4 me
IPv4addrIPv4addrIPv4addrIPv4addrIPv 4 addrIPv4 addr

Ronin and Samurai are more advanced than any of the simple splitters above. Ronin in particular does everything heuristic_splitter does, but handles far more difficult cases.

⚠️ Limitations

Ronin is the most advanced splitter in the Spiral package, but it is not perfect by any means. Certain limitations are known:

📚 More information

The name "Ronin" is a play on the use of the name "Samurai" by Enslen et al. (2009) for their identifier splitting algorithm. The core loop of Ronin is based on Samurai, but it would be inappropriate to call this implementation Samurai too. In an effort to imply the lineage of this modified algorithm, I chose "Ronin" (a name referring to a drifter samurai without a master, during the Japanese feudal period).

I implemented Samurai based on the description of the algorithm published in the following paper, then modified it repeatedly in an attempt to improve performance. Ronin is the result. A goal for Ronin was to produce a splitter that had good performance even without using a local frequency table.

<details> <summary><a href="https://dl.acm.org/citation.cfm?id=1591139"> Enslen, E., Hill, E., Pollock, L., & Vijay-Shanker, K. (2009). Mining source code to automatically split identifiers for software analysis. In Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories (MSR'09)</a></summary><br> <em> Abstract: Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques. </em> </details> <br>

The Ludiso data set is described in the following paper:

<details> <summary><a href="https://link.springer.com/article/10.1007/s10664-013-9261-0"> Hill, E., Binkley, D., Lawrie, D., Pollock, L., & Vijay-Shanker, K. (2014). An empirical study of identifier splitting techniques. Empirical Software Engineering, 19(6), 1754–1780.</a></summary><br> <em>Abstract: Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches. </em> </details> <br>

The INTT splitter and data set are described in the following paper:

<details> <summary><a href="https://link.springer.com/chapter/10.1007%2F978-3-642-22655-7_7"> Butler, S., Wermelinger, M., Yu, Y., & Sharp, H. (2011). Improving the Tokenisation of Identifier Names. In ECOOP 2011 – Object-Oriented Programming (pp. 130–154). Springer, Berlin, Heidelberg. </a></summary><br> <em> Abstract: Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available. </em> </details>

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing — info for developers

A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing list casics-team@googlegroups.com.

Everyone is asked to read and respect the code of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

The installation of Spiral includes a dictionary containing words from the NTLK project, specifically the words and wordnet dictionaries. The word dictionary is public domain, but the wordnet dictionary is copyright 2006 by Princeton University and made available under license terms that permit free redistribution. For more information, please see Princeton University "About WordNet" (Princeton University, 2010).

The vector artwork of the segmented spiral illusion at the top of this page was obtained from www.123freevectors.com.

<div align="center"> <a href="https://www.nsf.gov"> <img width="105" height="105" src=".graphics/NSF.svg"> </a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="https://www.caltech.edu"> <img width="100" height="100" src=".graphics/caltech-round.svg"> </a> </div>