Home

Awesome

<p align="center"> <img src="https://github.com/annotation/stam/raw/master/logo.png" alt="stam logo" width="320" /> </p>

Crate Docs GitHub build GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed. Technology Readiness Level 7/9 - Release Candidate - Technology ready enough and in initial use by end-users in intended scholarly environments. Further validation in progress.

STAM Library

STAM is a standalone data model for stand-off text annotation. This is a software library to work with the model from Rust, and is the primary library/reference implementation for STAM. It aims to implement the full model as per the STAM specification and most of the extensions.

What can you do with this library?

This STAM library is intended as a foundation upon which further applications can be built that deal with stand-off annotations on text. We implement all the low-level logic in dealing this so you no longer have to and can focus on your actual application. The library is written with performance in mind.

Installation

Add stam to your project's Cargo.toml:

$ cargo add stam

Usage

Import the library

use stam;

Or if you prefer losing the namespace:

use stam::*;

Loading a STAM JSON file containing an annotation store:

fn your_function() -> Result<(),stam::StamError> {
    let store = stam::AnnotationStore::from_file("example.stam.json", stam::Config::default())?;
    ...
}

We assume some kind of function returning Result<_,stam::StamError> for all examples in this section.

The annotation store is your workspace, it holds all resources, annotation sets (i.e. keys and annotation data) and of course the actual annotations. It is a memory-based store and you can as much as you like into it (as long as it fits in memory:).

When instantiating an annotation store, you can pass a configuration (stam::Config()) which specifies various parameters, such as which indices to generate. Use the various with_() methods (a builder pattern) to set the various configuration options.

Retrieving items

You can retrieve items by methods that are similarly named to the desired return type:

let annotation =  store.annotation("my-annotation").or_fail()?;
let resource = store.resource("my-resource").or_fail()?;
let annotationset: &stam::AnnotationDataSet = store.annotationset("my-annotationset").or_fail()?;
let key = annotationset.key("my-key").or_fail()?;
let data = annotationset.annotationdata("my-data").or_fail()?;

All of these methods return an Option<ResultItem<T>>, where T is a type in the STAM model like Annotation, TextResource,AnnotationDataSet, DataKey or TextSelection. The or_fail() method transforms it into a Result<T,StamError> and the ? unwraps it safely into ResultItem<T> or propagates the error further.

The ResultItem<T> type holds a reference to T, with a lifetime equal to the store, it also holds a reference to the store itself. You can call as_ref() on all ResultItem<T> instances to a direct reference with a lifetime equal to the store, this exposes a lower-level API. ResultItem<T> itself always exposes a high-level API, which is what you want in most cases.

The wrapping of TextSelection is a bit special, instead of ResultItem<TextSelection>, we typically use a more specialised type ResultTextSelection.

Adding items

Add a resource from an existing plain text file to an existing store:

let resource_handle = store.add_resource( stam::TextResourceBuilder::new().with_filename("my-text.txt")) )?;

*Here we see a Builder type that uses a builder pattern to construct instances of their associated types. The actual instances will be built by the underlying store.

A similar pattern works for AnnotationDataSet:

let annotationset_handle = store.add_dataset( stam::AnnotationDataSetBuilder::new().with_filename("myset.json") )?;

The add_* methods take take associated builder types and return handles. There is also a with_* variant which can be used in a chained builder pattern, as they return the modified AnnotationStore itself.

We use annotate() (or with_annotation()) to add annotations to an existing store:

let annotation_handle = store.annotate( stam::AnnotationBuilder::new()
           .with_target( stam::SelectorBuilder::TextSelector("testres", stam::Offset::simple(6,11))) 
           .with_data("testdataset", "pos", "noun") 
)?;

Let's now create a store and annotations from scratch, with an explicitly filled AnnotationDataSet:

let store = stam::AnnotationStore::new(stam::Config::default())
    .with_id("test")
    .with_resource( stam::TextResourceBuilder::new().with_id("testres").with_text("Hello world"))?
    .with_dataset( stam::AnnotationDataSetBuilder::new().with_id("testdataset")
           .with_key( "pos")
           .with_key_value_id("pos", "noun", "D1")
    )?
    .with_annotation( stam::Annotation::builder() 
            .with_id("A1")
            .with_target( stam::SelectorBuilder::textselector("testres", stam::Offset::simple(6,11))) 
            .with_existing_data("testdataset", "D1") )?;

And here is the very same thing but the AnnotationDataSet is filled implicitly here:

let store = stam::AnnotationStore::default().with_id("test")
    .with_resource( stam::TextResourceBuilder::new().with_id("testres").with_text("Hello world"))?
    .with_dataset( stam::AnnotationDataSetBuilder::new().with_id("testdataset"))
    .with_annotation( stam::AnnotationBuilder::new()
            .with_id("A1")
            .with_target( stam::SelectorBuilder::textselector("testres", stam::Offset::simple(6,11))) 
            .with_data_with_id("testdataset","pos","noun","D1")
    )?;

The implementation will ensure to reuse any already existing AnnotationData if possible, as not duplicating data is one of the core characteristics of the STAM model.

Serialisation to file

You can serialize the entire annotation store (including all sets and annotations) to a STAM JSON file:

store.to_file("example.stam.json")?;

Or to a STAM CSV file (this will actually create separate derived CSV files for sets and and annotations):

store.to_file("example.stam.csv")?;

Iterators & Searching

Iterating through all annotations in the store, and outputting a simple tab separated format with the data by annotation and the text by annotation:

for annotation in store.annotations() {
    let id = annotation.id().unwrap_or("");
    for data in annotation.data() {
        // get the text to which this annotation refers (if any)
        let text: Vec<&str> = annotation.text().collect();
        print!("{}\t{}\t{}\t{}", id, data.key().id().unwrap(), data.value(), text.join(" "));
    }
}

Here is an overview of the most important methods that return an iterator, the iterators in turn all return ResultItem<T> instances (or ResultTextSelection). The table is divided into two parts, the top part simple methods that follows STAM's ownership model. Those in the bottom part leverage the various reverse indices that are computed:

MethodTDescription
AnnotationStore.annotations()Annotationall annotations in the store
AnnotationStore.resources()TextResourceall resources in the store
AnnotationStore.datasets()AnnotationDataSetall annotation sets in the store
AnnotationDataSet.keys()DataKeyall keys in the set
AnnotationDataSet.data()AnnotationDataall data in the set
Annotation.data()AnnotationDatathe data pertaining to the annotation
-------------------------------------------------------------------------------------------------
TextResource.textselections()TextSelectionall known text selections in the resource (1)
TextResource.annotations()AnnotationAnnotations referencing this text using a TextSelector or AnnotationSelector
TextResource.annotations_as_metadata()AnnotationAnnotations referencing the resource via a ResourceSelector
AnnotationDataSet.annotations()AnnotationAll annotations making use of this set
AnnotationDataSet.annotations_as_metadata()AnnotationAnnotations referencing the set via a DataSetSelector
Annotation.annotations()AnnotationAnnotations that reference the current one via an AnnotationSelector
Annotation.annotations_in_targets()AnnotationAnnotations referenced by the current one via an AnnotationSelector
Annotation.textselections()AnnotationTargeted text selections (via TextSelector or AnnotationSelector)
AnnotationData.annotations()AnnotationAll annotations that use this data
DataKey.data()AnnotationDataAll annotation data that uses this key
TextSelection.annotations()AnnotationAll annotations that target this text selection
-------------------------------------------------------------------------------------------------

Notes:

The main named iterator traits in STAM are:

Iterator traitTMethods that produce the iterator
AnnotationIteratorAnnotationannotations() / annotations_in_targets()
DataIteratorAnnotationDatadata() / find_data()
TextSelectionIteratorTextSelectiontextselections() / related_text()
ResourcesIteratorAnnotationDataresources()
KeyIteratorDataKeykeys()
----------------------------------------------------------------------------------------------------------

The iterators expose an API allowing various transformations and filter actions: You can typically transform one type of iterator to another using the methods in the third column. Similarly, you can obtain an iterator from ResultItem instances through equally named methods.

All of these iterators have an owned collection counterpart (Handles<T>) that holds an entire collection in memory, the items are held by reference to a store, so the space-overhead is reduced. You can go from the former to the latter with .to_handles() and from the latter to the format with .items().

Iterator TraitCollection
AnnotationIteratorAnnotations
DataIteratorData
ResourcesIteratorResources
TextSelectionsIterTextSelections
KeyIteratorKeys
-----------------------------------------------------------

The iterators can be extended by filters, they are applied in a build pattern and return the an iterator that still implements the same trait, but with the filter applied:

Filter methodDescription
filter_annotation(&ResultItem<Annotation>)Filters on a single annotation
filter_annotations(Annotations)Filters on multiple annotations
filter_annotationdata(&ResultItem<AnnotationData>)Filters on a single data item
filter_data(Data)Filters on multiple data items
filter_key(&ResultItem<DataKey>)Filters on a data key
filter_value(value)Filters on a data value, the parameter can be of various types
--------------------------------------------------------------------

All these iterators are lazy-iterator, that is to say they don't do anything unless consumed. Once they are being iterated over, internal buffers may be allocated.

When you are not interested in the actual items but merely want to test whether there are results at all, then use the test() method that is available on these iterators.

For improved performance, you can add .parallel() to an iterator, any subsequent iterator methods (generic ones like map() and filter(), not STAM-specific), will then run in parallel over multiple cores.

Examples

Example retrieving all annotations for that have part-of-speech noun (fictitious model):

let dataset = store.dataset("linguistic-features").or_fail()?;
let key = dataset.key("part-of-speech").or_fail()?;
let annotationsiter = key.data().filter_value("noun".into()).annotations();

Alternatively, this can also be done as follows, following a slightly different path to get to the same results. Sometimes one version is more performant than the other, depending on how your data is modelled:

let annotationsiter = key.annotations().filter_value("noun".into());

Example testing whether a word is annotated with part-of-speech noun (fictitious model):

let dataset = store.dataset("linguistic-features").or_fail()?;
let key = dataset.key("part-of-speech").or_fail()?;
if word.annotations().filter_key(&key).filter_value("noun".into()).test() {
   ...    
}

Searching data

The above methods already allow to find data, but there is find_data() method on AnnotationStore and AnnotationDataSet provide a shortcut to quickly get data instances (via a DataIter).

Example:

let data = store.find_data("linguistic-features", "part-of-speech", "noun".into()).next()

Here and in examples before we use the into() method to coerce a &str into a DataOperator::Equals(&str). There are also other data operators available allowing for various types and various kinds of comparison (equality, inequality, greater than, less than, logical and/or etc).

Searching text

The following methods are available to search for text, they return iterators producing ResultItem<T> items.

MethodTDescription
TextResource.find_text()TextSelectionFinds a particular substring in the resource's text.
TextSelection.find_text()TextSelectionFinds a particular substring within the specified text selection.
TextResource.find_text_regex()TextSelectionIdem, but as powerful regular expressed based search.
TextSelection.find_text_regex()TextSelectionIdem, but as powerful regular expressed based search.
-------------------------------------------------------------------------------------------------

Searching related text

The related_text() method allows for for finding text selections that are in a certain relation with the current one(s). It takes a TextSelectionOperator as parameter, which distinguishes various variants.

The variants are typically constructed via a helper function on TextSelectionOperator (simply name of the variant in lowercase), e.g. TextSelectionOperator::equals().

Example, select all words in a sentence (sentence may be either an Annotation or TextSelection in this case):

let dataset = store.dataset("structure-type").or_fail()?;
let key_word = dataset.key("word").or_fail()?;
for word in sentence.related_text(stam::TextSelectionOperator::embeds()).annotations().filter_key(key_word) {
    ...
}

Querying

Rather than searching programmatically, you can also express queries via the STAM Query Language (STAMQL). Do note that this incurs a performance penalty due to extra overhead:

let query: Query = "SELECT ANNOTATION ?a WHERE DATA myset type = phrase;".try_into()?;
let iter = store.query(query);
let names = iter.names();
for results in iter {
    if let Ok(result) = results.get_by_name(&names, "a") {
       if let QueryResultItem::Annotation(annotation) = result {
          ...
        }
    }
}

API Reference Documentation

Please consult the API reference documentation for in-depth explanation on all structures, traits and methods, along with some examples.

Extensions

This library implements the following STAM extensions:

Python binding

This library comes with a binding for Python, see here.

Acknowledgements

This work is conducted at the KNAW Humanities Cluster's Digital Infrastructure department, and funded by the CLARIAH project (CLARIAH-PLUS, NWO grant 184.034.023) as part of the FAIR Annotations track of the Shared Development Roadmap.