Awesome

Graphony

Graphony is a Python library for doing high-performance graph analysis using the GraphBLAS over sparse and hypersparse data sets.

Graphony uses pygraphblas to store graph data in sparse GraphBLAS Matrices and node and edge properties in PostgreSQL.

Graphony's primary role is to easily construnct graph matrices and manage symbolic names and properties for graphs, nodes, properties and edges, and can be used to easily construct, save and manage graph data in a simple project directory format.

Graphs can be:

Simple: an edge connects one source to one destination.
Hypergraph: a graph with at lest one hyperedge connecting multiple source nodes to multiple destinations.
Multigraph: multiple edges can exist between a source and destination.
Property Graph: Nodes and and Edges can have arbitrary JSON properties.

Creating Graphs

This documentation is also a runnable Python test called a doctest. In order to run and verify this documentation, we must first create some helper objects like a function p() that will iterate results into a list and "pretty print" them. We also have to setup a test PostgreSQL database and initialize it with the base Graphony tables.

The core object of Graphony is a Graph(). A new Graph can be created with a connection string to an existing initialized database:

import os
import pprint
import postgresql
from pathlib import Path
from pygraphblas import FP64, INT64, gviz
from graphony import Graph, Node
p = lambda r: pprint.pprint(sorted(list(r)))
pgdata, conn = postgresql.setup()
postgresql.psql(f'-d "{conn}" -f dbinit/01_init_graphony.sql -f dbinit/02_karate_demo.sql')
G = Graph(conn)

The Graph object G is used throughout the following documenation to demonstrate the features of Graphony. A Graphony Graph consists of four concepts:

Graph: Top level object that contains all graph data in sub-graphs called properties.
Property: A named, typed sub-graph that holds edges. A property consists of two GraphBLAS Incidence Matrices that can be multiplied to project an adjacency with themselves, or any other combination of properties.
Edge: Property edges can be simple point to point edges or hyperedges that represent properties between multiple incoming and outgoing nodes.
Node: A node in the graph.

Simple Graphs

Graphs consist of multiple typed subgraphs called properties. All properties share node ids across a Graph object, but each property can store edge weights of different data types like int, float, or bool. Internally, "simple" graphs (non-hyper) are stored as Adjacency Matrices.

Before you can add an edge, a property to hold it must be declared first. The default edge type is bool if you don't specify one:

>>> G.add_property('friend')

Edges can be added directly into the Graph with the += method. In their simplest form, an edge is a Python tuple with 2 elements a source and a destination:

>>> G.friend += ('bob', 'alice')

>>> G.friend.draw(weights=False, filename='docs/imgs/G_friend_1')
<graphviz.dot.Digraph object at ...>

Using strings like 'bob' and 'alice' as edge endpoints creates new graph nodes automatically. You can also create nodes explicity and provide attributes for them:

>>> jane = Node(G, 'jane', favorite_color='blue')
>>> jane.favorite_color
'blue'
>>> G.friend += ('alice', jane)

>>> G.friend.draw(weights=False, filename='docs/imgs/G_friend_2')
<graphviz.dot.Digraph object at ...>

Now there are two edges in the friend property, one from bob to alice and the other from alice to jane.

>>> p(G.friend)
[friend(bob, alice), friend(alice, jane)]

An iterator of property tuples can also be provided to the += operator which will consume them and add them to the property:

>>> G.friend += [('bob', 'sal'), ('alice', 'rick')]

>>> G.friend.draw(weights=False, filename='docs/imgs/G_friend_3')
<graphviz.dot.Digraph object at ...>

As shown above, tuples are stored as boolean edges whose weights are always True and therefore can be ommited.

Hypergraphs

A Hypergraph is a generalization of a graph in which an edge can join any number of vertices in constrast to a simple graph, shown above, where an edge has exactly two endpoints and can only connect only one vertex to one other vertex.

In Graphony a hypergraph can created in any incidence property by passing the incidence=True flag. This causes the property to be stored internally as two Incidence Matrices which can represent non-simple graphs like the hypergraph shown here:

>>> G.add_property('manages', incidence=True)

New hyperedges can be defined by passing a nested tuple of nodes as either the source or destinations, or both, for a hyperedge.

>>> G.manages += [('bob', ('rick', 'alice')), (('alice', 'bob'), 'jane')]

>>> G.manages.draw(weights=True, filename='docs/imgs/G_manages_1')
<graphviz.dot.Digraph object at ...>

Here a hyperedge with one source and two destinations is created from bob to jane and alice, and another with two sources and one destination is created from alice and bob to jane.

Property Graph

Graphs can have any number of properties, each with a particular GraphBLAS type. In general this is referred to as a Property Graph. As shown above the default property type is bool which created unweighted edges, but graph edge types can be specified on a per-property basis:

>>> G.add_property('distance', int)
>>> G.distance += [('bob', 'alice', 422), ('alice', 'jane', 42)]

>>> G.distance.draw(weights=True, filename='docs/imgs/G_distance_2')
<graphviz.dot.Digraph object at ...>

Supported python types include bool, int, float and complex which are converted into the GraphBLAS types GrB_BOOL, GrB_INT64, GrB_FP64 and GxB_FC64 for storage. You can also pass a specific GraphBLAS type if you want different precision or a custom type.

Graph Querying

Currently our graph looks like this, it contains 3 properties, friend, manages and distance:

>>> G.draw(weights=True, filename='docs/imgs/G_all_1')
<graphviz.dot.Digraph object at ...>

Graphs have a call interface like G(...) that can be used to query individual edges. A query consists of three optional arguments for source, property and destination. The default value for all three is None, which acts as a wildcard to matches all values.

>>> p(G())
[friend(bob, alice),
 friend(bob, sal),
 friend(alice, jane),
 friend(alice, rick),
 manages((bob), (alice, rick), (True, True)),
 manages((bob, alice), (jane), (True)),
 distance(bob, alice, 422),
 distance(alice, jane, 42)]

Only print edges where bob is the src:

>>> p(G(source='bob'))
[friend(bob, alice),
 friend(bob, sal),
 manages((bob), (alice, rick), (True, True)),
 manages((bob, alice), (jane), (True)),
 distance(bob, alice, 422)]

Only print edges where manages is the property:

>>> p(G(property='manages'))
[manages((bob), (alice, rick), (True, True)),
 manages((bob, alice), (jane), (True))]

Only print edges where jane is the destination:

>>> p(G(destination='jane'))
[friend(alice, jane),
 manages((bob, alice), (jane), (True)),
 distance(alice, jane, 42)]

Only print edges that match that bob is a manages of jane. Note in this case it returns two hyperedges, as in both cases bob is a source and jane is a destination:

>>> p(G(source='bob', property='manages', destination='jane'))
[manages((bob, alice), (jane), (True))]

Loading Graphs from SQL

Any tuple producing iterator can be used to construct Graphs. Graphony offers a shorthand helper for this. Any query that produces 2 or 3 columns can be used to produce edges into the graph.

>>> G.add_property('karate')
>>> G.karate += G.sql("select 'k_' || s_id, 'k_' || d_id from graphony.karate")

>>> G.karate.draw(weights=False, filename='docs/imgs/G_karate_3',
...               directed=False, graph_attr=dict(layout='sfdp'))
<graphviz.dot.Graph object at ...>

All the edges are in the karate property, as defined in the sql query above:

>>> len(G.karate)
78

Multigraphs

In a Multigraph multiple edges can exist between two nodes. A good example is a De Bruijn graph, a directed graph that represents overlapping sequences of symbols.

These graphs are used in bioinformatics to analyze and assemble long sequences of genetic data. Construction involves iterating a sequence of genetic information and constructing multiple edges between pairs of nodes.

>>> from more_itertools import windowed
>>> G.add_property('debruijn', incidence=True)
>>> def kmer(t, k=3): 
...     return (tuple(map("".join, windowed(i, k-1))) for i in map("".join, windowed(t, k)))

>>> G.debruijn += kmer('ATCGATCGGATGACAGACACAATTC')
>>> G.debruijn.draw(graph_attr=dict(layout='circo'), weights=False, concentrate=True, filename='docs/imgs/G_debruijn_1')
<graphviz...>

Once the graph is built up it can be "collapsed" into a weighted graph, where the multi-edges between nodes are summed up into a single edge. In the GraphBLAS this can be accomplished by calling the properties with a semiring:

>>> M = G.debruijn(INT64.plus_pair)
>>> gviz.draw_graph(M, weights=True, label_vector=G.debruijn.label_vector(M), 
...                 graph_attr=dict(layout='circo'), filename='docs/imgs/G_debruijn_2')
<graphviz...>

Example Weighted De Bruijn using BioPython

Here's an example or using Biopython to create an weighted De Bruijn graph of a Circovirus:

>>> from Bio import SeqIO, Entrez
>>> Entrez.email = "info@graphegon.com"
>>> handle = Entrez.efetch(db="nucleotide", id="MZ299081", rettype="gb", retmode="text")
>>> record = SeqIO.read(handle, "genbank")
>>> handle.close()
>>> from more_itertools import windowed
>>> G.add_property('circovirus', incidence=True)
>>> def kmer(t, k=3): 
...     return (tuple(map("".join, windowed(i, k-1))) for i in map("".join, windowed(t, k)))
>>> seq = str(record.seq)
>>> G.circovirus += kmer(seq, 3)
>>> M = G.circovirus(INT64.plus_pair)
>>> gviz.draw_graph(M, weights=True, labels=True, label_vector=G.circovirus.label_vector(M),
...                 graph_attr=dict(layout='sfdp'), filename='docs/imgs/G_circovirus_1')
<graphviz...>

postgresql.teardown(pgdata)