Home

Awesome

Amazon DynamoDB Storage Backend for JanusGraph

JanusGraph: Distributed Graph Database is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time. -- JanusGraph Homepage

Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed database and supports both document and key-value data models. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad-tech, IoT, and many other applications. -- AWS DynamoDB Homepage

JanusGraph + DynamoDB = Distributed Graph Database - Cluster Host Management

Build Status

Features

The following is a list of features of the Amazon DynamoDB Storage Backend for JanusGraph.

Getting Started

This example populates a JanusGraph database backed by DynamoDB Local using the Marvel Universe Social Graph. The graph has a vertex per comic book character with an edge to each of the comic books in which they appeared.

Load a subset of the Marvel Universe Social Graph

  1. Install the prerequisites (Git, JDK 1.8, Maven, Docker, wget, gpg) of this tutorial. The command below uses a convenience script for Amazon Linux on EC2 instances to install Git, Open JDK 1.8, Maven, Docker and Docker Compose. It adds the ec2-user to the docker group so that you can execute Docker commands without using sudo. Log out and back in to effect changes on ec2-user.

    curl https://raw.githubusercontent.com/awslabs/dynamodb-janusgraph-storage-backend/master/src/test/resources/install-reqs.sh | bash
    exit
    
  2. Clone the repository and change directories.

    git clone https://github.com/awslabs/dynamodb-janusgraph-storage-backend.git && cd dynamodb-janusgraph-storage-backend
    
  3. Use Docker and Docker Compose to bake DynamoDB Local into a container and start Gremlin Server with the DynamoDB Storage Backend for JanusGraph installed.

    docker build -t awslabs/dynamodblocal ./src/test/resources/dynamodb-local-docker \
    && src/test/resources/install-gremlin-server.sh \
    && cp server/dynamodb-janusgraph-storage-backend-*.zip src/test/resources/dynamodb-janusgraph-docker \
    && mvn docker:build -Pdynamodb-janusgraph-docker \
    && docker-compose -f src/test/resources/docker-compose.yml up -d \
    && docker exec -i -t dynamodb-janusgraph /var/jg/bin/gremlin.sh
    
  4. After the Gremlin shell starts, set it up to execute commands remotely.

    :remote connect tinkerpop.server conf/remote.yaml session
    :remote console
    
  5. Load the first 100 lines of the Marvel graph using the Gremlin shell.

    com.amazon.janusgraph.example.MarvelGraphFactory.load(graph, 100, false)
    
  6. Print the characters and the comic-books they appeared in where the characters had a weapon that was a shield or claws.

    g.V().has('weapon', within('shield','claws')).as('weapon', 'character', 'book').select('weapon', 'character','book').by('weapon').by('character').by(__.out('appeared').values('comic-book'))
    
  7. Print the characters and the comic-books they appeared in where the characters had a weapon that was not a shield or claws.

    g.V().has('weapon').has('weapon', without('shield','claws')).as('weapon', 'character', 'book').select('weapon', 'character','book').by('weapon').by('character').by(__.out('appeared').values('comic-book'))
    
  8. Print a sorted list of the characters that appear in comic-book AVF 4.

    g.V().has('comic-book', 'AVF 4').in('appeared').values('character').order()
    
  9. Print a sorted list of the characters that appear in comic-book AVF 4 that have a weapon that is not a shield or claws.

    g.V().has('comic-book', 'AVF 4').in('appeared').has('weapon', without('shield','claws')).values('character').order()
    
  10. Exit remote mode and Control-C to quit.

    :remote console
    
  11. Clean up the composed Docker containers.

    docker-compose -f src/test/resources/docker-compose.yml stop
    

Load the Graph of the Gods

  1. Repeat steps 3 and 4 of the Marvel graph section, cleaning up the server directory beforehand with rm -rf server.

  2. Load the Graph of the Gods.

    GraphOfTheGodsFactory.loadWithoutMixedIndex(graph, true)
    
  3. Now you can follow the rest of the JanusGraph Getting Started documentation, starting from the Global Graph Indeces section. See the scriptEngines/gremlin-groovy/scripts list element in the Gremlin Server YAML file for more information about what is in scope in the remote environment.

  4. Alternatively, repeat steps 1 through 8 of the Marvel graph section and follow the examples in the TinkerPop documentation. Skip the TinkerGraph.open() step as the remote execution environment already has a graph variable set up. TinkerPop have other tutorials available as well.

Run Gremlin on Gremlin Server in EC2 using CloudFormation templates

The DynamoDB Storage Backend for JanusGraph includes CloudFormation templates that creates a VPC, an EC2 instance in the VPC, installs Gremlin Server with the DynamoDB Storage Backend for JanusGraph installed, and starts the Gremlin Server Websocket endpoint. Also included are templates that create the graph's DynamoDB tables. The Network ACL of the VPC includes just enough access to allow:

Requirements for running this CloudFormation template include two items.

Note, this cloud formation template downloads the JanusGraph zip files available on the JanusGraph downloads page. The CloudFormation template downloads these packages and builds and adds the DynamoDB Storage Backend for JanusGraph with its dependencies.

CloudFormation Template table

Below you can find a list of CloudFormation templates discussed in this document, and links to launch each stack in CloudFormation and to view the stack in the designer.

Template nameDescriptionView
Single-Item Model TablesSet up six graph tables with the single item data model.View
Multiple-Item Model TablesSet up six graph tables with the multiple item data model.View
Gremlin Server on DynamoDBThe HTTP user agent header to send with all requests.View

Instructions to Launch CloudFormation Stacks

  1. Choose between the single and multiple item data models and create your graph tables with the corresponding CloudFormation template above by downloading it and passing it to the CloudFormation console. Note, the configuration provided in src/test/resources/dynamodb.properties assumes that you will deploy the stack in us-west-2 and that you will use the multiple item model.
  2. Inspect the latest version of the Gremlin Server on DynamoDB stack in the third row above.
  3. Download the template from the third row to your computer and use it to create the Gremlin Server on DynamoDB stack.
  4. On the Select Template page, name your Gremlin Server stack and select the CloudFormation template that you just downloaded.
  5. On the Specify Parameters page, you need to specify the following:
  1. On the Options page, click Next.
  2. On the Review page, select "I acknowledge that this template might cause AWS CloudFormation to create IAM resources." Then, click Create.
  3. Start the Gremlin console on the host through SSH. You can just copy paste the GremlinShell output of the CloudFormation template and run it on your command line.
  4. Repeat steps 4 and onwards of the Marvel graph section above.

Data Model

The Amazon DynamoDB Storage Backend for JanusGraph has a flexible data model that allows clients to select the data model for each JanusGraph backend table. Clients can configure tables to use either a single-item model or a multiple-item model.

Single-Item Model

The single-item model uses a single DynamoDB item to store all values for a single key. In terms of JanusGraph backend implementations, the key becomes the DynamoDB hash key, and each column becomes an attribute name and the column value is stored in the respective attribute value.

This is definitely the most efficient implementation, but beware of the 400kb limit DynamoDB imposes on items. It is best to only use this on tables you are sure will not surpass the item size limit. Graphs with low vertex degree and low number of items per index can take advantage of this implementation.

Multiple-Item Model

The multiple-item model uses multiple DynamoDB items to store all values for a single key. In terms of JanusGraph backend implementations, the key becomes the DynamoDB hash key, and each column becomes the range key in its own item. The column values are stored in its own attribute.

The multiple item model is less efficient than the single-item during initial graph loads, but it gets around the 400kb limitation. The multiple-item model uses range Query calls instead of GetItem calls to get the necessary column values.

DynamoDB Specific Configuration

Each configuration option has a certain mutability level that governs whether and how it can be modified after the database is opened for the first time. The following listing describes the mutability levels.

  1. FIXED - Once the database has been opened, these configuration options cannot be changed for the entire life of the database
  2. GLOBAL_OFFLINE - These options can only be changed for the entire database cluster at once when all instances are shut down
  3. GLOBAL - These options can only be changed globally across the entire database cluster
  4. MASKABLE - These options are global but can be overwritten by a local configuration file
  5. LOCAL - These options can only be provided through a local configuration file

Leading namespace names are shortened and sometimes spaces were inserted in long strings to make sure the tables below are formatted correctly.

General DynamoDB Configuration Parameters

All of the following parameters are in the storage (s) namespace, and most are in the storage.dynamodb (s.d) namespace subset.

NameDescriptionDatatypeDefault ValueMutability
s.backendThe primary persistence provider used by JanusGraph. To use DynamoDB you must set this to com.amazon.janusgraph.diskstorage. dynamodb.DynamoDBStoreManagerStringLOCAL
s.d.prefixA prefix to put before the JanusGraph table name. This allows clients to have multiple graphs in the same AWS DynamoDB account in the same region.StringjgLOCAL
s.d.metrics-prefixPrefix on the codahale metric names emitted by DynamoDBDelegate.StringdLOCAL
s.d.force-consistent-readThis feature sets the force consistent read property on DynamoDB calls.BooleantrueLOCAL
s.d.enable-parallel-scanThis feature changes the scan behavior from a sequential scan (with consistent key order) to a segmented, parallel scan. Enabling this feature will make full graph scans faster, but it may cause this backend to be incompatible with Titan's OLAP library.BooleanfalseLOCAL
s.d.max-self-throttled-retriesThe number of retries that the backend should attempt and self-throttle.Integer60LOCAL
s.d.initial-retry-millisThe amount of time to initially wait (in milliseconds) when retrying self-throttled DynamoDB API calls.Integer25LOCAL
s.d.control-plane-rateThe rate in permits per second at which to issue DynamoDB control plane requests (CreateTable, UpdateTable, DeleteTable, ListTables, DescribeTable).Double10LOCAL
s.d.native-lockingSet this to false if you need to use JanusGraph's locking mechanism for remote lock expiry.BooleantrueLOCAL
s.d.use-titan-idsSet this to true if you are migrating from Titan to JanusGraph so that you do not have to copy your titan_ids table.BooleanfalseLOCAL

DynamoDB KeyColumnValue Store Configuration Parameters

Some configurations require specifications for each of the JanusGraph backend Key-Column-Value stores. Here is a list of the default JanusGraph backend Key-Column-Value stores:

Any store you define in the umbrella storage.dynamodb.stores.* namespace that starts with ulog_ will be used for user-defined transaction logs.

Again, if you opt out of storage-native locking with the storage.dynamodb.native-locking = false configuration, you will need to configure the data model, initial capacity and rate limiters for the three following stores:

You can configure the initial read and write capacity, rate limits, scan limits and data model for each KCV graph store. You can always scale up and down the read and write capacity of your tables in the DynamoDB console. If you have a write once, read many workload, or you are running a bulk data load, it is useful to adjust the capacity of edgestore and graphindex tables as necessary in the DynamoDB console, and decreasing the allocated capacity and rate limiters afterwards.

For details about these Key-Column-Value stores, please see Store Mapping and JanusGraph Data Model. All of these configuration parameters are in the storage.dynamodb.stores (s.d.s) umbrella namespace subset. In the tables below these configurations have the text t where the JanusGraph store name should go.

When upgrading from Titan 1.0.0, you will need to set the ids.store-name configuration to titan_ids to avoid re-using id ranges that are already assigned.

NameDescriptionDatatypeDefault ValueMutability
s.d.s.t.data-modelSINGLE means that all the values for a given key are put into a single DynamoDB item. A SINGLE is efficient because all the updates for a single key can be done atomically. However, the tradeoff is that DynamoDB has a 400k limit per item so it cannot hold much data. MULTI means that each 'column' is used as a range key in DynamoDB so a key can span multiple items. A MULTI implementation is slightly less efficient than SINGLE because it must use DynamoDB Query rather than a direct lookup. It is HIGHLY recommended to use MULTI for edgestore and graphindex unless your graph has very low max degree.StringMULTIFIXED
s.d.s.t.initial-capacity-readDefine the initial read capacity for a given DynamoDB table. Make sure to replace the s with your actual table name.Integer4LOCAL
s.d.s.t.initial-capacity-writeDefine the initial write capacity for a given DynamoDB table. Make sure to replace the s with your actual table name.Integer4LOCAL
s.d.s.t.read-rateThe max number of reads per second.Double4LOCAL
s.d.s.t.write-rateUsed to throttle write rate of given table. The max number of writes per second.Double4LOCAL
s.d.s.t.scan-limitThe maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed data set size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation.Integer10000LOCAL

DynamoDB Client Configuration Parameters

All of these configuration parameters are in the storage.dynamodb.client (s.d.c) namespace subset, and are related to the DynamoDB SDK client configuration.

NameDescriptionDatatypeDefault ValueMutability
s.d.c.connection-timeoutThe amount of time to wait (in milliseconds) when initially establishing a connection before giving up and timing out.Integer60000LOCAL
s.d.c.connection-ttlThe expiration time (in milliseconds) for a connection in the connection pool.Integer60000LOCAL
s.d.c.connection-maxThe maximum number of allowed open HTTP connections.Integer10LOCAL
s.d.c.retry-error-maxThe maximum number of retry attempts for failed retryable requests (ex: 5xx error responses from services).Integer0LOCAL
s.d.c.use-gzipSets whether gzip compression should be used.BooleanfalseLOCAL
s.d.c.use-reaperSets whether the IdleConnectionReaper is to be started as a daemon thread.BooleantrueLOCAL
s.d.c.user-agentThe HTTP user agent header to send with all requests.StringLOCAL
s.d.c.endpointSets the service endpoint to use for connecting to DynamoDB.StringLOCAL
s.d.c.signing-regionSets the signing region to use for signing requests to DynamoDB. Required.StringLOCAL

DynamoDB Client Proxy Configuration Parameters

All of these configuration parameters are in the storage.dynamodb.client.proxy (s.d.c.p) namespace subset, and are related to the DynamoDB SDK client proxy configuration.

NameDescriptionDatatypeDefault ValueMutability
s.d.c.p.domainThe optional Windows domain name for configuration an NTLM proxy.StringLOCAL
s.d.c.p.workstationThe optional Windows workstation name for configuring NTLM proxy support.StringLOCAL
s.d.c.p.hostThe optional proxy host the client will connect through.StringLOCAL
s.d.c.p.portThe optional proxy port the client will connect through.StringLOCAL
s.d.c.p.usernameThe optional proxy user name to use if connecting through a proxy.StringLOCAL
s.d.c.p.passwordThe optional proxy password to use when connecting through a proxy.StringLOCAL

DynamoDB Client Socket Configuration Parameters

All of these configuration parameters are in the storage.dynamodb.client.socket (s.d.c.s) namespace subset, and are related to the DynamoDB SDK client socket configuration.

NameDescriptionDatatypeDefault ValueMutability
s.d.c.s.buffer-send-hintThe optional size hints (in bytes) for the low level TCP send and receive buffers.Integer1048576LOCAL
s.d.c.s.buffer-recv-hintThe optional size hints (in bytes) for the low level TCP send and receive buffers.Integer1048576LOCAL
s.d.c.s.timeoutThe amount of time to wait (in milliseconds) for data to be transfered over an established, open connection before the connection times out and is closed.Long50000LOCAL
s.d.c.s.tcp-keep-aliveSets whether or not to enable TCP KeepAlive support at the socket level. Not used at the moment.BooleanLOCAL

DynamoDB Client Executor Configuration Parameters

All of these configuration parameters are in the storage.dynamodb.client.executor (s.d.c.e) namespace subset, and are related to the DynamoDB SDK client executor / thread-pool configuration.

NameDescriptionDatatypeDefault ValueMutability
s.d.c.e.core-pool-sizeThe core number of threads for the DynamoDB async client.Integer25LOCAL
s.d.c.e.max-pool-sizeThe maximum allowed number of threads for the DynamoDB async client.Integer50LOCAL
s.d.c.e.keep-aliveThe time limit for which threads may remain idle before being terminated for the DynamoDB async client.IntegerLOCAL
s.d.c.e.max-queue-lengthThe maximum size of the executor queue before requests start getting run in the caller.Integer1024LOCAL
s.d.c.e.max-concurrent-operationsThe expected number of threads expected to be using a single JanusGraph instance. Used to allocate threads to batch operations.Integer1LOCAL

DynamoDB Client Credential Configuration Parameters

All of these configuration parameters are in the storage.dynamodb.client.credentials (s.d.c.c) namespace subset, and are related to the DynamoDB SDK client credential configuration.

NameDescriptionDatatypeDefault ValueMutability
s.d.c.c.class-nameSpecify the fully qualified class that implements AWSCredentialsProvider or AWSCredentials.Stringcom.amazonaws.auth. BasicAWSCredentialsLOCAL
s.d.c.c.constructor-argsComma separated list of strings to pass to the credentials constructor.StringaccessKey,secretKeyLOCAL

Upgrading from Titan 1.0.0

Earlier versions of this software supported Titan 1.0.0. This software supports upgrading from the DynamoDB Storage Backend for Titan 1.0.0 by following the steps to update your configuration below.

  1. Set the JanusGraph configuration option ids.store-name=titan_ids. This allows you to reuse your titan_ids table.
  2. Update the classpath to the DynamoDB Storage Backend to use the latest package name, storage.backend=com.amazon.janusgraph.diskstorage.dynamodb.DynamoDBStoreManager .

Run all tests against DynamoDB Local on an EC2 Amazon Linux AMI

  1. Install dependencies. For Amazon Linux:

    sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo \
      -O /etc/yum.repos.d/epel-apache-maven.repo
    sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
    sudo yum update -y && sudo yum upgrade -y
    sudo yum install -y apache-maven sqlite-devel git java-1.8.0-openjdk-devel
    sudo alternatives --set java /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
    sudo alternatives --set javac /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/javac
    git clone https://github.com/awslabs/dynamodb-janusgraph-storage-backend.git
    cd dynamodb-janusgraph-storage-backend && mvn install
    
  2. Open a screen so that you can log out of the EC2 instance while running tests with screen.

  3. Run the single-item data model tests.

    mvn verify -P integration-tests \
    -Dexclude.category=com.amazon.janusgraph.testcategory.MultipleItemTestCategory \
    -Dinclude.category="**/*.java" > o 2>&1
    
  4. Run the multiple-item data model tests.

    mvn verify -P integration-tests \
    -Dexclude.category=com.amazon.janusgraph.testcategory.SingleItemTestCategory \
    -Dinclude.category="**/*.java" > o 2>&1
    
  5. Run other miscellaneous tests.

    mvn verify -P integration-tests -Dinclude.category="**/*.java" \
        -Dgroups=com.amazon.janusgraph.testcategory.IsolateRemainingTestsCategory > o 2>&1
    
  6. Exit the screen with CTRL-A D and logout of the EC2 instance.

  7. Monitor the CPU usage of your EC2 instance in the EC2 console. The single-item tests may take at least 1 hour and the multiple-item tests may take at least 2 hours to run. When CPU usage goes to zero, that means the tests are done.

  8. Log back into the EC2 instance and resume the screen with screen -r to review the test results.

    cd target/surefire-reports && grep testcase *.xml | grep -v "\/"
    
  9. Terminate the instance when done.