Awesome
What is NebulaGraph Importer?
NebulaGraph Importer is a tool to import data into NebulaGraph.
Features
- Support multiple data sources, currently supports
local
,s3
,oss
,ftp
,sftp
,hdfs
, andgcs
. - Support multiple file formats, currently only
csv
files are supported. - Support files containing multiple tags, multiple edges, and a mixture of both.
- Support data transformations.
- Support record filtering.
- Support multiple modes, including
INSERT
,UPDATE
,DELETE
. - Support connect multiple Graph with automatically load balance.
- Support retry after failure.
- Humanized status printing.
See configuration instructions for more features.
How to Install
From Releases
Download the packages on the Releases page, and give execute permissions to it.
You can choose according to your needs, the following installation packages are supported:
- binary
- archives
- apk
- deb
- rpm
From go install
$ go install github.com/vesoft-inc/nebula-importer/cmd/nebula-importer@latest
From docker
$ docker pull vesoft/nebula-importer:<version>
$ docker run --rm -ti \
--network=host \
-v <config_file>:<config_file> \
-v <data_dir>:<data_dir> \
vesoft/nebula-importer:<version>
--config <config_file>
# config_file: the absolute path to the configuration file.
# data_dir: the absolute path to the data directory, ignore if not a local file.
# version: the version of NebulaGraph Importer.
From Source Code
$ git clone https://github.com/vesoft-inc/nebula-importer
$ cd nebula-importer
$ make build
You can find a binary named nebula-importer
in bin
directory.
Configuration Instructions
NebulaGraph Importer
's configuration file is in YAML format. You can find some examples in examples.
Configuration options are divided into four groups:
client
is configuration options related to the NebulaGraph connection client.manager
is global control configuration options related to NebulaGraph Importer.log
is configuration options related to printing logs.sources
is the data source configuration items.
client
client:
version: v3
address: "127.0.0.1:9669"
user: root
password: nebula
ssl:
enable: true
certPath: "your/cert/file/path"
keyPath: "your/key/file/path"
caPath: "your/ca/file/path"
insecureSkipVerify: false
concurrencyPerAddress: 16
reconnectInitialInterval: 1s
retry: 3
retryInitialInterval: 1s
client.version
: Required. Specifies which version of NebulaGraph, currently onlyv3
is supported.client.address
: Required. The address of graph in NebulaGraph.client.user
: Optional. The user of NebulaGraph. The default value isroot
.client.password
: Optional. The password of NebulaGraph. The default value isnebula
.client.ssl
: Optional. SSL related configuration.client.ssl.enable
: Optional. Specifies whether to enable ssl authentication. The default value isfalse
.client.ssl.certPath
: Required. Specifies the path of the certificate file.client.ssl.keyPath
: Required. Specifies the path of the private key file.client.ssl.caPath
: Required. Specifies the path of the certification authority file.client.ssl.insecureSkipVerify
: Optional. Specifies whether a client verifies the server's certificate chain and host name. The default value isfalse
.client.concurrencyPerAddress
: Optional. The number of client connections to each graph in NebulaGraph. The default value is10
.client.reconnectInitialInterval
: Optional. The initialization interval for reconnecting NebulaGraph. The default value is1s
.client.retry
: Optional. The failed retrying times to execute nGQL queries in NebulaGraph client. The default value is3
.client.retryInitialInterval
: Optional. The initialization interval retrying. The default value is1s
.
manager
spaceName: basic_int_examples
batch: 128
readerConcurrency: 50
importerConcurrency: 512
statsInterval: 10s
hooks:
before:
- statements:
- UPDATE CONFIGS storage:wal_ttl=3600;
- UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
- statements:
- |
DROP SPACE IF EXISTS basic_int_examples;
CREATE SPACE IF NOT EXISTS basic_int_examples(partition_num=5, replica_factor=1, vid_type=int);
USE basic_int_examples;
wait: 10s
after:
- statements:
- |
UPDATE CONFIGS storage:wal_ttl=86400;
UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
manager.spaceName
: Required. Specifies which space the data is imported into.manager.batch
: Optional. Specifies the batch size for all sources of the inserted data. The default value is128
.manager.readerConcurrency
: Optional. Specifies the concurrency of reader to read from sources. The default value is50
.manager.importerConcurrency
: Optional. Specifies the concurrency of generating inserted nGQL statement, and then call client to import. The default value is512
.manager.statsInterval
: Optional. Specifies the interval at which statistics are printed. The default value is10s
.manager.hooks.before
: Optional. Configures the statements before the import begins.manager.hooks.before.[].statements
: Defines the list of statements.manager.hooks.before.[].wait
: Optional. Defines the waiting time after executing the above statements.
manager.hooks.after
: Optional. Configures the statements after the import is complete.manager.hooks.after.[].statements
: Optional. Defines the list of statements.manager.hooks.after.[].wait
: Optional. Defines the waiting time after executing the above statements.
log
log:
level: INFO
console: true
files:
- logs/nebula-importer.log
log.level
: Optional. Specifies the log level, optional values isDEBUG
,INFO
,WARN
,ERROR
,PANIC
orFATAL
. The default value isINFO
.log.console
: Optional. Specifies whether to print logs to the console. The default value istrue
.log.files
: Optional. Specifies which files to print logs to.
sources
sources
is the configuration of the data source list, each data source contains data source information, data processing and schema mapping.
The following are the relevant configuration items.
batch
specifies the batch size for this source of the inserted data. The priority is greater thanmanager.batch
.path
,s3
,oss
,ftp
,sftp
,hdfs
, andgcs
are information configurations of various data sources, and only one of them can be configured.csv
describes the csv file format information.tags
describes the schema definition for tags.edges
describes the schema definition for edges.
path
It only needs to be configured for local file data sources.
path: ./person.csv
path
: Required. Specifies the path where the data files are stored. If a relative path is used, the path and current configuration file directory are spliced. Wildcard filename is also supported, for example: ./follower-*.csv, please make sure that all matching files with the same schema.
s3
It only needs to be configured for s3 data sources.
s3:
endpoint: <endpoint>
region: <region>
bucket: <bucket>
key: <key>
accessKeyID: <Access Key ID>
accessKeySecret: <Access Key Secret>
endpoint
: Optional. The endpoint of s3 service, can be omitted if using aws s3.region
: Required. The region of s3 service.bucket
: Required. The bucket of file in s3 service.key
: Required. The object key of file in s3 service.accessKeyID
: Optional. TheAccess Key ID
of s3 service. If it is public data, no need to configure.accessKeySecret
: Optional. TheAccess Key Secret
of s3 service. If it is public data, no need to configure.
oss
It only needs to be configured for oss data sources.
oss:
endpoint: <endpoint>
bucket: <bucket>
key: <key>
accessKeyID: <Access Key ID>
accessKeySecret: <Access Key Secret>
endpoint
: Required. The endpoint of oss service.bucket
: Required. The bucket of file in oss service.key
: Required. The object key of file in oss service.accessKeyID
: Required. TheAccess Key ID
of oss service.accessKeySecret
: Required. TheAccess Key Secret
of oss service.
ftp
It only needs to be configured for ftp data sources.
ftp:
host: 192.168.0.10
port: 21
user: <user>
password: <password>
path: <path of file>
host
: Required. The host of ftp service.port
: Required. The port of ftp service.user
: Required. The user of ftp service.password
: Required. The password of ftp service.path
: Required. The path of file in the ftp service.
sftp
It only needs to be configured for sftp data sources.
sftp:
host: 192.168.0.10
port: 22
user: <user>
password: <password>
keyFile: <keyFile>
keyData: <keyData>
passphrase: <passphrase>
path: <path of file>
host
: Required. The host of sftp service.port
: Required. The port of sftp service.user
: Required. The user of sftp service.password
: Optional. The password of sftp service.keyFile
: Optional. The ssh key file path of sftp service.keyData
: Optional. The ssh key file content of sftp service.passphrase
: Optional. The ssh key passphrase of sftp service.path
: Required. The path of file in the sftp service.
hdfs
It only needs to be configured for hdfs data sources.
hdfs:
address: 192.168.0.10:8020
user: <user>
servicePrincipalName: <Kerberos Service Principal Name>
krb5ConfigFile: <Kerberos config file>
ccacheFile: <Kerberos ccache file>
keyTabFile: <Kerberos keytab file>
password: <Kerberos password>
dataTransferProtection: <Kerberos Data Transfer Protection>
disablePAFXFAST: false
path: <path of file>
address
: Required. The address of hdfs service.user
: Optional. The user of hdfs service.servicePrincipalName
: Optional. The kerberos service principal name of hdfs service when enable kerberos.krb5ConfigFile
: Optional. The kerberos config file of hdfs service when enable kerberos, default is/etc/krb5.conf
.ccacheFile
: Optional. The ccache file of hdfs service when enable kerberos.keyTabFile
: Optional. The keytab file of hdfs service when enable kerberos.password
: Optional. The kerberos password of hdfs service when enable kerberos.dataTransferProtection
: Optional. The data transfer protection of hdfs service.disablePAFXFAST
: Optional. Whether to prohibit the client to use PA_FX_FAST.path
: Required. The path of file in the sftp service.
gcs
It only needs to be configured for gcs data sources.
gcs:
endpoint: <endpoint>
bucket: <bucket>
key: <key>
credentialsFile: <Service account or refresh token JSON credentials file>
credentialsJSON: <Service account or refresh token JSON credentials>
withoutAuthentication: <false | true>
endpoint
: Optional. The endpoint of GCS service.bucket
: Required. The bucket of file in GCS service.key
: Required. The object key of file in GCS service.credentialsFile
: Optional. Path to the service account or refresh token JSON credentials file. Not required for public data.credentialsJSON
: Optional. Content of the service account or refresh token JSON credentials file. Not required for public data.withoutAuthentication
: Optional. Specifies that no authentication should be used, defaults tofalse
.
batch
batch: 256
batch
: Optional. Specifies the batch size for this source of the inserted data. The priority is greater thanmanager.batch
.
csv
csv:
delimiter: ","
withHeader: false
lazyQuotes: false
comment: ""
delimiter
: Optional. Specifies the delimiter for the CSV files. The default value is","
. And only a 1-character string delimiter is supported.withHeader
: Optional. Specifies whether to ignore the first record in csv file. The default value isfalse
.lazyQuotes
: Optional. If lazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.comment
: Optional. Specifies the comment character. Lines beginning with the Comment character without preceding whitespace are ignored.
tags
tags:
- name: Person
mode: INSERT
filter:
expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
id:
type: "STRING"
function: "hash"
index: 0
ignoreExistedIndex: true
props:
- name: "firstName"
type: "STRING"
index: 1
- name: "lastName"
type: "STRING"
index: 2
- name: "gender"
type: "STRING"
index: 3
nullable: true
defaultValue: male
- name: "birthday"
type: "DATE"
index: 4
nullable: true
nullValue: _NULL_
- name: "creationDate"
type: "DATETIME"
index: 5
- name: "locationIP"
type: "STRING"
index: 6
- name: "browserUsed"
type: "STRING"
index: 7
nullable: true
alternativeIndices:
- 6
# concatItems examples
tags:
- name: Person
id:
type: "STRING"
concatItems:
- "abc"
- 1
function: hash
name
: Required. The tag name.mode
: Optional. The mode for processing data, optional values isINSERT
,UPDATE
orDELETE
, defaultINSERT
.filter
: Optional. The data filtering configuration.expr
: Required. The filter expression. See the Filter Expression for details.
id
: Required. Describes the tag ID information.type
: Optional. The type for ID. The default value isSTRING
.index
: Optional. The column number in the records. Required ifconcatItems
is not configured.concatItems
: Optional. The concat items to generate for IDs. The concat item can be string, int or mixed. string represents a constant, and int represents an index column. Then connect all items. If set, the above index will have no effect.function
: Optional. Functions to generate the IDs. Currently, we only support functionhash
.
ignoreExistedIndex
: Optional. Specifies whether to enableIGNORE_EXISTED_INDEX
. The default value istrue
.props
: Required. Describes the tag props definition.name
: Required. The property name, must be the same with the tag property in NebulaGraph.type
: Optional. The property type, currentlyBOOL
,INT
,FLOAT
,DOUBLE
,STRING
,TIME
,TIMESTAMP
,DATE
,DATETIME
,GEOGRAPHY
,GEOGRAPHY(POINT)
,GEOGRAPHY(LINESTRING)
andgeography(polygon)
are supported. The default value isSTRING
.index
: Required. The column number in the records.nullable
: Optional. Whether this prop property can beNULL
, optional values istrue
orfalse
, defaultfalse
.nullValue
: Optional. Ignored whennullable
isfalse
. The value used to determine whether it is aNULL
. The property is set toNULL
when the value is equal tonullValue
, default""
.alternativeIndices
: Optional. Ignored whennullable
isfalse
. The property is fetched from records according to the indices in order until not equal tonullValue
.defaultValue
: Optional. Ignored whennullable
isfalse
. The property default value, when all the values obtained byindex
andalternativeIndices
arenullValue
.
edges
edges:
- name: KNOWS
mode: INSERT
filter:
expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
src:
id:
type: "INT"
index: 0
dst:
id:
type: "INT"
index: 1
rank:
index: 0
ignoreExistedIndex: true
props:
- name: "creationDate"
type: "DATETIME"
index: 2
nullable: true
nullValue: _NULL_
defaultValue: 0000-00-00T00:00:00
name
: Required. The edge name.mode
: Optional. Themode
here is similar tomode
in thetags
above.filter
: Optional. Thefilter
here is similar tofilter
in thetags
above.src
: Required. Describes the source definition for the edge.src.id
: Required. Theid
here is similar toid
in thetags
above.dst
: Required. Describes the destination definition for the edge.dst.id
: Required. Theid
here is similar toid
in thetags
above.rank
: Optional. Describes the rank definition for the edge.rank.index
: Required. The column number in the records.props
: Optional. Similar to theprops
in thetags
, but for edges.
See the Configuration Reference for details on the configurations.