Awesome

Extended XYZ specification and parsing tools

This repository contains a specification of the extended XYZ (extxyz) file format, and tools for reading and writing to it from programs written in C, Fortran, Python and Julia.

Installation

Python

The latest development version can be installed via

pip install git+https://github.com/libAtoms/extxyz

This requires Python 3.6+ and a working C compiler, plus the PCRE2 and libcleri libraries. libcleri is included here as a submodule and will be compiled automatically, but you may need to install PCRE2 with something similar to one of the following commands.

brew install pcre2 # Mac OS, with homebrew
sudo apt-get install libpcre2-dev # Ubuntu

Binary wheels for Linux and MacOS which do not require PCRE2 or libcleri are built in the GitHub CI for each tagged release.

Stable releases are made to PyPI, so you can install with

pip install extxyz

`libextxyz` C library

The underlying parser is written in C. The C code is compiled automatically when you build the Python package, but you can also compile it as shared library libextxyz.so as follows

make -C libextxyz
make -C libextxyz install

The Makefile respects the usual environment variables CC, CFLAGS, LDFLAGS, etc, plus prefix (default /usr/local).

Fortran bindings

To build the fextxyz exectuable demonstrating the Fortran bindings, you first need to download and compile QUIP -- see the CI for an example of how to do that automatically. Then, set QUIP_ROOT and QUIP_ARCH

export QUIP_ARCH=linux_x86_64_gfortran
export QUIP_ROOT=/path/to/QUIP
make -C libextxyz fextxyz

The Fortran bindings will later be moved to QUIP, since they are tied to QUIP's Dictionary and Atoms types.

Julia bindings

Julia bindings are distributed in a separate package, named ExtXYZ.jl. See its documentation for further details.

Usage

Usage of the Python package is similar to the ase.io.read() and ase.io.write() functions, e.g:

from extxyz import read, iread, write, ExtXYZTrajectoryWriter
from ase.build import bulk
from ase.optimize import BFGS
from ase.calculators.emt import EMT

atoms = bulk("Cu") * 3
frames = [atoms.copy() for frame in range(3)]
for frame in frames:
    frame.rattle()
    
write("filename.xyz", frames)

frames = read("filename.xyz") # all frames in file
atoms = read("filename.xyz", index=0) # first frame in file
write("newfile.xyz", frames)

traj = ExtXYZTrajectoryWriter("traj.xyz", atoms=atoms)
atoms.calc = EMT()
opt = BFGS(atoms, trajectory=traj)
opt.run(fmax=1e-3)

There is also an extxyz command line tool for testing purposes, see extxyz -h for help. This can alternatively be invoked via python -m extxyz.

Remaining issues

~~make treatement of 9 elem old-1d consistent: now extxyz.py always reshapes (not just Lattice) to 3x3, but extxyz.c does not.~~
Since we're using python regexp/PCRE, we could make per-atom strings be more complex, e.g. bare or quoted strings from key-value pairs. Should we?
Decide what to do about unparseable comment lines. Just assume an old fashioned xyz with an arbitrary line, or fail? I don't think we really want every parsing breaking typo to result in plain xyz.
Used to be able to quote with {}. Do we want to support this?

Extended XYZ specifcation

General formatting

Allowed characters: printable subset of ASCII, single byte
Allowed whitespace: plain space and tab (no fancy unicode nonbreaking space, etc)
Allowed end-of line (EOL) characters set by implementation + OS
- pure python: whatever is used to return lines by file object iterator
- low level c: fgets()
Blank lines: allowed only as 2nd line of each frame (for plain xyz) and at end of file

General definitions

regex: PCRE/python regular expression
Whitespace: regex \s, i.e. space and tab

Primitive Data Types

String

Sequence of one or more allowed characters, optionally quoted, but must be quoted in some circumstances.

Allowed characters - all except newline
Entire string may be surrounded by double quotes, as first and last characters (must match). Quotes inside string that are same as containing quotes must be escaped with backslash. Outermost double quotes are not considered part of string value.
Strings that contain any of the following characters must be quoted (not just backslash escaped)
- whitespace (regex \s)
- equals =
- double quote ", must be represented by \"
- comma ,
- open or close square bracket [ ] or curly brackets { }
- backslash, must be represented by double backslash \\
- newline, must be represented by \n
Backslash \: only present in quoted strings, only used for escaping next character. All backslash escaped characters are the following character itself except \n, which encodes a newline.
Must conform to one of the following regex
- quoted string: (")(?:(?=(\\?))\2.)*?\1
- bare (unquoted) string: (?:[^\s=",}{\]\[\\]|(?:\\[\s=",}{\]\[\\]))+
only used in comment line key-value pairs, not per-atom data

Simple string

Sequence of one or more allowed characters, unquoted (so even outermost quotes are part of string), and without whitespace

allowed characters - regex \S, i.e. all except newline and whitespace
regex \S+
only used in per-atom data, not comment line key-value pairs

Logical/boolean

T or F or [tT]rue or [fF]alse or TRUE or FALSE
regex
- true: (?:[tT]rue|TRUE|T)\b
- false: (?:[fF]alse|FALSE|F)\b

Integer number

string of one or more decimal digits, optionally preceded by sign

regex [+-]?+(?:0|[1-9][0-9]*)+\b

Floating point number

optional leading sign [+-], decimal number including optional decimal point ., optional [dDeE] folllowed by exponent consisting of optional sign followed by string of one or more digits
regex
- integer without leading sign bare_int = '(?:0|[1-9][0-9]*)'
- optional sign opt_sign = '[+-]?'
- floating number with decimal point float_dec = '(?:' + bare_int + '\.|\.)[0-9]*'
- exponent exp = '(?:[dDeE]'+opt_sign+'[0-9]+)?'
- end of number num_end = '(?:\b|(?=\W)|$)'
- combined float regexp opt_sign + '(?:' + float_dec + exp + '|' + bare_int + exp + '|' + bare_int + ')' + num_end

Order for identifying primitive data types, accept first one that matches

int
float
bool
bare string (containing no whitespace or special characters)
quoted string (starting and ending with double quote and containing only allowed characters)

one dimensional array (vector)

sequence of one or more of the same primitive type

new style: opens with [, one or more of the same primitive type separated by commas and optional whitespace, ends with ]
backward compatible: opens with " or {, one or more of the same primitive types (all types allowed in {}, all except string in "") separated by whitespace, ends with matching " or }. For backward compatibility, a single element backward compatible array is interpreted as a scalar of the same type.
primitive data type is determined by same priority as single primitive item, but must be satisfied by entire list simultaneously. E.g. all integers will result in an integer array, but a mix of integer and float will result in a float array, and a mix of integer and valid strings will results in a string array.

two dimensional array (matrix)

sequence of one or more new style one dimensional arrays of the same length and type

opens with [, one or more new style one dimensional arrays separated by commas, ends with ]
all contained one dimensional arrays in a single two dimensional array must have same number and primitive data type elements, and will be promoted to other possible types if necessary to parse entire array. E.g. a row of integers followed by a row of strings will be promoted to a 2-d string array.

XYZ file

A concatenation of 1 or more FRAMES (below), with optional blank lines at the end (but not between frames)

FRAME

Line 1: a single integer <N> preceded and followed by optional whitespace
Line 2: zero or more per-config key=value pairs (see key-value pairs below)
Lines 3..N+2: per-atom data lines with M columns each (see Properties and Per-Atom Data below)

key=value pairs on second ("comment") line

Associates per-configuration value with key. Spaces are allowed around = sign, which do not become part of the key or value.

Key: bare or quoted string

Value: primitive type, 1-D array, or 2-D array. Type is determined from context according to order specified above.

Special key "Properties”: defines the columns in the subsequent lines in the frame.

Value is a string with the format of a series of triplets, separated by “:”, each triplet having the format: “<name>:<T>:<m>”.
- The <name> (string) names the column(s), <T> is a one of “S”, “I”, “R”, “L”, and indicates the type in the column, “string”, “integer”, “real”, “logical”, respectively. <m> is an integer > 0 specifying how many consecutive columns are being referred to.
- The sum of the counts "m" must equal number of per-atom columns M (as defined in FRAME)
If after full parsing the key “Properties” is missing, the format is retroactively assumed to be plain xyz (4 columns, Z/species x y z), the entire second line is stored as a per-config “comment” property, and columns beyond the 4th are not read.

Per-atom data lines

Each column contains a sequence of primitive types, except string, which is replaced with simple string, separated by one or more whitespace characters, ending with EOL (optional for last line). The total number of columns in each row must be equal to the M and to the sum of the counts "m" in the "Properties" value string.

READING `ase.atoms.Atoms` FROM THIS FORMAT

Specific keys indicate special values, with specific order for overriding

Key-value pairs:

Lattice -> Atoms.cell, optional [do we want to accept "cell" also?]
- 3x3 matrix - rows are cell vectors [preferred]
- 9-vector - 3 cell vectors concatenated [only for backward compat]
- 3-vector - diagonal entries of cell matrix [?]
pbc -> Atoms.pbc, optional
- 3-vector of bool
- default [False]*3 if no Lattice, otherwise [True]*3
Calculator results, used to set SinglePointCalculator.results dict
- all per-config properties in ase.calculator.all_properties, with same name
- scalars, vectors - directly stored
- stress
  - 6-vector Voigt
  - 9-vector, 3x3 matrix, stored as stress Voigt-6, fail if not symmetric
- virial -> stress (to convert multiply by -1/cell_vol), same format as stress [warn/fail if stress also present, perhaps only if inconsistent?]

Properties keys (all types are per-atom), types are simple

Atoms
- Z -> numbers
- species -> numbers, fail if not valid chemical symbol [warn/fail if conflict with Z?]
- pos -> positions
- mass -> masses
- velo -> momenta (get mass from atomic number if missing)
- same name: initial_charges, initial_magmoms
Calculator.results
- local_energy -> energies
- forces -> forces [also support “force”? What about overriding, complain if inconsistent?]
- same name: magmoms (scalar or 3-vector), charges

WRITING ase.atoms.Atoms TO THIS FORMAT

General considerations

platform-appropriate EOL
[require some specific whitespace convention?]
scalars
- all strings are quoted
- otherwise stored unquoted
arrays
- use {} [or []?] container marks, comma separated (not backward compatible " and space separated forms)
Definitely store (naming as described below)
- all "first-class" Atoms properties (cell, pbc, numbers, masses, positions, momenta [any others?])
- all info keys that are scalar, 1-D, 2-D array of prim type
- all arrays that are scalar (Natoms x 1) or 1-D array( Natoms x (m > 1)) of prim type, shape[1] mapped to number of columns and space separated, not using regular array notation
- [optionally warn about un-representable quantities?]
all Calculator.results key-value pairs, per-config same as info, per-atom same as arrays
Perhaps store
- all info keys, per-config calculator results that are not representable (i.e. not prim type scalar, 1-D, or 2-D for per-config only) but can be mapped to JSON, as string starting with "_JSON "
- same for arrays [?]
In general, keep ASE data type/dimension, invert mapping of names for reading. For quantities that have multiple possible names, use:
- Lattice, not cell, 3x3 matrix
- velo, not momenta
- stress, not virial, as 3x3 matrix [are we OK with this?]