Comparisons with other transport protocols

There are already several solutions with considerable overlap with the suggested protocol. In order to get a clearer view as to whether it is worth implementing another new protocol, some existing are described and compared against the goals of the suggested SAYDX protocol. In this document, only the transport layer should be discussed here. Possible higher level protocols based on SAYDX (e.g. for exchanging data between atomistic simulation tools) should be described and compared against existing options in a different chapter.

HDF5

HDF5 is an open-source library (partially BSD, partially MIT licensed) using (and defining) its own data storage format. It is widespread in the scientific community and bindings exist for almost all commonly used programming languages.

SAYDX is probably closest to HDF5 in spirit. The main similarities are:

  • Basic building blocks are arrays of data. Type information about elements in an array is stored only once for an array (together with the shape of the array).
  • Data is stored in binary format.
  • Arrays can be arranged in a tree like structure with named nodes.
  • HDF5 allows storage of data (of arbitrary type) at a node. This should be possible in the SAYDX tree as well (although, only as string attributes).

Differences:

  • HDF5 is pretty much file I/O oriented. Although it is possible to build up a tree in memory, it is not clear to me (B.A.), whether this tree can then be communicated to a routine in a library, other than being written to disc and being reread again. In environments supporting FIFO, named pipes might be able to partly address this, but this is operating system specific (BH). SAYDX should allow for passing trees between program components (even if written in different languages, e.g. Fortran and C) and running on a range of OSes.

  • HDF5 is optimized for handling large amount of data. It had very advanced features, like parallel I/O. On the other hand, it is a very complex library which is not straightforward to build or link. SAYDX should be optimized for more moderate data amounts (kilobytes to few megabytes, maybe up to a few hundred megabytes). It should have much less features than HDF5 (e.g. no parallel I/O), but hopefully then be much easier to maintain, port, build and to link.

  • HDF5 requires the path to each node in the tree to be unique. SAYDX should be more like XML, allowing a node to have several children with the same name (although open for discussions). Using XML-notation to indicate nodes, in SAYDX one could have:

    <simulation>
      <frame>
        <timestep>1</timestep>
        <coordinates>...</coordinates>
        [...]
      </frame>
      <frame>
        <timestep>2</timestep>
        <coordinates>...</coordinates>
        [...]
      </frame>
    </simulation>
    

    while in HDF5 (using the same notation) one would have to write

    <simulation>
      <frame1>
        <coordinates>...</coordinates>
        [...]
      </frame1>
      <frame2>
        <coordinates>...</coordinates>
        [...]
      </frame2>
    </simulation>
    

JSON

The JSON protocol has become very popular in the last few years. It allows a standardized data exchange between different application. Being used a lot in informatics (and also in science), libraries exist for (probably) all programming languages used in science.

Similarities:

  • Data can be arranged in a structure, which one may interpret as a tree with named nodes (in JSON they are actually a combination of lists and dictionaries).

Differences:

  • JSON was not designed to exchange scientific (numeric) data. Its JavaScript implementation has only one numerical data type (double precision float), although other implementations may handle numerical data differently. SAYDX should offer all important numeric data types (e.g. from half-precision up to quadruple precision – given appropriate compiler support)
  • JSON was not designed to exchange array data. Its “array” construct is basically a flat list where each element can be of arbitrary data type. Type information must therefore be stored for each element separately. Shape information (e.g. for a multi-dimensional array) must also be stored separately in the tree.
  • JSON was not designed to exchange large amount of numeric data. It is a text based format, so binary data would have to be converted to text first and then back again to binary format.

MessagePack

MessagePack is a binary protocol to exchange data between applications. Implementations seem to be present in nearly all relevant programming languages, but apparently not Fortran so far. It uses similar concepts to JSON, but is binary based and has a much better handling of numerical data than JSON. It has support for single and double precision float numbers and allows definition of extended types.

Similarities:

  • Data can be arranged in a structure, which one may interpreted as a tree with named nodes (actually a combination of lists and dictionaries).
  • Data is stored in / serialized to a binary format.

Differences:

  • Similar to JSON, MessagePack’s array concept is a flat list containing objects of arbitrary types, with the same disadvantages as above.
  • As a consequence of the additional type information is stored with each element, hence adding a native (Fortran or C) array to the tree always requires copying and reading out the array from the tree into a native array as well. This is may not be a problem if the tree is communicated via sockets, but could raise efficiency problems if the tree is passed via an API between various components of an application.

CSlib

CSlib is a client-server library for interchanging data between applications. It allows for exchanging data via files, sockets (via ZeroMQ) or MPI. It is already part of LAMMPS and should also exist as a separate project under GitHub (apparently it has not been uploaded yet). Its licensing is unclear, probably modified BSD, although some documents mention it as GPL-licensed.

Similarities:

  • CSlib passes data in binary form. It has data types suited for scientific applications.
  • It is possible to transmit multi-dimensional arrays with CSlib.

Differences:

  • CSlibs messages are composed of fields, each field being assigned to have an arbitrary data type with zero or more entries of that type. It does not have the concept of a hierarchical tree. However, with an appropriate wrapper, it could be probably used to transmit a tree.
  • While it is possible to transmit multi-dimensional arrays with CSlib, it seems that the array shape is not transmitted explicitly (only the number of elements). This shape data would therefore have to be communicated as an extra message.
  • CSlib is not designed for passing a tree via an API between parts of an application (e.g. caller passes a tree to a library routine and receives another tree in response), but concentrates on sending it via sockets, file I/O or MPI-messaging.

MxUI

The MxUI library wraps MPI calls for simplifying Multiple-Program Multiple-Data communication. The library provides a C++ header only implementation. It can also interpolate the transmitted data.

Similarities:

  • Data can be arranged in structures, of arbitrary type

Differences:

  • Templated push and fetch operations
  • Processing (interpolation) of data by transmission