Storage Specification

hdmf-zarr currently uses the Zarr DirectoryStore, which uses directories and files on a standard file system to serialize data.

Format Mapping

Here we describe the mapping of HDMF primitives (e.g., Groups, Datasets, Attributes, Links, etc.) used by the HDMF schema language to Zarr storage primitives. HDMF data modeling primitives were originally designed with HDF5 in mind. However, Zarr uses very similar primitives, and as such the high-level mapping between HDMF schema and Zarr storage is overall fairly simple. The main complication is that Zarr does not support links and references (see Zarr issue #389) and as such have to implemented by hdmf-zarr.

Mapping of groups

NWB Primitive

Zarr Primitive

Group

Group

Dataset

Dataset

Attribute

Attribute

Link

Stored as JSON formatted Attributes

Mapping of HDMF specification language keys

Here we describe the mapping of keys from the HDMF specification language to Zarr storage objects:

Groups

Mapping of groups

NWB Key

Zarr

name

Name of the Group in Zarr

doc

Zarr attribute doc on the Zarr group

groups

Zarr groups within the Zarr group

datasets

Zarr datasets within the Zarr group

attributes

Zarr attributes on the Zarr group

links

Stored as JSON formatted attributes on the Zarr Group

linkable

Not mapped; Stored in schema only

quantity

Not mapped; Number of appearances of the group

neurodata_type

Attribute neurodata_type on the Zarr Group

namespace ID

Attribute namespace on the Zarr Group

object ID

Attribute object_id on the Zarr Group

Reserved groups

The ZarrIO backend typically caches the schema used to create a file in the group /specifications (see also Caching format specifications)

Datasets

Mapping of datasets

HDMF Specification Key

Zarr

name

Name of the dataset in Zarr

doc

Zarr attribute doc on the Zarr dataset

dtype

Data type of the Zarr dataset (see dtype mappings table) and stored in the zarr_dtype reserved attribute

shape

Shape of the Zarr dataset if the shape is fixed, otherwise shape defines the maxshape

dims

Not mapped

attributes

Zarr attributes on the Zarr dataset

linkable

Not mapped; Stored in schema only

quantity

Not mapped; Number of appearances of the dataset

neurodata_type

Attribute neurodata_type on the Zarr dataset

namespace ID

Attribute namespace on the Zarr dataset

object ID

Attribute object_id on the Zarr dataset

Note

  • TODO Update mapping of dims

Attributes

Mapping of attributes

HDMF Specification Key

Zarr

name

Name of the attribute in Zarr

doc

Not mapped; Stored in schema only

dtype

Data type of the Zarr attribute

shape

Shape of the Zarr attribute if the shape is fixed, otherwise shape defines the maxshape

dims

Not mapped; Reflected by the shape of the attribute data

required

Not mapped; Stored in schema only

value

Data value of the attribute

Note

Attributes are stored as JSON documents in Zarr (using the DirectoryStore). As such, all attributes must be JSON serializable. The ZarrIO backend attempts to cast types (e.g., numpy arrays) to JSON serializable types as much as possible, but not all possible types may be supported.

Reserved attributes

The ZarrIO backend defines a set of reserved attribute names defined in __reserve_attribute. These reserved attributes are used to implement functionality (e.g., links and object references, which are not natively supported by Zarr) and may be added on any Group or Dataset in the file.

Reserved Attribute Name

Usage

zarr_link

Attribute used to store links. See Links for details.

zarr_dtype

Attribute used to specify the data type of a dataset. This is used to implement the storage of object references as part of datasets. See Object References

In addition, the following reserved attributes are added to the root Group of the file only:

Reserved Attribute Name

Usage

.specloc

Attribute storing the path to the Group where the scheme for the file are cached. See SPEC_LOC_ATTR

Object References

Object reference behave much the same way as Links, with the key difference that they are stored as part of datasets or attributes. This approach allows for storage of large collections of references as values of multi-dimensional arrays (i.e., the data type of the array is a reference type).

Storing object references in Datasets

To identify that a dataset contains object reference, the reserved attribute zarr_dtype is set to 'object' (see also Reserved attributes). In this way, we can unambiguously if a dataset stores references that need to be resolved.

Similar to Links, object references are defined via dicts, which are stored as elements of the Dataset. In contrast to links, individual object reference do not have a name but are identified by their location (i.e., index) in the dataset. As such, object references only have the source with the relative path to the target Zarr file, and the path identifying the object within the source Zarr file. The individual object references are defined in the ZarrIO as py:class:~hdmf_zarr.utils.ZarrReference object created via the __get_ref() helper function.

By default, ZarrIO uses the numcodecs.pickles.Pickle codec to encode object references defined as py:class:~hdmf_zarr.utils.ZarrReference dicts in datasets. Users may set the codec used to encode objects in Zarr datasets via the object_codec_class parameter of the __init__() constructor of ZarrIO. E.g., we could use ZarrIO( ... , object_codec_class=numcodecs.JSON) to serialize objects using JSON.

Storing object references in Attributes

Object references are stored in a attributes as dicts with the following keys:

  • zarr_dtype : Indicating the data type for the attribute. For object references zarr_dtype is set to "object" (or "region" for Region references)

  • value: The value of the object references, i.e., here the py:class:~hdmf_zarr.utils.ZarrReference dictionary with the source, path, object_id, and source_object_id keys defining the object reference, with the definition of the keys being the same as for Links.

For example in NWB, the attribute ElectricalSeries.electrodes.table would be defined as follows:

"table": {
    "value": {
        "path": "/general/extracellular_ephys/electrodes",
        "source": ".",
        "object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
        "source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
    },
    "zarr_dtype": "object"
}

Region references

Region references are similar to object references, but instead of references other Datasets or Groups, region references link to subsets of another Dataset. To identify region references, the reserved attribute zarr_dtype is set to 'region' (see also Reserved attributes). In addition to the source and path, the py:class:~hdmf_zarr.utils.ZarrReference object will also need to store the definition of the region that is being referenced, e.g., a slice or list on point indices.

Warning

Region references are not yet fully implemented in ZarrIO. To implement region references will require updating: 1) py:class:~hdmf_zarr.utils.ZarrReference to add a region key to support storing the selection for the region, 2) __get_ref() to support passing in the region definition to be added to the py:class:~hdmf_zarr.utils.ZarrReference, 3) write_dataset() already partially implements the required logic for creating region references by checking for hdmf.build.RegionBuilder inputs but will likely need updates as well 4) __read_dataset() to support reading region references, which may also require updates to __parse_ref() and __resolve_ref(), and 5) and possibly other parts of ZarrIO. 6) The py:class:~hdmf_zarr.zarr_utils.ContainerZarrRegionDataset and py:class:~hdmf_zarr.zarr_utils.ContainerZarrRegionDataset classes will also need to be finalized to support region references.

dtype mappings

The mappings of data types is as follows

dtype spec value

storage type

size

  • “float”

  • “float32”

single precision floating point

32 bit

  • “double”

  • “float64”

double precision floating point

64 bit

  • “long”

  • “int64”

signed 64 bit integer

64 bit

  • “int”

  • “int32”

signed 32 bit integer

32 bit

  • “int16”

signed 16 bit integer

16 bit

  • “int8”

signed 8 bit integer

8 bit

  • “uint32”

unsigned 32 bit integer

32 bit

  • “uint16”

unsigned 16 bit integer

16 bit

  • “uint8”

unsigned 8 bit integer

8 bit

  • “bool”

boolean

8 bit

  • “text”

  • “utf”

  • “utf8”

  • “utf-8”

unicode

variable

  • “ascii”

  • “str”

ascii

variable

  • “ref”

  • “reference”

  • “object”

Reference to another group or dataset. See Object References

  • region

Reference to a region of another dataset. See :ref:sec-zarr-storage-references`

  • compound dtype

Compound data type

  • “isodatetime”

ASCII ISO8061 datetime string. For example 2018-09-28T14:43:54.123+02:00

variable

Caching format specifications

In practice it is useful to cache the specification a file was created with (including extensions) directly in the Zarr file. Caching the specification in the file ensures that users can access the specification directly if necessary without requiring external resources. For the Zarr backend, caching of the schema is implemented as follows.

The ZarrIO` backend adds the reserved top-level group /specifications in which all format specifications (including extensions) are cached. The default name for this group is defined in DEFAULT_SPEC_LOC_DIR and caching of specifications is implemented in ZarrIO.__cache_spec. The /specifications group contains for each specification namespace a subgroup /specifications/<namespace-name>/<version> in which the specification for a particular version of a namespace are stored (e.g., /specifications/core/2.0.1 in the case of the NWB core namespace at version 2.0.1). The actual specification data is then stored as a JSON string in scalar datasets with a binary, variable-length string data type. The specification of the namespace is stored in /specifications/<namespace-name>/<version>/namespace while additional source files are stored in /specifications/<namespace-name>/<version>/<source-filename>. Here <source-filename> refers to the main name of the source-file without file extension (e.g., the core namespace defines nwb.ephys.yaml as source which would be stored in /specifications/core/2.0.1/nwb.ecephys).