Storage Specification

hdmf-zarr currently uses the Zarr DirectoryStore, which uses directories and files on a standard file system to serialize data.

Format Mapping

Here we describe the mapping of HDMF primitives (e.g., Groups, Datasets, Attributes, Links, etc.) used by the HDMF schema language to Zarr storage primitives. HDMF data modeling primitives were originally designed with HDF5 in mind. However, Zarr uses very similar primitives, and as such the high-level mapping between HDMF schema and Zarr storage is overall fairly simple. The main complication is that Zarr does not support links and references (see Zarr issue #389) and as such have to implemented by hdmf-zarr.

Mapping of groups
NWB Primitive	Zarr Primitive
Group	Group
Dataset	Dataset
Attribute	Attribute
Link	Stored as JSON formatted Attributes

Mapping of HDMF specification language keys

Here we describe the mapping of keys from the HDMF specification language to Zarr storage objects:

Groups

Mapping of groups
NWB Key	Zarr
name	Name of the Group in Zarr
doc	Zarr attribute `doc` on the Zarr group
groups	Zarr groups within the Zarr group
datasets	Zarr datasets within the Zarr group
attributes	Zarr attributes on the Zarr group
links	Stored as JSON formatted attributes on the Zarr Group
quantity	Not mapped; Number of appearances of the group
neurodata_type	Attribute `neurodata_type` on the Zarr Group
namespace ID	Attribute `namespace` on the Zarr Group
object ID	Attribute `object_id` on the Zarr Group

Reserved groups

The ZarrIO backend typically caches the schema used to create a file in the group /specifications (see also Caching format specifications)

Datasets

Mapping of datasets
HDMF Specification Key	Zarr
name	Name of the dataset in Zarr
doc	Zarr attribute `doc` on the Zarr dataset
dtype	Data type of the Zarr dataset (see dtype mappings table) and stored in the `zarr_dtype` reserved attribute
shape	Shape of the Zarr dataset if the shape is fixed, otherwise shape defines the maxshape
dims	Not mapped
attributes	Zarr attributes on the Zarr dataset
quantity	Not mapped; Number of appearances of the dataset
neurodata_type	Attribute `neurodata_type` on the Zarr dataset
namespace ID	Attribute `namespace` on the Zarr dataset
object ID	Attribute `object_id` on the Zarr dataset

Note

TODO Update mapping of dims

Attributes

Mapping of attributes
HDMF Specification Key	Zarr
name	Name of the attribute in Zarr
doc	Not mapped; Stored in schema only
dtype	Data type of the Zarr attribute
shape	Shape of the Zarr attribute if the shape is fixed, otherwise shape defines the maxshape
dims	Not mapped; Reflected by the shape of the attribute data
required	Not mapped; Stored in schema only
value	Data value of the attribute

Note

Attributes are stored as JSON documents in Zarr (using the DirectoryStore). As such, all attributes must be JSON serializable. The ZarrIO backend attempts to cast types (e.g., numpy arrays) to JSON serializable types as much as possible, but not all possible types may be supported.

Reserved attributes

The ZarrIO backend defines a set of reserved attribute names defined in __reserve_attribute. These reserved attributes are used to implement functionality (e.g., links and object references, which are not natively supported by Zarr) and may be added on any Group or Dataset in the file.

Reserved Attribute Name

Usage

zarr_link

Attribute used to store links. See Links for details.

zarr_dtype

Attribute used to specify the data type of a dataset. This is used to implement the storage of object references as part of datasets. See Object References

In addition, the following reserved attributes are added to the root Group of the file only:

Reserved Attribute Name

Usage

.specloc

Attribute storing the path to the Group where the scheme for the file are cached. See SPEC_LOC_ATTR

Links

Similar to soft links in a file system, a link is an object in a Group that links to another Group or Dataset, either within the same Zarr file or another external Zarr file. Links and reference are not natively supported by Zarr but are implemented in ZarrIO in an OS independent fashion using the zarr_link reserved attribute (see __reserve_attribute) to store a list of dicts serialized as JSON. Each dict (i.e., element) in the list defines a link, with each dict containing the following keys:

name : Name of the link
source : Relative path to the root of the Zarr file containing the linked object. For links pointing to an object within the same Zarr file, the value of source will be ".". For external links that point to object in another Zarr file, the value of source will be the path to the other Zarr file relative to the root path of the Zarr file containing the link.
path : Path to the linked object within the Zarr file identified by the source key
object_id: Object id of the reference object. May be None in case the referenced object does not have an assigned object_id (e.g., in the case we reference a dataset with a fixed name but without and assigned data_type (or neurodata_type in the case of NWB).
source_object_id: Object id of the source Zarr file indicated by the source key. The source should always have an object_id (at least if the source file is a valid HDMF formatted file).

For example:

"zarr_link": [
    {
        "name": "device",
        "source": ".",
        "path": "/general/devices/array",
        "object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
        "source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
    }
]

Mapping of links
HDMF Specification Key	Zarr
name	Name of the link
doc	Not mapped; Stored in schema only
target_type	Not mapped. The target type is determined by the type of the target of the link

Hint

In Zarr, attributes are stored in JSON as part of the hidden .zattrs file in the folder defining the Group or Dataset.

Hint

In ZarrIO, links are written by the __write_link__() function, which also uses the helper functions i) _create_ref() to construct py:meth:~hdmf_zarr.utils.ZarrRefernce and ii) __add_link__() to add a link to the Zarr file. __read_links() then parses links and also uses the __resolve_ref() helper function to resolve the paths stored in links.

Object References

Object reference behave much the same way as Links, with the key difference that they are stored as part of datasets or attributes. This approach allows for storage of large collections of references as values of multi-dimensional arrays (i.e., the data type of the array is a reference type).

Storing object references in Datasets

To identify that a dataset contains object reference, the reserved attribute zarr_dtype is set to 'object' (see also Reserved attributes). In this way, we can unambiguously if a dataset stores references that need to be resolved.

Similar to Links, object references are defined via dicts, which are stored as elements of the Dataset. In contrast to links, individual object reference do not have a name but are identified by their location (i.e., index) in the dataset. As such, object references only have the source with the relative path to the target Zarr file, and the path identifying the object within the source Zarr file. The individual object references are defined in the ZarrIO as py:class:~hdmf_zarr.utils.ZarrReference object created via the _create_ref() helper function.

By default, ZarrIO uses the numcodecs.pickles.Pickle codec to encode object references defined as py:class:~hdmf_zarr.utils.ZarrReference dicts in datasets. Users may set the codec used to encode objects in Zarr datasets via the object_codec_class parameter of the __init__() constructor of ZarrIO. E.g., we could use ZarrIO( ... , object_codec_class=numcodecs.JSON) to serialize objects using JSON.

Storing object references in Attributes

Object references are stored in a attributes as dicts with the following keys:

zarr_dtype : Indicating the data type for the attribute. For object references zarr_dtype is set to "object"
value: The value of the object references, i.e., here the py:class:~hdmf_zarr.utils.ZarrReference dictionary with the source, path, object_id, and source_object_id keys defining the object reference, with the definition of the keys being the same as for Links.

For example in NWB, the attribute ElectricalSeries.electrodes.table would be defined as follows:

"table": {
    "value": {
        "path": "/general/extracellular_ephys/electrodes",
        "source": ".",
        "object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
        "source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
    },
    "zarr_dtype": "object"
}

dtype mappings

The mappings of data types is as follows

dtype spec value

storage type

size

“float”

“float32”

single precision floating point

32 bit

“double”

“float64”

double precision floating point

64 bit

“long”

“int64”

signed 64 bit integer

64 bit

“int”

“int32”

signed 32 bit integer

32 bit

“int16”

signed 16 bit integer

16 bit

“int8”

signed 8 bit integer

8 bit

“uint32”

unsigned 32 bit integer

32 bit

“uint16”

unsigned 16 bit integer

16 bit

“uint8”

unsigned 8 bit integer

8 bit

“bool”

boolean

8 bit

“text”

“utf”

“utf8”

“utf-8”

unicode

variable

“ascii”

“str”

ascii

variable

“ref”

“reference”

“object”

Reference to another group or dataset. See Object References

compound dtype

Compound data type. Stored in zarr_dtype as a list of dicts with "name" and "dtype" keys (see example below).

“isodatetime”

ASCII ISO8061 datetime string. For example 2018-09-28T14:43:54.123+02:00

variable

Note

For compound dtypes, the zarr_dtype attribute is stored as a list of dictionaries, where each dictionary describes a field in the compound type. For example:

"zarr_dtype": [
    {
        "dtype": "uint32",
        "name": "x"
    },
    {
        "dtype": "uint32",
        "name": "y"
    },
    {
        "dtype": "float32",
        "name": "weight"
    }
]

Caching format specifications

In practice it is useful to cache the specification a file was created with (including extensions) directly in the Zarr file. Caching the specification in the file ensures that users can access the specification directly if necessary without requiring external resources. For the Zarr backend, caching of the schema is implemented as follows.

The ZarrIO` backend adds the reserved top-level group /specifications in which all format specifications (including extensions) are cached. The default name for this group is defined in DEFAULT_SPEC_LOC_DIR and caching of specifications is implemented in ZarrIO.__cache_spec. The /specifications group contains for each specification namespace a subgroup /specifications/<namespace-name>/<version> in which the specification for a particular version of a namespace are stored (e.g., /specifications/core/2.0.1 in the case of the NWB core namespace at version 2.0.1). The actual specification data is then stored as a JSON string in scalar datasets with a binary, variable-length string data type. The specification of the namespace is stored in /specifications/<namespace-name>/<version>/namespace while additional source files are stored in /specifications/<namespace-name>/<version>/<source-filename>. Here <source-filename> refers to the main name of the source-file without file extension (e.g., the core namespace defines nwb.ephys.yaml as source which would be stored in /specifications/core/2.0.1/nwb.ecephys).

Consolidating Metadata

Zarr allows users to consolidate all metadata for groups and arrays within the given store. By default, every file will consolidate all metadata within into a single .zmetadata file, stored in the root group. This reduces the number of read operations when retrieving certain metadata in read mode.

Note

When updating a file, the consolidated metadata will also need to be updated via zarr.consolidate_metadata(path) to ensure the consolidated metadata is consistent with the file.