Storage Specification
hdmf-zarr currently uses the Zarr DirectoryStore, which uses directories and files on a standard file system to serialize data.
Format Mapping
Here we describe the mapping of HDMF primitives (e.g., Groups, Datasets, Attributes, Links, etc.) used by the HDMF schema language to Zarr storage primitives. HDMF data modeling primitives were originally designed with HDF5 in mind. However, Zarr uses very similar primitives, and as such the high-level mapping between HDMF schema and Zarr storage is overall fairly simple. The main complication is that Zarr does not support links and references (see Zarr issue #389) and as such have to implemented by hdmf-zarr.
NWB Primitive |
Zarr Primitive |
|---|---|
Group |
Group |
Dataset |
Dataset |
Attribute |
Attribute |
Link |
Stored as JSON formatted Attributes |
Mapping of HDMF specification language keys
Here we describe the mapping of keys from the HDMF specification language to Zarr storage objects:
Groups
NWB Key |
Zarr |
|---|---|
name |
Name of the Group in Zarr |
doc |
Zarr attribute |
groups |
Zarr groups within the Zarr group |
datasets |
Zarr datasets within the Zarr group |
attributes |
Zarr attributes on the Zarr group |
links |
Stored as JSON formatted attributes on the Zarr Group |
quantity |
Not mapped; Number of appearances of the group |
neurodata_type |
Attribute |
namespace ID |
Attribute |
object ID |
Attribute |
Reserved groups
The ZarrIO backend typically caches the schema used to create a file in the
group /specifications (see also Caching format specifications)
Datasets
HDMF Specification Key |
Zarr |
|---|---|
name |
Name of the dataset in Zarr |
doc |
Zarr attribute |
dtype |
Data type of the Zarr dataset (see dtype mappings table) and stored in the |
shape |
Shape of the Zarr dataset if the shape is fixed, otherwise shape defines the maxshape |
dims |
Not mapped |
attributes |
Zarr attributes on the Zarr dataset |
quantity |
Not mapped; Number of appearances of the dataset |
neurodata_type |
Attribute |
namespace ID |
Attribute |
object ID |
Attribute |
Note
TODO Update mapping of dims
Attributes
HDMF Specification Key |
Zarr |
|---|---|
name |
Name of the attribute in Zarr |
doc |
Not mapped; Stored in schema only |
dtype |
Data type of the Zarr attribute |
shape |
Shape of the Zarr attribute if the shape is fixed, otherwise shape defines the maxshape |
dims |
Not mapped; Reflected by the shape of the attribute data |
required |
Not mapped; Stored in schema only |
value |
Data value of the attribute |
Note
Attributes are stored as JSON documents in Zarr (using the DirectoryStore). As such, all attributes
must be JSON serializable. The ZarrIO backend attempts to cast types
(e.g., numpy arrays) to JSON serializable types as much as possible, but not all possible types may
be supported.
Reserved attributes
The ZarrIO backend defines a set of reserved attribute names defined in
__reserve_attribute. These reserved attributes are used to implement
functionality (e.g., links and object references, which are not natively supported by Zarr) and may be
added on any Group or Dataset in the file.
Reserved Attribute Name
Usage
zarr_link
Attribute used to store links. See Links for details.
zarr_dtype
Attribute used to specify the data type of a dataset. This is used to implement the storage of object references as part of datasets. See Object References
In addition, the following reserved attributes are added to the root Group of the file only:
Reserved Attribute Name
Usage
.specloc
Attribute storing the path to the Group where the scheme for the file are cached. See
SPEC_LOC_ATTR
Links
Similar to soft links in a file system, a link is an object in a Group that links to another Group or Dataset,
either within the same Zarr file or another external Zarr file. Links and reference are not natively supported by
Zarr but are implemented in ZarrIO in an OS independent fashion using the zarr_link
reserved attribute (see __reserve_attribute) to store a list of dicts serialized
as JSON. Each dict (i.e., element) in the list defines a link, with each dict containing the following keys:
name: Name of the linksource: Relative path to the root of the Zarr file containing the linked object. For links pointing to an object within the same Zarr file, the value of source will be".". For external links that point to object in another Zarr file, the value of source will be the path to the other Zarr file relative to the root path of the Zarr file containing the link.path: Path to the linked object within the Zarr file identified by thesourcekeyobject_id: Object id of the reference object. May be None in case the referenced object does not have an assigned object_id (e.g., in the case we reference a dataset with a fixed name but without and assigneddata_type(orneurodata_typein the case of NWB).source_object_id: Object id of the source Zarr file indicated by thesourcekey. Thesourceshould always have anobject_id(at least if thesourcefile is a valid HDMF formatted file).
For example:
"zarr_link": [
{
"name": "device",
"source": ".",
"path": "/general/devices/array",
"object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
"source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
}
]
HDMF Specification Key |
Zarr |
|---|---|
name |
Name of the link |
doc |
Not mapped; Stored in schema only |
target_type |
Not mapped. The target type is determined by the type of the target of the link |
Hint
In Zarr, attributes are stored in JSON as part of the hidden .zattrs file in the folder defining
the Group or Dataset.
Hint
In ZarrIO, links are written by the
__write_link__() function, which also uses the helper functions
i) _create_ref() to construct py:meth:~hdmf_zarr.utils.ZarrRefernce
and ii) __add_link__() to add a link to the Zarr file.
__read_links() then parses links and also uses the
__resolve_ref() helper function to resolve the paths stored in links.
Object References
Object reference behave much the same way as Links, with the key difference that they are stored as part of datasets or attributes. This approach allows for storage of large collections of references as values of multi-dimensional arrays (i.e., the data type of the array is a reference type).
Storing object references in Datasets
To identify that a dataset contains object reference, the reserved attribute zarr_dtype is set to
'object' (see also Reserved attributes). In this way, we can unambiguously
if a dataset stores references that need to be resolved.
Similar to Links, object references are defined via dicts, which are stored as elements of
the Dataset. In contrast to links, individual object reference do not have a name but are identified
by their location (i.e., index) in the dataset. As such, object references only have the source with
the relative path to the target Zarr file, and the path identifying the object within the source
Zarr file. The individual object references are defined in the
ZarrIO as py:class:~hdmf_zarr.utils.ZarrReference object created via
the _create_ref() helper function.
By default, ZarrIO uses the numcodecs.pickles.Pickle codec to
encode object references defined as py:class:~hdmf_zarr.utils.ZarrReference dicts in datasets.
Users may set the codec used to encode objects in Zarr datasets via the object_codec_class
parameter of the __init__() constructor of
ZarrIO. E.g., we could use
ZarrIO( ... , object_codec_class=numcodecs.JSON) to serialize objects using JSON.
Storing object references in Attributes
Object references are stored in a attributes as dicts with the following keys:
zarr_dtype: Indicating the data type for the attribute. For object referenceszarr_dtypeis set to"object"value: The value of the object references, i.e., here the py:class:~hdmf_zarr.utils.ZarrReference dictionary with thesource,path,object_id, andsource_object_idkeys defining the object reference, with the definition of the keys being the same as for Links.
For example in NWB, the attribute ElectricalSeries.electrodes.table would be defined as follows:
"table": {
"value": {
"path": "/general/extracellular_ephys/electrodes",
"source": ".",
"object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
"source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
},
"zarr_dtype": "object"
}
dtype mappings
The mappings of data types is as follows
dtypespec valuestorage type
size
“float”
“float32”
single precision floating point
32 bit
“double”
“float64”
double precision floating point
64 bit
“long”
“int64”
signed 64 bit integer
64 bit
“int”
“int32”
signed 32 bit integer
32 bit
“int16”
signed 16 bit integer
16 bit
“int8”
signed 8 bit integer
8 bit
“uint32”
unsigned 32 bit integer
32 bit
“uint16”
unsigned 16 bit integer
16 bit
“uint8”
unsigned 8 bit integer
8 bit
“bool”
boolean
8 bit
“text”
“utf”
“utf8”
“utf-8”
unicode
variable
“ascii”
“str”
ascii
variable
“ref”
“reference”
“object”
Reference to another group or dataset. See Object References
compound dtype
Compound data type. Stored in
zarr_dtypeas a list of dicts with"name"and"dtype"keys (see example below).
“isodatetime”
ASCII ISO8061 datetime string. For example
2018-09-28T14:43:54.123+02:00variable
Note
For compound dtypes, the zarr_dtype attribute is stored as a list of dictionaries,
where each dictionary describes a field in the compound type. For example:
"zarr_dtype": [
{
"dtype": "uint32",
"name": "x"
},
{
"dtype": "uint32",
"name": "y"
},
{
"dtype": "float32",
"name": "weight"
}
]
Caching format specifications
In practice it is useful to cache the specification a file was created with (including extensions) directly in the Zarr file. Caching the specification in the file ensures that users can access the specification directly if necessary without requiring external resources. For the Zarr backend, caching of the schema is implemented as follows.
The ZarrIO` backend adds the reserved top-level group /specifications
in which all format specifications (including extensions) are cached. The default name for this group is
defined in DEFAULT_SPEC_LOC_DIR and caching of
specifications is implemented in ZarrIO.__cache_spec.
The /specifications group contains for each specification namespace a subgroup
/specifications/<namespace-name>/<version> in which the specification for a particular version of a namespace
are stored (e.g., /specifications/core/2.0.1 in the case of the NWB core namespace at version 2.0.1).
The actual specification data is then stored as a JSON string in scalar datasets with a binary, variable-length string
data type. The specification of the namespace is stored in
/specifications/<namespace-name>/<version>/namespace while additional source files are stored in
/specifications/<namespace-name>/<version>/<source-filename>. Here <source-filename> refers to the main name
of the source-file without file extension (e.g., the core namespace defines nwb.ephys.yaml as source which would
be stored in /specifications/core/2.0.1/nwb.ecephys).
Consolidating Metadata
Zarr allows users to consolidate all metadata for groups and arrays within the given store. By default, every file will consolidate all metadata within into a single .zmetadata file, stored in the root group. This reduces the number of read operations when retrieving certain metadata in read mode.
Note
When updating a file, the consolidated metadata will also need to be updated via zarr.consolidate_metadata(path) to ensure the consolidated metadata is consistent with the file.