Storage Specification¶
hdmf-zarr currently uses the Zarr DirectoryStore, which uses directories and files on a standard file system to serialize data.
Format Mapping¶
Here we describe the mapping of HDMF primitives (e.g., Groups, Datasets, Attributes, Links, etc.) used by the HDMF schema language to Zarr storage primitives. HDMF data modeling primitives were originally designed with HDF5 in mind. However, Zarr uses very similar primitives, and as such the high-level mapping between HDMF schema and Zarr storage is overall fairly simple. The main complication is that Zarr does not support links and references (see Zarr issue #389) and as such have to implemented by hdmf-zarr.
NWB Primitive |
Zarr Primitive |
---|---|
Group |
Group |
Dataset |
Dataset |
Attribute |
Attribute |
Link |
Stored as JSON formatted Attributes |
Mapping of HDMF specification language keys¶
Here we describe the mapping of keys from the HDMF specification language to Zarr storage objects:
Groups¶
NWB Key |
Zarr |
---|---|
name |
Name of the Group in Zarr |
doc |
Zarr attribute |
groups |
Zarr groups within the Zarr group |
datasets |
Zarr datasets within the Zarr group |
attributes |
Zarr attributes on the Zarr group |
links |
Stored as JSON formatted attributes on the Zarr Group |
linkable |
Not mapped; Stored in schema only |
quantity |
Not mapped; Number of appearances of the group |
neurodata_type |
Attribute |
namespace ID |
Attribute |
object ID |
Attribute |
Reserved groups¶
The ZarrIO
backend typically caches the schema used to create a file in the
group /specifications
(see also Caching format specifications)
Datasets¶
HDMF Specification Key |
Zarr |
---|---|
name |
Name of the dataset in Zarr |
doc |
Zarr attribute |
dtype |
Data type of the Zarr dataset (see dtype mappings table) and stored in the |
shape |
Shape of the Zarr dataset if the shape is fixed, otherwise shape defines the maxshape |
dims |
Not mapped |
attributes |
Zarr attributes on the Zarr dataset |
linkable |
Not mapped; Stored in schema only |
quantity |
Not mapped; Number of appearances of the dataset |
neurodata_type |
Attribute |
namespace ID |
Attribute |
object ID |
Attribute |
Note
TODO Update mapping of dims
Attributes¶
HDMF Specification Key |
Zarr |
---|---|
name |
Name of the attribute in Zarr |
doc |
Not mapped; Stored in schema only |
dtype |
Data type of the Zarr attribute |
shape |
Shape of the Zarr attribute if the shape is fixed, otherwise shape defines the maxshape |
dims |
Not mapped; Reflected by the shape of the attribute data |
required |
Not mapped; Stored in schema only |
value |
Data value of the attribute |
Note
Attributes are stored as JSON documents in Zarr (using the DirectoryStore). As such, all attributes
must be JSON serializable. The ZarrIO
backend attempts to cast types
(e.g., numpy arrays) to JSON serializable types as much as possible, but not all possible types may
be supported.
Reserved attributes¶
The ZarrIO
backend defines a set of reserved attribute names defined in
__reserve_attribute
. These reserved attributes are used to implement
functionality (e.g., links and object references, which are not natively supported by Zarr) and may be
added on any Group or Dataset in the file.
Reserved Attribute Name
Usage
zarr_link
Attribute used to store links. See Links for details.
zarr_dtype
Attribute used to specify the data type of a dataset. This is used to implement the storage of object references as part of datasets. See Object References
In addition, the following reserved attributes are added to the root Group of the file only:
Reserved Attribute Name
Usage
.specloc
Attribute storing the path to the Group where the scheme for the file are cached. See
SPEC_LOC_ATTR
Links¶
Similar to soft links in a file system, a link is an object in a Group that links to another Group or Dataset,
either within the same Zarr file or another external Zarr file. Links and reference are not natively supported by
Zarr but are implemented in ZarrIO
in an OS independent fashion using the zarr_link
reserved attribute (see __reserve_attribute
) to store a list of dicts serialized
as JSON. Each dict (i.e., element) in the list defines a link, with each dict containing the following keys:
name
: Name of the linksource
: Relative path to the root of the Zarr file containing the linked object. For links pointing to an object within the same Zarr file, the value of source will be"."
. For external links that point to object in another Zarr file, the value of source will be the path to the other Zarr file relative to the root path of the Zarr file containing the link.path
: Path to the linked object within the Zarr file idenfied by thesource
keyobject_id
: Object id of the reference object. May be None in case the referenced object does not have an assigned object_id (e.g., in the case we reference a dataset with a fixed name but without and assigneddata_type
(orneurodata_type
in the case of NWB).source_object_id
: Object id of the source Zarr file indicated by thesource
key. Thesource
should always have anobject_id
(at least if thesource
file is a valid HDMF formatted file).
For example:
"zarr_link": [
{
"name": "device",
"source": ".",
"path": "/general/devices/array",
"object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
"source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
}
]
HDMF Specification Key |
Zarr |
---|---|
name |
Name of the link |
doc |
Not mapped; Stored in schema only |
target_type |
Not mapped. The target type is determined by the type of the target of the link |
Hint
In Zarr, attributes are stored in JSON as part of the hidden .zattrs
file in the folder defining
the Group or Dataset.
Hint
In ZarrIO
, links are written by the
__write_link__()
function, which also uses the helper functions
i) __get_ref()
to construct py:meth:~hdmf_zarr.utils.ZarrRefernce
and ii) __add_link__()
to add a link to the Zarr file.
__read_links()
then parses links and also uses the
__resolve_ref()
helper function to resolve the paths stored in links.
Object References¶
Object reference behave much the same way as Links, with the key difference that they are stored as part of datasets or attributes. This approach allows for storage of large collections of references as values of multi-dimensional arrays (i.e., the data type of the array is a reference type).
Storing object references in Datasets¶
To identify that a dataset contains object reference, the reserved attribute zarr_dtype
is set to
'object'
(see also Reserved attributes). In this way, we can unambiguously
if a dataset stores references that need to be resolved.
Similar to Links, object references are defined via dicts, which are stored as elements of
the Dataset. In contrast to links, individual object reference do not have a name
but are identified
by their location (i.e., index) in the dataset. As such, object references only have the source
with
the relative path to the target Zarr file, and the path
identifying the object within the source
Zarr file. The individual object references are defined in the
ZarrIO
as py:class:~hdmf_zarr.utils.ZarrReference object created via
the __get_ref()
helper function.
By default, ZarrIO
uses the numcodecs.pickles.Pickle
codec to
encode object references defined as py:class:~hdmf_zarr.utils.ZarrReference dicts in datasets.
Users may set the codec used to encode objects in Zarr datasets via the object_codec_class
parameter of the __init__()
constructor of
ZarrIO
. E.g., we could use
ZarrIO( ... , object_codec_class=numcodecs.JSON)
to serialize objects using JSON.
Storing object references in Attributes¶
Object references are stored in a attributes as dicts with the following keys:
zarr_dtype
: Indicating the data type for the attribute. For object referenceszarr_dtype
is set to"object"
(or"region"
for Region references)value
: The value of the object references, i.e., here the py:class:~hdmf_zarr.utils.ZarrReference dictionary with thesource
,path
,object_id
, andsource_object_id
keys defining the object reference, with the definition of the keys being the same as for Links.
For example in NWB, the attribute ElectricalSeries.electrodes.table
would be defined as follows:
"table": {
"value": {
"path": "/general/extracellular_ephys/electrodes",
"source": ".",
"object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
"source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
},
"zarr_dtype": "object"
}
Region references¶
Region references are similar to object references, but instead of references other Datasets or Groups,
region references link to subsets of another Dataset. To identify region references, the reserved attribute
zarr_dtype
is set to 'region'
(see also Reserved attributes). In addition
to the source
and path
, the py:class:~hdmf_zarr.utils.ZarrReference object will also need to
store the definition of the region
that is being referenced, e.g., a slice or list on point indices.
Warning
Region references are not yet fully implemented in ZarrIO
.
To implement region references will require updating:
1) py:class:~hdmf_zarr.utils.ZarrReference to add a region
key to support storing
the selection for the region,
2) __get_ref()
to support passing in the region definition to
be added to the py:class:~hdmf_zarr.utils.ZarrReference,
3) write_dataset()
already partially implements the required
logic for creating region references by checking for hdmf.build.RegionBuilder
inputs
but will likely need updates as well
4) __read_dataset()
to support reading region references,
which may also require updates to __parse_ref()
and
__resolve_ref()
, and
5) and possibly other parts of ZarrIO
.
6) The py:class:~hdmf_zarr.zarr_utils.ContainerZarrRegionDataset and
py:class:~hdmf_zarr.zarr_utils.ContainerZarrRegionDataset classes will also need to be finalized
to support region references.
dtype mappings¶
The mappings of data types is as follows
dtype
spec valuestorage type
size
“float”
“float32”
single precision floating point
32 bit
“double”
“float64”
double precision floating point
64 bit
“long”
“int64”
signed 64 bit integer
64 bit
“int”
“int32”
signed 32 bit integer
32 bit
“int16”
signed 16 bit integer
16 bit
“int8”
signed 8 bit integer
8 bit
“uint32”
unsigned 32 bit integer
32 bit
“uint16”
unsigned 16 bit integer
16 bit
“uint8”
unsigned 8 bit integer
8 bit
“bool”
boolean
8 bit
“text”
“utf”
“utf8”
“utf-8”
unicode
variable
“ascii”
“str”
ascii
variable
“ref”
“reference”
“object”
Reference to another group or dataset. See Object References
region
Reference to a region of another dataset. See :ref:sec-zarr-storage-references`
compound dtype
Compound data type
“isodatetime”
ASCII ISO8061 datetime string. For example
2018-09-28T14:43:54.123+02:00
variable
Caching format specifications¶
In practice it is useful to cache the specification a file was created with (including extensions) directly in the Zarr file. Caching the specification in the file ensures that users can access the specification directly if necessary without requiring external resources. For the Zarr backend, caching of the schema is implemented as follows.
The ZarrIO`
backend adds the reserved top-level group /specifications
in which all format specifications (including extensions) are cached. The default name for this group is
defined in DEFAULT_SPEC_LOC_DIR
and caching of
specifications is implemented in ZarrIO.__cache_spec
.
The /specifications
group contains for each specification namespace a subgroup
/specifications/<namespace-name>/<version>
in which the specification for a particular version of a namespace
are stored (e.g., /specifications/core/2.0.1
in the case of the NWB core namespace at version 2.0.1).
The actual specification data is then stored as a JSON string in scalar datasets with a binary, variable-length string
data type. The specification of the namespace is stored in
/specifications/<namespace-name>/<version>/namespace
while additional source files are stored in
/specifications/<namespace-name>/<version>/<source-filename>
. Here <source-filename>
refers to the main name
of the source-file without file extension (e.g., the core namespace defines nwb.ephys.yaml
as source which would
be stored in /specifications/core/2.0.1/nwb.ecephys
).