Converting NWB HDF5 files to/from Zarr

This tutorial illustrates how to convert data between HDF5 and Zarr using a Neurodata Without Borders (NWB) file from the DANDI data archive as an example. In this tutorial we will convert our example file from HDF5 to Zarr and then back again to HDF5. The NWB standard is defined using HDMF and uses the HDF5IO HDF5 backend from HDMF for storage.

Setup

Here we use a small NWB file from the DANDI neurophysiology data archive from Dandiset 001333 as an example. To download the file directly from DANDI we can use:

 1import os
 2from dandi.dandiapi import DandiAPIClient
 3
 4dandiset_id = "001333"
 5filepath = "sub-healthy-simulated-beta/sub-healthy-simulated-beta_ses-162_ecephys.nwb"   # 220 KiB file
 6with DandiAPIClient() as client:
 7    asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath)
 8
 9s3_path = asset.get_content_url(follow_redirects=1, strip_query=True)
10filename = os.path.basename(asset.path)
11asset.download(filename)

We here use a local copy of a small file from this Dandiset as an example:

import os
import shutil
from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO

# Input file to convert
basedir = "resources"
filename = os.path.join(basedir, "sub-healthy-simulated-beta_ses-162_ecephys.nwb")
# Zarr file to generate for converting from HDF5 to Zarr
zarr_filename = "test_zarr_" + os.path.basename(filename) + ".zarr"
# HDF5 file to generate for converting from Zarr to HDF5
hdf_filename = "test_hdf5_" + os.path.basename(filename)

# Delete our converted HDF5 and Zarr file from previous runs of this notebook
for fname in [zarr_filename, hdf_filename]:
    if os.path.exists(fname):
        print("Removing %s" % fname)
        if os.path.isfile(fname):  # Remove a single file (here the HDF5 file)
            os.remove(fname)
        else:  # Remove whole directory and subtree (here the Zarr file)
            shutil.rmtree(fname)

Convert the NWB file from HDF5 to Zarr

To convert files between storage backends, we use HDMF’s export functionality. As this is an NWB file, we here use the pynwb.NWBHDF5IO backend for reading the file from from HDF5 and use the NWBZarrIO backend to export the file to Zarr.

with NWBHDF5IO(filename, 'r') as read_io:  # Create HDF5 IO object for read
    with NWBZarrIO(zarr_filename, 'w') as export_io:  # Create Zarr IO object for write
        export_io.export(src_io=read_io, write_args=dict(link_data=False))  # Export from HDF5 to Zarr

Note

When converting between backends we need to set link_data=False as linking from Zarr to HDF5 (and vice-versa) is not supported.

Read the Zarr file back in

The basic behavior of the NWBFile object is the same.

# Print the NWBFile to illustrate that
print(nwb_zarr)
root pynwb.file.NWBFile at 0x135214958219200
Fields:
  devices: {
    NEURON_Simulator <class 'pynwb.device.Device'>
  }
  electrode_groups: {
    shank0 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank1 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank2 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank3 <class 'pynwb.ecephys.ElectrodeGroup'>
  }
  electrodes: electrodes <class 'pynwb.ecephys.ElectrodesTable'>
  experiment_description: The PESD dataset is generated from a cortico-basal-ganglia network for a Parkinsonian computational model. The computational model of the cortico-basal-ganglia is originally presented by Fleming et al. in the article: 'Simulation of Closed-Loop Deep Brain Stimulation Control Schemes for Suppression of Pathological Beta Oscillations in Parkinson's Disease'.
  experimenter: ['Ananna Biswas']
  file_create_date: [datetime.datetime(2025, 3, 27, 16, 53, 28, 55430, tzinfo=tzoffset(None, -14400))]
  identifier: 7a68ea11-865a-481a-a5fd-d91fe6def653
  institution: Michigan Technological University
  keywords: <zarr.core.Array '/general/keywords' (4,) object read-only>
  lab: BrainX Lab
  processing: {
    ecephys <class 'pynwb.base.ProcessingModule'>
  }
  related_publications: ['https://arxiv.org/abs/2407.17756' 'DOI: 10.3389/fnins.2020.00166']
  session_description: Parkinson's Electrophysiological Signal Dataset (PESD) Generated from Simulation
  session_start_time: 2025-03-27 16:53:27.990500-04:00
  subject: subject pynwb.file.Subject at 0x135214958217072
Fields:
  age: P0D
  age__reference: birth
  description: This is a simulated dataset generated from a computational model.
  sex: U
  species: Homo sapiens
  subject_id: healthy-simulated-beta

  timestamps_reference_time: 2025-03-27 16:53:27.990500-04:00

The main difference is that datasets are now represented by Zarr arrays compared to h5py Datasets when reading from HDF5.

print(type(nwb_zarr.electrodes['label'].data))
<class 'zarr.core.Array'>

For illustration purposes, we here show the NWB Electrodes table.

                                            location  ...         label
id                                                    ...
0   Simulated Cortico-basal-ganglia network of brain  ...  shank0_elec0
1   Simulated Cortico-basal-ganglia network of brain  ...  shank0_elec1
2   Simulated Cortico-basal-ganglia network of brain  ...  shank0_elec2
3   Simulated Cortico-basal-ganglia network of brain  ...  shank1_elec0
4   Simulated Cortico-basal-ganglia network of brain  ...  shank1_elec1
5   Simulated Cortico-basal-ganglia network of brain  ...  shank1_elec2
6   Simulated Cortico-basal-ganglia network of brain  ...  shank2_elec0
7   Simulated Cortico-basal-ganglia network of brain  ...  shank2_elec1
8   Simulated Cortico-basal-ganglia network of brain  ...  shank2_elec2
9   Simulated Cortico-basal-ganglia network of brain  ...  shank3_elec0
10  Simulated Cortico-basal-ganglia network of brain  ...  shank3_elec1
11  Simulated Cortico-basal-ganglia network of brain  ...  shank3_elec2

[12 rows x 4 columns]

Convert the Zarr file back to HDF5

Using the same approach as above, we can now convert our Zarr file back to HDF5.

with NWBZarrIO(zarr_filename, 'r') as read_io:  # Create Zarr IO object for read
    with NWBHDF5IO(hdf_filename, 'w') as export_io:  # Create HDF5 IO object for write
        export_io.export(src_io=read_io, write_args=dict(link_data=False))  # Export from Zarr to HDF5

Read the new HDF5 file back

Now our file has been converted from HDF5 to Zarr and back again to HDF5. Here we check that we can still read that file.

root pynwb.file.NWBFile at 0x135214957317008
Fields:
  devices: {
    NEURON_Simulator <class 'pynwb.device.Device'>
  }
  electrode_groups: {
    shank0 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank1 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank2 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank3 <class 'pynwb.ecephys.ElectrodeGroup'>
  }
  electrodes: electrodes <class 'pynwb.ecephys.ElectrodesTable'>
  experiment_description: The PESD dataset is generated from a cortico-basal-ganglia network for a Parkinsonian computational model. The computational model of the cortico-basal-ganglia is originally presented by Fleming et al. in the article: 'Simulation of Closed-Loop Deep Brain Stimulation Control Schemes for Suppression of Pathological Beta Oscillations in Parkinson's Disease'.
  experimenter: ['Ananna Biswas']
  file_create_date: [datetime.datetime(2025, 3, 27, 16, 53, 28, 55430, tzinfo=tzoffset(None, -14400))]
  identifier: 7a68ea11-865a-481a-a5fd-d91fe6def653
  institution: Michigan Technological University
  keywords: <StrDataset for HDF5 dataset "keywords": shape (4,), type "|O">
  lab: BrainX Lab
  processing: {
    ecephys <class 'pynwb.base.ProcessingModule'>
  }
  related_publications: ['https://arxiv.org/abs/2407.17756' 'DOI: 10.3389/fnins.2020.00166']
  session_description: Parkinson's Electrophysiological Signal Dataset (PESD) Generated from Simulation
  session_start_time: 2025-03-27 16:53:27.990500-04:00
  subject: subject pynwb.file.Subject at 0x135214957319024
Fields:
  age: P0D
  age__reference: birth
  description: This is a simulated dataset generated from a computational model.
  sex: U
  species: Homo sapiens
  subject_id: healthy-simulated-beta

  timestamps_reference_time: 2025-03-27 16:53:27.990500-04:00

Gallery generated by Sphinx-Gallery