Note
Go to the end to download the full example code.
Streaming NWB Zarr files from S3
This tutorial demonstrates how to stream NWB files stored in Zarr format from Amazon S3 cloud storage. Streaming from S3 allows you to access large datasets without downloading the entire file, which is particularly useful for exploring data, reading specific subsets, or working with datasets too large for local storage.
Prerequisites
To stream data from S3, you need to install the optional dependencies fsspec and s3fs:
pip install hdmf-zarr[full]
Or install the dependencies separately:
pip install fsspec s3fs
Streaming from a Public S3 Bucket
To read an NWB Zarr file from a public S3 bucket, you can provide the S3 URL
to NWBZarrIO. For HTTPS URLs (https://), no
additional configuration is needed. For s3:// protocol URLs, you need to
specify storage_options=dict(anon=True) to enable anonymous access.
Here we demonstrate reading from a public dataset in the DANDI Archive using an HTTPS URL:
from hdmf_zarr import NWBZarrIO
# Public S3 URL from DANDI Archive (DANDISET 000719)
# Path: sub-R6_ses-20200206T210000_behavior+ophys_DirectoryStore_rechunked.nwb.zarr
s3_url = "https://dandiarchive.s3.amazonaws.com/zarr/c8c6b848-fbc6-4f58-85ff-e3f2618ee983/"
# Open the file from S3
try:
with NWBZarrIO(s3_url, mode="r") as io:
nwbfile = io.read()
print(f"Session Description: {nwbfile.session_description}")
print(f"Identifier: {nwbfile.identifier}")
print(f"Subject ID: {nwbfile.subject.subject_id if nwbfile.subject else 'N/A'}")
except Exception as e:
print(f"Note: Could not access S3 file (network access may be required): {e}")
Session Description: TwoTower_foraging_002_002
Identifier: 7208f856-f527-479f-973d-e6e72326a8ea
Subject ID: R6
Accessing Private S3 Buckets
To access files in private S3 buckets, you need to provide AWS credentials. There are several ways to do this:
Option 1: Use AWS credentials from environment or ~/.aws/credentials
If your AWS credentials are configured via environment variables
(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or in the AWS credentials file,
you can simply omit the anon=True option:
Option 2: Provide credentials explicitly
You can also provide credentials directly via the storage_options parameter:
Note: Never hardcode credentials in your scripts. Use environment variables or AWS credentials files instead.
The Importance of Consolidated Metadata
Zarr files store metadata for each array and group in separate files. When reading from S3, each metadata access requires a separate network request, which can significantly slow down file opening and data access.
Consolidated metadata addresses this by storing all metadata in a single
.zmetadata file at the root of the Zarr store. This helps improve read performance
by reducing the number of S3 requests needed to open a file.
By default, NWBZarrIO consolidates metadata when
writing files, and automatically uses consolidated metadata when available
during read operations.
Generating and Updating Consolidated Metadata
When you create or modify a Zarr file, you should consolidate the metadata
to ensure optimal performance for readers, especially those streaming from S3.
By default, NWBZarrIO automatically consolidates
metadata when writing files. See the
write() method’s consolidate_metadata
parameter for more details.
Note
If you modify a Zarr file after creation (e.g., by directly using zarr APIs), you need to manually update the consolidated metadata:
import zarr
path = "myfile.nwb.zarr"
zarr.consolidate_metadata(path)
This ensures that the .zmetadata file reflects the current state of the
Zarr store. This step is critical before uploading modified files to S3.
For more details on consolidated metadata, see the Zarr documentation and the Storage Specification section of the hdmf-zarr documentation.
Using the Convenience Method
NWBZarrIO provides a convenience static method
read_nwb() for quick read access:
# Read file directly using the convenience static method
try:
nwbfile = NWBZarrIO.read_nwb(s3_url)
print(f"Session Start Time: {nwbfile.session_start_time}")
except Exception as e:
print(f"Note: Could not access S3 file (network access may be required): {e}")
Session Start Time: 2020-02-06 21:00:00-08:00
Note
PyNWB also provides a more general read() method that
can automatically detect and use the appropriate IO class (HDF5 or Zarr) based
on the file path or URL.
Best Practices for S3 Streaming
Always use consolidated metadata for files stored on S3. This is the default when writing with
NWBZarrIO.Use HTTPS URLs (
https://) for public buckets when possible, as they work without additional configuration.For private buckets, configure AWS credentials properly using environment variables or the AWS credentials file rather than hardcoding them.
After modifying Zarr files, always run
zarr.consolidate_metadata(path)before uploading to S3.Test your S3 URLs to ensure they are accessible before sharing them with collaborators.
Consider network costs: While streaming is convenient, repeated access to the same data may be less efficient than downloading the file once for local access.