zarrdata

DOI

A convenient way to store a publically-accessible Zarr dataset that is versioned and optionally tied to a Zenodo DOI:

import xarray as xr
import fsspec
uri = 'https://scottyhq.github.io/zarrdata/air_temperature.zarr'
ds = xr.open_dataset(uri, engine="zarr", consolidated=True)
ds.air.isel(time=1).plot(x="lon")

Unknown

The basic idea is to host a smallish citeable record (<1GB) on a static GitHub pages website so that your tutorial, research code, benchmarking suite, etc. can run against a citeable dataset.

Key limitation of this approach is that Zarr chunks must be less than 100MB, per GitHub repository limits and the total size of the repo/zarr store should be less than 1GB per GitHub Pages limits. If you’re dealing with data>1GB or want high-performance you probably want to store the data files on AWS S3, GCS, etc…

Configuration steps

  1. Add zarr data In the create_zarr.py script I just create a Zarr store from the Xarray tutorial dataset, but if you have data.zarr you just add it to your repo

  2. Add a jekyll configuration file GitHub pages automatically deploys your repository and serves static HTTP via Jekyll. Because Jekyll ignores hidden files (.zattrs, .zmetadata, etc) by default you need a _config.yml to ensure those files are added

  3. Enable github pages To publish the site you just need to enable GitHub Pages for the repository. It’s as simple as going to repository Settings->Pages->Source (select ‘main’ branch and ‘Save’)! The you’ll have a live HTTP-website with the repo README.md rendered! For this repo https://github.com/scottyhq/zarrdata the website is https://scottyhq.github.io/zarrdata .