A gentle introduction

A gentle introduction#

map_blocks is inspired by the dask.array function of the same name and lets you map a function on blocks of the xarray object (including Datasets!).

At compute time, your function will receive an xarray object with concrete (computed) values along with appropriate metadata. This function should return an xarray object.

Setup#

import dask
import numpy as np
import xarray as xr

First lets set up a LocalCluster using dask.distributed.

You can use any kind of dask cluster. This step is completely independent of xarray. While not strictly necessary, the dashboard provides a nice learning tool.

from dask.distributed import Client

client = Client()
client

Client

Client-6fc0f5fd-2ffd-11ef-8e15-000d3a353dfd

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

👆

Click the Dashboard link above. Or click the "Search" button in the dashboard.

Let’s test that the dashboard is working..

import dask.array

dask.array.ones((1000, 4), chunks=(2, 1)).compute()  # should see activity in dashboard
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       ...,
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Let’s open a dataset. We specify chunks so that we create a dask arrays for the DataArrays

ds = xr.tutorial.open_dataset("air_temperature", chunks={"time": 100})
ds
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB dask.array<chunksize=(100, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Simple example#

Here is an example

def time_mean(obj):
    # use xarray's convenient API here
    # you could convert to a pandas dataframe and use pandas' extensive API
    # or use .plot() and plt.savefig to save visualizations to disk in parallel.
    return obj.mean("lat")


ds.map_blocks(time_mean)  # this is lazy!
<xarray.Dataset> Size: 1MB
Dimensions:  (time: 2920, lon: 53)
Coordinates:
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Data variables:
    air      (time, lon) float64 1MB dask.array<chunksize=(100, 53), meta=np.ndarray>
# this will calculate values and will return True if the computation works as expected
ds.map_blocks(time_mean).identical(ds.mean("lat"))
True

Exercise#

Try applying the following function with map_blocks. Specify scale as an argument and offset as a kwarg.

The docstring should help: https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html

def time_mean_scaled(obj, scale, offset):
    return obj.mean("lat") * scale + offset

More advanced functions#

map_blocks needs to know what the returned object looks like exactly. It does so by passing a 0-shaped xarray object to the function and examining the result. This approach cannot work in all cases For such advanced use cases, map_blocks allows a template kwarg. See https://docs.xarray.dev/en/stable/user-guide/dask.html#map-blocks for more details

client.close()