# Xarray's Data structures

In this lesson, we cover the basics of Xarray data structures. Our
learning goals are as follows. By the end of the lesson, we will be able to:

- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray

---

## Introduction

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)
are an essential part of computational science. They are encountered in a wide
range of fields, including physics, astronomy, geoscience, bioinformatics,
engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)
provides the fundamental data structure and API for working with raw ND arrays.
However, real-world datasets are usually more than just raw numbers; they have
labels which encode information about how the array values map to locations in
space, time, etc.

Here is an example of how we might structure a dataset for a weather forecast:

<img src="https://docs.xarray.dev/en/stable/_images/dataset-diagram.png" align="center" width="80%">

You'll notice multiple data variables (temperature, precipitation), coordinate
variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these
fit into Xarray's data structures below.

Xarray doesn’t just keep track of labels on arrays – it uses them to provide a
powerful and concise interface. For example:

- Apply operations over dimensions by name: `x.sum('time')`.

- Select values by label (or logical location) instead of integer location:
  `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.

- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions
  (array broadcasting) based on dimension names, not shape.

- Easily use the split-apply-combine paradigm with groupby:
  `x.groupby('time.dayofyear').mean()`.

- Database-like alignment based on coordinate labels that smoothly handles
  missing values: `x, y = xr.align(x, y, join='outer')`.

- Keep track of arbitrary metadata in the form of a Python dictionary:
  `x.attrs`.

The N-dimensional nature of xarray’s data structures makes it suitable for
dealing with multi-dimensional scientific data, and its use of dimension names
instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much
more manageable than the raw numpy ndarray: with xarray, you don’t need to keep
track of the order of an array’s dimensions or insert dummy dimensions of size 1
to align arrays (e.g., using np.newaxis).

The immediate payoff of using xarray is that you’ll write less code. The
long-term payoff is that you’ll understand what you were thinking when you come
back to look at it weeks or months later.


## Data structures

Xarray provides two data structures: the `DataArray` and `Dataset`. The
`DataArray` class attaches dimension names, coordinates and attributes to
multi-dimensional arrays while `Dataset` combines multiple arrays.

Both classes are most commonly created by reading data.
To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.

Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.
We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name.

In [None]:
import numpy as np
import xarray as xr

### Dataset

`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.


In [None]:
ds = xr.tutorial.load_dataset("air_temperature")
ds

We can access "layers" of the Dataset (individual DataArrays) with dictionary syntax

In [None]:
ds["air"]

We can save some typing by using the "attribute" or "dot" notation. This won't work for variable names that clash with built-in
method names (for example, `mean`).

In [None]:
ds.air

#### What is all this anyway? (String representations)

Xarray has two representation types: `"html"` (which is only available in
notebooks) and `"text"`. To choose between them, use the `display_style` option.

So far, our notebook has automatically displayed the `"html"` representation (which we will continue using).
The `"html"` representation is interactive, allowing you to collapse sections (left arrows) and
view attributes and values for each value (right hand sheet icon and data symbol).

In [None]:
with xr.set_options(display_style="html"):
    display(ds)

The output consists of:

- a summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first
  dimension is named `lat` and has a size of `25`, the second dimension is named
  `time` and has a size of `2920`, and the third dimension is named `lon` and has a size
  of `53`. Because we will access the dimensions by name, the order doesn't matter.
- an unordered list of *coordinates* or dimensions with coordinates with one item
  per line. Each item has a name, one or more dimensions in parentheses, a dtype
  and a preview of the values. Also, if it is a dimension coordinate, it will be
  marked with a `*`.
- an alphabetically sorted list of *dimensions without coordinates* (if there are any)
- an unordered list of *attributes*, or metadata

Compare that with the string representation, which is very similar except the dimensions are given a `*` prefix instead of bold and you cannot collapse or expand the outputs.

In [None]:
with xr.set_options(display_style="text"):
    display(ds)

To understand each of the components better, we'll explore the "air" variable of our Dataset.

### DataArray

The `DataArray` class consists of an array (data) and its associated dimension names, labels, and attributes (metadata).


In [None]:
da = ds["air"]
da

#### String representations

We can use the same two representations (`"html"`, which is only available in
notebooks, and `"text"`) to display our `DataArray`.

In [None]:
with xr.set_options(display_style="html"):
    display(da)

In [None]:
with xr.set_options(display_style="text"):
    display(da)

In the string representation of a `DataArray` (versus a `Dataset`), we also see:
- the `DataArray` name ('air')
- a preview of the array data (collapsible in the `"html"` representation)

We can also access the data array directly:

In [None]:
ds.air.data  # (or equivalently, `da.data`)

#### Named dimensions 

`.dims` are the named axes of your data. They may (dimension coordinates) or may not (dimensions without coordinates) have associated values. Names can be anything that fits into a Python `set` (i.e. calling `hash()` on it doesn't raise an error), but to be
useful they should be strings.

In this case we have 2 spatial dimensions (`latitude` and `longitude` are stored with shorthand names `lat` and `lon`) and one temporal dimension (`time`).

In [None]:
ds.air.dims

#### Coordinates


`.coords` is a simple [dict-like](https://docs.python.org/3/glossary.html#term-mapping) [data container](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#coordinates)
for mapping coordinate names to values. These values can be:
- another `DataArray` object
- a tuple of the form `(dims, data, attrs)` where `attrs` is optional. This is
  roughly equivalent to creating a new `DataArray` object with
  `DataArray(dims=dims, data=data, attrs=attrs)`
- a 1-dimensional `numpy` array (or anything that can be coerced to one using [`numpy.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html), such as a `list`) containing numbers, datetime objects, strings, etc. to label each point.

Here we see the actual timestamps and spatial positions of our air temperature data:


In [None]:
ds.air.coords

The difference between the dimension labels (dimension coordinates) and normal
coordinates is that for now it only is possible to use indexing operations
(`sel`, `reindex`, etc.) with dimension coordinates. Also, while coordinates can
have arbitrary dimensions, dimension coordinates have to be one-dimensional.

#### Attributes 

`.attrs` is a dictionary that can contain arbitrary Python objects (strings, lists, integers, dictionaries, etc.) containing information about your data. Your only
limitation is that some attributes may not be writeable to certain file formats.

In [None]:
ds.air.attrs

## To Pandas and back

`DataArray` and `Dataset` objects are frequently created by converting from
other libraries such as [pandas](https://pandas.pydata.org/) or by reading from
data storage formats such as
[NetCDF](https://www.unidata.ucar.edu/software/netcdf/) or
[zarr](https://zarr.readthedocs.io/en/stable/).

To convert from / to `pandas`, we can use the
<code>[to_xarray](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html)</code>
methods on [pandas](https://zarr.readthedocs.io/en/stable/) objects or the
<code>[to_pandas](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_pandas.html)</code>
methods on `xarray` objects:


In [None]:
import pandas as pd

In [None]:
series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series

In [None]:
arr = series.to_xarray()
arr

In [None]:
arr.to_pandas()

We can also control what `pandas` object is used by calling `to_series` /
`to_dataframe`:


**<code>to_series</code>**: This will always convert `DataArray` objects to
`pandas.Series`, using a `MultiIndex` for higher dimensions


In [None]:
ds.air.to_series()

**<code>to_dataframe</code>**: This will always convert `DataArray` or `Dataset`
objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named
for this.


In [None]:
ds.air.to_dataframe()

Since columns in a `DataFrame` need to have the same index, they are
broadcasted.


In [None]:
ds.to_dataframe()