Multidimensional Arrays: Reference

Key Points

datasets for the xarray tutorial
  • refer to this page for access to the tutorial data

Introduction to multidimensional arrays
  • unlabelled, N-dimensional arrays of numbers (e.g. NumPy’s ndarray) are the most widely used data structure in scientific computing

  • these arrays lack meaningful metadata, so users must track indices in an arbitrary fashion

  • in-memory operations, needed to process and visualize large arrays, are reaching limits as datasets grow in size

xarray architecture
  • xarray is build on the netCDF data model

  • xarray has two main data structures: DataArray and Dataset

  • DataArrays store the multi-dimensional arrays

  • Datasets are the multi-dimensional equivalent of a Pandas dataframe

label-based indexing
  • xarray’s labeled dimensions free the user from having to track positional ordering of dimensions when accessing data, creating a more simplified workflow

  • xarray has plotting functinality that is a thin wrapper around the Matplotlib library

  • xarray uses syntax and function names from Matplotlib whenever possible

arithmetic and aggregation
  • xarray’s labeled dimensions enable simplified arithmetic and data aggregation, enabling many powerful shortcuts

groupby processing
  • xarray provides Pandas-like methods for performing data aggregation over defined groupings in the data

out-of-core computation
  • dask integration with xarray allows you to work with large datasets that “fit on disk” rather than having to “fit in memory”.

  • It is important to chunk the data correctly for this to work.

  • xarray provides tools for creating and analyzing masked data.

  • A summary of everything so far

FIXME: more reference material.