HDF5 dataspaces and selections

While datatype declare what is stored the dataspace defines how it is stored. At the current state of the HDF5 library there are only two storage schemes available

  • scalar which means that only a single value of a particular datatype is stored

  • simple which is a simple n-dimensional regular array of data elements of a given type.

Simple and scalar dataspaces are represented by the Scalar and Simple classes in the pninexus.h5cpp.dataspace package. Dataspaces are required when creating attributes and datasets (we will discuss this in more detail in HDF5 attributes and HDF5 datasets).

Scalar dataspaces are simple to create

from pninexus.h5cpp.dataspace import Scalar

dataspace = Scalar()

as its constructor does not require any additional arguments. For simple dataspaces a bit more effort has to be made. There are three different configurations for a simple dataspace

  • a fixed size dataspace whose size (and thus the size of the dataset constructed with it) cannot be changed once it has been created

  • an extensible dataspace of finite size

  • and an extensible dataspace of infinite size.

The first case is fairly simple

from pninexus.h5cpp.dataspace import Simple

dataspace = Simple((12,3))

which will create a 2-dimensional dataspace with 12 elemnts along the first and 3 elements along the second dimension. To cover the second sitatuation we have to pass a second argument with the maximum number of elements along each dimension

dataspace = Simple((12,3),(24,6))

The newly created dataspace has 12 elements along the first and 3 along the second dimension. However, it would be possible to extend it up to 24 and 6 dimensions along the two dimensions (we will see later in HDF5 datasets how this is done practically).

The last case finally where we can extend a dataset indefinitely along one or more of its dimensions requires a special constant UNLIMITED from the dataspace package

from pninexus.h5cpp.dataspace import Simple, UNLIMITED

dataspace = Simple((0,10),(UNLIMITED,10))

This code snippet shows a typical use case for such a dataspace. We start with no elements along the first dimension but with the option to extend the dataspace indefinitely along this first dimension. As a result we are able to extend a dataset using such a dataspace indefinitely along this dimension and thus append data as it drops in. This has the nice advantage that we do not have to know the number of recorded data points in advance.

Selections

A topic closely related to dataspaces are selections which is related to HDF5s partial IO feature. In many cases the data stored to disk is far larger than the memory available on the machine the data should be analyzed. Or, not the entire data is required but only a relatively small part of it. HDF5 allows to apply selections on a dataspace and subsequently read only the selected data from a dataset.

Note

Another missconception about HDF5 is that a selection is applied to a dataset. This is wrong. The selection is applied to a dataspace which is then used to describe the data stored on disk.

HDF5 supports two kinds of selections

  • point selections where an arbitrary set of data elements can be selected

  • and hyperslab selections where a regular pattern of data elements is selected.

Currently only the hyperslab selection is implemented in h5cpp and thus in this Python wrapper. The class in charge is Hyperslab.

A Hyperslab has 4 parameters

  • an offset determining the start of the selection in the dataspace

  • a block array which determines how many elements are selected in each individual block

  • a stride giving the distance between the blocks

  • and a count value which determines how many blocks to read.