Deferred Columns
Contents
Deferred Columns¶
Motivation. When working with multimodal datasets, the data in some columns may fit easily in memory, while the data in other columns are best kept on disk and loaded only when needed. For example, in an image dataset, the image labels and metadata are small and may fit in memory, while the images themselves are large and should stay on disk until they are needed.
In Meerkat, columns like ImageColumn
and AudioColumn
make it easy to work with complex data types that can’t fit in memory. If you check out the implementation of these classes, you’ll notice that they are straightforward subclasses of DeferredColumn
.
What’s a DeferredColumn? A DeferredColumn
wraps around another column and represents what you would get if you applied a function to its content. You can think of it as a deferred map operation.
Consider the following example, where we create a simple Meerkat column…
In [1]: import meerkat as mk
In [2]: col = mk.column(list(range(10)))
…and create a deferred column, dcol
, based on it:
In [3]: dcol = col.defer(function=lambda x: x + 10)
In [4]: dcol
Out[4]: column([DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), DeferredCell(fn=<lambda>), ...], backend=DeferredColumn
Like other columns, deferred columns can be subselected.
In [5]: small_dcol = dcol[:5]
Unlike other columns, deferred columns are callable. When we call a deferred column, we apply the function to the underlying column.
In [6]: small_dcol()
Out[6]: column([10, 11, 12, 13, 14], backend=PandasScalarColumn
Critically, the function inside a deferred column is called neither on creation or selection, but only later once the column is called! This is very useful for columns with large data types that we don’t want to load all into memory at once. For example, we could create a DeferredColumn
that lazily loads images…
In [7]: from PIL import Image
In [8]: df = mk.DataFrame(
...: {
...: "filepath": ["/abs/path/to/image0.jpg", ...],
...: "image_id": ["image0", ...]
...: }
...: )
...:
In [9]: df["image"] = df["filepath"].defer(fn=Image.open)
Notice how we provide an absolute path to the images. This makes the column usable from any working directory.
However, using absolute paths is in other ways not ideal: what if we want to share the DataFrame and open it on a different machine? In the section below, we discuss a subclass of DeferredColumn
that makes it easy to manage filepaths.
FileColumn¶
As discussed above, FileColumn
, a simple subclass of DeferredColumn
.
The FileColumn
constructor takes an additional argument, base_dir
, which is the base directory from which all file paths are relative.
When base_dir
is provided, the paths passed to filepaths
should be relative to base_dir
:
In [10]: from PIL import Image
In [11]: df = mk.DataFrame(
....: {
....: "filepath": ["image0.jpg", ...],
....: "image_id": ["image0", ...]
....: }
....: )
....:
In [12]: df["image"] = mk.FileColumn.from_filepaths(
....: filepaths=df["filepath"],
....: loader=Image.open,
....: base_dir="/abs/path/to",
....: )
....:
The base_dir
can then be changed at any time, so if we wanted to share the DataFrame with another user, we could instruct them to reset the base_dir using df["image"].base_dir = "/other/users/abs/path/to"
. Introducing this additional step isn’t ideal though, so we recommend using the environment variables technique as described below.
Using Environment Variables in base_dir
Environment variables in the base_dir
argument are automatically expanded. For example, if you set the environment variable MEERKAT_BASE_DIR
to "/abs/path/to"
, then you can use df["image"].base_dir = "$MEERKAT_BASE_DIR/path/to"
. This is ideal for sharing DataFrames between different users and machines.
Note that the Meerkat dataset registry relies heavily on this technique, using a special environment variable MEERKAT_DATASET_DIR
that points to the mk.config.datasets.root_dir
.
An ImageColumn
is a just a FileColumn
like this one, with a few more bells and whistles!