Column¶

A Column is a sequential data structure (analagous to a Series in Pandas or a Vector in R). Meerkat supports a diverse set of column types (e.g., TensorColumn, ImageColumn), each intended for different kinds of data.

Below we create a simple column to hold a set of images stored on disk. To create it, we simply pass filepaths to the ImageColumn constructor.

img_col = mk.image(
    ["img_0.jpg", "img_1.jpg", "img_2.jpg"], 
    base_dir=abs_path_to_img_dir
)
img_col
(FileColumn)
0
1
2

All Meerkat columns are subclasses of Column and share a common interface, which includes __len__, __getitem__, __setitem__, filter, map, and concat. Below we get the length of the column we just created.

len(img_col)
3

Certain column types may expose additional functionality. For example,  TensorColumn inherits most of the functionality of an ndarray.

id_col = mk.TensorColumn([0, 1, 2])
id_col.sum()
id_col == 1
(NumPyTensorColumn)
0 False
1 True
2 False

If you don’t know which column type to use, you can just pass a familiar data structure like a list, np.ndarray, pd.Series, and torch.Tensor to Column.from_data and Meerkat will automatically pick an appropriate column type.

import torch
tensor = torch.tensor([1,2,3])
mk.Column.from_data(tensor)
(TorchTensorColumn)
0 tensor(1)
1 tensor(2)
2 tensor(3)

Column Types¶

There are four core column types in Meerkat, each with a different interface.

  1. ScalarColumn Each row stores a single numeric or string value. These columns have an interface similar to a Pandas Series.

  2. TensorColumn Each row stores an identically shaped multi-dimensional array (e.g. vector, matrix, or tensor). These columns have an interface similar to a NumPy ndarray.

  3. ObjectColumn Each row stores an arbitrary Python object. These columns should be used sparingly, as they are significantly slower than the columns above. However, they may be useful in small DataFrames.

  4. DeferredColumn Represents a deferred map operations. A DeferredColumn maintains a single function and a pointer to another column. Each row represents (but does not actually store) the value returned from applying the function to the corresponding row of the other column.

Flexibility in Implementation

Meerkat columns are simple wrappers around well-optimized data structures from other libraries. These libraries (e.g. NumPy) run compiled machine code that is significantly faster than routines written in Python.

The data structure underlying a column is available through the .data attribute of the column. For example, the following code creates a TensorColumn and then accesses the underlying NumPy array.

Meerkat is unopinionated when it comes to the choice of data structure underlying columns. This provides users the flexibility to choose the best data structure for their use case. For example, a TensorColumn can be backed by either a NumPy Array) or a PyTorch Tensor.

Each ScalarColumn object in Meerkat is actually an instance of one of its subclasses (PandasScalarColumn, ArrowScalarColumn). These subclasses are responsible for implementing the ScalarColumn interface for a particular choice of data structure. Similarly, each TensorColumn object is an instance of its subclasses (NumPyTensorColumn, TorchTensorColumn).

How to pick a subclass? In general, users should not have to think about which subclass to use. Meerkat chooses a subclass based on the data structure of the input data. For example, the following code creates a ScalarColumn backed by a Pandas Series:

You can also explicitly specify the subclass to use. For example, the following code creates a ScalarColumn backed by an Arrow array: