Column is a sequential data structure (analagous to a
Series in Pandas or a
Vector in R). Meerkat supports a diverse set of column types (e.g.,
ImageColumn), each intended for different kinds of data.
Below we create a simple column to hold a set of images stored on disk. To create it, we simply pass filepaths to the
img_col = mk.image( ["img_0.jpg", "img_1.jpg", "img_2.jpg"], base_dir=abs_path_to_img_dir ) img_col
All Meerkat columns are subclasses of
Column and share a common interface, which includes
concat. Below we get the length of the column we just created.
id_col = mk.TensorColumn([0, 1, 2]) id_col.sum() id_col == 1
If you don’t know which column type to use, you can just pass a familiar data structure like a
Column.from_data and Meerkat will automatically pick an appropriate column type.
import torch tensor = torch.tensor([1,2,3]) mk.Column.from_data(tensor)
There are four core column types in Meerkat, each with a different interface.
ScalarColumnEach row stores a single numeric or string value. These columns have an interface similar to a Pandas Series.
TensorColumnEach row stores an identically shaped multi-dimensional array (e.g. vector, matrix, or tensor). These columns have an interface similar to a NumPy ndarray.
ObjectColumnEach row stores an arbitrary Python object. These columns should be used sparingly, as they are significantly slower than the columns above. However, they may be useful in small DataFrames.
DeferredColumnRepresents a deferred map operations. A DeferredColumn maintains a single function and a pointer to another column. Each row represents (but does not actually store) the value returned from applying the function to the corresponding row of the other column.
Flexibility in Implementation
Meerkat columns are simple wrappers around well-optimized data structures from other libraries. These libraries (e.g. NumPy) run compiled machine code that is significantly faster than routines written in Python.
The data structure underlying a column is available through the
.data attribute of the column. For example, the following code creates a
TensorColumn and then accesses the underlying NumPy array.
Meerkat is unopinionated when it comes to the choice of data structure underlying columns. This provides users the flexibility to choose the best data structure for their use case. For example, a
TensorColumn can be backed by either a NumPy Array) or a PyTorch Tensor.
ScalarColumn object in Meerkat is actually an instance of one of its subclasses (
ArrowScalarColumn). These subclasses are responsible for implementing the
ScalarColumn interface for a particular choice of data structure. Similarly, each
TensorColumn object is an instance of its subclasses (
How to pick a subclass? In general, users should not have to think about which subclass to use. Meerkat chooses a subclass based on the data structure of the input data. For example, the following code creates a
ScalarColumn backed by a Pandas Series:
You can also explicitly specify the subclass to use. For example, the following code creates a
ScalarColumn backed by an Arrow array: