Column
Contents
Column¶
A Column
is a sequential data structure (analagous to a Series
in Pandas or a Vector
in R). Meerkat supports a diverse set of column types (e.g., TensorColumn
, ImageColumn
), each intended for different kinds of data.
Below we create a simple column to hold a set of images stored on disk. To create it, we simply pass filepaths to the ImageColumn
constructor.
img_col = mk.image(
["img_0.jpg", "img_1.jpg", "img_2.jpg"],
base_dir=abs_path_to_img_dir
)
img_col
(FileColumn) | |
---|---|
0 | |
1 | |
2 |
All Meerkat columns are subclasses of Column
and share a common interface, which includes
__len__
,
__getitem__
,
__setitem__
,
filter
,
map
,
and concat
. Below we get the length of the column we just created.
len(img_col)
3
Certain column types may expose additional functionality. For example,  TensorColumn
 inherits most of the functionality of an ndarray
.
id_col = mk.TensorColumn([0, 1, 2])
id_col.sum()
id_col == 1
(NumPyTensorColumn) | |
---|---|
0 | False |
1 | True |
2 | False |
If you don’t know which column type to use, you can just pass a familiar data structure like a list
, np.ndarray
, pd.Series
, and torch.Tensor
to Column.from_data
and Meerkat will automatically pick an appropriate column type.
import torch
tensor = torch.tensor([1,2,3])
mk.Column.from_data(tensor)
(TorchTensorColumn) | |
---|---|
0 | tensor(1) |
1 | tensor(2) |
2 | tensor(3) |
Column Types¶
There are four core column types in Meerkat, each with a different interface.
ScalarColumn
Each row stores a single numeric or string value. These columns have an interface similar to a Pandas Series.TensorColumn
Each row stores an identically shaped multi-dimensional array (e.g. vector, matrix, or tensor). These columns have an interface similar to a NumPy ndarray.ObjectColumn
Each row stores an arbitrary Python object. These columns should be used sparingly, as they are significantly slower than the columns above. However, they may be useful in small DataFrames.DeferredColumn
Represents a deferred map operations. A DeferredColumn maintains a single function and a pointer to another column. Each row represents (but does not actually store) the value returned from applying the function to the corresponding row of the other column.
Flexibility in Implementation
Meerkat columns are simple wrappers around well-optimized data structures from other libraries. These libraries (e.g. NumPy) run compiled machine code that is significantly faster than routines written in Python.
The data structure underlying a column is available through the .data
attribute of the column. For example, the following code creates a TensorColumn
and then accesses the underlying NumPy array.
Meerkat is unopinionated when it comes to the choice of data structure underlying columns. This provides users the flexibility to choose the best data structure for their use case. For example, a TensorColumn
can be backed by either a NumPy Array) or a PyTorch Tensor.
Each ScalarColumn
object in Meerkat is actually an instance of one of its subclasses (PandasScalarColumn
, ArrowScalarColumn
). These subclasses are responsible for implementing the ScalarColumn
interface for a particular choice of data structure. Similarly, each TensorColumn
object is an instance of its subclasses (NumPyTensorColumn
, TorchTensorColumn
).
How to pick a subclass? In general, users should not have to think about which subclass to use. Meerkat chooses a subclass based on the data structure of the input data. For example, the following code creates a ScalarColumn
backed by a Pandas Series:
You can also explicitly specify the subclass to use. For example, the following code creates a ScalarColumn
backed by an Arrow array: