{
"cells": [
{
"cell_type": "markdown",
"id": "78a1c0b4",
"metadata": {},
"source": [
"# Column\n",
"\n",
"A {class}`~meerkat.Column` is a sequential data structure (analagous to a [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) in Pandas or a [`Vector`](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Simple-manipulations-numbers-and-vectors) in R). Meerkat supports a diverse set of column types (*e.g.,* {class}`~meerkat.TensorColumn`, {class}`~meerkat.ImageColumn`), each intended for different kinds of data.\n",
"\n",
"Below we create a simple column to hold a set of images stored on disk. To create it, we simply pass filepaths to the {class}`~meerkat.ImageColumn` constructor."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e4e7d265",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import os\n",
"import meerkat as mk\n",
"abs_path_to_img_dir = os.path.join(os.path.dirname(os.path.dirname(mk.__file__)), \"docs/assets/guide/data_structures\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7a648096",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" | \n",
" (FileColumn) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
"  | \n",
"
\n",
" \n",
" 1 | \n",
"  | \n",
"
\n",
" \n",
" 2 | \n",
"  | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"column([FileCell(fn=<...7f19e84b82b0>), FileCell(fn=<...7f19e84b82b0>), FileCell(fn=<...7f19e84b82b0>)], backend=FileColumn"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"img_col = mk.image(\n",
" [\"img_0.jpg\", \"img_1.jpg\", \"img_2.jpg\"], \n",
" base_dir=abs_path_to_img_dir\n",
")\n",
"img_col"
]
},
{
"cell_type": "markdown",
"id": "0ea23ee8",
"metadata": {},
"source": [
"All Meerkat columns are subclasses of {class}`~meerkat.Column` and share a common interface, which includes \n",
"{py:meth}`__len__ `,\n",
"{py:meth}`__getitem__ `, \n",
"{py:meth}`__setitem__ `, \n",
"{py:meth}`filter `, \n",
"{py:meth}`map `, \n",
"and {py:meth}`concat `. Below we get the length of the column we just created."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d0e03859",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(img_col)"
]
},
{
"cell_type": "markdown",
"id": "be4de4be",
"metadata": {},
"source": [
"Certain column types may expose additional functionality. For example, {class}`~meerkat.TensorColumn` inherits most of the functionality of an [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0f0cd55f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" (NumPyTensorColumn) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"column([False, True, False], backend=NumPyTensorColumn"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id_col = mk.TensorColumn([0, 1, 2])\n",
"id_col.sum()\n",
"id_col == 1"
]
},
{
"cell_type": "markdown",
"id": "ef579205",
"metadata": {},
"source": [
"If you don't know which column type to use, you can just pass a familiar data structure like a ``list``, ``np.ndarray``, ``pd.Series``, and ``torch.Tensor`` to {py:meth}`Column.from_data ` and Meerkat will automatically pick an appropriate column type."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1e4a1ebd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" (TorchTensorColumn) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" tensor(1) | \n",
"
\n",
" \n",
" 1 | \n",
" tensor(2) | \n",
"
\n",
" \n",
" 2 | \n",
" tensor(3) | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"column([tensor(1), tensor(2), tensor(3)], backend=TorchTensorColumn"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch\n",
"tensor = torch.tensor([1,2,3])\n",
"mk.Column.from_data(tensor)"
]
},
{
"cell_type": "markdown",
"id": "db57082e",
"metadata": {},
"source": [
"# Column Types\n",
"\n",
"There are four core column types in Meerkat, each with a different interface.\n",
"\n",
"1. {class}`~meerkat.ScalarColumn` Each row stores a single numeric or string value. These columns have an interface similar to a Pandas Series. \n",
"2. {class}`~meerkat.TensorColumn` Each row stores an identically shaped multi-dimensional array (*e.g.* vector, matrix, or tensor). These columns have an interface similar to a NumPy ndarray. \n",
"3. {class}`~meerkat.ObjectColumn` Each row stores an arbitrary Python object. These columns should be used sparingly, as they are significantly slower than the columns above. However, they may be useful in small DataFrames. \n",
"4. {class}`~meerkat.DeferredColumn` Represents a *deferred* map operations. A DeferredColumn maintains a single function and a pointer to another column. Each row represents (*but does not actually store*) the value returned from applying the function to the corresponding row of the other column.\n",
"\n",
"````{admonition} Flexibility in Implementation\n",
"\n",
"Meerkat columns are simple wrappers around well-optimized data structures from other libraries. These libraries (e.g. NumPy) run compiled machine code that is significantly faster than routines written in Python. \n",
"\n",
"The data structure underlying a column is available through the ``.data`` attribute of the column. For example, the following code creates a {class}`~meerkat.TensorColumn` and then accesses the underlying NumPy array.\n",
"\n",
"```{code-cell} ipython3\n",
"import meerkat as mk;\n",
"col = mk.TensorColumn([0,1,2]);\n",
"col.data\n",
"```\n",
"\n",
"Meerkat is unopinionated when it comes to the choice of data structure underlying columns. This provides users the **flexibility** to choose the best data structure for their use case. For example, a `TensorColumn` can be backed by either a [NumPy Array](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)) or a [PyTorch Tensor](https://pytorch.org/docs/stable/tensors.html).\n",
"\n",
"Each `ScalarColumn` object in Meerkat is actually an instance of one of its subclasses ({class}`~meerkat.PandasScalarColumn`, {class}`~meerkat.ArrowScalarColumn`). These subclasses are responsible for implementing the {class}`~meerkat.ScalarColumn` interface for a particular choice of data structure. Similarly, each `TensorColumn` object is an instance of its subclasses ({class}`~meerkat.NumPyTensorColumn`, {class}`~meerkat.TorchTensorColumn`). \n",
"\n",
"*How to pick a subclass?* In general, users should not have to think about which subclass to use. Meerkat chooses a subclass based on the data structure of the input data. For example, the following code creates a `ScalarColumn` backed by a Pandas Series:\n",
"\n",
"```{code-cell} ipython3\n",
"mk.column([0,1,2])\n",
"```\n",
"\n",
"You can also explicitly specify the subclass to use. For example, the following code creates a `ScalarColumn` backed by an Arrow array:\n",
"\n",
"```{code-cell} ipython3\n",
"mk.ArrowScalarColumn([0,1,2])\n",
"```\n",
"````"
]
}
],
"metadata": {
"file_format": "mystnb",
"kernelspec": {
"display_name": "python3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
},
"source_map": [
5,
13,
20,
26,
36,
38,
43,
47,
51,
55
]
},
"nbformat": 4,
"nbformat_minor": 5
}