Data Selection

As discussed in the intro, there are two key data structures in Meerkat: the Column and the DataFrame. In this guide, we’ll demonstrate how to access the data stored within them.

Throughout, we’ll be selecting data from the following DataFrame, which holds the Imagenette dataset, a small subset of the original ImageNet. This DataFrame includes a column holding images, a column holding their labels, and a few others.

import meerkat as mk
df = mk.get("imagenette", version="160px")

Below is an overview of the data selection methods discussed in this guide.

Selecting Columns

The columns in a DataFrame are uniquely identified by str names. The code below displays the column names in the Imagenette data frame we loaded above:

df.columns
['path',
 'noisy_labels_0',
 'noisy_labels_1',
 'noisy_labels_5',
 'noisy_labels_25',
 'noisy_labels_50',
 'is_valid',
 'label_id',
 'label',
 'label_idx',
 'split',
 'img_path',
 'img_id',
 'index',
 'img']

Using these column names, we can pull out an individual column or a subset of them as a new DataFrame.

Selecting a Single Column

str -> Column

To select a single column, we simply pass it’s name to the index operator. For example,

   col = df["img"]
   col.head()
(FileColumn)
0
1
2
3
4

Passing a str that isn’t among the column names will raise a KeyError.

It may be helpful to think of a DataFrame as a dictionary mapping column names to columns.

Indeed, a DataFrame implements other parts of the dict interface including :meth:~meerkat.DataFrame.keys(), :meth:~meerkat.DataFrame.values(), and :meth:~meerkat.DataFrame.items(). Unlike a dictionary, multiple columns in a DataFrame can be selected at once.

Selecting Multiple Columns

List[str] -> DataFrame

You can select multiple columns by passing a list of column names. Doing so will return a new DataFrame with a subset of the columns in the original. For example,

    df = df[["img", "img_id", "label"]]
    df.head()
img img_id label
0 n02979186_9036 cassette player
1 n02979186_11957 cassette player
2 n02979186_9715 cassette player
3 n02979186_21736 cassette player
4 ILSVRC2012_val_00046953 cassette player

Passing a str that isn’t among the column names will raise a KeyError.

Copy vs. Reference

See copying for more information.

You may be wondering whether the columns returned by indexing are copies of the columns in the original DataFrame. The columns returned by the index operator reference the same columns in the original DataFrame. This means that modifying the columns returned by the index operator will modify the columns in the original DataFrame.

Selecting Rows by Position

In Meerkat, the rows of a DataFrame or Column are ordered. This means that rows are uniquely identified by their position in the DataFrame or Column (similar to how the elements of a Python List are uniquely identified by their position in the list).

Row indices range from 0 to the number of rows in the DataFrame or Column minus one. To see how many rows a DataFrame or a column has we can use len(). For example,

   len(df)
13394

Above we mentioned how a DataFrame could be viewed as a dictionary mapping column names to columns. Equivalently, it also may be helpful to think of a DataFrame as a list of dictionaries mapping column names to values. The DataFrame interface supports both of these views – under the hood, storage is organized so as to make both column and row accesses fast.

Selecting a Single Row by Position

int -> Row

To select a single row from a DataFrame, we simply pass it’s position to the index operator.

   row = df[2]
   row
{'img': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2b2ad21df0>),
 'img_id': 'n02979186_9715',
 'label': 'cassette player'}

Passing an int that is less than 0 or greater than len(df) will raise an IndexError.

Notice that row holds a FileCell object, not a PIL Image or other in-memory image object. The “image” has not yet been loaded from disk into memory. The FileCell knows how to load the image into memory, but stops just short of doing so. Later on, when we want to access the image, we can call the row or cell to load the image into memory.

   row()
{'img': <PIL.Image.Image image mode=RGB size=160x216>,
 'img_id': 'n02979186_9715',
 'label': 'cassette player'}
   row["img"]()
../../_images/7eb303742248bcec1074cfde8fd3dd614c677fc2ee3a25ee640925fc6cc9c89f.png

Why do we wait to load the image into memory? Image datasets often don’t fit into memory. By deferring the loading of images until they are needed, we can manipulate large image datasets quickly.

Materializing Deferred Columns

The images in df are stored in a subclass of DeferredColumn called ImageColumn. Deferred columns are a special type of column that defer the materialization of data until it is needed. They play a central role in Meerkat as they make it easy to work with large data types like images and videos. Learn more in the deferred guide.

int -> Any

The same position-based indexing works for selecting a single cell from a Column.

col = df["label"]
col[2]
'cassette player'

Passing an int that is less than 0 or greater than len(df["label"]) will raise an IndexError.

Selecting Multiple Rows by Position

There are three different ways we can select a subset of rows from a DataFrame or Column: via slice, Sequence[int], or Sequence[bool].

slice -> DataFrame

To select a set of contiguous rows from a DataFrame, we can use an integer slice [start:end]. The subset of rows will be returned as a new DataFrame.

    df[50:100]
img img_id label
0 n02979186_8227 cassette player
1 n02979186_4313 cassette player
2 n02979186_1148 cassette player
3 n02979186_4266 cassette player
4 n02979186_9873 cassette player
... ... ... ...
45 n02979186_734 cassette player
46 n02979186_9863 cassette player
47 n02979186_27494 cassette player
48 n02979186_11839 cassette player
49 n02979186_27347 cassette player

We can also use integer slices to select a set of evenly spaced rows from a DataFrame [start:end:step]. For example, below we select every tenth row from the first 100 rows in the DataFrame.

    df[0:100:10]
img img_id label
0 n02979186_9036 cassette player
1 n02979186_12419 cassette player
2 n02979186_6725 cassette player
3 n02979186_14793 cassette player
4 n02979186_9858 cassette player
5 n02979186_8227 cassette player
6 n02979186_16667 cassette player
7 n02979186_10993 cassette player
8 n02979186_4704 cassette player
9 n02979186_2163 cassette player

Sequence[int] -> DataFrame

To select multiple rows from a DataFrame we can also pass a list of int.

    small_df = df[[0, 2, 5, 8, 17]]
    small_df
img img_id label
0 n02979186_9036 cassette player
1 n02979186_9715 cassette player
2 n02979186_10568 cassette player
3 n02979186_10756 cassette player
4 n02979186_21779 cassette player

Other valid sequences of int that can be used to index are:

  • Tuple[int] – a tuple of integers.

  • np.ndarray[np.integer] - a NumPy NDArray with dtype np.integer.

  • pd.Series[np.integer] - a Pandas Series with dtype np.integer.

  • torch.Tensor[torch.int64] - a PyTorch Tensor with dtype torch.int.

  • mk.Column - a Meerkat column who’s cells are int, np.integer, or torch.int64.

This is useful when the rows are neither contiguous nor evenly spaced (otherwise slice indexing, described above, is faster).

Sequence[bool] -> DataFrame

To select multiple rows from a DataFrame we can also pass a list of bool the same length as the DataFrame. Below we select the first and last rows from the smaller DataFrame small_df that we selected in the panel above.

small_df[[True, False, False, False, True]]
img img_id label
0 n02979186_9036 cassette player
1 n02979186_21779 cassette player

Other valid sequences of bool that can be used to select are:

  • Tuple[bool] – a tuple of bool.

  • np.ndarray[bool] - a NumPy NDArray with dtype bool.

  • pd.Series[bool] - a Pandas Series with dtype bool.

  • torch.Tensor[torch.bool] - a PyTorch Tensor with dtype torch.bool.

  • mk.Column - a Meerkat column who’s cells are int, bool, or torch.bool.

This is very useful for quickly selecting a subset of rows that satisfy a predicate (like you might do with a WHERE clause in SQL). For example, say we want to select all rows that have a value of "parachute" in the "label" column. We could do this using the following code:

    parachute_df = df[df["label"] == "parachute"]
    parachute_df.head()
img img_id label
0 n03888257_45616 parachute
1 n03888257_2919 parachute
2 n03888257_37776 parachute
3 n03888257_10639 parachute
4 n03888257_17133 parachute

Copy vs. Reference

See advanced/copying.rst for more information.

You may be wondering whether the rows returned by indexing are copies or references of the rows in the original DataFrame. This depends on (1) which of the selection strategies above you use (slice vs. Sequence[int] vs. Sequence[bool]) and (2) the column type (e.g. PandasSeriesColumn, TensorColumn).

In general, columns inherit the copying behavior of their underlying data structure. For example, a TensorColumn has the copying behavior of a NumPy array, as described in the Numpy indexing documentation <https://numpy.org/doc/stable/reference/arrays.indexing.html>_. See a more detailed discussion in advanced/copying.rst .

Selecting Rows by Key

It is also possible to select rows from a DataFrame by a key column. In Meerkat, a key column is a ScalarColumn containing str or int values that uniquely identify each row. The primary key in Meerkat is analogous to the primary key in a SQL database or the index in a Pandas DataFrame.

The primary key of df is the "img_id" column.

    print(df.primary_key_name)
    df.primary_key
img_id
(PandasScalarColumn)
0 n02979186_9036
1 n02979186_11957
2 n02979186_9715
3 n02979186_21736
4 ILSVRC2012_val_00046953
... ...
13389 n03425413_17521
13390 n03425413_20711
13391 n03425413_19050
13392 n03425413_13831
13393 n03425413_1242

The primary key can be set using set_primary_key(), which takes a column name or a ScalarColumn as input.

Selecting a Single Row by Key

str|int -> Row

To select a single row from a DataFrame by key, we can use the .loc[] operator and pass a key value.

    df.loc["n03888257_37776"]
{'img': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2b2ad21df0>),
 'img_id': 'n03888257_37776',
 'label': 'parachute'}

Selecting Multiple Rows by Key

Sequence[str|int] -> DataFrame

We can also select a subset of rows in a DataFrame by passing a list of key values to .loc[].

    df.loc[["n03888257_37776", "n03425413_20711", "n03425413_1242"]]
img img_id label
0 n03888257_37776 parachute
1 n03425413_20711 gas pump
2 n03425413_1242 gas pump

Passing a str|int that isn’t in the primary key will raise a KeyError.

For Pandas Users

index vs. primary key: Pandas DataFrames maintain an index object that is separate from the DataFrame’s columns. The index object is used to select rows by key using the .loc[] indexer. In Meerkat, there is no separate index object. Instead, we designate one of the columns the primary key and can select rows based on the values in that column using .loc[]. The Meerkat approach, where the primary key is a column in the DataFrame, resembles the approach taken by most SQL databases.

.iloc: Pandas users are likely familiar with .loc properties of DataFrame and Series. These properties are used to select data by integer position and by key in the index, respectively. In Meerkat, we do not support .iloc – to index by position, simply apply the index operator [] directly to the object.