Data Selection¶

As discussed in the intro, there are two key data structures in Meerkat: the Column and the DataFrame. In this guide, we’ll demonstrate how to access the data stored within them.

Throughout, we’ll be selecting data from the following DataFrame, which holds the Imagenette dataset, a small subset of the original ImageNet. This DataFrame includes a column holding images, a column holding their labels, and a few others.

import meerkat as mk
df = mk.get("imagenette", version="160px")

Below is an overview of the data selection methods discussed in this guide.

Selecting Columns ¶

The columns in a DataFrame are uniquely identified by str names. The code below displays the column names in the Imagenette data frame we loaded above:

df.columns

['path',
 'noisy_labels_0',
 'noisy_labels_1',
 'noisy_labels_5',
 'noisy_labels_25',
 'noisy_labels_50',
 'is_valid',
 'label_id',
 'label',
 'label_idx',
 'split',
 'img_path',
 'img_id',
 'index',
 'img']

Using these column names, we can pull out an individual column or a subset of them as a new DataFrame.

Selecting a Single Column ¶

`str` -> `Column`¶

To select a single column, we simply pass it’s name to the index operator. For example,

   col = df["img"]
   col.head()

	(FileColumn)
0
1
2
3
4

Passing a str that isn’t among the column names will raise a KeyError.

It may be helpful to think of a DataFrame as a dictionary mapping column names to columns.

Indeed, a DataFrame implements other parts of the dict interface including :meth:~meerkat.DataFrame.keys(), :meth:~meerkat.DataFrame.values(), and :meth:~meerkat.DataFrame.items(). Unlike a dictionary, multiple columns in a DataFrame can be selected at once.

Selecting Multiple Columns ¶

`List[str]` -> `DataFrame`¶

You can select multiple columns by passing a list of column names. Doing so will return a new DataFrame with a subset of the columns in the original. For example,

    df = df[["img", "img_id", "label"]]
    df.head()

	img_id	label
0	n02979186_9036	cassette player
1	n02979186_11957	cassette player
2	n02979186_9715	cassette player
3	n02979186_21736	cassette player
4	ILSVRC2012_val_00046953	cassette player

Passing a str that isn’t among the column names will raise a KeyError.

Copy vs. Reference

See copying for more information.

You may be wondering whether the columns returned by indexing are copies of the columns in the original DataFrame. The columns returned by the index operator reference the same columns in the original DataFrame. This means that modifying the columns returned by the index operator will modify the columns in the original DataFrame.

Selecting Rows by Position ¶

In Meerkat, the rows of a DataFrame or Column are ordered. This means that rows are uniquely identified by their position in the DataFrame or Column (similar to how the elements of a Python List are uniquely identified by their position in the list).

Row indices range from 0 to the number of rows in the DataFrame or Column minus one. To see how many rows a DataFrame or a column has we can use len(). For example,

   len(df)

Above we mentioned how a DataFrame could be viewed as a dictionary mapping column names to columns. Equivalently, it also may be helpful to think of a DataFrame as a list of dictionaries mapping column names to values. The DataFrame interface supports both of these views – under the hood, storage is organized so as to make both column and row accesses fast.

Selecting a Single Row by Position ¶

`int` -> `Row`¶

To select a single row from a DataFrame, we simply pass it’s position to the index operator.

   row = df[2]
   row

{'img': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2b2ad21df0>),
 'img_id': 'n02979186_9715',
 'label': 'cassette player'}

Passing an int that is less than 0 or greater than len(df) will raise an IndexError.

Notice that row holds a FileCell object, not a PIL Image or other in-memory image object. The “image” has not yet been loaded from disk into memory. The FileCell knows how to load the image into memory, but stops just short of doing so. Later on, when we want to access the image, we can call the row or cell to load the image into memory.

   row()

{'img': <PIL.Image.Image image mode=RGB size=160x216>,
 'img_id': 'n02979186_9715',
 'label': 'cassette player'}

   row["img"]()

../../_images/7eb303742248bcec1074cfde8fd3dd614c677fc2ee3a25ee640925fc6cc9c89f.png

Why do we wait to load the image into memory? Image datasets often don’t fit into memory. By deferring the loading of images until they are needed, we can manipulate large image datasets quickly.

Materializing Deferred Columns

The images in df are stored in a subclass of DeferredColumn called ImageColumn. Deferred columns are a special type of column that defer the materialization of data until it is needed. They play a central role in Meerkat as they make it easy to work with large data types like images and videos. Learn more in the deferred guide.

`int` -> `Any`¶

The same position-based indexing works for selecting a single cell from a Column.

col = df["label"]
col[2]

'cassette player'

Passing an int that is less than 0 or greater than len(df["label"]) will raise an IndexError.

Selecting Multiple Rows by Position ¶

There are three different ways we can select a subset of rows from a DataFrame or Column: via slice, Sequence[int], or Sequence[bool].

`slice` -> `DataFrame`¶

To select a set of contiguous rows from a DataFrame, we can use an integer slice [start:end]. The subset of rows will be returned as a new DataFrame.

    df[50:100]

	img	img_id	label
0		n02979186_8227	cassette player
1		n02979186_4313	cassette player
2		n02979186_1148	cassette player
3		n02979186_4266	cassette player
4		n02979186_9873	cassette player
...	...	...	...
45		n02979186_734	cassette player
46		n02979186_9863	cassette player
47		n02979186_27494	cassette player
48		n02979186_11839	cassette player
49		n02979186_27347	cassette player

We can also use integer slices to select a set of evenly spaced rows from a DataFrame [start:end:step]. For example, below we select every tenth row from the first 100 rows in the DataFrame.

    df[0:100:10]

	img_id	label
0	n02979186_9036	cassette player
1	n02979186_12419	cassette player
2	n02979186_6725	cassette player
3	n02979186_14793	cassette player
4	n02979186_9858	cassette player
5	n02979186_8227	cassette player
6	n02979186_16667	cassette player
7	n02979186_10993	cassette player
8	n02979186_4704	cassette player
9	n02979186_2163	cassette player

`Sequence[int]` -> `DataFrame`¶

To select multiple rows from a DataFrame we can also pass a list of int.

    small_df = df[[0, 2, 5, 8, 17]]
    small_df

	img_id	label
0	n02979186_9036	cassette player
1	n02979186_9715	cassette player
2	n02979186_10568	cassette player
3	n02979186_10756	cassette player
4	n02979186_21779	cassette player

Other valid sequences of int that can be used to index are:

Tuple[int] – a tuple of integers.
np.ndarray[np.integer] - a NumPy NDArray with dtype np.integer.
pd.Series[np.integer] - a Pandas Series with dtype np.integer.
torch.Tensor[torch.int64] - a PyTorch Tensor with dtype torch.int.
mk.Column - a Meerkat column who’s cells are int, np.integer, or torch.int64.

This is useful when the rows are neither contiguous nor evenly spaced (otherwise slice indexing, described above, is faster).

`Sequence[bool]` -> `DataFrame`¶

To select multiple rows from a DataFrame we can also pass a list of bool the same length as the DataFrame. Below we select the first and last rows from the smaller DataFrame small_df that we selected in the panel above.

small_df[[True, False, False, False, True]]

	img	img_id	label
0		n02979186_9036	cassette player
1		n02979186_21779	cassette player

Other valid sequences of bool that can be used to select are:

Tuple[bool] – a tuple of bool.
np.ndarray[bool] - a NumPy NDArray with dtype bool.
pd.Series[bool] - a Pandas Series with dtype bool.
torch.Tensor[torch.bool] - a PyTorch Tensor with dtype torch.bool.
mk.Column - a Meerkat column who’s cells are int, bool, or torch.bool.

This is very useful for quickly selecting a subset of rows that satisfy a predicate (like you might do with a WHERE clause in SQL). For example, say we want to select all rows that have a value of "parachute" in the "label" column. We could do this using the following code:

    parachute_df = df[df["label"] == "parachute"]
    parachute_df.head()

	img_id	label
0	n03888257_45616	parachute
1	n03888257_2919	parachute
2	n03888257_37776	parachute
3	n03888257_10639	parachute
4	n03888257_17133	parachute

Copy vs. Reference

See advanced/copying.rst for more information.

You may be wondering whether the rows returned by indexing are copies or references of the rows in the original DataFrame. This depends on (1) which of the selection strategies above you use (slice vs. Sequence[int] vs. Sequence[bool]) and (2) the column type (e.g. PandasSeriesColumn, TensorColumn).

In general, columns inherit the copying behavior of their underlying data structure. For example, a TensorColumn has the copying behavior of a NumPy array, as described in the Numpy indexing documentation <https://numpy.org/doc/stable/reference/arrays.indexing.html>_. See a more detailed discussion in advanced/copying.rst .

Selecting Rows by Key ¶

It is also possible to select rows from a DataFrame by a key column. In Meerkat, a key column is a ScalarColumn containing str or int values that uniquely identify each row. The primary key in Meerkat is analogous to the primary key in a SQL database or the index in a Pandas DataFrame.

The primary key of df is the "img_id" column.

    print(df.primary_key_name)
    df.primary_key

img_id

	(PandasScalarColumn)
0	n02979186_9036
1	n02979186_11957
2	n02979186_9715
3	n02979186_21736
4	ILSVRC2012_val_00046953
...	...
13389	n03425413_17521
13390	n03425413_20711
13391	n03425413_19050
13392	n03425413_13831
13393	n03425413_1242

The primary key can be set using set_primary_key(), which takes a column name or a ScalarColumn as input.

Selecting a Single Row by Key ¶

`str|int` -> `Row`¶

To select a single row from a DataFrame by key, we can use the .loc[] operator and pass a key value.

    df.loc["n03888257_37776"]

{'img': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2b2ad21df0>),
 'img_id': 'n03888257_37776',
 'label': 'parachute'}

Selecting Multiple Rows by Key ¶

`Sequence[str|int]` -> `DataFrame`¶

We can also select a subset of rows in a DataFrame by passing a list of key values to .loc[].

    df.loc[["n03888257_37776", "n03425413_20711", "n03425413_1242"]]

	img_id	label
0	n03888257_37776	parachute
1	n03425413_20711	gas pump
2	n03425413_1242	gas pump

Passing a str|int that isn’t in the primary key will raise a KeyError.

For Pandas Users

index vs. primary key: Pandas DataFrames maintain an index object that is separate from the DataFrame’s columns. The index object is used to select rows by key using the .loc[] indexer. In Meerkat, there is no separate index object. Instead, we designate one of the columns the primary key and can select rows based on the values in that column using .loc[]. The Meerkat approach, where the primary key is a column in the DataFrame, resembles the approach taken by most SQL databases.

.iloc: Pandas users are likely familiar with .loc properties of DataFrame and Series. These properties are used to select data by integer position and by key in the index, respectively. In Meerkat, we do not support .iloc – to index by position, simply apply the index operator [] directly to the object.

🔮 v0.4.11

Data Selection

Contents