Data Selection
Contents
Data Selection¶
As discussed in the intro, there are two key data structures in Meerkat: the Column and the DataFrame. In this guide, we’ll demonstrate how to access the data stored within them.
Throughout, we’ll be selecting data from the following DataFrame, which holds the Imagenette dataset, a small subset of the original ImageNet. This DataFrame includes a column holding images, a column holding their labels, and a few others.
import meerkat as mk
df = mk.get("imagenette", version="160px")
Below is an overview of the data selection methods discussed in this guide.
Selecting Columns¶
The columns in a DataFrame are uniquely identified by str
names. The code
below displays the column names in the Imagenette data frame we loaded above:
df.columns
['path',
'noisy_labels_0',
'noisy_labels_1',
'noisy_labels_5',
'noisy_labels_25',
'noisy_labels_50',
'is_valid',
'label_id',
'label',
'label_idx',
'split',
'img_path',
'img_id',
'index',
'img']
Using these column names, we can pull out an individual column or a subset of them as a new DataFrame.
Selecting a Single Column¶
str
-> Column
¶
To select a single column, we simply pass it’s name to the index operator. For example,
col = df["img"]
col.head()
(FileColumn) | |
---|---|
0 | |
1 | |
2 | |
3 | |
4 |
Passing a str
that isn’t among the column names will raise a KeyError
.
It may be helpful to think of a DataFrame as a dictionary mapping column names to columns.
Indeed, a DataFrame implements other parts of the dict
interface including :meth:~meerkat.DataFrame.keys()
, :meth:~meerkat.DataFrame.values()
, and :meth:~meerkat.DataFrame.items()
. Unlike a dictionary, multiple columns in a DataFrame can be selected at once.
Selecting Multiple Columns¶
List[str]
-> DataFrame
¶
You can select multiple columns by passing a list of column names. Doing so will return a new DataFrame with a subset of the columns in the original. For example,
df = df[["img", "img_id", "label"]]
df.head()
img | img_id | label | |
---|---|---|---|
0 | n02979186_9036 | cassette player | |
1 | n02979186_11957 | cassette player | |
2 | n02979186_9715 | cassette player | |
3 | n02979186_21736 | cassette player | |
4 | ILSVRC2012_val_00046953 | cassette player |
Passing a str
that isn’t among the column names will raise a KeyError
.
Copy vs. Reference
See copying for more information.
You may be wondering whether the columns returned by indexing are copies of the columns in the original DataFrame. The columns returned by the index operator reference the same columns in the original DataFrame. This means that modifying the columns returned by the index operator will modify the columns in the original DataFrame.
Selecting Rows by Position¶
In Meerkat, the rows of a DataFrame or Column are ordered. This means that rows are uniquely identified by their position in the DataFrame or Column (similar to how the elements of a Python List are uniquely identified by their position in the list).
Row indices range from 0 to the number of rows in the DataFrame or Column minus one. To see how many rows a DataFrame or a column has we can use len()
. For example,
len(df)
13394
Above we mentioned how a DataFrame could be viewed as a dictionary mapping column names to columns. Equivalently, it also may be helpful to think of a DataFrame as a list of dictionaries mapping column names to values. The DataFrame interface supports both of these views – under the hood, storage is organized so as to make both column and row accesses fast.
Selecting a Single Row by Position¶
int
-> Row
¶
To select a single row from a DataFrame, we simply pass it’s position to the index operator.
row = df[2]
row
{'img': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2b2ad21df0>),
'img_id': 'n02979186_9715',
'label': 'cassette player'}
Passing an int
that is less than 0
or greater than len(df)
will raise an IndexError
.
Notice that row
holds a FileCell
object, not a PIL Image or other in-memory image object.
The “image” has not yet been loaded from disk into memory. The FileCell
knows how to load the image into memory, but stops just short of doing so.
Later on, when we want to access the image, we can call the row or cell to load the image into memory.
row()
{'img': <PIL.Image.Image image mode=RGB size=160x216>,
'img_id': 'n02979186_9715',
'label': 'cassette player'}
row["img"]()
Why do we wait to load the image into memory? Image datasets often don’t fit into memory. By deferring the loading of images until they are needed, we can manipulate large image datasets quickly.
Materializing Deferred Columns
The images in df
are stored in a subclass of DeferredColumn
called ImageColumn
.
Deferred columns are a special type of column that defer the materialization of data until it is needed. They play a central role in Meerkat as they make it easy to work with large data types like images and videos.
Learn more in the deferred guide.
int
-> Any
¶
The same position-based indexing works for selecting a single cell from a Column.
col = df["label"]
col[2]
'cassette player'
Passing an int
that is less than 0
or greater than len(df["label"])
will raise an IndexError
.
Selecting Multiple Rows by Position¶
There are three different ways we can select a subset of rows from a DataFrame or Column: via slice
, Sequence[int]
, or Sequence[bool]
.
slice
-> DataFrame
¶
To select a set of contiguous rows from a DataFrame, we can use an integer slice [start:end]
.
The subset of rows will be returned as a new DataFrame.
df[50:100]
img | img_id | label | |
---|---|---|---|
0 | n02979186_8227 | cassette player | |
1 | n02979186_4313 | cassette player | |
2 | n02979186_1148 | cassette player | |
3 | n02979186_4266 | cassette player | |
4 | n02979186_9873 | cassette player | |
... | ... | ... | ... |
45 | n02979186_734 | cassette player | |
46 | n02979186_9863 | cassette player | |
47 | n02979186_27494 | cassette player | |
48 | n02979186_11839 | cassette player | |
49 | n02979186_27347 | cassette player |
We can also use integer slices to select a set of evenly spaced rows from a DataFrame [start:end:step]
. For example, below we select every tenth row from the first 100 rows in the DataFrame.
df[0:100:10]
img | img_id | label | |
---|---|---|---|
0 | n02979186_9036 | cassette player | |
1 | n02979186_12419 | cassette player | |
2 | n02979186_6725 | cassette player | |
3 | n02979186_14793 | cassette player | |
4 | n02979186_9858 | cassette player | |
5 | n02979186_8227 | cassette player | |
6 | n02979186_16667 | cassette player | |
7 | n02979186_10993 | cassette player | |
8 | n02979186_4704 | cassette player | |
9 | n02979186_2163 | cassette player |
Sequence[int]
-> DataFrame
¶
To select multiple rows from a DataFrame we can also pass a list of int
.
small_df = df[[0, 2, 5, 8, 17]]
small_df
img | img_id | label | |
---|---|---|---|
0 | n02979186_9036 | cassette player | |
1 | n02979186_9715 | cassette player | |
2 | n02979186_10568 | cassette player | |
3 | n02979186_10756 | cassette player | |
4 | n02979186_21779 | cassette player |
Other valid sequences of int
that can be used to index are:
Tuple[int]
– a tuple of integers.np.ndarray[np.integer]
- a NumPy NDArray withdtype
np.integer
.pd.Series[np.integer]
- a Pandas Series withdtype
np.integer
.torch.Tensor[torch.int64]
- a PyTorch Tensor withdtype
torch.int
.mk.Column
- a Meerkat column who’s cells areint
,np.integer
, ortorch.int64
.
This is useful when the rows are neither contiguous nor evenly spaced (otherwise slice indexing, described above, is faster).
Sequence[bool]
-> DataFrame
¶
To select multiple rows from a DataFrame we can also pass a list of bool
the
same length as the DataFrame. Below we select the first and last rows from
the smaller DataFrame small_df
that we selected in the panel above.
small_df[[True, False, False, False, True]]
img | img_id | label | |
---|---|---|---|
0 | n02979186_9036 | cassette player | |
1 | n02979186_21779 | cassette player |
Other valid sequences of bool
that can be used to select are:
Tuple[bool]
– a tuple of bool.np.ndarray[bool]
- a NumPy NDArray withdtype
bool
.pd.Series[bool]
- a Pandas Series withdtype
bool
.torch.Tensor[torch.bool]
- a PyTorch Tensor withdtype
torch.bool
.mk.Column
- a Meerkat column who’s cells areint
,bool
, ortorch.bool
.
This is very useful for quickly selecting a subset of rows that satisfy a predicate
(like you might do with a WHERE
clause in SQL).
For example, say we want to select all rows that have a value of "parachute"
in
the "label"
column. We could do this using the following code:
parachute_df = df[df["label"] == "parachute"]
parachute_df.head()
img | img_id | label | |
---|---|---|---|
0 | n03888257_45616 | parachute | |
1 | n03888257_2919 | parachute | |
2 | n03888257_37776 | parachute | |
3 | n03888257_10639 | parachute | |
4 | n03888257_17133 | parachute |
Copy vs. Reference
See advanced/copying.rst for more information.
You may be wondering whether the rows returned by indexing are copies or references of the rows in the original DataFrame.
This depends on (1) which of the selection strategies above you use (slice
vs. Sequence[int]
vs. Sequence[bool]
) and (2) the column type (e.g. PandasSeriesColumn
, TensorColumn
).
In general, columns inherit the copying behavior of their underlying data structure.
For example, a TensorColumn
has the copying behavior of a NumPy array, as described in the Numpy indexing documentation <https://numpy.org/doc/stable/reference/arrays.indexing.html>
_.
See a more detailed discussion in advanced/copying.rst .
Selecting Rows by Key¶
It is also possible to select rows from a DataFrame by a key column.
In Meerkat, a key column is a ScalarColumn
containing str
or int
values that uniquely identify each row. The primary key in Meerkat is analogous to the primary key in a SQL database or the index in a Pandas DataFrame.
The primary key of df
is the "img_id"
column.
print(df.primary_key_name)
df.primary_key
img_id
(PandasScalarColumn) | |
---|---|
0 | n02979186_9036 |
1 | n02979186_11957 |
2 | n02979186_9715 |
3 | n02979186_21736 |
4 | ILSVRC2012_val_00046953 |
... | ... |
13389 | n03425413_17521 |
13390 | n03425413_20711 |
13391 | n03425413_19050 |
13392 | n03425413_13831 |
13393 | n03425413_1242 |
The primary key can be set using set_primary_key()
, which takes a column name or a ScalarColumn
as input.
Selecting a Single Row by Key¶
str|int
-> Row
¶
To select a single row from a DataFrame by key, we can use the .loc[]
operator and pass a key value.
df.loc["n03888257_37776"]
{'img': FileCell(fn=<meerkat.columns.deferred.file.FileLoader object at 0x7f2b2ad21df0>),
'img_id': 'n03888257_37776',
'label': 'parachute'}
Selecting Multiple Rows by Key¶
Sequence[str|int]
-> DataFrame
¶
We can also select a subset of rows in a DataFrame by passing a list of key values to .loc[]
.
df.loc[["n03888257_37776", "n03425413_20711", "n03425413_1242"]]
img | img_id | label | |
---|---|---|---|
0 | n03888257_37776 | parachute | |
1 | n03425413_20711 | gas pump | |
2 | n03425413_1242 | gas pump |
Passing a str|int
that isn’t in the primary key will raise a KeyError
.
For Pandas Users
index vs. primary key
:
Pandas DataFrames maintain an index object that is separate from the DataFrame’s columns.
The index object is used to select rows by key using the .loc[]
indexer.
In Meerkat, there is no separate index object.
Instead, we designate one of the columns the primary key and can select rows based on the values in that column using .loc[]
.
The Meerkat approach, where the primary key is a column in the DataFrame, resembles the approach taken by most SQL databases.
.iloc
:
Pandas users are likely familiar with .loc
properties of DataFrame and Series.
These properties are used to select data by integer position and by key in the index, respectively.
In Meerkat, we do not support .iloc
– to index by position, simply apply the index operator []
directly to the object.