Datasets¶

Meerkat provides a dataset registry that makes it easy to download datasets and load them into Meerkat data structures. For example, using get() we can download and prepare the Imagenette dataset:

In [1]: import meerkat as mk

In [2]: df = mk.datasets.get("imagenette")

Some datasets have multiple versions, for example Imagenette provides a full-size version as well as 320 pixel and 160 pixel versions. You can list a dataset’s available versions with versions():

In [3]: mk.datasets.versions("imagenette")
Out[3]: ['full', '320px', '160px']

In [4]: mk.datasets.get("imagenette", version="160px")

By default datasets are downloaded to ~/.meerkat/datasets/{name}/{version}. However, if you already have the dataset downloaded elsewhere or you want to download to a different location, you can specify the dataset_dir argument.

df = mk.datasets.get("imagenette", dataset_dir="/local/download/of/imagenette/full")

You can also configure Meerkat to use a different default root directory. By setting the mk.config.datasets.root_dir = "/local/download/of", the default location for datasets will be /local/download/of/datasets/{name}/{version}.

How does Meerkat’s dataset registry fit in with other dataset hubs? The purpose of the Meerkat dataset registry is to provide code for downloading datasets and loading them into DataFrame objects. The Meerkat registry, like Torchvision Datasets, doesn’t actually host any data. In contrast, dataset hubs like HuggingFace Datasets and Activeloop Hub are great community efforts that do host data. So, the Meerkat registry is complementary to these hubs: in fact, we can currently load any dataset in the HuggingFace hubs directly through our registry. For example, we can load the IMBD dataset hosted on HuggingFace with mk.datasets.get("imdb").

Contributing Datasets

We encourage users to contribute datasets to the Meerkat registry. If you’re already using Meerkat with your dataset, contributing it to the registry is straightforward: you just share the code that you’re already using to load the dataset into Meerkat. Please follow the instructions in Contributing Datasets.

The table below lists all of the datasets currently in the meerkat registry. You can also list these datasets programmatically with mk.datasets.catalog.

  tags versions homepage
celeba
image
face recognition
main
link
coco
image
object recognition
2014
link
expw
image
classification
main
link
fer
image
facial emotion recognition
plus
link
imagenet
image
classification
ilsvrc2012
link
imagenette
image
classification
full
320px
160px
link
lvis
image
object recognition
v1
link
mirflickr
image
retrieval
25k
link
ngoa
art
main
link
pascal
image
object recognition
2012
link
fer
image
facial recognition
algorithmic bias
plus
link